[
https://issues.apache.org/jira/browse/FLINK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Feifan Wang updated FLINK-21986:
--------------------------------
Description:
I run a regular join job with flink_1.12.1 , and find taskmanager native memory
not release timely after restart cause by exceeded checkpoint tolerable failure
threshold.
*problem job information:*
# job first restart cause by exceeded checkpoint tolerable failure threshold.
# then taskmanager be killed by yarn many times
# in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G
!image-2021-03-25-15-53-44-214.png|width=496,height=103!
# nonheap size increase after restart,but still under 160M.
!https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102!
# taskmanager process memory increase 3-4G after restart(this figure show one
of taskmanager)
!image-2021-03-25-16-07-29-083.png|width=493,height=107!
*my guess:*
[RocksDB
wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management]
mentioned :Many of the Java Objects used in the RocksJava API will be backed
by C++ objects for which the Java Objects have ownership. As C++ has no notion
of automatic garbage collection for its heap in the way that Java does, we must
explicitly free the memory used by the C++ objects when we are finished with
them.
So, is it possible that RocksDBStateBackend not call
AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?
*I make a change:*
Actively call System.gc() and System.runFinalization() every minute.
*And run this test again:*
# taskmanager process memory no obvious increase
!image-2021-03-26-11-46-06-828.png|width=495,height=93!
# job run for several days,and restart many times,but no taskmanager killed by
yarn like before
*Summary:*
# first,there is some native memory can not release timely after restart in
this situation
# I guess it maybe RocksDB C++ object,but I hive not check it from source code
of RocksDBStateBackend
was:
I run a regular join job with flink_1.12.1 , and find taskmanager native memory
not release timely after restart cause by exceeded checkpoint tolerable failure
threshold.
*problem job information:*
# job first restart cause by exceeded checkpoint tolerable failure threshold.
# then taskmanager be killed by yarn many times
# in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G
!image-2021-03-25-15-53-44-214.png|width=496,height=103!
# nonheap size increase after restart,but still under 160M.
!https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102!
# taskmanager process memory increase 3-4G after restart(this figure show one
of taskmanager)
!image-2021-03-25-16-07-29-083.png|width=493,height=107!
*my guess:*
[RocksDB
wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management]
mentioned :Many of the Java Objects used in the RocksJava API will be backed
by C++ objects for which the Java Objects have ownership. As C++ has no notion
of automatic garbage collection for its heap in the way that Java does, we must
explicitly free the memory used by the C++ objects when we are finished with
them.
So, is it possible that RocksDBStateBackend not call
AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?
*I make a change:*
Actively call System.gc() and System.runFinalization() every minute.
*And run this test again:*
# taskmanager process memory no obvious increase
!image-2021-03-26-11-46-06-828.png|width=495,height=93!
# job run for several days,and restart many times,but no taskmanager killed by
yarn like before
*Summary:*
# first,there is some native memory can not release timely after restart in
this situation
# I guess it maybe RocksDB C++ object,but I hive not check it from source code
of RocksDBStateBackend
> taskmanager native memory not release timely after restart
> ----------------------------------------------------------
>
> Key: FLINK-21986
> URL: https://issues.apache.org/jira/browse/FLINK-21986
> Project: Flink
> Issue Type: Bug
> Components: Runtime / State Backends
> Affects Versions: 1.12.1
> Environment: flink version:1.12.1
> run :yarn session
> job type:mock source -> regular join
>
> checkpoint interval: 3m
> Taskmanager memory : 16G
>
> Reporter: Feifan Wang
> Priority: Major
> Attachments: image-2021-03-25-15-53-44-214.png,
> image-2021-03-25-16-07-29-083.png, image-2021-03-26-11-46-06-828.png,
> image-2021-03-26-11-47-21-388.png
>
>
> I run a regular join job with flink_1.12.1 , and find taskmanager native
> memory not release timely after restart cause by exceeded checkpoint
> tolerable failure threshold.
> *problem job information:*
> # job first restart cause by exceeded checkpoint tolerable failure threshold.
> # then taskmanager be killed by yarn many times
> # in this case,tm heap is set to 7.68G,bug all tm heap size is under 4.2G
> !image-2021-03-25-15-53-44-214.png|width=496,height=103!
> # nonheap size increase after restart,but still under 160M.
>
> !https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102!
> # taskmanager process memory increase 3-4G after restart(this figure show
> one of taskmanager)
> !image-2021-03-25-16-07-29-083.png|width=493,height=107!
>
> *my guess:*
> [RocksDB
> wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management]
> mentioned :Many of the Java Objects used in the RocksJava API will be backed
> by C++ objects for which the Java Objects have ownership. As C++ has no
> notion of automatic garbage collection for its heap in the way that Java
> does, we must explicitly free the memory used by the C++ objects when we are
> finished with them.
> So, is it possible that RocksDBStateBackend not call
> AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?
> *I make a change:*
> Actively call System.gc() and System.runFinalization() every minute.
> *And run this test again:*
> # taskmanager process memory no obvious increase
> !image-2021-03-26-11-46-06-828.png|width=495,height=93!
> # job run for several days,and restart many times,but no taskmanager killed
> by yarn like before
>
> *Summary:*
> # first,there is some native memory can not release timely after restart in
> this situation
> # I guess it maybe RocksDB C++ object,but I hive not check it from source
> code of RocksDBStateBackend
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)