[jira] [Updated] (FLINK-21986) taskmanager native memory not release timely after restart

Feifan Wang (Jira) Thu, 25 Mar 2021 21:24:08 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Feifan Wang updated FLINK-21986:
--------------------------------
    Description: 
I run a regular join job with flink_1.12.1 , and find taskmanager native memory 
not release timely after restart cause by exceeded checkpoint tolerable failure 
threshold.

*problem job information：*
 # job first restart cause by exceeded checkpoint tolerable failure threshold.
 # then taskmanager be killed by yarn many times
 # in this case，tm heap is set to 7.68G，bug all tm heap size is under 4.2G
 !image-2021-03-25-15-53-44-214.png|width=496,height=103!
 # nonheap size increase after restart，but still under 160M.
 
!https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102!
 # taskmanager process memory increase 3-4G after restart（this figure show one 
of taskmanager）
 !image-2021-03-25-16-07-29-083.png|width=493,height=107!

 

*my guess：*

[RocksDB 
wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management]
 mentioned ：Many of the Java Objects used in the RocksJava API will be backed 
by C++ objects for which the Java Objects have ownership. As C++ has no notion 
of automatic garbage collection for its heap in the way that Java does, we must 
explicitly free the memory used by the C++ objects when we are finished with 
them.

So, is it possible that RocksDBStateBackend not call 
AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?

*I make a change:*

        Actively call System.gc() and System.runFinalization() every minute.

 *And run this test again:*
 # taskmanager process memory no obvious increase
 !image-2021-03-26-11-46-06-828.png|width=495,height=93!
 # job run for several days，and restart many times，but no taskmanager killed by 
yarn like before

 

*Summary：*
 # first，there is some native memory can not release timely after restart in 
this situation
 # I guess it maybe RocksDB C++ object，but I hive not check it from source code 
of RocksDBStateBackend

 

  was:
I run a regular join job with flink_1.12.1 , and find taskmanager native memory 
not release timely after restart cause by exceeded checkpoint tolerable failure 
threshold.

*problem job information：*
 # job first restart cause by exceeded checkpoint tolerable failure threshold.
 # then taskmanager be killed by yarn many times
 # in this case，tm heap is set to 7.68G，bug all tm heap size is under 4.2G
!image-2021-03-25-15-53-44-214.png|width=496,height=103!
 # nonheap size increase after restart，but still under 160M.
!https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102!
 # taskmanager process memory increase 3-4G after restart（this figure show one 
of taskmanager）
!image-2021-03-25-16-07-29-083.png|width=493,height=107!

*my guess：*

 

[RocksDB 
wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management]
 mentioned ：Many of the Java Objects used in the RocksJava API will be backed 
by C++ objects for which the Java Objects have ownership. As C++ has no notion 
of automatic garbage collection for its heap in the way that Java does, we must 
explicitly free the memory used by the C++ objects when we are finished with 
them.

 

So, is it possible that RocksDBStateBackend not call 
AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?

*I make a change:*

        Actively call System.gc() and System.runFinalization() every minute.

 *And run this test again:*
 # taskmanager process memory no obvious increase
!image-2021-03-26-11-46-06-828.png|width=495,height=93!
 # job run for several days，and restart many times，but no taskmanager killed by 
yarn like before



*Summary：*
 # first，there is some native memory can not release timely after restart in 
this situation
 # I guess it maybe RocksDB C++ object，but I hive not check it from source code 
of RocksDBStateBackend

 


> taskmanager native memory not release timely after restart
> ----------------------------------------------------------
>
>                 Key: FLINK-21986
>                 URL: https://issues.apache.org/jira/browse/FLINK-21986
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / State Backends
>    Affects Versions: 1.12.1
>         Environment: flink version：1.12.1
> run ：yarn session
> job type：mock source -> regular join
>  
> checkpoint interval: 3m
> Taskmanager memory : 16G
>  
>            Reporter: Feifan Wang
>            Priority: Major
>         Attachments: image-2021-03-25-15-53-44-214.png, 
> image-2021-03-25-16-07-29-083.png, image-2021-03-26-11-46-06-828.png, 
> image-2021-03-26-11-47-21-388.png
>
>
> I run a regular join job with flink_1.12.1 , and find taskmanager native 
> memory not release timely after restart cause by exceeded checkpoint 
> tolerable failure threshold.
> *problem job information：*
>  # job first restart cause by exceeded checkpoint tolerable failure threshold.
>  # then taskmanager be killed by yarn many times
>  # in this case，tm heap is set to 7.68G，bug all tm heap size is under 4.2G
>  !image-2021-03-25-15-53-44-214.png|width=496,height=103!
>  # nonheap size increase after restart，but still under 160M.
>  
> !https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102!
>  # taskmanager process memory increase 3-4G after restart（this figure show 
> one of taskmanager）
>  !image-2021-03-25-16-07-29-083.png|width=493,height=107!
>  
> *my guess：*
> [RocksDB 
> wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management]
>  mentioned ：Many of the Java Objects used in the RocksJava API will be backed 
> by C++ objects for which the Java Objects have ownership. As C++ has no 
> notion of automatic garbage collection for its heap in the way that Java 
> does, we must explicitly free the memory used by the C++ objects when we are 
> finished with them.
> So, is it possible that RocksDBStateBackend not call 
> AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?
> *I make a change:*
>         Actively call System.gc() and System.runFinalization() every minute.
>  *And run this test again:*
>  # taskmanager process memory no obvious increase
>  !image-2021-03-26-11-46-06-828.png|width=495,height=93!
>  # job run for several days，and restart many times，but no taskmanager killed 
> by yarn like before
>  
> *Summary：*
>  # first，there is some native memory can not release timely after restart in 
> this situation
>  # I guess it maybe RocksDB C++ object，but I hive not check it from source 
> code of RocksDBStateBackend
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-21986) taskmanager native memory not release timely after restart

Reply via email to