[jira] [Commented] (FLINK-21986) taskmanager native memory not release timely after restart

Yun Tang (Jira) Wed, 31 Mar 2021 23:59:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312920#comment-17312920
 ]


Yun Tang commented on FLINK-21986:
----------------------------------

[~Feifan Wang] I think this is really weird after looking at your prof svg. 
Could you make your example as public so that I could run the job by myselfy.

If it's not easy to extract the logic, could you please provided the RocksDB 
LOG (need to set DEBUG level, and you could set it in Flink-1.13 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#state-backend-rocksdb-log-level
 ).

> taskmanager native memory not release timely after restart
> ----------------------------------------------------------
>
>                 Key: FLINK-21986
>                 URL: https://issues.apache.org/jira/browse/FLINK-21986
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / State Backends
>    Affects Versions: 1.12.1
>         Environment: flink version：1.12.1
> run ：yarn session
> job type：mock source -> regular join
>  
> checkpoint interval: 3m
> Taskmanager memory : 16G
>  
>            Reporter: Feifan Wang
>            Priority: Major
>         Attachments: 82544.svg, image-2021-03-25-15-53-44-214.png, 
> image-2021-03-25-16-07-29-083.png, image-2021-03-26-11-46-06-828.png, 
> image-2021-03-26-11-47-21-388.png
>
>
> I run a regular join job with flink_1.12.1 , and find taskmanager native 
> memory not release timely after restart cause by exceeded checkpoint 
> tolerable failure threshold.
> *problem job information：*
>  # job first restart cause by exceeded checkpoint tolerable failure threshold.
>  # then taskmanager be killed by yarn many times
>  # in this case，tm heap is set to 7.68G，bug all tm heap size is under 4.2G
>  !image-2021-03-25-15-53-44-214.png|width=496,height=103!
>  # nonheap size increase after restart，but still under 160M.
>  
> !https://km.sankuai.com/api/file/cdn/706284607/716474606?contentType=1&isNewContent=false&isNewContent=false|width=493,height=102!
>  # taskmanager process memory increase 3-4G after restart（this figure show 
> one of taskmanager）
>  !image-2021-03-25-16-07-29-083.png|width=493,height=107!
>  
> *my guess：*
> [RocksDB 
> wiki|https://github.com/facebook/rocksdb/wiki/RocksJava-Basics#memory-management]
>  mentioned ：Many of the Java Objects used in the RocksJava API will be backed 
> by C++ objects for which the Java Objects have ownership. As C++ has no 
> notion of automatic garbage collection for its heap in the way that Java 
> does, we must explicitly free the memory used by the C++ objects when we are 
> finished with them.
> So, is it possible that RocksDBStateBackend not call 
> AbstractNativeReference#close() to release memory use by RocksDB C++ Object ?
> *I make a change:*
>         Actively call System.gc() and System.runFinalization() every minute.
>  *And run this test again:*
>  # taskmanager process memory no obvious increase
>  !image-2021-03-26-11-46-06-828.png|width=495,height=93!
>  # job run for several days，and restart many times，but no taskmanager killed 
> by yarn like before
>  
> *Summary：*
>  # first，there is some native memory can not release timely after restart in 
> this situation
>  # I guess it maybe RocksDB C++ object，but I hive not check it from source 
> code of RocksDBStateBackend
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21986) taskmanager native memory not release timely after restart

Reply via email to