[jira] [Comment Edited] (FLINK-9693) Possible memory leak in jobmanager retaining archived checkpoints

Steven Zhen Wu (JIRA) Wed, 25 Jul 2018 22:16:36 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556325#comment-16556325
 ]


Steven Zhen Wu edited comment on FLINK-9693 at 7/26/18 5:15 AM:
----------------------------------------------------------------

We can actually reproduce the issue by killing jobmanager node for very large 
jobs, like parallelism over 1,000. This issue starts to appear when replacement 
jobmanager node came up. 

Another observation is that ~10 GB memory leak seems to happen very quickly 
(like < a few mins).


was (Author: stevenz3wu):
We can actually reproduce the issue by killing jobmanager node for very large 
jobs, like parallelism over 1,000. This issue starts to appear when replacement 
jobmanager node came up. 

> Possible memory leak in jobmanager retaining archived checkpoints
> -----------------------------------------------------------------
>
>                 Key: FLINK-9693
>                 URL: https://issues.apache.org/jira/browse/FLINK-9693
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager, State Backends, Checkpointing
>    Affects Versions: 1.5.0, 1.6.0
>         Environment: !image.png!!image (1).png!
>            Reporter: Steven Zhen Wu
>            Assignee: Till Rohrmann
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.4.3, 1.5.1, 1.6.0
>
>         Attachments: 20180725_jm_mem_leak.png, 
> 41K_ExecutionVertex_objs_retained_9GB.png, ExecutionVertexZoomIn.png
>
>
> First, some context about the job
>  * Flink 1.4.1
>  * stand-alone deployment mode
>  * embarrassingly parallel: all operators are chained together
>  * parallelism is over 1,000
>  * stateless except for Kafka source operators. checkpoint size is 8.4 MB.
>  * set "state.backend.fs.memory-threshold" so that only jobmanager writes to 
> S3 to checkpoint
>  * internal checkpoint with 10 checkpoints retained in history
>  
> Summary of the observations
>  * 41,567 ExecutionVertex objects retained 9+ GB of memory
>  * Expanded in one ExecutionVertex. it seems to storing the kafka offsets for 
> source operator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-9693) Possible memory leak in jobmanager retaining archived checkpoints

Reply via email to