[jira] [Commented] (FLINK-11205) Task Manager Metaspace Memory Leak

William Leese (JIRA) Mon, 29 Jul 2019 07:37:30 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895306#comment-16895306
 ]


William Leese commented on FLINK-11205:
---------------------------------------

We're seeing some significant metaspace growth with flink 1.8.0 and job 
restarts. We can easily hit 1Gb metaspace usage within 15 minutes with rapid 
restarts.
That said, we attempted to troubleshoot this issue by looking at all the 
classes in metaspace using `jcmd <pid> GC.class_stats`.

Here we observed that after every job restart another entry is created for 
every class in our job. Where the old classes have InstBytes=0. So far so good, 
but moving to the Total column for these entries show that memory is still 
being used. Also, adding up all entries in the Total column indeed corresponds 
to our metaspace usage. So far we could only conclude that our job classes - 
none of them - were being unloaded.

Then we stumbled upon this ticket. Now here are our results running the 
SocketWindowWordCount jar in a flink 1.8.0 cluster with one taskmanager.

We achieve a class count by doing a jcmd 3052 GC.class_stats | grep -i 
org.apache.flink.streaming.examples.windowing.SessionWindowing | wc -l

*First* run: 
  Class Count: 1
  Metaspace: 30695K

After *800*~ runs:
  Class Count: 802
  Metaspace: 39406K


Interesting when we looked a bit later the class count slowly went down, 
slowly, step by step. If I had to guess it took about 15 minutes to go from 
800~ to 170~, with metaspace dropping to 35358K. In a sense we've seen this 
behavior, but with much much larger increases in metaspace usage over far fewer 
job restarts.

> Task Manager Metaspace Memory Leak 
> -----------------------------------
>
>                 Key: FLINK-11205
>                 URL: https://issues.apache.org/jira/browse/FLINK-11205
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.5.5, 1.6.2, 1.7.0
>            Reporter: Nawaid Shamim
>            Priority: Major
>         Attachments: Screenshot 2018-12-18 at 12.14.11.png, Screenshot 
> 2018-12-18 at 15.47.55.png
>
>
> Job Restarts causes task manager to dynamically load duplicate classes. 
> Metaspace is unbounded and grows with every restart. YARN aggressively kill 
> such containers but this affect is immediately seems on different task 
> manager which results in death spiral.
> Task Manager uses dynamic loader as described in 
> [https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html]
> {quote}
> *YARN*
> YARN classloading differs between single job deployments and sessions:
>  * When submitting a Flink job/application directly to YARN (via {{bin/flink 
> run -m yarn-cluster ...}}), dedicated TaskManagers and JobManagers are 
> started for that job. Those JVMs have both Flink framework classes and user 
> code classes in the Java classpath. That means that there is _no dynamic 
> classloading_ involved in that case.
>  * When starting a YARN session, the JobManagers and TaskManagers are started 
> with the Flink framework classes in the classpath. The classes from all jobs 
> that are submitted against the session are loaded dynamically.
> {quote}
> The above is not entirely true specially when you set {{-yD 
> classloader.resolve-order=parent-first}} . We also above observed the above 
> behaviour when submitting a Flink job/application directly to YARN (via 
> {{bin/flink run -m yarn-cluster ...}}).
> !Screenshot 2018-12-18 at 12.14.11.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (FLINK-11205) Task Manager Metaspace Memory Leak

Reply via email to