[
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176097#comment-17176097
]
Till Rohrmann commented on FLINK-16142:
---------------------------------------
Hi [~paguan], FLINK-16225 addresses the problem by terminating the
{{TaskManager}} process if a class leak is detected. The cluster remains
functional only if you have spare {{TaskManagers}} or Flink can start new
{{TaskManagers}} (e.g. if deployed on Yarn or K8s). The underlying problem of
class leaks cannot be really solved by this ticket because it is caused by some
third party dependencies (in your case Kinesis).
In order to properly solve this problem we need to understand where the class
leak is coming from. If it caused by Kinesis, then one either can exclude the
part causing it or a fix in the Kinesis libraries is needed. As a first step,
you could take a look at the heap dumps of the failing {{TaskManager}} process
to see where the class leak is coming from. This could help us to decide on the
next steps.
> Memory Leak causes Metaspace OOM error on repeated job submission
> -----------------------------------------------------------------
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
> Issue Type: Bug
> Components: Client / Job Submission
> Affects Versions: 1.10.0
> Reporter: Thomas Wozniakowski
> Assignee: Andrey Zagrebin
> Priority: Blocker
> Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our
> use-case exactly (RocksDB state backend running in a containerised cluster).
> Unfortunately, it seems like there is a memory leak somewhere in the job
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.<clinit>(AwsSdkMetrics.java:359)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom
--
This message was sent by Atlassian Jira
(v8.3.4#803005)