[
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043658#comment-17043658
]
Thomas Wozniakowski commented on FLINK-16142:
---------------------------------------------
Hey [~sewen] - we've applied the fix on our build by excluding:
{code:groovy}
exclude 'com/amazonaws/jmx/SdkMBeanRegistrySupport*'
exclude 'com/masabi/pattern/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*'
exclude
'org/apache/flink/kinesis/shaded/com/amazonaws/jmx/SdkMBeanRegistrySupport*'
{code}
The results have been interesting. It certainly seems to _help_ the problem,
but we're still seeing OOM errors in our builds where jobs are rapidly started
and stopped. Looking at the TaskManager metrics, I can see that the classes
actually *are* being unloaded now, for sure (after 10 runs, 60,000 loaded,
52,000 unloaded) but it seems to maybe be on a bit of a delay? Like something
is alive and is hanging onto it for a few seconds after the job exits.
I've attached another heap dump, could you help us track down what might be
causing this final wrinkle? Feels like we're close here!
> Memory Leak causes Metaspace OOM error on repeated job submission
> -----------------------------------------------------------------
>
> Key: FLINK-16142
> URL: https://issues.apache.org/jira/browse/FLINK-16142
> Project: Flink
> Issue Type: Bug
> Components: Client / Job Submission
> Affects Versions: 1.10.0
> Reporter: Thomas Wozniakowski
> Priority: Blocker
> Fix For: 1.10.1, 1.11.0
>
> Attachments: Leak-GC-root.png, java_pid1.hprof, java_pid1.hprof
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our
> use-case exactly (RocksDB state backend running in a containerised cluster).
> Unfortunately, it seems like there is a memory leak somewhere in the job
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.<clinit>(AwsSdkMetrics.java:359)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom
--
This message was sent by Atlassian Jira
(v8.3.4#803005)