[jira] [Commented] (FLINK-25023) ClassLoader leak on JM/TM through indirectly-started Hadoop threads out of user code

Jira Thu, 30 Dec 2021 13:19:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466995#comment-17466995
 ]


Kerem Ulutaş commented on FLINK-25023:
--------------------------------------

I am also investigating a metaspace OOM exception in my setup. I am using a 
custom Flink image, which is:
 
- Based on flink:1.13.5,
- Includes Hadoop 2.10.0 binary distribution, downloaded directly from 
[https://hadoop.apache.org/releases.html]
- {{export HADOOP_CLASSPATH=$(hadoop classpath)}} added to docker-entrypoint.sh

What I observe in my setup is, in a Flink session cluster, requested metaspace 
is not decreasing after job finishes - usage is like 38.5 mb with no jobs and 
77.0 mb after 1 job - no matter it finishes successfully or it fails. 
Eventually after a number of job submissions I get a Metaspace OOM exception.
 
In the thread dump tab of the taskmanager page, I don't see any 
{{org.apache.hadoop}} related threads in the beginning - but after the first 
job submission this one is listed and it stays:
 
{{"org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" 
daemon prio=5 Id=63 WAITING on java.lang.ref.ReferenceQueue$Lock@4f32f212}}
{{    at [email protected]/java.lang.Object.wait(Native Method)}}
{{    - waiting on java.lang.ref.ReferenceQueue$Lock@4f32f212}}
{{    at [email protected]/java.lang.ref.ReferenceQueue.remove(Unknown Source)}}
{{    at [email protected]/java.lang.ref.ReferenceQueue.remove(Unknown Source)}}
{{    at 
app//org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3693)}}
{{    at [email protected]/java.lang.Thread.run(Unknown Source)}}


I believe the following are also caused by the leaking 
StatisticsDataReferenceCleaner thread spawned by static code block in 
{{org.apache.hadoop.fs.FileSystem}} class:

FLINK-15239
FLINK-19916
 
I also tried placing .jars required for accessing HDFS to {{$FLINK_HOME/lib}} 
directory. I set the Hadoop dependencies as scope = provided in my pom.xml and 
job finished successfully, but I still have the leaking thread - I couldn't see 
the thread leak go away like [~lirui] said in this comment.


One thing also worth mentioning is, this Hadoop issue comment from [~lirui] 
again.

> ClassLoader leak on JM/TM through indirectly-started Hadoop threads out of 
> user code
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-25023
>                 URL: https://issues.apache.org/jira/browse/FLINK-25023
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / FileSystem, Connectors / Hadoop 
> Compatibility, FileSystems
>    Affects Versions: 1.14.0, 1.12.5, 1.13.3
>            Reporter: Nico Kruber
>            Assignee: David Morávek
>            Priority: Major
>              Labels: pull-request-available
>
> If a Flink job is using HDFS through Flink's filesystem abstraction (either 
> on the JM or TM), that code may actually spawn a few threads, e.g. from 
> static class members:
>  * 
> {{org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner}}
>  * {{IPC Parameter Sending Thread#*}}
> These threads are started as soon as the classes are loaded which may be in 
> the context of the user code. In this specific scenario, however, the created 
> threads may contain references to the context class loader (I did not see 
> that though) or, as happened here, it may inherit thread contexts such as the 
> {{ProtectionDomain}} (from an {{{}AccessController{}}}).
> Hence user contexts and user class loaders are leaked into long-running 
> threads that are run in Flink's (parent) classloader.
> Fortunately, it seems to only *leak a single* {{ChildFirstClassLoader}} in 
> this concrete example but that may depend on which code paths each client 
> execution is walking.
>  
> A *proper solution* doesn't seem so simple:
>  * We could try to proactively initialize available file systems in the hope 
> to start all threads in the parent classloader with parent context.
>  * We could create a default {{ProtectionDomain}} for spawned threads as 
> discussed at [https://dzone.com/articles/javalangoutofmemory-permgen], 
> however, the {{StatisticsDataReferenceCleaner}} isn't actually actively 
> spawned from any callback but as a static variable and this with the class 
> loading itself (but maybe this is still possible somehow).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-25023) ClassLoader leak on JM/TM through indirectly-started Hadoop threads out of user code

Reply via email to