[
https://issues.apache.org/jira/browse/FLINK-25023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kerem Ulutaş updated FLINK-25023:
---------------------------------
Attachment:
flink--standalonesession-0-flink-jobmanager-589479f45b-p66k4-1.log
flink--taskexecutor-0-flink-taskmanager-9f6685b57-vfb2n-1.log
> ClassLoader leak on JM/TM through indirectly-started Hadoop threads out of
> user code
> ------------------------------------------------------------------------------------
>
> Key: FLINK-25023
> URL: https://issues.apache.org/jira/browse/FLINK-25023
> Project: Flink
> Issue Type: Bug
> Components: Connectors / FileSystem, Connectors / Hadoop
> Compatibility, FileSystems
> Affects Versions: 1.14.0, 1.12.5, 1.13.3
> Reporter: Nico Kruber
> Assignee: David Morávek
> Priority: Major
> Labels: pull-request-available
> Attachments: Screen Shot 2021-12-31 at 21.26.25.png, Screen Shot
> 2021-12-31 at 21.26.44.png,
> flink--standalonesession-0-flink-jobmanager-589479f45b-p66k4-1.log,
> flink--standalonesession-0-flink-jobmanager-589479f45b-p66k4.log,
> flink--taskexecutor-0-flink-taskmanager-9f6685b57-vfb2n-1.log,
> flink--taskexecutor-0-flink-taskmanager-9f6685b57-vfb2n.log,
> job_submission-1.log, job_submission-2.log, job_submission.log,
> taskmanager_thread_dump-1.log, taskmanager_thread_dump.log
>
>
> If a Flink job is using HDFS through Flink's filesystem abstraction (either
> on the JM or TM), that code may actually spawn a few threads, e.g. from
> static class members:
> *
> {{org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner}}
> * {{IPC Parameter Sending Thread#*}}
> These threads are started as soon as the classes are loaded which may be in
> the context of the user code. In this specific scenario, however, the created
> threads may contain references to the context class loader (I did not see
> that though) or, as happened here, it may inherit thread contexts such as the
> {{ProtectionDomain}} (from an {{{}AccessController{}}}).
> Hence user contexts and user class loaders are leaked into long-running
> threads that are run in Flink's (parent) classloader.
> Fortunately, it seems to only *leak a single* {{ChildFirstClassLoader}} in
> this concrete example but that may depend on which code paths each client
> execution is walking.
>
> A *proper solution* doesn't seem so simple:
> * We could try to proactively initialize available file systems in the hope
> to start all threads in the parent classloader with parent context.
> * We could create a default {{ProtectionDomain}} for spawned threads as
> discussed at [https://dzone.com/articles/javalangoutofmemory-permgen],
> however, the {{StatisticsDataReferenceCleaner}} isn't actually actively
> spawned from any callback but as a static variable and this with the class
> loading itself (but maybe this is still possible somehow).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)