[ 
https://issues.apache.org/jira/browse/FLINK-32203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726626#comment-17726626
 ] 

Oleksandr Nitavskyi commented on FLINK-32203:
---------------------------------------------

[~chesnay] thanks for looking into PR 
(https://github.com/apache/flink/pull/22664). You can see in attach an example 
of the stack trace, which we get when Log4jThread is being created.

We have run the job and were killing one of JobManager to rely on HA and 
trigger the job restart.
During debug of the Log4jThread creation we saw that in StackTrace there are 
Presto (for checkpoint) or Hadoop S3A (to write output on S3) FileSystems, 
which are loaded from Plugin Classloader. (example stack trace is attached)

Do you know if a plugin Classloader instance is created per job, when a job is 
being created? If yes, probably this instance is being passed to 
Log4jContextFactory and thus a new Log4j subsystem being created.

> Potential ClassLoader memory leak due to log4j configuration
> ------------------------------------------------------------
>
>                 Key: FLINK-32203
>                 URL: https://issues.apache.org/jira/browse/FLINK-32203
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Oleksandr Nitavskyi
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: classloader_leak.png, 
> stack_trace_example_with_log4j_creation_on_job_reload.log
>
>
> *Context*
> We have encountered a memory leak related to ClassLoaders in Apache Flink. 
> ChildFirstClassLoader is not properly garbage collected, when job is being 
> restarted.
> Heap Dump has shown that Log4j starts a configuration watch thread, which 
> then has Strong reference to ChildFirstClassLoader via AccessControlContext. 
> Since thread is never stopped, ChildFirstClassLoader is never cleaned. 
> Removal monitorInterval introduced in FLINK-20510 helps to mitigate the 
> issue, I believe it could be applied to log4j config by default.
> *How to reproduce*
> Deploy Flink job, which uses Hadoop File System (e.g. s3a). Redeploy the job 
> -> in Task Manager dump you should see multiple Log4jThreads
> *AC*
> We have a configuration which doesn't lead easy to memory leak with default 
> configuration for Flink users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to