[jira] [Commented] (FLINK-28248) Metaspace memory is leaking when repeatedly submitting Beam batch pipelines via the REST API

Arkadiusz Gasinski (Jira) Tue, 05 Jul 2022 07:00:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-28248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562680#comment-17562680
 ]


Arkadiusz Gasinski commented on FLINK-28248:
--------------------------------------------

So after spending the past few days in the Eclipse memory analyzer and testing 
various dependencies setups, I can safely say that the references to the 
ChildFirstClassLoader leak through thread locals containing Jackson 
configurations.

Here's one example, where the jobmanager-io-thread-1 holds a reference to the 
ChildFirstClasLoader instance through some thread-local entry. And essentially 
each job submission is a new thread-local entry that holds the reference to the 
new class loader instance used to submit the job.

!image-2022-07-05-15-47-45-038.png!

Here an example where the RMI Connection thread holds a reference to 
ChildFirstClassLoader via its contextClassLoader instance:

!image-2022-07-05-15-51-05-840.png!

Again, it's Jackson that's present somewhere in between.

In the next screenshot, it's some flink-akka actor thread that indirectly 
stores the reference to ChildFirstClassLoader instance and again, it's Jackson 
that is somewhere in the middle.

!image-2022-07-05-15-58-43-448.png!

I think the important bit of information is that I moved Jackson libraries to 
the Flink's lib folder as I've also moved some other common libs there that 
depend on Jackson and if Jackson is not there, job submission fails with 
ClassNotFound exception, even if Jackson is packaged in the job's jar.

> Metaspace memory is leaking when repeatedly submitting Beam batch pipelines 
> via the REST API
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-28248
>                 URL: https://issues.apache.org/jira/browse/FLINK-28248
>             Project: Flink
>          Issue Type: Bug
>          Components: API / Core
>    Affects Versions: 1.14.4
>            Reporter: Arkadiusz Gasinski
>            Priority: Major
>         Attachments: image-2022-06-24-14-45-51-689.png, 
> image-2022-06-24-14-51-47-909.png, image-2022-06-24-15-07-43-035.png, 
> image-2022-07-05-15-47-45-038.png, image-2022-07-05-15-51-05-840.png, 
> image-2022-07-05-15-58-43-448.png
>
>
> We have a Flink cluster running on k8s/OpenShift in session mode running our 
> Apache Beam pipelines. Some of these pipelines are streaming pipelines and 
> run continuously; some are batch pipelines submitted periodically whenever 
> there is a load to be processed.
> We believe that the batch pipelines cause the issue. We submit 1 to several 
> batch jobs every 5 minutes. For each job, a new instance of the 
> ChildFirstClassLoader is instantiated and it looks like they are not closed 
> properly after the job finishes.
> Attached is the screenshot from the Eclipse memory analyzer - from the Leak 
> Suspects report. When the heap dump was captured, there were 2 streaming and 
> several batch jobs running plus over 100 finished batch jobs.
> !image-2022-06-24-14-45-51-689.png!
> In our current setup, we allocate 8GB for the metaspace:
> !image-2022-06-24-14-51-47-909.png!
>  
> And the top components from the mem analyzer:
> !image-2022-06-24-15-07-43-035.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-28248) Metaspace memory is leaking when repeatedly submitting Beam batch pipelines via the REST API

Reply via email to