[jira] [Comment Edited] (FLINK-36876) Operator Heap Memory Leak

chenyuzhi (Jira) Tue, 10 Dec 2024 01:00:34 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-36876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17904403#comment-17904403
 ]


chenyuzhi edited comment on FLINK-36876 at 12/10/24 8:51 AM:
-------------------------------------------------------------

I have analyzed the heap dump of Kubernetes Operator using MAT;

 

One hint of LeakSuspect is memory of *java.lang.ref.Finalizer.*

!image-2024-12-10-16-09-22-749.png|width=670,height=193!

 

The domiator_tree summary:

!image-2024-12-10-16-11-21-830.png|width=711,height=188!

 

 

Acccording to the domiator_tree summary, the main cause is the  
*PoolThreadCache.* After some research and testing, I think the problem is the 
reconcile mechanism.

 

As shown below，in the every reconcile operatrion 
 # Kubernetes Operator will create a new RestClusterClient/RestClient for 
observerFlinkdeployment;  
 # When creating RestClient, it will  create  NioEventLoopGroup/PoolThreadCache

!screenshot-1.png|width=672,height=175!

 

 

RootCause analysis:

The PoolThreadCache has implemented the finalize() method.

 

 

Thus the PoolThreadCache won't be released until  the JVM finalizer 
thread（Single Thread）processed, but Kubernetes Operator would create a new  
RestClusterClient(create new PoolThreadCache at the sametime) for every 
flinkdeployment in a reconcile interval. 

 

When the amount of flinkdeployment is increasing,  the speed of creating 
PoolThreadCache would beyond the releasing PoolThreadCache.

 

 

Solution: 

One solution is make RestClient accepted a shared  
NioEventLoopGroup/PoolThreadCache passed by external.

It has been tested  in our env to solve this problem.

 

 

 


was (Author: stupid_pig):
I have analyzed the heap dump of Kubernetes Operator using MAT;

 

One hint of LeakSuspect is memory of *java.lang.ref.Finalizer.*

!image-2024-12-10-16-09-22-749.png|width=670,height=193!

 

The domiator_tree summary:

!image-2024-12-10-16-11-21-830.png|width=711,height=188!

 

 

Acccording to the domiator_tree summary, the main cause is the  
*PoolThreadCache.* After some research and testing, I think the problem is the 
reconcile mechanism.

 

As shown below，in the every reconcile operatrion 
 # Kubernetes Operator will create a new RestClusterClient/RestClient for 
observerFlinkdeployment;  
 # When creating RestClient, it will  create  NioEventLoopGroup/PoolThreadCache

!screenshot-1.png|width=1029,height=268!

 

 

RootCause analysis:

The PoolThreadCache has implemented the finalize() method.

 

 

Thus the PoolThreadCache won't be released until  the JVM finalizer 
thread（Single Thread）processed, but Kubernetes Operator would create a new  
RestClusterClient(create new PoolThreadCache at the sametime) for every 
flinkdeployment in a reconcile interval. 

 

 When the amount of flinkdeployment is increasing,  the speed of creating 
PoolThreadCache would beyond the releasing PoolThreadCache.

 

 

Solution: 

One solution is make RestClient accepted a shared  
NioEventLoopGroup/PoolThreadCache passed by external.

It has been tested  in our env to solve this problem.

 

 

 

> Operator Heap Memory Leak
> -------------------------
>
>                 Key: FLINK-36876
>                 URL: https://issues.apache.org/jira/browse/FLINK-36876
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.6.1, kubernetes-operator-1.10.0
>         Environment:  
> Flink Operator Version:   1.6.1 (I think the latest version 1.10 has the same 
> problem)
>  
> JDK:  openjdk version "11.0.24" 
>  
> GC: G1
>  
> FlinkDeployment ammout:  3000+
>  
>            Reporter: chenyuzhi
>            Priority: Major
>         Attachments: image-2024-12-10-15-57-00-309.png, 
> image-2024-12-10-16-09-22-749.png, image-2024-12-10-16-11-21-830.png, 
> screenshot-1.png
>
>
> When the amount of FlinkDeployment increasing, the heap memory used by 
> Kubernetes Operator keep increasing.
> Eventhough after Old GC, the heap memory used can not be decreased as 
> expected. 
> !image-2024-12-10-15-57-00-309.png|width=443,height=541!
> Finally, the Kubernetes Operator was OOMKilled by OS;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-36876) Operator Heap Memory Leak

Reply via email to