[
https://issues.apache.org/jira/browse/FLINK-36876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17904403#comment-17904403
]
chenyuzhi edited comment on FLINK-36876 at 12/10/24 8:51 AM:
-------------------------------------------------------------
I have analyzed the heap dump of Kubernetes Operator using MAT;
One hint of LeakSuspect is memory of *java.lang.ref.Finalizer.*
!image-2024-12-10-16-09-22-749.png|width=670,height=193!
The domiator_tree summary:
!image-2024-12-10-16-11-21-830.png|width=711,height=188!
Acccording to the domiator_tree summary, the main cause is the
*PoolThreadCache.* After some research and testing, I think the problem is the
reconcile mechanism.
As shown below,in the every reconcile operatrion
# Kubernetes Operator will create a new RestClusterClient/RestClient for
observerFlinkdeployment;
# When creating RestClient, it will create NioEventLoopGroup/PoolThreadCache
!screenshot-1.png|width=672,height=175!
RootCause analysis:
The PoolThreadCache has implemented the finalize() method.
Thus the PoolThreadCache won't be released until the JVM finalizer
thread(Single Thread)processed, but Kubernetes Operator would create a new
RestClusterClient(create new PoolThreadCache at the sametime) for every
flinkdeployment in a reconcile interval.
When the amount of flinkdeployment is increasing, the speed of creating
PoolThreadCache would beyond the releasing PoolThreadCache.
Solution:
One solution is make RestClient accepted a shared
NioEventLoopGroup/PoolThreadCache passed by external.
It has been tested in our env to solve this problem.
was (Author: stupid_pig):
I have analyzed the heap dump of Kubernetes Operator using MAT;
One hint of LeakSuspect is memory of *java.lang.ref.Finalizer.*
!image-2024-12-10-16-09-22-749.png|width=670,height=193!
The domiator_tree summary:
!image-2024-12-10-16-11-21-830.png|width=711,height=188!
Acccording to the domiator_tree summary, the main cause is the
*PoolThreadCache.* After some research and testing, I think the problem is the
reconcile mechanism.
As shown below,in the every reconcile operatrion
# Kubernetes Operator will create a new RestClusterClient/RestClient for
observerFlinkdeployment;
# When creating RestClient, it will create NioEventLoopGroup/PoolThreadCache
!screenshot-1.png|width=1029,height=268!
RootCause analysis:
The PoolThreadCache has implemented the finalize() method.
Thus the PoolThreadCache won't be released until the JVM finalizer
thread(Single Thread)processed, but Kubernetes Operator would create a new
RestClusterClient(create new PoolThreadCache at the sametime) for every
flinkdeployment in a reconcile interval.
When the amount of flinkdeployment is increasing, the speed of creating
PoolThreadCache would beyond the releasing PoolThreadCache.
Solution:
One solution is make RestClient accepted a shared
NioEventLoopGroup/PoolThreadCache passed by external.
It has been tested in our env to solve this problem.
> Operator Heap Memory Leak
> -------------------------
>
> Key: FLINK-36876
> URL: https://issues.apache.org/jira/browse/FLINK-36876
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.6.1, kubernetes-operator-1.10.0
> Environment:
> Flink Operator Version: 1.6.1 (I think the latest version 1.10 has the same
> problem)
>
> JDK: openjdk version "11.0.24"
>
> GC: G1
>
> FlinkDeployment ammout: 3000+
>
> Reporter: chenyuzhi
> Priority: Major
> Attachments: image-2024-12-10-15-57-00-309.png,
> image-2024-12-10-16-09-22-749.png, image-2024-12-10-16-11-21-830.png,
> screenshot-1.png
>
>
> When the amount of FlinkDeployment increasing, the heap memory used by
> Kubernetes Operator keep increasing.
> Eventhough after Old GC, the heap memory used can not be decreased as
> expected.
> !image-2024-12-10-15-57-00-309.png|width=443,height=541!
> Finally, the Kubernetes Operator was OOMKilled by OS;
--
This message was sent by Atlassian Jira
(v8.20.10#820010)