[jira] [Commented] (FLINK-34726) Flink Kubernetes Operator has some room for optimizing performance.

Gyula Fora (Jira) Tue, 19 Mar 2024 00:44:12 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-34726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828210#comment-17828210
 ]


Gyula Fora commented on FLINK-34726:
------------------------------------

Thanks for the detailed analysis [~Fei Feng] . You are completely right that we 
don't optimise the rest client usage and that may add a significant overhead. 
We have done similar optimisation in the past for config access/generation by 
using the FlinkResourceContext class. 

We could probably move the rest client generation logic there instead of hiding 
it under the FlinkService completely. This will be however a bigger change as 
it will affect the methods of the FlinkService interface as well.

Sounds a bit strange that getSecondaryResource is so expensive as that should 
happen from a cache. We should look into it while it's expensive in the first 
place because passing the FlinkDeployment objects around will make the code a 
bit more complicated, but I guess that could also be hidden under the 
FlinkSessionJobContext

> Flink Kubernetes Operator has some room for optimizing performance.
> -------------------------------------------------------------------
>
>                 Key: FLINK-34726
>                 URL: https://issues.apache.org/jira/browse/FLINK-34726
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.5.0, kubernetes-operator-1.6.0, 
> kubernetes-operator-1.7.0
>            Reporter: Fei Feng
>            Priority: Major
>         Attachments: operator_no_submit_no_kill.flamegraph.html
>
>
> When there is a huge number of FlinkDeployment and FlinkSessionJob in a 
> kubernetes cluster, there will be a significant delay between event submit 
> into reconcile thread pool and  event is processed. 
> this is our test：we give operator enough resource（cpu: 10core, memory: 20g, 
> reconcile thread pool  size was 200 ) and we deployed 10000 jobs firstly (one 
> FlinkDeployment and one SessionJob per job) , then we do submit/delete job 
> tests. we found that 
> 1. it cost about 2min between create new FlinkDeployment and FlinkSessionJob 
> CR to k8s and the flink job submited to jobmanager.
> 2. it cost about 1min between delete a FlinkDeployment and FlinkSessionJob CR 
>  and the flink job and session cluster cleared.
>  
> I use async-profiler to get flamegraph when  there is a huge number 
> FlinkDeployment and FlinkSessionJob. I found two obvious areas for 
> optimization
> 1. For Flinkdeployment: in the observe step, we call 
> AbstractFlinkService.getClusterInfo/listJobs/getTaskManagerInfo , every time 
> we call these method we need create RestClusterClient/ send requests/ close, 
> I think we should reuse RestClusterClient as much as possible to avoid 
> frequently creating objects to reduce GC pressure
> 2. For FlinkSessionJob （This issue is more obvious）: in the whole reconcile 
> loop, we call getSecondaryResource 5 times to get FlinkDeployement resource 
> info. Based on my current understanding of the Flink Operator, I think we do 
> not need to call it 5 times in a single reconcile loop, calling it once is 
> enough. If yes, we cloud save 30% cpu usage (every getSecondaryResource cost 
> 6% cpu usage)
> [^operator_no_submit_no_kill.flamegraph.html]
> I hope we can discuss solutions to address this problem together. I'm very 
> willing to optimize and resolve this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34726) Flink Kubernetes Operator has some room for optimizing performance.

Reply via email to