Fei Feng created FLINK-34726:
--------------------------------

             Summary: Flink Kubernetes Operator has some room for optimizing 
performance.
                 Key: FLINK-34726
                 URL: https://issues.apache.org/jira/browse/FLINK-34726
             Project: Flink
          Issue Type: Improvement
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.7.0, kubernetes-operator-1.6.0, 
kubernetes-operator-1.5.0
            Reporter: Fei Feng
         Attachments: operator_no_submit_no_kill.flamegraph.html

When there is a huge number of FlinkDeployment and FlinkSessionJob in a 
kubernetes cluster, there will be a significant delay between event submit into 
reconcile thread pool and  event is processed. 

this is our test:we give operator enough resource(cpu: 10core, memory: 20g, 
reconcile thread pool  size was 200 ) and we deployed 10000 jobs firstly (one 
FlinkDeployment and one SessionJob per job) , then we do submit/delete job 
tests. we found that 
1. it cost about 2min between create new FlinkDeployment and FlinkSessionJob CR 
to k8s and the flink job submited to jobmanager.
2. it cost about 1min between delete a FlinkDeployment and FlinkSessionJob CR  
and the flink job and session cluster cleared.

 

I use async-profiler to get flamegraph when  there is a huge number 
FlinkDeployment and FlinkSessionJob. I found two obvious areas for optimization

1. For Flinkdeployment: in the observe step, we call 
AbstractFlinkService.getClusterInfo/listJobs/getTaskManagerInfo , every time we 
call these method we need create RestClusterClient/ send requests/ close, I 
think we should reuse RestClusterClient as much as possible to avoid frequently 
creating objects to reduce GC pressure

2. For FlinkSessionJob (This issue is more obvious): in the whole reconcile 
loop, we call getSecondaryResource 5 times to get FlinkDeployement resource 
info. Based on my current understanding of the Flink Operator, I think we do 
not need to call it 5 times in a single reconcile loop, calling it once is 
enough. If yes, we cloud save 30% cpu usage (every getSecondaryResource cost 6% 
cpu usage)

[^operator_no_submit_no_kill.flamegraph.html]

I hope we can discuss solutions to address this problem together. I'm very 
willing to optimize and resolve this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to