[ https://issues.apache.org/jira/browse/FLINK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291607#comment-17291607 ]
Zhu Zhu edited comment on FLINK-16069 at 2/26/21, 12:14 PM: ------------------------------------------------------------ I had a PoC to verify the idea to cache and reuse ShuffleDescriptor for ALL-to-ALL connection pattern. https://github.com/zhuzhurk/flink/commits/FLINK_16069_deployment_perf_improvement_poc3 With this change, the time to deploy tasks in main thread for a 8000x8000 job can be reduced from ~90s to ~5s. This means main thread can be relieved. However, the E2E deployment performance does have an obvious improvement because the bottleneck now becomes the submission of TaskDeploymentDescriptor. It was 1min 40s before the improvement and still 1min 30s after the improvement. (see key logs in attached {{FLINK-16069-POC-results}}) Theoretically, reducing the TDD creation time to unblock main thread is good enough in the scope of this ticket. However, I also noticed heartbeat timeout errors can happen sometimes with this improvement. So I still need some more time to look into why this timeout can happen and how to solve it. was (Author: zhuzh): I had a PoC to verify the idea to cache and reuse ShuffleDescriptor for ALL-to-ALL connection pattern. https://github.com/zhuzhurk/flink/commits/FLINK_16069_deployment_perf_improvement_poc3 With this change, the time to deploy tasks in main thread for a 8000x8000 job can be reduced from ~90s to ~5s. This means main thread can be relieved. However, the E2E deployment performance does have an obvious improvement because the bottleneck now becomes the submission of TaskDeploymentDescriptor. It was 1min 40s before the improvement and still 1min 30s after the improvement. (see key logs in attached {{FLINK-16069-POC-results}}) Theoretically, reduce the TDD creation time to unblock main thread is good enough in the scope of this ticket. However, I also noticed heartbeat timeout errors can happen sometimes with this improvement. So I still need some more time to look into why this timeout can happen and how to solve it. > Creation of TaskDeploymentDescriptor can block main thread for long time > ------------------------------------------------------------------------ > > Key: FLINK-16069 > URL: https://issues.apache.org/jira/browse/FLINK-16069 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: huweihua > Priority: Major > Attachments: FLINK-16069-POC-results > > > The deploy of tasks will take long time when we submit a high parallelism > job. And Execution#deploy run in mainThread, so it will block JobMaster > process other akka messages, such as Heartbeat. The creation of > TaskDeploymentDescriptor take most of time. We can put the creation in future. > For example, A job [source(8000)->sink(8000)], the total 16000 tasks from > SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of > TaskManager timeout and job never success. -- This message was sent by Atlassian Jira (v8.3.4#803005)