[ 
https://issues.apache.org/jira/browse/FLINK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291607#comment-17291607
 ] 

Zhu Zhu edited comment on FLINK-16069 at 2/26/21, 12:14 PM:
------------------------------------------------------------

I had a PoC to verify the idea to cache and reuse ShuffleDescriptor for 
ALL-to-ALL connection pattern. 
https://github.com/zhuzhurk/flink/commits/FLINK_16069_deployment_perf_improvement_poc3

With this change, the time to deploy tasks in main thread for a 8000x8000 job 
can be reduced from ~90s to ~5s.
This means main thread can be relieved. However, the E2E deployment performance 
does have an obvious improvement 
because the bottleneck now becomes the submission of TaskDeploymentDescriptor. 
It was 1min 40s before the improvement 
and still 1min 30s after the improvement. (see key logs in attached 
{{FLINK-16069-POC-results}})

Theoretically, reducing the TDD creation time to unblock main thread is good 
enough in the scope of this ticket.
However, I also noticed heartbeat timeout errors can happen sometimes with this 
improvement. 
So I still need some more time to look into why this timeout can happen and how 
to solve it.


was (Author: zhuzh):
I had a PoC to verify the idea to cache and reuse ShuffleDescriptor for 
ALL-to-ALL connection pattern. 
https://github.com/zhuzhurk/flink/commits/FLINK_16069_deployment_perf_improvement_poc3

With this change, the time to deploy tasks in main thread for a 8000x8000 job 
can be reduced from ~90s to ~5s.
This means main thread can be relieved. However, the E2E deployment performance 
does have an obvious improvement 
because the bottleneck now becomes the submission of TaskDeploymentDescriptor. 
It was 1min 40s before the improvement 
and still 1min 30s after the improvement. (see key logs in attached 
{{FLINK-16069-POC-results}})

Theoretically, reduce the TDD creation time to unblock main thread is good 
enough in the scope of this ticket.
However, I also noticed heartbeat timeout errors can happen sometimes with this 
improvement. 
So I still need some more time to look into why this timeout can happen and how 
to solve it.

> Creation of TaskDeploymentDescriptor can block main thread for long time
> ------------------------------------------------------------------------
>
>                 Key: FLINK-16069
>                 URL: https://issues.apache.org/jira/browse/FLINK-16069
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huweihua
>            Priority: Major
>         Attachments: FLINK-16069-POC-results
>
>
> The deploy of tasks will take long time when we submit a high parallelism 
> job. And Execution#deploy run in mainThread, so it will block JobMaster 
> process other akka messages, such as Heartbeat. The creation of 
> TaskDeploymentDescriptor take most of time. We can put the creation in future.
> For example, A job [source(8000)->sink(8000)], the total 16000 tasks from 
> SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of 
> TaskManager timeout and job never success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to