[ 
https://issues.apache.org/jira/browse/FLINK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038976#comment-17038976
 ] 

Till Rohrmann commented on FLINK-16069:
---------------------------------------

Hi [~huwh], do you know what exactly is taking so long. Is the creation of the 
{{TaskDeploymentDescriptors}}? If yes, is it the iteration over the input edges?

I think it is not as easy as moving the {{TaskDeploymentDescriptor}} creation 
into a future because we are accessing the {{ExecutionGraph}} through the 
passed result partitions. This means that in case of a concurrent recovery we 
might have a race condition where we read state from an already reset 
{{Execution}}, for example.

> Creation of TaskDeploymentDescriptor can block main thread for long time
> ------------------------------------------------------------------------
>
>                 Key: FLINK-16069
>                 URL: https://issues.apache.org/jira/browse/FLINK-16069
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huweihua
>            Priority: Major
>
> The deploy of tasks will take long time when we submit a high parallelism 
> job. And Execution#deploy run in mainThread, so it will block JobMaster 
> process other akka messages, such as Heartbeat. The creation of 
> TaskDeploymentDescriptor take most of time. We can put the creation in future.
> For example, A job [source(8000)->sink(8000)], the total 16000 tasks from 
> SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of 
> TaskManager timeout and job never success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to