[ 
https://issues.apache.org/jira/browse/FLINK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311960#comment-17311960
 ] 

Zhu Zhu commented on FLINK-16069:
---------------------------------

>From what I can see, heartbeat timeout happens because the scheduled 
>heartbeats sending actions(HeartbeatManagerSenderImpl::run) are not executed 
>in time.
By sampling the jstacks during deploying tasks, JM main thread is not blocked 
in a certain process and is processing in coming requests (mainly 
`heartbeatFromTaskManager` and `updateTaskExecutionState` (tasks switching to 
RUNNING)). However, sometimes in the sampled result, there is even no active JM 
main thread. So I also suspect that messages are not put into JM actor's 
mailbox in time or the mailbox events are not dispatched in time. Note that 
this problem happens during the deploying stage, at this time all the 
future-executor thread and akka remoting dispatcher threads are busy dealing 
with `submitTask` messages.

> Creation of TaskDeploymentDescriptor can block main thread for long time
> ------------------------------------------------------------------------
>
>                 Key: FLINK-16069
>                 URL: https://issues.apache.org/jira/browse/FLINK-16069
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huweihua
>            Priority: Major
>         Attachments: FLINK-16069-POC-results
>
>
> The deploy of tasks will take long time when we submit a high parallelism 
> job. And Execution#deploy run in mainThread, so it will block JobMaster 
> process other akka messages, such as Heartbeat. The creation of 
> TaskDeploymentDescriptor take most of time. We can put the creation in future.
> For example, A job [source(8000)->sink(8000)], the total 16000 tasks from 
> SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of 
> TaskManager timeout and job never success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to