[ 
https://issues.apache.org/jira/browse/FLINK-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884956#comment-16884956
 ] 

Yun Gao commented on FLINK-13249:
---------------------------------

Hi [~till.rohrmann], I think a little differently for the cause of this issue. 
I agree with that it is caused by the deadlock between _requestSubpartition -> 
waitForChannel_  and _retriggerPartitionRequest,_ but I think it might not be 
caused by [FLINK-13013|https://issues.apache.org/jira/browse/FLINK-13013]. 
Instead, I think it might be caused by 
[FLINK-12530|https://issues.apache.org/jira/browse/FLINK-12530], since in this 
issue we move the retriggerPartitionRequest from the Task#executor to the Netty 
IO Thread. Since after the changing, the Netty IO thread is blocked when trying 
to acquire the request lock, then it cannot proceed to handle the connection of 
the other channels, and this cause the _waitForChannel_ to wait __ forever.

> Distributed Jepsen test fails with blocked TaskExecutor
> -------------------------------------------------------
>
>                 Key: FLINK-13249
>                 URL: https://issues.apache.org/jira/browse/FLINK-13249
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.9.0
>            Reporter: Till Rohrmann
>            Assignee: Stefan Richter
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.9.0
>
>         Attachments: jstack_25661_YarnTaskExecutorRunner
>
>
> The distributed Jepsen test which kills {{JobMasters}} started to fail 
> recently. From a first glance, it looks as if the {{TaskExecutor's}} main 
> thread is blocked by some operation. Further investigation is required.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to