[
https://issues.apache.org/jira/browse/FLINK-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884956#comment-16884956
]
Yun Gao commented on FLINK-13249:
---------------------------------
Hi [~till.rohrmann], I think a little differently for the cause of this issue.
I agree with that it is caused by the deadlock between _requestSubpartition ->
waitForChannel_ and _retriggerPartitionRequest,_ but I think it might not be
caused by [FLINK-13013|https://issues.apache.org/jira/browse/FLINK-13013].
Instead, I think it might be caused by
[FLINK-12530|https://issues.apache.org/jira/browse/FLINK-12530], since in this
issue we move the retriggerPartitionRequest from the Task#executor to the Netty
IO Thread. Since after the changing, the Netty IO thread is blocked when trying
to acquire the request lock, then it cannot proceed to handle the connection of
the other channels, and this cause the _waitForChannel_ to wait __ forever.
> Distributed Jepsen test fails with blocked TaskExecutor
> -------------------------------------------------------
>
> Key: FLINK-13249
> URL: https://issues.apache.org/jira/browse/FLINK-13249
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.9.0
> Reporter: Till Rohrmann
> Assignee: Stefan Richter
> Priority: Blocker
> Labels: test-stability
> Fix For: 1.9.0
>
> Attachments: jstack_25661_YarnTaskExecutorRunner
>
>
> The distributed Jepsen test which kills {{JobMasters}} started to fail
> recently. From a first glance, it looks as if the {{TaskExecutor's}} main
> thread is blocked by some operation. Further investigation is required.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)