[
https://issues.apache.org/jira/browse/FLINK-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883969#comment-16883969
]
Till Rohrmann commented on FLINK-13249:
---------------------------------------
FLINK-13013 introduces a deadlock when setting up the {{SingleInputGate}}. The
problem is that when we setup the {{SingleInputGate}} we try to request the
partitions. Requesting the partitions acquires first the {{requestLock}}. Next
we try to create the {{PartitionRequestClient}} in a blocking fashion. If the
corresponding partition cannot be found, we ask the {{JobMaster}} about the
state of the producer via the
{{PartitionProducerStateProvider#requestPartitionProducerState}}. The response
to this request arrives in a different thread which tries to call
{{SingleInputGate#retriggerPartitionRequest}}. The problem is that this method
also requires the {{SingleInputGate#requestLock}} which creates the deadlock.
> Distributed Jepsen test fails with blocked TaskExecutor
> -------------------------------------------------------
>
> Key: FLINK-13249
> URL: https://issues.apache.org/jira/browse/FLINK-13249
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.9.0
> Reporter: Till Rohrmann
> Priority: Blocker
> Labels: test-stability
> Fix For: 1.9.0
>
>
> The distributed Jepsen test which kills {{JobMasters}} started to fail
> recently. From a first glance, it looks as if the {{TaskExecutor's}} main
> thread is blocked by some operation. Further investigation is required.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)