[jira] [Commented] (FLINK-13249) Distributed Jepsen test fails with blocked TaskExecutor

Till Rohrmann (JIRA) Fri, 12 Jul 2019 09:38:40 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883969#comment-16883969
 ]


Till Rohrmann commented on FLINK-13249:
---------------------------------------

FLINK-13013 introduces a deadlock when setting up the {{SingleInputGate}}. The 
problem is that when we setup the {{SingleInputGate}} we try to request the 
partitions. Requesting the partitions acquires first the {{requestLock}}. Next 
we try to create the {{PartitionRequestClient}} in a blocking fashion. If the 
corresponding partition cannot be found, we ask the {{JobMaster}} about the 
state of the producer via the 
{{PartitionProducerStateProvider#requestPartitionProducerState}}. The response 
to this request arrives in a different thread which tries to call 
{{SingleInputGate#retriggerPartitionRequest}}. The problem is that this method 
also requires the {{SingleInputGate#requestLock}} which creates the deadlock.

> Distributed Jepsen test fails with blocked TaskExecutor
> -------------------------------------------------------
>
>                 Key: FLINK-13249
>                 URL: https://issues.apache.org/jira/browse/FLINK-13249
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Till Rohrmann
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.9.0
>
>
> The distributed Jepsen test which kills {{JobMasters}} started to fail 
> recently. From a first glance, it looks as if the {{TaskExecutor's}} main 
> thread is blocked by some operation. Further investigation is required.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (FLINK-13249) Distributed Jepsen test fails with blocked TaskExecutor

Reply via email to