[
https://issues.apache.org/jira/browse/FLINK-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091251#comment-17091251
]
Zhu Zhu commented on FLINK-17330:
---------------------------------
>> Would it work to say that in the first version we don't support pipelined
>> regions which contain a blocking data exchange?
I think it's possible but it may be hard for users to identify whether there
are cyclic dependencies. Most users will have to choose the mode to set all
edges BLOCKING to be safe and lose the benefit of pipelined region scheduling.
So if we'd like to take it this way, I think it's better we do it automatically
for users, i.e. override GlobalDataExchangeMode to be ALL_EDGES_BLOCKING if
cyclic dependency is detected.
>> I think we would have to detect cyclic dependencies between pipelined
>> regions and merge all regions which are part of the cycle into the same
>> pipelined region
This sounds good to me. It's more flexible without the assumption of the
mapping of logical topology and execution topology. The main concern is the
computation complexity but we can review it later.
I just realized my previous question actually consists of 2 parts:
1. whether to merge regions into one if they have cyclic dependencies? This
seems to be a must if we want to avoid resource deadlocks in initial scheduling
and failure recovery
2. how to detect cyclic dependencies? Checking whether there are intra-region
all-to-all blocking edges can be a performance efficient solution but is not
the only choice, and it also requires attention to POINTWISE edges. If we can
have a common way to find out cyclic dependencies in O(V^2), I think it's even
better. This question can be answered later when we have a deeper look at all
the options.
> Avoid scheduling deadlocks caused by cyclic input dependencies between regions
> ------------------------------------------------------------------------------
>
> Key: FLINK-17330
> URL: https://issues.apache.org/jira/browse/FLINK-17330
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Affects Versions: 1.11.0
> Reporter: Zhu Zhu
> Priority: Major
> Fix For: 1.11.0
>
>
> Imagine a job like this:
> A -- (pipelined FORWARD) --> B -- (blocking ALL-to-ALL) --> D
> A -- (pipelined FORWARD) --> C -- (pipelined FORWARD) --> D
> parallelism=2 for all vertices.
> We will have 2 execution pipelined regions:
> R1 = {A1, B1, C1, D1}
> R2 = {A2, B2, C2, D2}
> R1 has a cross-region input edge (B2->D1).
> R2 has a cross-region input edge (B1->D2).
> Scheduling deadlock will happen since we schedule a region only when all its
> inputs are consumable (i.e. blocking partitions to be finished). This is
> because R1 can be scheduled only if R2 finishes, while R2 can be scheduled
> only if R1 finishes.
> To avoid this, one solution is to force a logical pipelined region with
> intra-region ALL-to-ALL blocking edges to form one only execution pipelined
> region, so that there would not be cyclic input dependency between regions.
> Besides that, we should also pay attention to avoid cyclic cross-region
> POINTWISE blocking edges.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)