[
https://issues.apache.org/jira/browse/TEZ-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170491#comment-14170491
]
Rajesh Balamohan commented on TEZ-1649:
---------------------------------------
>>>
The min max = 0 case should still work when there are only SG edges. So maybe
the completed task checking should be for the non-SG edges
>>>
[~bikassaha] - Even when we have only SG edges, we need to check for 1 task to
be completed in each of the edges. Otherwise, we might have a scenario wherein
we get events from SG1 and change the downstream vertex's parallelism. Later
point in time, SG2 could change its parallelism causing issue. IMHO, it would
be safe to check for 1 task per edge to be completed to avoid DAG getting into
hung state.
> ShuffleVertexManager auto reduce parallelism can cause jobs to hang
> indefinitely (with ScatterGather edges)
> -----------------------------------------------------------------------------------------------------------
>
> Key: TEZ-1649
> URL: https://issues.apache.org/jira/browse/TEZ-1649
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-1649.1.patch, TEZ-1649.2.patch, TEZ-1649.3.patch,
> TEZ-1649.png
>
>
> Consider the following DAG
> M1, M2 --> R1
> M2, M3 --> R2
> R1 --> R2
> All edges are Scatter-Gather.
> 1. Set R1's (1000 parallelism) min/max setting to 0.25 - 0.5f
> 2. Set R2's (21 parallelism) min/max setting to 0.2 and 0.3f
> 3. Let M1 send some data from HDFS (test.txt)
> 4. Let M2 (50 parallelism) generate some data and send it to R2
> 5. Let M3 (500 parallelism) generate some data and send it to R2
> - Since R2's min/max can get satisfied by getting events from M3 itself, R2
> will change its parallelism quickly than R1.
> - In the mean time, R1 changes its parallelism from 1000 to 20. This is not
> propagated to R2 and it would keep waiting.
> Tested this on a small scale (20 node) cluster and it happens consistently.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)