[
https://issues.apache.org/jira/browse/TEZ-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169746#comment-14169746
]
Bikas Saha commented on TEZ-1649:
---------------------------------
This can be local var? Is this for testing? Then an @visiblefortesting
annotation would be good.
{code}+ int bipartiteSources = 0;{code}
Can we rename this to something like "sourceVerticesScheduled" because we have
our own scheduling logic which may not permit scheduling of tasks and then it
can get confusing.
{code}+ boolean scheduleTasks = false;{code}
These should probably now have biPartite or SG in the name
{code}numSourceTasksCompleted and totalNumSourceTasks{code}
The comment should probably be "no tasks scheduled".
For checking that scheduling has not happening we should probably check that
pendingTasks still has all tasks.
{code}+ Assert.assertEquals(4, manager.pendingTasks.size()); // all tasks
scheduled <== this comment
+
+ //1 task of every source vertex needs to be completed. Until then, we
defer scheduling
+ manager.onSourceTaskCompleted(mockSrcVertexId2, new Integer(0));
+ verify(mockContext, times(0)).setVertexParallelism(anyInt(),
any(VertexLocationHint.class), anyMap(), anyMap());{code}
The min max = 0 case should still work when there are only SG edges. So maybe
the completed task checking should be for the non-SG edges and once they are
scheduled we continue the existing min/max based scheduling for the SG edges.
This could be a perf issue in a resource rich cluster where all tasks can fit
at once and we want to schedule everything at once.
{code} Assert.assertTrue(manager.numSourceTasksCompleted == 0);
+ //Atleast 1 task should be complete in all sources
+ manager.onSourceTaskCompleted(mockSrcVertexId1, new Integer(0));
+ manager.onSourceTaskCompleted(mockSrcVertexId2, new Integer(0));
+ manager.onSourceTaskCompleted(mockSrcVertexId3, new Integer(0));{code}
test timeout 5 seconds
{code}+
+ @Test{code}
Other than the min/max comment, the rest are minor. Any renames (if needed) can
be done before commit to reduce patch size for review.
> ShuffleVertexManager auto reduce parallelism can cause jobs to hang
> indefinitely (with ScatterGather edges)
> -----------------------------------------------------------------------------------------------------------
>
> Key: TEZ-1649
> URL: https://issues.apache.org/jira/browse/TEZ-1649
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-1649.1.patch, TEZ-1649.2.patch, TEZ-1649.3.patch,
> TEZ-1649.png
>
>
> Consider the following DAG
> M1, M2 --> R1
> M2, M3 --> R2
> R1 --> R2
> All edges are Scatter-Gather.
> 1. Set R1's (1000 parallelism) min/max setting to 0.25 - 0.5f
> 2. Set R2's (21 parallelism) min/max setting to 0.2 and 0.3f
> 3. Let M1 send some data from HDFS (test.txt)
> 4. Let M2 (50 parallelism) generate some data and send it to R2
> 5. Let M3 (500 parallelism) generate some data and send it to R2
> - Since R2's min/max can get satisfied by getting events from M3 itself, R2
> will change its parallelism quickly than R1.
> - In the mean time, R1 changes its parallelism from 1000 to 20. This is not
> propagated to R2 and it would keep waiting.
> Tested this on a small scale (20 node) cluster and it happens consistently.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)