[
https://issues.apache.org/jira/browse/TEZ-1400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095816#comment-14095816
]
Bikas Saha commented on TEZ-1400:
---------------------------------
Can you confirm that ShuffleVertexManager is being explicitly enabled for
certain (or all) vertices by calling the vertex.setVertexManager() and then
providing it a payload that configures
TEZ_AM_SHUFFLE_VERTEX_MANAGER_ENABLE_AUTO_PARALLEL to true.
This should not be turned on via the main job configuration as it will get
inadvertently turned on for vertices that should not change their parallelism.
If this is being enabled explicitly via the setVertexManager() with a payload
then that is where the bug should be. If its not being explicitly turned on via
setVertexManager() then that should change.
One other thing you could try is to create a formal payload object for this
manager and have a configurer that can set up all its parameters. By default it
could pick up params from the client side tez-site.xml. Also remove the
creation of payload from am conf if there is no payload to make the payload
required.
> Reducers stuck when enabling auto-reduce parallelism (MRR case)
> ---------------------------------------------------------------
>
> Key: TEZ-1400
> URL: https://issues.apache.org/jira/browse/TEZ-1400
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.5.0
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Labels: performance
> Attachments: TEZ-1400.1.patch, dag.dot
>
>
> In M -> R1 -> R2 case, if R1 is optimized by auto-parallelism R2 gets stuck
> waiting for events.
> e.g
> Map 1: 0/1 Map 2: -/- Map 5: 0/1 Map 6: 0/1 Map 7: 0/1
> Reducer 3: 0/23 Reducer 4: 0/1
> ...
> ...
> Map 1: 1/1 Map 2: 148(+13)/161 Map 5: 1/1 Map 6: 1/1 Map
> 7: 1/1 Reducer 3: 0(+3)/3 Reducer 4: 0(+1)/1 <== Auto reduce
> parallelism kicks in
> ..
> Map 1: 1/1 Map 2: 161/161 Map 5: 1/1 Map 6: 1/1 Map 7: 1/1
> Reducer 3: 3/3 Reducer 4: 0(+1)/1
> Job is stuck waiting for events in Reducer 4.
> [fetcher [Reducer_3] #23]
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleScheduler: copy(3
> of 23 at 0.02 MB/s) <=== *Waiting for 20 more partitions, even though
> Reducer3 has been optimized to use 3 reducers
--
This message was sent by Atlassian JIRA
(v6.2#6252)