[jira] [Commented] (TEZ-1400) Reducers stuck when enabling auto-reduce parallelism (MRR case)

Rajesh Balamohan (JIRA) Tue, 12 Aug 2014 20:33:26 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-1400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095101#comment-14095101
 ]


Rajesh Balamohan commented on TEZ-1400:
---------------------------------------

Yes, tried this as a part of debugging yesterday.  However, 
appContext.getAMConf().get("tez.am.shuffle-vertex-manager.min-src-fraction") 
return different results when the job is executed.  E.g sometimes it return 
null and sometimes 0.0.  Returning null is perfectly fine, as 
ShuffleVertexManager would default the min to 0.25f.  Issue is related to 0.0 
when Hive is not setting this.

@sseth - Yes, for "Reducer 3" hive sets up auto parallelism.  Issue is I was 
able to see "0.0" for other vertices (e.g Map 2 or Map 5, and it happens 
randomly) as well from amConf.


> Reducers stuck when enabling auto-reduce parallelism (MRR case)
> ---------------------------------------------------------------
>
>                 Key: TEZ-1400
>                 URL: https://issues.apache.org/jira/browse/TEZ-1400
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.0
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>              Labels: performance
>         Attachments: TEZ-1400.1.patch, dag.dot
>
>
> In M -> R1 -> R2 case, if R1 is optimized by auto-parallelism R2 gets stuck 
> waiting for events.
> e.g
> Map 1: 0/1      Map 2: -/-      Map 5: 0/1      Map 6: 0/1      Map 7: 0/1    
>   Reducer 3: 0/23 Reducer 4: 0/1
> ...
> ...
> Map 1: 1/1      Map 2: 148(+13)/161     Map 5: 1/1      Map 6: 1/1      Map 
> 7: 1/1      Reducer 3: 0(+3)/3      Reducer 4: 0(+1)/1  <== Auto reduce 
> parallelism kicks in
> ..
> Map 1: 1/1      Map 2: 161/161  Map 5: 1/1      Map 6: 1/1      Map 7: 1/1    
>   Reducer 3: 3/3  Reducer 4: 0(+1)/1
> Job is stuck waiting for events in Reducer 4.
>  [fetcher [Reducer_3] #23] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleScheduler: copy(3 
> of 23 at 0.02 MB/s) <=== *Waiting for 20 more partitions, even though 
> Reducer3 has been optimized to use 3 reducers



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TEZ-1400) Reducers stuck when enabling auto-reduce parallelism (MRR case)

Reply via email to