[jira] [Commented] (TEZ-1400) Reducers stuck when enabling auto-reduce parallelism (MRR case)

Rajesh Balamohan (JIRA) Mon, 11 Aug 2014 05:34:36 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-1400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092728#comment-14092728
 ]


Rajesh Balamohan commented on TEZ-1400:
---------------------------------------

In this case, Hive submitted the plan to Tez and Hive does not set min/max 
fractions anywhere.  Somewhere within tez code, min/max are getting reset to 
"0.0" before landing up in ShuffleVertexManager causing this issue. 

For e.g, I checked min/max value when "Map 7" gets 
VertexImpl->InitTransition()->transition()-->setupVertex(), the min/max values 
in appContext.getConf() are reset to 0.0.  Need to check why this is happening. 
 IMO, appContext.getAMConf()'s values should not be modified during vertex 
setups.  Please correct me if this assumption is wrong.



> Reducers stuck when enabling auto-reduce parallelism (MRR case)
> ---------------------------------------------------------------
>
>                 Key: TEZ-1400
>                 URL: https://issues.apache.org/jira/browse/TEZ-1400
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.0
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>              Labels: performance
>         Attachments: TEZ-1400.1.patch, dag.dot
>
>
> In M -> R1 -> R2 case, if R1 is optimized by auto-parallelism R2 gets stuck 
> waiting for events.
> e.g
> Map 1: 0/1      Map 2: -/-      Map 5: 0/1      Map 6: 0/1      Map 7: 0/1    
>   Reducer 3: 0/23 Reducer 4: 0/1
> ...
> ...
> Map 1: 1/1      Map 2: 148(+13)/161     Map 5: 1/1      Map 6: 1/1      Map 
> 7: 1/1      Reducer 3: 0(+3)/3      Reducer 4: 0(+1)/1  <== Auto reduce 
> parallelism kicks in
> ..
> Map 1: 1/1      Map 2: 161/161  Map 5: 1/1      Map 6: 1/1      Map 7: 1/1    
>   Reducer 3: 3/3  Reducer 4: 0(+1)/1
> Job is stuck waiting for events in Reducer 4.
>  [fetcher [Reducer_3] #23] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleScheduler: copy(3 
> of 23 at 0.02 MB/s) <=== *Waiting for 20 more partitions, even though 
> Reducer3 has been optimized to use 3 reducers



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TEZ-1400) Reducers stuck when enabling auto-reduce parallelism (MRR case)

Reply via email to