[ 
https://issues.apache.org/jira/browse/TEZ-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335661#comment-14335661
 ] 

Rajesh Balamohan commented on TEZ-2001:
---------------------------------------

In 20 nodes, Reducer_2 had ~500 tasks.  Initial wave of 180+ tasks would 
benefit from pipelining as they were able to pull the data earlier as the 
previous stage kept spilling data.  However, rest of the reducer tasks (i.e in 
second wave) would not benefit from this.   On a larger cluster, this can be 
tested out to understand the benefits independently.  Also need to make changes 
in ShuffleVertexManager to schedule tasks which would have to pull large amount 
of data earlier than others. Otherwise there can be corner case, where the 
scheduled downstream tasks would end up getting empty partitions and not doing 
useful work. Will file a separate JIRA for it as an enhancement in 
ShuffleVertexManager.

> Support pipelined data transfer for ordered output
> --------------------------------------------------
>
>                 Key: TEZ-2001
>                 URL: https://issues.apache.org/jira/browse/TEZ-2001
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2001.1.patch, TEZ-2001.2.patch, TEZ-2001.3.patch, 
> benchmark_q17_10TB.png, dag_plan.jpg
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to