[
https://issues.apache.org/jira/browse/TEZ-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335661#comment-14335661
]
Rajesh Balamohan commented on TEZ-2001:
---------------------------------------
In 20 nodes, Reducer_2 had ~500 tasks. Initial wave of 180+ tasks would
benefit from pipelining as they were able to pull the data earlier as the
previous stage kept spilling data. However, rest of the reducer tasks (i.e in
second wave) would not benefit from this. On a larger cluster, this can be
tested out to understand the benefits independently. Also need to make changes
in ShuffleVertexManager to schedule tasks which would have to pull large amount
of data earlier than others. Otherwise there can be corner case, where the
scheduled downstream tasks would end up getting empty partitions and not doing
useful work. Will file a separate JIRA for it as an enhancement in
ShuffleVertexManager.
> Support pipelined data transfer for ordered output
> --------------------------------------------------
>
> Key: TEZ-2001
> URL: https://issues.apache.org/jira/browse/TEZ-2001
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2001.1.patch, TEZ-2001.2.patch, TEZ-2001.3.patch,
> benchmark_q17_10TB.png, dag_plan.jpg
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)