[
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367849#comment-14367849
]
Bikas Saha commented on TEZ-145:
--------------------------------
I know what you are talking about but let me restate to check if we are on the
same page.
Combining can be at multiple levels - task, host, rack etc.
Doing these combines in theory requires maintaining partition boundaries per
combining level. However, if tasks are maintaining partition boundaries then
there is a task explosion (== level-arity * partition count). Hence, an
efficient, multi-level combine operation, needs to operate on multiple
partitions per task at each level. Such that a reasonable number of tasks can
be used to process a large number of partitions. This statement can be true
even for the final reducer. Partially, that is what happens with auto-reduce
except that the tasks lost their partition boundaries.
If the processor can find a way to process multiple partitions while keeping
them logically separate then we could de-link physical tasks from physical
partitioning. If that is supported by the processor, the edge manager can be
set up to do the correct routing of N output/partition indeces to the same task.
> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Hitesh Shah
> Assignee: Tsuyoshi Ozawa
> Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees,
> support of being able to run a combiner in a non-local mode would allow
> performance efficiencies to be gained by running a combiner at a rack-level.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)