[
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491620#comment-14491620
]
Bikas Saha commented on TEZ-145:
--------------------------------
Right, like I said in a previous comment, the transducer needs to maintain
partition boundaries while doing its work for this to be useful.
This would need a single vertex with its vertex manager (to do the rack aware
grouping) and a single EdgeManager that does the custom routing from grouped
maps to their transducer. This would be a fairly asymmetric edge because of
arbitrary groupings.
Not sure why pipelining is required for this? Essentially we are introducing
another vertex that is doing some partial grouping. In fact, it could be done
today in user land without Tez changes and we should be able to accomplish that
in this jira. The completed map outputs are being aggregated transparently for
the next stage.
Where Tez support could be needed for efficiency is to be able to short circuit
this stage. Lets say, the vertex manager figures out that the transducer stage
is going to be useless (given data distribution, size and latency). Then Tez
could allow removing this stage from the DAG so that the real consumer stage
can be started with no overhead.
> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Hitesh Shah
> Assignee: Tsuyoshi Ozawa
> Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees,
> support of being able to run a combiner in a non-local mode would allow
> performance efficiencies to be gained by running a combiner at a rack-level.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)