[
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100146#comment-14100146
]
Bikas Saha commented on TEZ-145:
--------------------------------
We can look at a combination of vertex manager/edge (possibly only
vertexmanager) that can do the following job. Given a set of distributed data
(typically produced by the vertex that reads input but can be any vertex), how
can it be made less distributed so that we have a balance of parallelism and
cost of computation. The cost of launching 10K tasks to process 10K small
pieces of data might be less than launching 1K tasks to process the same data
if it was in 1K pieces. If the disttibuted data is already small then it can be
aggregated to a smaller set of locations, via an aggregation tree, so that each
location has a meaningful size wrt the cost of computation. If the distributed
data is large but we are going to apply a reducing aggregating function on it
(eg. sum or count) then it may make sense to apply that aggreagtion function on
it while executing that aggregation tree because then the data volume will
decrease as it funnels down that tree. MR combiner is simulating 0 level trees
in some sense. However, if the data volume is large and there is no reducing
function then we should skip the aggregation tree and directly do the
subsequent operation because the aggregation tree provides no benefit. So
having the ability of to create these kind of aggregation trees via a vertex
manager would be one promising way to go about doing this. However we are not
ready for that yet because we dont support dynamic addition/removal of vertices
in the DAG yet. Still, we can make a first step where the used has the option
of statically creating the aggregation vertex with a rack level aggregation
manager that can do rack level aggregations of data. This compile time choice
can be made by the user when they are going to process a large volume of data
and apply a reducing function to it.
> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Hitesh Shah
> Assignee: Tsuyoshi OZAWA
> Labels: TEZ-1
>
> For aggregate operators that can benefit by running in multi-level trees,
> support of being able to run a combiner in a non-local mode would allow
> performance efficiencies to be gained by running a combiner at a rack-level.
--
This message was sent by Atlassian JIRA
(v6.2#6252)