[ 
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100146#comment-14100146
 ] 

Bikas Saha commented on TEZ-145:
--------------------------------

We can look at a combination of vertex manager/edge (possibly only 
vertexmanager) that can do the following job. Given a set of distributed data 
(typically produced by the vertex that reads input but can be any vertex), how 
can it be made less distributed so that we have a balance of parallelism and 
cost of computation. The cost of launching 10K tasks to process 10K small 
pieces of data might be less than launching 1K tasks to process the same data 
if it was in 1K pieces. If the disttibuted data is already small then it can be 
aggregated to a smaller set of locations, via an aggregation tree, so that each 
location has a meaningful size wrt the cost of computation. If the distributed 
data is large but we are going to apply a reducing aggregating function on it 
(eg. sum or count) then it may make sense to apply that aggreagtion function on 
it while executing that aggregation tree because then the data volume will 
decrease as it funnels down that tree. MR combiner is simulating 0 level trees 
in some sense. However, if the data volume is large and there is no reducing 
function then we should skip the aggregation tree and directly do the 
subsequent operation because the aggregation tree provides no benefit. So 
having the ability of to create these kind of aggregation trees via a vertex 
manager would be one promising way to go about doing this. However we are not 
ready for that yet because we dont support dynamic addition/removal of vertices 
in the DAG yet. Still, we can make a first step where the used has the option 
of statically creating the aggregation vertex with a rack level aggregation 
manager that can do rack level aggregations of data. This compile time choice 
can be made by the user when they are going to process a large volume of data 
and apply a reducing function to it.

> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
>                 Key: TEZ-145
>                 URL: https://issues.apache.org/jira/browse/TEZ-145
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Tsuyoshi OZAWA
>              Labels: TEZ-1
>
> For aggregate operators that can benefit by running in multi-level trees, 
> support of being able to run a combiner in a non-local mode would allow 
> performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to