[ 
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490468#comment-14490468
 ] 

Bikas Saha commented on TEZ-145:
--------------------------------

Taking a step back, lets figure out the scenarios for this. 
Do we agree that for small jobs (small data) - this is not going to be helpful 
because we will be adding an extra stage latency for small combiner benefits.
Large job (large data) with no data reduction in the map side combiner - this 
is not going to be helpful because the extra combiner will not reduce the data 
further.
Large job (large data) with high data reduction in the map side combiner - this 
is going to be useful because the extra combiner will reduce the data further 
and also decrease the number of data shards by aggregating small outputs from 
the map tasks into smaller number of combiner tasks.
Large job (large data) with lot of filtering (no combiner) - this may be 
useful, not because their is a combine operation) but to reduce the large 
number of small outputs produced by the map tasks into a smaller number of 
shards due to the combiner tasks.

> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
>                 Key: TEZ-145
>                 URL: https://issues.apache.org/jira/browse/TEZ-145
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Tsuyoshi Ozawa
>         Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees, 
> support of being able to run a combiner in a non-local mode would allow 
> performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to