[
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490468#comment-14490468
]
Bikas Saha commented on TEZ-145:
--------------------------------
Taking a step back, lets figure out the scenarios for this.
Do we agree that for small jobs (small data) - this is not going to be helpful
because we will be adding an extra stage latency for small combiner benefits.
Large job (large data) with no data reduction in the map side combiner - this
is not going to be helpful because the extra combiner will not reduce the data
further.
Large job (large data) with high data reduction in the map side combiner - this
is going to be useful because the extra combiner will reduce the data further
and also decrease the number of data shards by aggregating small outputs from
the map tasks into smaller number of combiner tasks.
Large job (large data) with lot of filtering (no combiner) - this may be
useful, not because their is a combine operation) but to reduce the large
number of small outputs produced by the map tasks into a smaller number of
shards due to the combiner tasks.
> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Hitesh Shah
> Assignee: Tsuyoshi Ozawa
> Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees,
> support of being able to run a combiner in a non-local mode would allow
> performance efficiencies to be gained by running a combiner at a rack-level.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)