[
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490468#comment-14490468
]
Bikas Saha edited comment on TEZ-145 at 4/10/15 10:30 PM:
----------------------------------------------------------
Taking a step back, lets figure out the scenarios for this.
Do we agree that
1) Small jobs (small data) - this is not going to be helpful because we will be
adding an extra stage latency for small combiner benefits.
2) Large job (large data) with no data reduction in the map side combiner -
this is not going to be helpful because the extra combiner will not reduce the
data further.
3) Large job (large data) with high data reduction in the map side combiner -
this is going to be useful because the extra combiner will reduce the data
further and also decrease the number of data shards by aggregating small
outputs from the map tasks into smaller number of combiner tasks.
4) Large job (large data) with lot of filtering (no combiner) - this may be
useful, not because their is a combine operation) but to reduce the large
number of small outputs produced by the map tasks into a smaller number of
shards due to the combiner tasks.
For 3/4 this may be useful if we can run aggregation combiner tasks at the rack
level to coalesce the data within a rack (cheap) compared to having to pull
that data across racks in the final reducer. Even in these cases, given better
networks, we need to understand the trade off between pulling the data across
to the final reducer vs the cost of running the extra combiner stage.
Essentially, what is the killer scenario for this?
was (Author: bikassaha):
Taking a step back, lets figure out the scenarios for this.
Do we agree that for small jobs (small data) - this is not going to be helpful
because we will be adding an extra stage latency for small combiner benefits.
Large job (large data) with no data reduction in the map side combiner - this
is not going to be helpful because the extra combiner will not reduce the data
further.
Large job (large data) with high data reduction in the map side combiner - this
is going to be useful because the extra combiner will reduce the data further
and also decrease the number of data shards by aggregating small outputs from
the map tasks into smaller number of combiner tasks.
Large job (large data) with lot of filtering (no combiner) - this may be
useful, not because their is a combine operation) but to reduce the large
number of small outputs produced by the map tasks into a smaller number of
shards due to the combiner tasks.
> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Hitesh Shah
> Assignee: Tsuyoshi Ozawa
> Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees,
> support of being able to run a combiner in a non-local mode would allow
> performance efficiencies to be gained by running a combiner at a rack-level.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)