[
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490535#comment-14490535
]
Gopal V commented on TEZ-145:
-----------------------------
Combiner have only made sense in case of 3/4.
#3 is the true use case for this, because combiners are written only for those
scenarios.
The old MRv2 model did a re-merge + combine() only if there were > 3 spills per
task.
So tuning it to have no extra spills produced bad shuffle performance, which is
what the Tez approach is not vulnerable to, since it is meant to combine
host-local data (plus skip merges via pipelining).
The original scenario where I discovered a need for this was when I was trying
to find the first/last transaction of sessions across a time window, to look
for overlapped session-ids for the same user to detect multiple device usage or
stolen tokens.
> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
> Key: TEZ-145
> URL: https://issues.apache.org/jira/browse/TEZ-145
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Hitesh Shah
> Assignee: Tsuyoshi Ozawa
> Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees,
> support of being able to run a combiner in a non-local mode would allow
> performance efficiencies to be gained by running a combiner at a rack-level.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)