[ 
https://issues.apache.org/jira/browse/TEZ-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490647#comment-14490647
 ] 

Tsuyoshi Ozawa commented on TEZ-145:
------------------------------------

[~bikassaha] [~gopalv] As Gopal mentioned, this feature can target 3 and 4. 
This is a benchmark result of prototype of MAPREDCE-4502: 
http://www.slideshare.net/ozax86/prestrata-hadoop-word-meetup/11
On MAPREDUCE-4502, I tried to run combiner after spilling tasks: it causes 
performance trade off between aggregation ratio vs disk IO. So, Gopal's comment 
as follows makes sense to me.

{quote}
So tuning it to have no extra spills produced bad shuffle performance, which is 
what the Tez approach is not vulnerable to, since it is meant to combine 
host-local data (plus skip merges via pipelining).
{quote}

If we can implement in-memory combiner or such kind of DAG support in Tez 
layer, we can improve performance more. However, we need to change the 
semantics of fault tolerance. 

> Support a combiner processor that can run non-local to map/reduce nodes
> -----------------------------------------------------------------------
>
>                 Key: TEZ-145
>                 URL: https://issues.apache.org/jira/browse/TEZ-145
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Tsuyoshi Ozawa
>         Attachments: TEZ-145.2.patch, WIP-TEZ-145-001.patch
>
>
> For aggregate operators that can benefit by running in multi-level trees, 
> support of being able to run a combiner in a non-local mode would allow 
> performance efficiencies to be gained by running a combiner at a rack-level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to