[ 
https://issues.apache.org/jira/browse/HADOOP-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606147#action_12606147
 ] 

Doug Cutting commented on HADOOP-3594:
--------------------------------------

I spoke with Ben, and he argued that Pig could implement its combiner in its 
mapper to address this, and that would probably be faster too, since a HashMap 
could be used to buffer tuples and they would not need to be serialized and 
deserialized, as they are with a combiner.

> Guaranteeing that combiner is called at least once
> --------------------------------------------------
>
>                 Key: HADOOP-3594
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3594
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>             Fix For: 0.18.0
>
>
> In 18, hadoop decides how many times to call combiner on both map and reduce 
> sides. The possible number is between 0 and N. 
> While having multiple invocations can be useful, not invoking combiner at all 
> can have serious consequences for a range of functions called algebraic 
> (http://classweb.gmu.edu/kersch/inft864/Readings/Shoshani/DataCube/DataCubeTechReport.pdf).
>  The main properties of such functions is that the intermediate and final 
> computations are different and that the first invokation transforms the data 
> to a different form. A most common example of this is AVERAGE function. While 
> it is possible to workaround this issue by annotating each tuple, it seems 
> that it would be much easier and faster if hadoop always guaranteed at least 
> a single invocation.
>  
> Not having this guarantee will break all sorts of existing combiners.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to