In pymapred.py we use "in-mapper" equivalent of combiner for aggregating the counts. Just as Doug suggests, it is based on a large hashtable and probably is more efficient than using a standard combiner.
On 6/18/08 4:31 PM, "Doug Cutting (JIRA)" <[EMAIL PROTECTED]> wrote: > > [ > https://issues.apache.org/jira/browse/HADOOP-3594?page=com.atlassian.jira.plug > in.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606147#action_126 > 06147 ] > > Doug Cutting commented on HADOOP-3594: > -------------------------------------- > > I spoke with Ben, and he argued that Pig could implement its combiner in its > mapper to address this, and that would probably be faster too, since a HashMap > could be used to buffer tuples and they would not need to be serialized and > deserialized, as they are with a combiner. > >> Guaranteeing that combiner is called at least once >> -------------------------------------------------- >> >> Key: HADOOP-3594 >> URL: https://issues.apache.org/jira/browse/HADOOP-3594 >> Project: Hadoop Core >> Issue Type: Bug >> Reporter: Olga Natkovich >> Fix For: 0.18.0 >> >> >> In 18, hadoop decides how many times to call combiner on both map and reduce >> sides. The possible number is between 0 and N. >> While having multiple invocations can be useful, not invoking combiner at all >> can have serious consequences for a range of functions called algebraic >> (http://classweb.gmu.edu/kersch/inft864/Readings/Shoshani/DataCube/DataCubeTe >> chReport.pdf). The main properties of such functions is that the intermediate >> and final computations are different and that the first invokation transforms >> the data to a different form. A most common example of this is AVERAGE >> function. While it is possible to workaround this issue by annotating each >> tuple, it seems that it would be much easier and faster if hadoop always >> guaranteed at least a single invocation. >> >> Not having this guarantee will break all sorts of existing combiners.
