[ 
https://issues.apache.org/jira/browse/PIG-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-274.
--------------------------------

    Resolution: Fixed

This work has been done

> changing combiner behavior past hadoop 18
> -----------------------------------------
>
>                 Key: PIG-274
>                 URL: https://issues.apache.org/jira/browse/PIG-274
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>
> In hadoop 18, the way commbiners are handled is changing. The hadoop team 
> agreed to keep things backward compatible for now but will depricate the 
> current behavior in the future (likely in hadoop 19) so pig needs to adjust 
> to the new behavior. This should be done in the post 2.0 code base.
> Old behavior: combiner is called once and only once per map task
> New behavior: combiner can be run 0 or more times on both map and reduce 
> sides. 0 times happens if only a single <K, V> fits into sort buffer. 
> Multiple time can happen in case of a hierarchical merge.
> The main issue that causes problem for pig is that we would not know in 
> advance whether the combiner will run 0,1 or more times. This causes several 
> issues:
> (1) Lets assume that we compute count. If we enable combiner, reducer expects 
> to get numbers not values as its input. Hadoop team suggested that we could 
> annotate each tuple with a byte that tells if it want through combiner. This 
> could be expensive computatinally as well as will use extra memory. One 
> things to notice is that some algebraics (like SUM, MIN, MAX) don't care 
> whether the data was precombined as they always to the same thing. Perhaps we 
> can make algebaic functions declare if they care or not. Then we only anotate 
> the ones that need it.
> (2) Since combiner can be called 1 or more times, getInitial and 
> getIntermediate have to do the same thing. So again, we need to change the 
> interface to reflcat that.
> (3) current combiner code assumes that it only works with 1 input. When it 
> runs on the reduce side, it can be dealing with tuples from multiple inputs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to