[
https://issues.apache.org/jira/browse/PIG-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607610#action_12607610
]
Pi Song commented on PIG-274:
-----------------------------
Sounds reasonable. Please note that the contract when implementing "Interim"
will have to allow multi-level from now on.
All the algebraic functions that we've got should be alright. The new
covariance thing should be fine as well.
> changing combiner behavior past hadoop 18
> -----------------------------------------
>
> Key: PIG-274
> URL: https://issues.apache.org/jira/browse/PIG-274
> Project: Pig
> Issue Type: Bug
> Reporter: Olga Natkovich
>
> In hadoop 18, the way commbiners are handled is changing. The hadoop team
> agreed to keep things backward compatible for now but will depricate the
> current behavior in the future (likely in hadoop 19) so pig needs to adjust
> to the new behavior. This should be done in the post 2.0 code base.
> Old behavior: combiner is called once and only once per map task
> New behavior: combiner can be run 0 or more times on both map and reduce
> sides. 0 times happens if only a single <K, V> fits into sort buffer.
> Multiple time can happen in case of a hierarchical merge.
> The main issue that causes problem for pig is that we would not know in
> advance whether the combiner will run 0,1 or more times. This causes several
> issues:
> (1) Lets assume that we compute count. If we enable combiner, reducer expects
> to get numbers not values as its input. Hadoop team suggested that we could
> annotate each tuple with a byte that tells if it want through combiner. This
> could be expensive computatinally as well as will use extra memory. One
> things to notice is that some algebraics (like SUM, MIN, MAX) don't care
> whether the data was precombined as they always to the same thing. Perhaps we
> can make algebaic functions declare if they care or not. Then we only anotate
> the ones that need it.
> (2) Since combiner can be called 1 or more times, getInitial and
> getIntermediate have to do the same thing. So again, we need to change the
> interface to reflcat that.
> (3) current combiner code assumes that it only works with 1 input. When it
> runs on the reduce side, it can be dealing with tuples from multiple inputs.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.