[ https://issues.apache.org/jira/browse/PIG-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich resolved PIG-274. -------------------------------- Resolution: Fixed This work has been done > changing combiner behavior past hadoop 18 > ----------------------------------------- > > Key: PIG-274 > URL: https://issues.apache.org/jira/browse/PIG-274 > Project: Pig > Issue Type: Bug > Reporter: Olga Natkovich > > In hadoop 18, the way commbiners are handled is changing. The hadoop team > agreed to keep things backward compatible for now but will depricate the > current behavior in the future (likely in hadoop 19) so pig needs to adjust > to the new behavior. This should be done in the post 2.0 code base. > Old behavior: combiner is called once and only once per map task > New behavior: combiner can be run 0 or more times on both map and reduce > sides. 0 times happens if only a single <K, V> fits into sort buffer. > Multiple time can happen in case of a hierarchical merge. > The main issue that causes problem for pig is that we would not know in > advance whether the combiner will run 0,1 or more times. This causes several > issues: > (1) Lets assume that we compute count. If we enable combiner, reducer expects > to get numbers not values as its input. Hadoop team suggested that we could > annotate each tuple with a byte that tells if it want through combiner. This > could be expensive computatinally as well as will use extra memory. One > things to notice is that some algebraics (like SUM, MIN, MAX) don't care > whether the data was precombined as they always to the same thing. Perhaps we > can make algebaic functions declare if they care or not. Then we only anotate > the ones that need it. > (2) Since combiner can be called 1 or more times, getInitial and > getIntermediate have to do the same thing. So again, we need to change the > interface to reflcat that. > (3) current combiner code assumes that it only works with 1 input. When it > runs on the reduce side, it can be dealing with tuples from multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.