changing combiner behavior past hadoop 18
-----------------------------------------

                 Key: PIG-274
                 URL: https://issues.apache.org/jira/browse/PIG-274
             Project: Pig
          Issue Type: Bug
            Reporter: Olga Natkovich


In hadoop 18, the way commbiners are handled is changing. The hadoop team 
agreed to keep things backward compatible for now but will depricate the 
current behavior in the future (likely in hadoop 19) so pig needs to adjust to 
the new behavior. This should be done in the post 2.0 code base.

Old behavior: combiner is called once and only once per map task
New behavior: combiner can be run 0 or more times on both map and reduce sides. 
0 times happens if only a single <K, V> fits into sort buffer. Multiple time 
can happen in case of a hierarchical merge.

The main issue that causes problem for pig is that we would not know in advance 
whether the combiner will run 0,1 or more times. This causes several issues:

(1) Lets assume that we compute count. If we enable combiner, reducer expects 
to get numbers not values as its input. Hadoop team suggested that we could 
annotate each tuple with a byte that tells if it want through combiner. This 
could be expensive computatinally as well as will use extra memory. One things 
to notice is that some algebraics (like SUM, MIN, MAX) don't care whether the 
data was precombined as they always to the same thing. Perhaps we can make 
algebaic functions declare if they care or not. Then we only anotate the ones 
that need it.
(2) Since combiner can be called 1 or more times, getInitial and 
getIntermediate have to do the same thing. So again, we need to change the 
interface to reflcat that.
(3) current combiner code assumes that it only works with 1 input. When it runs 
on the reduce side, it can be dealing with tuples from multiple inputs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to