iirc, the last time support for combiners were added, Utkarsh unearthed a bunch of bugs (and so the restricted use of combiners in pig) ... cant access the testcases in the patch, but hopefully they are also covered !

Regards,
Mridul

Pradeep Kamath (JIRA) wrote:
     [ 
https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-563:
-------------------------------

    Status: Patch Available  (was: Open)

Changes are in two main places:
1) CombinerOptimizer which decides whether to use the combiner and also 
modifies the Map/combine/reduce plans to use the combiner
2) Builtin Aggregate UDFs - SUM, MIN, MAX, AVG and their typed variants and 
COUNT

The CombinerOptimizer is changed as follows:
The combiner is used only in the case of a group by followed by foreach generate <simple 
project>*, <algebraic udf>* where <simple project> is the projection of the group 
by key (not a nested project like group.$0). Two new foreachs are inserted - one  in the combine 
and one in the map plan which will be based on the reduce foreach.  The map foreach will have one 
inner plan for each  inner plan in the foreach we're duplicating.  For projections, the plan will 
be the same.  For algebraic udfs, the plan will have the initial version of the function.  The 
combine foreach will have one inner plan for each inner plan in the foreach we're duplicating.  
For projections, the project operators will be changed to project the same column as its position 
in the foreach. For algebraic udfs, the plan will have the intermediate version of the function. 
In the inner plans of the reduce foreach for projections, the project operators will be changed 
to project the same column as its position in the foreach. For algebraic udfs, the plan will have 
the final version of the function. The input to the udf will be a POProject which will project 
the column corresponding to the position of the udf in the foreach.
The map plan is changed by replacing the existing Local rearrange with a special operator 
POPreCombinerLocalRearrange which behaves like the regular local rearrange in the getNext() as far 
as getting its input and constructing the "key" out of the input. It then returns a tuple 
with two fields - the key in the first position and the "value" inside a bag in the 
second position. This output format resembles the format out of a Package. This output will feed to 
the map foreach which expects this format. Then a normal local rearrange will be attached as the 
leaf of the map plan with a project as its input which projects the key from the map foreach. The 
combine plan will have the POCombiner package (formerly POPOstCombinerPackage), the combiner 
foreach and a local rearrange. The reduce plan will have a POCombiner package and the modified 
foreach at its root.

The UDFs are changed to have correct implementations for Initial, Intermediate 
and Final. TestBuiltin has also been changed to test this new setup.


PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner 
is used for a pig query
------------------------------------------------------------------------------------------------------

                Key: PIG-563
                URL: https://issues.apache.org/jira/browse/PIG-563
            Project: Pig
         Issue Type: Improvement
   Affects Versions: types_branch
           Reporter: Pradeep Kamath
           Assignee: Pradeep Kamath
            Fix For: types_branch


Currently Pig's use of the combiner assumes the combiner is called exactly once 
in Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more times. 
This issue is to track changes needed in the CombinerOptimizer visitor and the 
builtin Algebraic UDFS (SUM, COUNT, MIN, MAX, AVG) to be able to work in this 
new model.




Reply via email to