[ 
https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-563:
-------------------------------

    Status: Patch Available  (was: Open)

Changes are in two main places:
1) CombinerOptimizer which decides whether to use the combiner and also 
modifies the Map/combine/reduce plans to use the combiner
2) Builtin Aggregate UDFs - SUM, MIN, MAX, AVG and their typed variants and 
COUNT

The CombinerOptimizer is changed as follows:
The combiner is used only in the case of a group by followed by foreach 
generate <simple project>*, <algebraic udf>* where <simple project> is the 
projection of the group by key (not a nested project like group.$0). Two new 
foreachs are inserted - one  in the combine and one in the map plan which will 
be based on the reduce foreach.  The map foreach will have one inner plan for 
each  inner plan in the foreach we're duplicating.  For projections, the plan 
will be the same.  For algebraic udfs, the plan will have the initial version 
of the function.  The combine foreach will have one inner plan for each inner 
plan in the foreach we're duplicating.  For projections, the project operators 
will be changed to project the same column as its position in the foreach. For 
algebraic udfs, the plan will have the intermediate version of the function. In 
the inner plans of the reduce foreach for projections, the project operators 
will be changed to project the same column as its position in the foreach. For 
algebraic udfs, the plan will have the final version of the function. The input 
to the udf will be a POProject which will project the column corresponding to 
the position of the udf in the foreach.
The map plan is changed by replacing the existing Local rearrange with a 
special operator POPreCombinerLocalRearrange which behaves like the regular 
local rearrange in the getNext() as far as getting its input and constructing 
the "key" out of the input. It then returns a tuple with two fields - the key 
in the first position and the "value" inside a bag in the second position. This 
output format resembles the format out of a Package. This output will feed to 
the map foreach which expects this format. Then a normal local rearrange will 
be attached as the leaf of the map plan with a project as its input which 
projects the key from the map foreach. The combine plan will have the 
POCombiner package (formerly POPOstCombinerPackage), the combiner foreach and a 
local rearrange. The reduce plan will have a POCombiner package and the 
modified foreach at its root.

The UDFs are changed to have correct implementations for Initial, Intermediate 
and Final. TestBuiltin has also been changed to test this new setup.


> PERFORMANCE: enable combiner to be called 0 or more times whenver the 
> combiner is used for a pig query
> ------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-563
>                 URL: https://issues.apache.org/jira/browse/PIG-563
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>
> Currently Pig's use of the combiner assumes the combiner is called exactly 
> once in Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more 
> times. This issue is to track changes needed in the CombinerOptimizer visitor 
> and the builtin Algebraic UDFS (SUM, COUNT, MIN, MAX, AVG) to be able to work 
> in this new model.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to