iirc, the last time support for combiners were added, Utkarsh unearthed
a bunch of bugs (and so the restricted use of combiners in pig) ... cant
access the testcases in the patch, but hopefully they are also covered !
Regards,
Mridul
Pradeep Kamath (JIRA) wrote:
[
https://issues.apache.org/jira/browse/PIG-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pradeep Kamath updated PIG-563:
-------------------------------
Status: Patch Available (was: Open)
Changes are in two main places:
1) CombinerOptimizer which decides whether to use the combiner and also
modifies the Map/combine/reduce plans to use the combiner
2) Builtin Aggregate UDFs - SUM, MIN, MAX, AVG and their typed variants and
COUNT
The CombinerOptimizer is changed as follows:
The combiner is used only in the case of a group by followed by foreach generate <simple
project>*, <algebraic udf>* where <simple project> is the projection of the group
by key (not a nested project like group.$0). Two new foreachs are inserted - one in the combine
and one in the map plan which will be based on the reduce foreach. The map foreach will have one
inner plan for each inner plan in the foreach we're duplicating. For projections, the plan will
be the same. For algebraic udfs, the plan will have the initial version of the function. The
combine foreach will have one inner plan for each inner plan in the foreach we're duplicating.
For projections, the project operators will be changed to project the same column as its position
in the foreach. For algebraic udfs, the plan will have the intermediate version of the function.
In the inner plans of the reduce foreach for projections, the project operators will be changed
to project the same column as its position in the foreach. For algebraic udfs, the plan will have
the final version of the function. The input to the udf will be a POProject which will project
the column corresponding to the position of the udf in the foreach.
The map plan is changed by replacing the existing Local rearrange with a special operator
POPreCombinerLocalRearrange which behaves like the regular local rearrange in the getNext() as far
as getting its input and constructing the "key" out of the input. It then returns a tuple
with two fields - the key in the first position and the "value" inside a bag in the
second position. This output format resembles the format out of a Package. This output will feed to
the map foreach which expects this format. Then a normal local rearrange will be attached as the
leaf of the map plan with a project as its input which projects the key from the map foreach. The
combine plan will have the POCombiner package (formerly POPOstCombinerPackage), the combiner
foreach and a local rearrange. The reduce plan will have a POCombiner package and the modified
foreach at its root.
The UDFs are changed to have correct implementations for Initial, Intermediate
and Final. TestBuiltin has also been changed to test this new setup.
PERFORMANCE: enable combiner to be called 0 or more times whenver the combiner
is used for a pig query
------------------------------------------------------------------------------------------------------
Key: PIG-563
URL: https://issues.apache.org/jira/browse/PIG-563
Project: Pig
Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
Fix For: types_branch
Currently Pig's use of the combiner assumes the combiner is called exactly once
in Hadoop. With Hadoop 18, the combiner could be called 0, 1 or more times.
This issue is to track changes needed in the CombinerOptimizer visitor and the
builtin Algebraic UDFS (SUM, COUNT, MIN, MAX, AVG) to be able to work in this
new model.