[ 
https://issues.apache.org/jira/browse/PIG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-580:
-------------------------------

    Patch Info: [Patch Available]
       Summary: PERFORMANCE: Combiner should also be used when there are 
distinct aggregates in a foreach following a group provided there are no 
non-algebraics in the foreach   (was: Combiner should also be used when there 
are distinct aggregates in a foreach following a group provided there are no 
non-algebraics in the foreach )

The changes are mostly in CombinerOptimizer and an additional UDF - 
Distinct.java. In the combiner optimizer if the inner plan of the foreach 
following a group has a distinct feeding into a UDF (algebraic or not), that 
plan is considered to be "combineable" - it is marked as being of type 
ExprType.DISTINCT. If there is atleast one inner plan in the foreach of type -  
ExprType.DISTINCT or ExprType.ALGEBRAIC and no inner plan of type 
ExprType.NOT_ALGEBRAIC then the optimizer considers the foreach to be 
combineable. If an inner plan is of type ExprType.Distinct, then the leaf of 
the plan is a UDF whose input comes from a PODistinct with a Project[bag][*] in 
between the two. to take the tuples from the PODistinct and supply it as a bag 
to the UDF. This part of the plan is modified as explained in the code comment 
below:
{noformat}
                // A PODistinct in the plan will always have
                // a Project[bag](*) as its successor.
                // We will replace it with a POUserFunc with "Distinct" as 
                // the underlying UDF. 
                // In the map and combine, we will make this POUserFunc
                // the leaf of the plan by removing other operators which
                // are descendants up to the leaf.
                // In the reduce we will keep descendants intact. Further
                // down in fixProjectAndInputs we will change the inputs to
                // this POUserFunc in the combine and reduce plans to be
                // just projections of the column "i"
{noformat}

The idea is that the "distinct" portion of this plan is combined and then the 
combined result is fed to the UDF. So in the map and combine phases only 
partial distinct outputs are computed and the UDF which takes the distinct 
output as input is not called. Then in the reduce, the "combined" distinct 
output is fed as input to the UDF. 

All these changes are independent of the changes on other non-distinct agg 
inner plans which may have Algebraic UDFs - in those plans, the existing 
modifications will remain.



> PERFORMANCE: Combiner should also be used when there are distinct aggregates 
> in a foreach following a group provided there are no non-algebraics in the 
> foreach 
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-580
>                 URL: https://issues.apache.org/jira/browse/PIG-580
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>
> Currently Pig uses the combiner only when there is foreach following a group 
> when the elements in the foreach generate have the following characteristics:
> 1) simple project of the "group" column
> 2) Algebraic UDF
> The above conditions exclude use of the combiner for distinct aggregates - 
> the distinct operation itself is combinable (irrespective of whether it feeds 
> to an algebraic or non algebraic udf). So if the following foreach should 
> also be combinable:
> {code}
> ..
> b = group a by $0;
> c = foreach b generate { x = distinct a; generate group, COUNT(x), SUM(x.$1) }
> {code}
> The combiner optimizer should cause the distinct to be combined and the final 
> combine output should feed the COUNT() and SUM() in the reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to