[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pallavi Rao updated PIG-4709:
-----------------------------
    Attachment: PIG-4709-v1.patch

Outlining the approach here:
Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup (via 
POGlobalRearrange). When the grouped data is consumed by subsequent operations 
to perform algebraic operations, this is sub-optimal as there is a lot of 
shuffle traffic. It is recommended that “reduceBy/combineBy/aggregateBy” 
operators of Spark be used in such cases.  For this patch, reduceBy operator of 
Spark was chosen over combineBy primarily to enable splitting of plan into 
multiple operators rather than stuffing all the operations (map, combine, 
merge) into a single operator. Hence, this patch introduces a new physical 
operator in Spark called POReduceBySpark. This class extends POForEach mostly 
to enable consuming from Input Plans. The spark plan is processed by 
SparkCombinerOptimizer to modify the plan to as follows:
{code}
 Checks for algebraic operations and if they exist.
    Replaces global rearrange (cogroup) with reduceBy as follows:
    Input:
    foreach (using algebraicOp)
       -> packager
          -> globalRearrange
              -> localRearrange
    Output:
     foreach (using algebraicOp.Final)
       -> reduceBy (uses algebraicOp.Intermediate)
          -> foreach (using algebraicOp.Initial)
              -> localRearrange
{code}
Example : If AVG(num) is the algebraic operation. The first foreach(initial) 
will emit (num,1). Given 2 tuples (num1, 1) and (num2, 1), 
ReduceBy(intermediate) will emit (num1+num2, 2) and so on. The final foreach 
will emit ((num1+num2….)/count).
Existing CombinerOptimizerUtil and the methods there in have been used heavily 
to perform the optimization.

This optimization can be turned off by setting  pig.exec.nocombiner to ’true'

> Improve performance of GROUPBY operator on Spark
> ------------------------------------------------
>
>                 Key: PIG-4709
>                 URL: https://issues.apache.org/jira/browse/PIG-4709
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Pallavi Rao
>            Assignee: Pallavi Rao
>              Labels: spork
>             Fix For: spark-branch
>
>         Attachments: PIG-4709-v1.patch, PIG-4709.patch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to