[
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pallavi Rao updated PIG-4709:
-----------------------------
Attachment: PIG-4709-v1.patch
Outlining the approach here:
Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup (via
POGlobalRearrange). When the grouped data is consumed by subsequent operations
to perform algebraic operations, this is sub-optimal as there is a lot of
shuffle traffic. It is recommended that “reduceBy/combineBy/aggregateBy”
operators of Spark be used in such cases. For this patch, reduceBy operator of
Spark was chosen over combineBy primarily to enable splitting of plan into
multiple operators rather than stuffing all the operations (map, combine,
merge) into a single operator. Hence, this patch introduces a new physical
operator in Spark called POReduceBySpark. This class extends POForEach mostly
to enable consuming from Input Plans. The spark plan is processed by
SparkCombinerOptimizer to modify the plan to as follows:
{code}
Checks for algebraic operations and if they exist.
Replaces global rearrange (cogroup) with reduceBy as follows:
Input:
foreach (using algebraicOp)
-> packager
-> globalRearrange
-> localRearrange
Output:
foreach (using algebraicOp.Final)
-> reduceBy (uses algebraicOp.Intermediate)
-> foreach (using algebraicOp.Initial)
-> localRearrange
{code}
Example : If AVG(num) is the algebraic operation. The first foreach(initial)
will emit (num,1). Given 2 tuples (num1, 1) and (num2, 1),
ReduceBy(intermediate) will emit (num1+num2, 2) and so on. The final foreach
will emit ((num1+num2….)/count).
Existing CombinerOptimizerUtil and the methods there in have been used heavily
to perform the optimization.
This optimization can be turned off by setting pig.exec.nocombiner to ’true'
> Improve performance of GROUPBY operator on Spark
> ------------------------------------------------
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: Pallavi Rao
> Assignee: Pallavi Rao
> Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709-v1.patch, PIG-4709.patch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the
> grouped data is consumed by subsequent operations to perform algebraic
> operations, this is sub-optimal as there is lot of shuffle traffic.
> The Spark Plan must be optimized to use reduceBy, where possible, so that a
> combiner is used.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)