[
https://issues.apache.org/jira/browse/PIG-273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301521#comment-14301521
]
Brian Johnson commented on PIG-273:
-----------------------------------
I created https://issues.apache.org/jira/browse/PIG-4406 for this issue
> Need to optimize the ways splits are handled, both in the top level plan and
> in nested plans.
> ---------------------------------------------------------------------------------------------
>
> Key: PIG-273
> URL: https://issues.apache.org/jira/browse/PIG-273
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Alan Gates
> Priority: Minor
> Fix For: 0.2.0
>
>
> Currently, in the new pipeline rework (see PIG-157), splits in the data flow
> are not handled efficiently.
> In the top level plans splits cause all the output data to be written to hdfs
> and then reread by each leg of the split. This forces both a read/write and
> a new map/reduce pass when it is not always necessary. For example, consider:
> A = load 'myfile';
> split A into B if $0 < 100, C if $0 >= 100;
> B1 = group B by $0;
> ...
> C1 = group B by $1;
> ...
> In this case A will be loaded, and then immediately stored again. Then a
> plan will be executed that handles the B* part of the script, and then
> another executed that will handle the C* part of the script.
> In nested plans, each projection of the generate is computed separately, even
> if they share common steps in the plan. For example:
> B = group A by $0;
> C= foreach B {
> C1 = distinct $1;
> C2 = filter C1 by $1 > 0;
> generate group, COUNT(C1), COUNT(C2);
> }
> That will currently be executed with two nested plans, distinct->COUNT(C1)
> and distinct->filter->COUNT(C2). The same distinct will be computed twice.
> Ideally we would like to compute the distinct once and then split the output.
> I suspect that optimizing the inner plan is more important because there are
> more situations where this occurs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)