[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665669#action_12665669
 ] 

Alan Gates commented on PIG-627:
--------------------------------

I propose to implement this as follows.

Currently split works by dumping all of its input to disk, and then starting MR 
jobs for each of it's outputs.  So if you have a script like:

{code}
A = load ...
split A into B, beta ...
C = filter B ...
D = group C ...
E = foreach D ...
store E
gamma = filter beta ...
delta = group gamma ...
epsilon = foreach delta ...
store epsilon
{code}

then A will be loaded and immediately stored by the split.  This output will 
then be loaded before C, and run to the store E.  Then the output will again be
loaded for gamma and run to epsilon.

If instead split was changed to have inner plans like foreach, the the above 
could be executed as A is loaded and the input passed to split.  Each tuple it
received it would run through a pipeline that contained C and a separate 
pipeline that contained gamma.  Separate map reduce jobs would then be started, 
one to
handle D-E and one delta-epsilon.  This turns 3 reads of the data into one plus 
two partials (depending on how selective the two filters are).

The relevance to the current issue is that queries like:

{code}
A = load ..
B = filter A ...
store B ...
C = group B ...
D = foreach C ...
store D;
{code}

would be implicitly converted to:

{code}
A = load ..
B = filter A ...
split B into B1, B2;
store B1 ...
C = group B2 ...
D = foreach C ...
store D;
{code}

Changes needed to accomplish this:
 * Add an optimization pass that takes a plan with splits and rearranges it to 
be contained within the splits plus any subsequent MR jobs.  This may need to 
be split up between the logical to physical translator and the MR compiler.  It 
also needs to be able to handle diamonds in the plan, where split data comes 
back together, either as part of the same MR job or in a later job.
 * Implement a split operator that can contain inner plans.  This is basically 
a foreach without a generate, and so hopefully much of the code from foreach 
could be shared or at least stolen.  It will be somewhat different in that it 
will be able to contain any non-MR boundary forcing task (filter, foreach, 
dump, store) and not be able to contain distinct or order by.




> PERFORMANCE: multi-query optimization
> -------------------------------------
>
>                 Key: PIG-627
>                 URL: https://issues.apache.org/jira/browse/PIG-627
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>             Fix For: types_branch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to