PERFORMANCE: multi-query optimization on multiple group bys following a join or 
cogroup
---------------------------------------------------------------------------------------

                 Key: PIG-983
                 URL: https://issues.apache.org/jira/browse/PIG-983
             Project: Pig
          Issue Type: Improvement
          Components: impl
            Reporter: Richard Ding


The current multi-query optimizer works well with pig scripts like this one:

{code}
data = LOAD 'input' AS (a:chararray, b:int, c:int);
A = GROUP data BY b;
B = GROUP data BY c;
C = FOREACH A GENERATE group, COUNT(data);
D = FOREACH B GENERATE group, SUM(data.b);
STORE C INTO 'output1';
STORE D INTO 'output2';
{code}

In this case the original three Map-Reduce jobs are merged into one MR job by 
the optimizer.

The current optimizer, however, won't reduce the number of MR jobs for the 
scripts in which multiple group bys follow a join or a cogroup, such as this 
one:

{code}
data1 = LOAD 'input1' AS (a1:chararray, b1:int, c1:int);
data2 = LOAD 'input2' AS (a2:chararray, b2:int, c2:int);
A = JOIN data1 BY a1, data2 BY a2;
B = GROUP A BY data1::b1;
C = GROUP B BY data2::c2;
D = FOREACH B GENERATE group, COUNT(A);
E = FOREACH C GENERATE group, SUM(A.data2::b2);
STORE D INTO 'output1';
STORE E INTO 'output2';                        
{code}

Three MR jobs are still needed to run this script.

Multi-query optimizer should work with this kind of scripts by merging the 
group bys and reducing the overall MR jobs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to