Gunther Hagleitner updated PIG-627:

    Attachment: fix_store_prob.patch

This patch addresses an issue with the way we deal with scripts that do:
store a into 'foo';
a = load 'foo';

In the logical plan this will end up as a split with one branch storing into 
'foo' and the other continuing the processing after the load. The actual load 
is removed.

This works well but has an unfortunate side effect. If the store/load mark the 
boundary between two map-reduce jobs the MRCompiler has to insert a tmp 
store-load bridge - which means that we now end up with two stores.

This fix detects this case in the optimizing phase after the compilation. It 
removes the unnecessary store and loads from the other one.

> PERFORMANCE: multi-query optimization
> -------------------------------------
>                 Key: PIG-627
>                 URL: https://issues.apache.org/jira/browse/PIG-627
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Olga Natkovich
>         Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
> merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
> multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
> multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch, multiquery_explain_fix.patch
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to