[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

Pradeep Kamath (JIRA) Mon, 23 Mar 2009 10:35:32 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688339#action_12688339
 ]


Pradeep Kamath commented on PIG-627:
------------------------------------

Comments for Richard's patch - multiquery-phase2_0313.patch

In MultiQueryOptimizer:
- what about mr not being map only and with mr splittee? - is this not handled 
for now?
- Is the single mapper case and the single map-reduce case when the script has 
an explicit store 'file' and load 'file' - if this is so, then in
mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is 
removed - shouldn't the store remain?   
- There is common code in mergeOnlyMapperSplittee() and 
meregOnlyMapReduceSplittee() which should be moved to a function to reduce the 
code duplication.

Just want to confirm that the multi query optimization is only for map reduce 
mode - since the optimizer is being called in MapReduceLauncher

In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I 
noticed that in POSplit, it causes an exception - I think it should return the 
error whhic would later be caught in the map() or reduce() - a test to make 
sure errors do get caught and cause failures would be good.

spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of 
ReverseDependencyWalker.

The following comment in BinStorage needs to be clarified:
{noformat}
        if (!FileLocalizer.fileExists(fileName, storage)) {
            // At compile time in batch mode, the file may not exist
            // (such as intermediate file). Just return null - the
            // same way as we could's get a valid record from the input. --> 
does this actually mean "the same way as we would if we did not get a valid 
record" ?
            return null;
        }
        

{noformat}


> PERFORMANCE: multi-query optimization
> -------------------------------------
>
>                 Key: PIG-627
>                 URL: https://issues.apache.org/jira/browse/PIG-627
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Olga Natkovich
>         Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
> multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, 
> multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

Reply via email to