[
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688339#action_12688339
]
Pradeep Kamath commented on PIG-627:
------------------------------------
Comments for Richard's patch - multiquery-phase2_0313.patch
In MultiQueryOptimizer:
- what about mr not being map only and with mr splittee? - is this not handled
for now?
- Is the single mapper case and the single map-reduce case when the script has
an explicit store 'file' and load 'file' - if this is so, then in
mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is
removed - shouldn't the store remain?
- There is common code in mergeOnlyMapperSplittee() and
meregOnlyMapReduceSplittee() which should be moved to a function to reduce the
code duplication.
Just want to confirm that the multi query optimization is only for map reduce
mode - since the optimizer is being called in MapReduceLauncher
In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I
noticed that in POSplit, it causes an exception - I think it should return the
error whhic would later be caught in the map() or reduce() - a test to make
sure errors do get caught and cause failures would be good.
spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of
ReverseDependencyWalker.
The following comment in BinStorage needs to be clarified:
{noformat}
if (!FileLocalizer.fileExists(fileName, storage)) {
// At compile time in batch mode, the file may not exist
// (such as intermediate file). Just return null - the
// same way as we could's get a valid record from the input. -->
does this actually mean "the same way as we would if we did not get a valid
record" ?
return null;
}
{noformat}
> PERFORMANCE: multi-query optimization
> -------------------------------------
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 1.0.0
> Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, multi-store-0303.patch,
> multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch,
> multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a
> map-reduce job that generated output2. As the resuld data is read, parsed and
> filetered twice which is unnecessary and costly.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.