[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688339#action_12688339 ]
Pradeep Kamath commented on PIG-627: ------------------------------------ Comments for Richard's patch - multiquery-phase2_0313.patch In MultiQueryOptimizer: - what about mr not being map only and with mr splittee? - is this not handled for now? - Is the single mapper case and the single map-reduce case when the script has an explicit store 'file' and load 'file' - if this is so, then in mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is removed - shouldn't the store remain? - There is common code in mergeOnlyMapperSplittee() and meregOnlyMapReduceSplittee() which should be moved to a function to reduce the code duplication. Just want to confirm that the multi query optimization is only for map reduce mode - since the optimizer is being called in MapReduceLauncher In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I noticed that in POSplit, it causes an exception - I think it should return the error whhic would later be caught in the map() or reduce() - a test to make sure errors do get caught and cause failures would be good. spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of ReverseDependencyWalker. The following comment in BinStorage needs to be clarified: {noformat} if (!FileLocalizer.fileExists(fileName, storage)) { // At compile time in batch mode, the file may not exist // (such as intermediate file). Just return null - the // same way as we could's get a valid record from the input. --> does this actually mean "the same way as we would if we did not get a valid record" ? return null; } {noformat} > PERFORMANCE: multi-query optimization > ------------------------------------- > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, multi-store-0303.patch, > multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, > multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.