[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Ding updated PIG-627: ----------------------------- Attachment: multiquery-phase2_0323.patch Thanks for reviewing the patch. In MultiQueryOptimizer: * what about mr not being map only and with mr splittee? - is this not handled for now? _Yeah. There are two cases where splittees will not be merged into splitter: (1) splitter is not map only and splittee has reducer, and (2) splittee has multiple roots (loads)_ * Is the single mapper case and the single map-reduce case when the script has an explicit store 'file' and load 'file' - if this is so, then in mergeOnlyMapperSplittee() and mergeOnlyMapReduceSplittee(), the store is removed - shouldn't the store remain? _Explicit store/load combination in a script is transformed into an implicit split, hence the store remains_ * There is common code in mergeOnlyMapperSplittee() and meregOnlyMapReduceSplittee() which should be moved to a function to reduce the code duplication. _Fixed_ Just want to confirm that the multi query optimization is only for map reduce mode - since the optimizer is being called in MapReduceLauncher _Yes_ In POForEach when there is POStatus.STATUS_ERR, it is returned to the caller. I noticed that in POSplit, it causes an exception - I think it should return the error whhic would later be caught in the map() or reduce() - a test to make sure errors do get caught and cause failures would be good. _Fixed_ spawnChildWalker() of ReverseDependencyOrderWalker should return an instance of ReverseDependencyWalker. _Fixed_ > PERFORMANCE: multi-query optimization > ------------------------------------- > > Key: PIG-627 > URL: https://issues.apache.org/jira/browse/PIG-627 > Project: Pig > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Olga Natkovich > Attachments: file_cmds-0305.patch, multi-store-0303.patch, > multi-store-0304.patch, multiquery-phase2_0313.patch, > multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, > multiquery_0306.patch, multiquery_explain_fix.patch > > > Currently, if your Pig script contains multiple stores and some shared > computation, Pig will execute several independent queries. For instance: > A = load 'data' as (a, b, c); > B = filter A by a > 5; > store B into 'output1'; > C = group B by b; > store C into 'output2'; > This script will result in map-only job that generated output1 followed by a > map-reduce job that generated output2. As the resuld data is read, parsed and > filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.