Pradeep Kamath commented on PIG-627:

Sorry about the misunderstanding, I think I looked at a different patch. After 
reviewing the right patch, here are some comments:

The patch throws Java Exceptions like IllegalStateException. This should be 
replaced with the appropriate Exception class (like MRCompilerException) as 
specified in 
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification. The 
exception should be created with the error code, error source and error message 
constructor. New error codes should be introduced if one of the existing ones 
 cannot be used. If new codes are introduced, the wiki table should be updated.

The following can be used to check for file existence in 
BinStorage.determineSchema() - only in the case where the file does not exist, 
null should be returned
 public static boolean fileExists(String filename, DataStorage store)
            throws IOException {
        ElementDescriptor elem = store.asElement(filename);
        return elem.exists() || globMatchesFiles(elem, store);

Instead of introducing a rootsFirst attribute in DependencyOrderWalker, I 
wonder if we should have a ReverseDependencyOrderWalker since that is what the 
rootsFirst == false case will be. If we are not visiting roots to leaf, we 
really are not visiting in a dependency order - so the meaning of dependency 
order is no longer honored - this can be confusing I think. By explicitly 
naming the walker ReverseDependencyOrderWalker, the intent of walking from 
leaves to roots is more clear I think.

In POSplit currently there is a PhysicalPlan representing the merged inner 
plans (where all plans are mutually exclusive) and there is also a 
List<PhysicalPlan> which has the same information in the form of a List. In the 
rest of pig code, inner plans have always been modelled as List<PhysicalPlan>. 
For consistency, it is better to just have a List<PhysicalPlan> to represent 
the inner plans.

> PERFORMANCE: multi-query optimization
> -------------------------------------
>                 Key: PIG-627
>                 URL: https://issues.apache.org/jira/browse/PIG-627
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Olga Natkovich
>         Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
> multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to