[jira] Commented: (PIG-143) Proposal for refactoring of parsing logic in Pig

Pi Song (JIRA) Thu, 17 Apr 2008 07:11:19 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590040#action_12590040
 ]


Pi Song commented on PIG-143:
-----------------------------

Dear Alan,

Inline answers:-

1) Make Validator have a "has a" relationship to Visitor instead of "is a" 
seems like it has some downsides. First, I suspect all validators are going to 
be the same boiler plate code: instantiate visitor, visit, translate any 
exceptions into message collector. Two, the visitors that are doing the actual 
validation are forced to throw exceptions rather than write directly to the 
message collector. This is inefficient. If instead Validator extended Visitor, 
then you avoid both these issues. I think this "is a" relationship is 
appropriate as there will never be a validator that isn't a visitor.

[Pi] Agree for "is-a". This will be fixed in the next patch. For the exception 
thing 1) Visitors are not forced to throw exceptions (See in 
InputOutputFileValidator for example). The contract to implement a validator 
here is "keeping all the error messages in the message collector as much as 
possible. Only throw exception to stop the logic when the problem is too bad or 
if the following messages are likely to be screwed due to the current problem"

2) In TypeCheckingVisitor, why is the function "visit(ExpressionOperator)", 
which is just a big switch statement to find the right type, necessary? Won't 
java automatically pick the right visit() based on the class type that is 
passed?

[Pi] Java picks the visit() based on the reference holder's type. For example, 
the output will be "Animal, Dog, Animal" in the following code:-
{noformat}
public class TestPoly {
    public static void main(String[] args) {
        
        TestPoly obj = new TestPoly() ;
        obj.doIt(new Animal()) ;
        obj.doIt(new Dog()) ;
        
        Animal an = new Dog() ;
        obj.doIt(an) ;
        
    }
    
    public void doIt(Animal an) {
        System.out.println("Animal") ;       
    }
    
    public void doIt(Dog d) {
        System.out.println("Dog") ;       
    }
    
    static class Animal {
        
    }

    static class Dog extends Animal {
        
    }

}
{noformat}
So on the safe side for the caller, that bit will be needed.

3) For arithmetic expressions, I think we also want to allow it if both sides 
are bytearray. In that case we cast both to double (since it's the most 
generic). In general, see the charts at 
http://wiki.apache.org/pig/PigTypesFunctionalSpec to determine what conversions 
we want to do.

[Pi] Add/Sub/Mul/Div for (ByteArray x ByteArray) is defined as "Error" in those 
tables. I already followed that Wiki. If you want it to work even if both sides 
are bytearray, I would suggest to have double as output.


4) On my to do list is to add a method to Schema to do schema merging.
This would handle exactly issues like this, where type promotion needs to be
done, and throwing errors in cases where types can't be promoted.

[Pi] What is the difference between that new method and Schema.reconcile() ?

5) Hopefully the existing walker will work in this case. I've used
similar code before to walk plans that were being altered as they were walked,
so I expect this to work, or at least work after a little tweaking.

[Pi] You're right. The walker should work fine for casting insert job. Though 
there are some other cases that the walker might be screwed. For example, if 
you swap parent and child relation. In such cases, a special walker would be 
needed.


> Proposal for refactoring of parsing logic in Pig
> ------------------------------------------------
>
>                 Key: PIG-143
>                 URL: https://issues.apache.org/jira/browse/PIG-143
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Pi Song
>            Assignee: Pi Song
>         Attachments: ParserDrawing.png, pigtype_cycle_check.patch, 
> validation_part1.patch, validation_part2.patch, validation_v2.patch
>
>
> h2. Pig Script Parser Refactor Proposal 
> This is my initial proposal on pig script parser refactor work. Please note 
> that I need your opinions for improvements.
> *Problem*
> The basic concept is around the fact that currently we do validation logics 
> in parsing stage (for example, file existence checking) which I think is not 
> clean and difficult to add new validation rules. In the future, we will need 
> to add more and more validation logics to improve usability.
> *My proposal:-*  (see [^ParserDrawing.png])
> - Only keep parsing logic in the parser and leave output of parsing logic 
> being unchecked logical plans. (Therefore the parser only does syntactic 
> checking)
> - Create a new class called LogicalPlanValidationManager which is responsible 
> for validations of the AST from the parser.
> - A new validation logic will be subclassing LogicalPlanValidator 
> - We can chain a set of LogicalPlanValidators inside 
> LogicalPlanValidationManager to do validation work. This allows a new 
> LogicalPlanValidator to be added easily like a plug-in. 
> - This greatly promotes modularity of the validation logics which  is 
> +particularly good when we have a lot of people working on different things+ 
> (eg. streaming may require a special validation logic)
> - We can set the execution order of validators
> - There might be some backend specific validations needed when we implement 
> new execution engines (For example a logical operation that one backend can 
> do but others can't).  We can plug-in this kind of validations on-the-fly 
> based on the backend in use.
> *List of LogicalPlanValidators extracted from the current parser logic:-*
> - File existence validator
> - Alias existence validator
> *Logics possibly be added in the very near future:-*
> - Streaming script test execution
> - Type checking + casting promotion + type inference
> - Untyped plan test execution
> - Logic to prevent reading and writing from/to the same file
> The common way to implement a LogicalPlanValidator will be based on Visitor 
> pattern. 
> *Cons:-*
>  - By having every validation logic traversing AST from the root node every 
> time, there is a performance hit. However I think this is neglectable due to 
> the fact that Pig is very expressive and normally queries aren't too big (99% 
> of queries contain no more than 1000 AST nodes).
> *Next Step:-*
> LogicalPlanFinalizer which is also a pipeline except that each stage can 
> modify the input AST. This component will generally do a kind of global 
> optimizations.
> *Further ideas:-*
> - Composite visitor can make validations more efficient in some cases but I 
> don't think we need
> - ASTs within the pipeline never change (read-only) so validations can be 
> done in parallel to improve responsiveness. But again I don't think we need 
> this unless we have so many I/O bound logics.
> - The same pipeline concept can also be applied in physical plan 
> validation/optimization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-143) Proposal for refactoring of parsing logic in Pig

Reply via email to