[ 
https://issues.apache.org/jira/browse/PIG-143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pi Song updated PIG-143:
------------------------

    Attachment: validation_part1.patch

Here is the implementation of what I proposed. This patch doesn't compile as I 
have to wait for some parts of logical plan rework to finish first.

It includes:-
1. A simple validation framework.
2. Two validations: Input/output file existence checker, and Type checker 
(incomplete)

This framework starts from adding new classes to the plan package as I think 
validation logics are necessary for all the plan types (LogicalPlan (input 
validation), platform-specific PhysicalPlan (mostly for debugging)).

There are some issues with the type checker:-
- The current logical layer implementation has not incorporated inner plan 
concept yet. My type checker assumes some new methods to exist, thus it won't 
compile. 
- In TypeCheckingVisitor, the check logic of BinaryExpressionOperator, 
UnaryExpressionOperator, LOBinCond, LOCast are complete.
- In TypeCheckingVisitor for relational operators, LOSort/LOSplit/LOFilter are 
missing only inner plans in them. These can be an example of how visiting inner 
plan works (a lot of boilerplate code)
- Error messages are still not user-friendly

The procedure of visiting an LO in type checking is:-
1. For non-relational, just check based on tables in TypeDesignSpec. Call 
setType() of the current expression when done.
2. For relational operators:-
- Check operator specific properties
- Check if all the needed data in innerplan roots can be acquired from the 
input operator(s)
- Copy the types acquired from input operators to innerplan roots
- For each inner plan, recursively visit.
- Check properties of each inner plan output (leave(s))
- If needed, compose the output schema based on the operator's logic and inner 
plan outputs
- GetType should be DataType.BAG always.

> Proposal for refactoring of parsing logic in Pig
> ------------------------------------------------
>
>                 Key: PIG-143
>                 URL: https://issues.apache.org/jira/browse/PIG-143
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Pi Song
>            Assignee: Pi Song
>         Attachments: ParserDrawing.png, pigtype_cycle_check.patch, 
> validation_part1.patch
>
>
> h2. Pig Script Parser Refactor Proposal 
> This is my initial proposal on pig script parser refactor work. Please note 
> that I need your opinions for improvements.
> *Problem*
> The basic concept is around the fact that currently we do validation logics 
> in parsing stage (for example, file existence checking) which I think is not 
> clean and difficult to add new validation rules. In the future, we will need 
> to add more and more validation logics to improve usability.
> *My proposal:-*  (see [^ParserDrawing.png])
> - Only keep parsing logic in the parser and leave output of parsing logic 
> being unchecked logical plans. (Therefore the parser only does syntactic 
> checking)
> - Create a new class called LogicalPlanValidationManager which is responsible 
> for validations of the AST from the parser.
> - A new validation logic will be subclassing LogicalPlanValidator 
> - We can chain a set of LogicalPlanValidators inside 
> LogicalPlanValidationManager to do validation work. This allows a new 
> LogicalPlanValidator to be added easily like a plug-in. 
> - This greatly promotes modularity of the validation logics which  is 
> +particularly good when we have a lot of people working on different things+ 
> (eg. streaming may require a special validation logic)
> - We can set the execution order of validators
> - There might be some backend specific validations needed when we implement 
> new execution engines (For example a logical operation that one backend can 
> do but others can't).  We can plug-in this kind of validations on-the-fly 
> based on the backend in use.
> *List of LogicalPlanValidators extracted from the current parser logic:-*
> - File existence validator
> - Alias existence validator
> *Logics possibly be added in the very near future:-*
> - Streaming script test execution
> - Type checking + casting promotion + type inference
> - Untyped plan test execution
> - Logic to prevent reading and writing from/to the same file
> The common way to implement a LogicalPlanValidator will be based on Visitor 
> pattern. 
> *Cons:-*
>  - By having every validation logic traversing AST from the root node every 
> time, there is a performance hit. However I think this is neglectable due to 
> the fact that Pig is very expressive and normally queries aren't too big (99% 
> of queries contain no more than 1000 AST nodes).
> *Next Step:-*
> LogicalPlanFinalizer which is also a pipeline except that each stage can 
> modify the input AST. This component will generally do a kind of global 
> optimizations.
> *Further ideas:-*
> - Composite visitor can make validations more efficient in some cases but I 
> don't think we need
> - ASTs within the pipeline never change (read-only) so validations can be 
> done in parallel to improve responsiveness. But again I don't think we need 
> this unless we have so many I/O bound logics.
> - The same pipeline concept can also be applied in physical plan 
> validation/optimization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to