[jira] Commented: (PIG-158) Rework logical plan

Santhosh Srinivasan (JIRA) Mon, 21 Apr 2008 11:24:48 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590987#action_12590987
 ]


Santhosh Srinivasan commented on PIG-158:
-----------------------------------------

Pi,

Thanks for the comments/questions. My replies are inline with [Santhosh]

1) So seems like every operator will handle its own output schema right? That 
makes it easier for me.

[Santhosh] Yes, each operator handles its own output schema.

2) in LOProject.getSchema() :-

mExp is an ExpressionOperator. From what you said before .getSchema() will 
always return null except if we're dealing with tuple operators. I think this 
will make more sense if mExp becomes RelationOperator (We only do projection 
operations on relational operators, don't we?)

[Santhosh]  The getSchema() method for expression operators will mostly return 
null except when tuple operations are performed. Projection operator is used on 
expression operators and not on relational operators. Projection can occur in 
multiple places, UDFs, aliases to expressions, etc. For example, $1.($0, $1, 
$2) will translate to LOProject(List<1>, LOProject(List<0,1,2>, null)).

In the example Pig Script given below, $0 and $1 are aliased as url and rank 
respectively. The aliases are associated with the schema of the column 
projection. In order to use these aliases in subsequent statements, LOProject 
needs to support schemas.

{code}
a = load 'a';
b = foreach a generate $0 as url, $1 as rank;
c = foreach b generate url; 
{code}

3) LOSplitOutput

shouldn't the output schema be the same as the input schema ? (So, use 
mPlan.getPredecessors()). Typing is done by induction so you're not supposed to 
use type from downstream operator to infer type in upstream operator.

[Santhosh] Thanks, I will fix that

4) In think LOSort should have "List<LogicalPlan> cols" because we should be 
able to sort by more than one field and each field can be nested expressions

[Santhosh] List<ExpressionOperators> cols seems to be a better fit than 
List<LogicalPlan> cols. Given that each field can be nested expressions, the 
expression operator can then be a LOProject which is a nested expression 
operator.

When I make changes for nested plans, I will come back to the discussion of 
including a list of LogicalPlan.


5) LOCross also uses mPlan.getSuccessors()

[Santhosh] Thanks, I will fix that

6) LOSplit should have "List<LogicalPlan> conds" because we should be able to 
split be nested expressions

[Santhosh] A split operator is similar to a filter operator except that there 
are multiple outputs. The condition on which the split occurs is hence an 
expression.

{code}
split a into b by $1 > '3', c by $1 < '1';
{code}

The above statement is equivalent to the following pig statements.

{code}
b = filter a by $1 > '3';
c = filter a by $1 < '1';
{code}

When I make changes for nested plans, I will come back to the discussion of 
including a list of LogicalPlan.

7) All "private Log log = LogFactory.getLog(getClass());" should be made static

[Santhosh] Good point. I spent a few moments thinking about static v/s member 
variable. Static is better.

8) LogicalOperator :-

{code}
* A boolean variable to remember if input has to be flattened Used only in
     * the context of generate
     */
    private boolean mIsFlatten = false;
{code}

So, shouldn't this be in LOGenerate ?

[Santhosh] Flatten is a property of the participating expressions and not an 
attribute of the operator.  In the example below, $1 is flattened but $2 is 
not. If flatten is an attribute of LOGenerate, we will not be able to figure 
out which column to flatten.

E.g.: 

{code}
a = load 'input1';
b = load 'input2
c =  group a by $1, b by $1;
d = foreach b generate flatten($1), $2;
{code}

9) I still think we should have LORelationalOperator superclass that has 
getSchema()/setSchema() and remove those two methods from ExpressionOperator.

[Santhosh] You made this suggestion last week too :) Let me get these things 
out and come back to it in a couple of days.

10) LOForEach should have "List<LogicalPlan> evals" because each output column 
should be calculated used nested expressions.

[Santhosh] LOForEach will have a nested LogicalPlan instead of a list of 
LogicalPlan. The idea is to incorporate the nested plan patch and then have 
LOForEach contain a nested Pig Script.

11) LOCogroup should have "ArrayList<LogicalPlan> mGroupByCols", same reason as 
above.

[Santhosh] I think the reasoning in my response to LOSort having 
List<ExpressionOperator> holds good here. When I make changes for nested plans, 
I will come back to the discussion of including a list of LogicalPlan.


12) What is LORegexp for? "Regex match" or "Regex replace" ? If it's "Regex 
match" so getType() should return DataType.BOOLEAN

[Santhosh] LORegexp is used for Regex Matches. I agree that getType should 
return DataType.BOOLEAN

> Rework logical plan
> -------------------
>
>                 Key: PIG-158
>                 URL: https://issues.apache.org/jira/browse/PIG-158
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: logical_operators.patch, logical_operators_rev_1.patch, 
> logical_operators_rev_2.patch, logical_operators_rev_3.patch, 
> parser_changes.patch, ParserErrors.txt, visitorWalker.patch
>
>
> Rework the logical plan in line with 
> http://wiki.apache.org/pig/PigExecutionModel

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-158) Rework logical plan

Reply via email to