[ 
https://issues.apache.org/jira/browse/PIG-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593776#action_12593776
 ] 

Pi Song commented on PIG-158:
-----------------------------

As I started looking at what exactly nested data processing model is, I 
realized that the notion of higher-order function is a better generalized form 
of inner plans.

Here are definitions of all operators that we have excluding expressions. We 
should be able to structure our LOs based on them. From what I can see here, 
we're getting very close now.
NOTE: This abstraction doesn't capture sequences of operations within 
operators. It only focuses on input and output.
{noformat}
FILTER : Bag x (f: Tuple -> Boolean ) -> Bag

COGroup :  [ Bag, f: Tuple -> Tuple ]^n -> Bag

FOREACH  :  Bag x (f: Tuple -> Tuple) -> Bag

GENERATE :  Tuple x (list of flatten flags) -> Tuple
Note: If we just forget the flatten flags of GENERATE, it can be reduced to 
"GENERATE: Tuple -> Tuple"

PROJECT : Tuple x (list of indices) -> Tuple

CROSS : Bag x Bag -> Bag

JOIN : This can be constructed by COGroup

ORDER : Bag x (f: Tuple x Tuple -> CompareResult) -> Bag

DISTINCT: Bag x (f: Tuple x Tuple -> CompareResult) -> Bag

UNION: Bag x Bag -> Bag

SPLIT: [ Bag, f: Tuple -> Boolean ]^n -> Bag^n
{noformat}

1) As FILTER/SPLIT don't require FOREACH, COGroup shouldn't need ForEach either 
to keep them consistent. ( I was wrong in the last post )

2) Generally nested "f: Tuple -> Tuple" means n inner plans are required (where 
n is the number of fields in output tuple) . Alternatively, "f: Tuple -> Tuple" 
can be substituted by GENERATE. Though the implementation of GENERATE itself 
will obviously need n inner plans.
 
- FOREACH requires n plans, n = number of fields in output tuple. We do 
normally stick GENERATE to FOREACH *BUT* to make everything consistent in this 
case we will need inner plans of expressions nested in GENERATE which is again 
nested in FOREACH. I start to think this will make it difficult to implement. 
Possibly we may not need GENERATE and just keep n plans in FOREACH. 

- COGroup requires m plans, m = sum of number of grouped fields of all input 
bags. Alternatively if we stick GENERATE before every input bag, we will need 
only n GENERATEs, n = number of input bags. Introducing GENERATE here seems to 
make it difficult to implement again. I suggest using n*m plans instead.

- To conclude, to make implementation simple at this stage and still keep 
things consistent, I think we should drop GENERATE operator. Intrinsically 
GENERATE is a custom Tuple -> Tuple function, we will have to add it back when 
we start working on Tuple -> Tuple operators in the future. What is your 
opinion?

> Rework logical plan
> -------------------
>
>                 Key: PIG-158
>                 URL: https://issues.apache.org/jira/browse/PIG-158
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: logical_operators.patch, logical_operators_rev_1.patch, 
> logical_operators_rev_2.patch, logical_operators_rev_3.patch, 
> parser_changes.patch, parser_changes_v1.patch, parser_changes_v2.patch, 
> parser_changes_v3.patch, ParserErrors.txt, visitorWalker.patch
>
>
> Rework the logical plan in line with 
> http://wiki.apache.org/pig/PigExecutionModel

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to