[
https://issues.apache.org/jira/browse/PIG-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593776#action_12593776
]
Pi Song commented on PIG-158:
-----------------------------
As I started looking at what exactly nested data processing model is, I
realized that the notion of higher-order function is a better generalized form
of inner plans.
Here are definitions of all operators that we have excluding expressions. We
should be able to structure our LOs based on them. From what I can see here,
we're getting very close now.
NOTE: This abstraction doesn't capture sequences of operations within
operators. It only focuses on input and output.
{noformat}
FILTER : Bag x (f: Tuple -> Boolean ) -> Bag
COGroup : [ Bag, f: Tuple -> Tuple ]^n -> Bag
FOREACH : Bag x (f: Tuple -> Tuple) -> Bag
GENERATE : Tuple x (list of flatten flags) -> Tuple
Note: If we just forget the flatten flags of GENERATE, it can be reduced to
"GENERATE: Tuple -> Tuple"
PROJECT : Tuple x (list of indices) -> Tuple
CROSS : Bag x Bag -> Bag
JOIN : This can be constructed by COGroup
ORDER : Bag x (f: Tuple x Tuple -> CompareResult) -> Bag
DISTINCT: Bag x (f: Tuple x Tuple -> CompareResult) -> Bag
UNION: Bag x Bag -> Bag
SPLIT: [ Bag, f: Tuple -> Boolean ]^n -> Bag^n
{noformat}
1) As FILTER/SPLIT don't require FOREACH, COGroup shouldn't need ForEach either
to keep them consistent. ( I was wrong in the last post )
2) Generally nested "f: Tuple -> Tuple" means n inner plans are required (where
n is the number of fields in output tuple) . Alternatively, "f: Tuple -> Tuple"
can be substituted by GENERATE. Though the implementation of GENERATE itself
will obviously need n inner plans.
- FOREACH requires n plans, n = number of fields in output tuple. We do
normally stick GENERATE to FOREACH *BUT* to make everything consistent in this
case we will need inner plans of expressions nested in GENERATE which is again
nested in FOREACH. I start to think this will make it difficult to implement.
Possibly we may not need GENERATE and just keep n plans in FOREACH.
- COGroup requires m plans, m = sum of number of grouped fields of all input
bags. Alternatively if we stick GENERATE before every input bag, we will need
only n GENERATEs, n = number of input bags. Introducing GENERATE here seems to
make it difficult to implement again. I suggest using n*m plans instead.
- To conclude, to make implementation simple at this stage and still keep
things consistent, I think we should drop GENERATE operator. Intrinsically
GENERATE is a custom Tuple -> Tuple function, we will have to add it back when
we start working on Tuple -> Tuple operators in the future. What is your
opinion?
> Rework logical plan
> -------------------
>
> Key: PIG-158
> URL: https://issues.apache.org/jira/browse/PIG-158
> Project: Pig
> Issue Type: Sub-task
> Components: impl
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: logical_operators.patch, logical_operators_rev_1.patch,
> logical_operators_rev_2.patch, logical_operators_rev_3.patch,
> parser_changes.patch, parser_changes_v1.patch, parser_changes_v2.patch,
> parser_changes_v3.patch, ParserErrors.txt, visitorWalker.patch
>
>
> Rework the logical plan in line with
> http://wiki.apache.org/pig/PigExecutionModel
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.