[
https://issues.apache.org/jira/browse/PIG-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743348#action_12743348
]
Daniel Dai commented on PIG-922:
--------------------------------
Design for push up projection rule:
Presumption:
* Prune columns of loader, save time for record parsing
a = load 'a' as (n1:chararray, n2:chararray, n3:chararray);
b = foreach a generate n1, n2;
=> a = load 'a' (n1:chararray, n2:chararray)
We do not need to parse n3 in our loader.
* Prune columns across map-reduce boundary (between map-reduce jobs or
inter map-reduce jobs), save bandwidth
a = load 'a' as (n1:chararray, n2:chararray, n3:chararray);
b = group a by n1;
c = sort b by n2;
d = foreach c generate n2, n3;
=> a = load 'a' as (n1:chararray, n2:chararray, n3:chararray);
b = group a by n1;
b1 = foreach b generate n2, n3;
c = sort b1 by n2;
d = foreach c generate n2, n3;
* Prune column within map-reduce boundary does not seem to be helpful
store a into 'a';
b = filter a by n1='1';
c = foreach b generate n2;
dump c;
=> store a into 'a';
a1 = foreach a generate n1, n2;
b = filter a1 by n1='1';
c = foreach b generate n2;
dump c;
In this case, an extra foreach step is processed, but we gain no benefit.
Algorithm description:
1. Divide all logical operators into two categories: create map-reduce
boundary and not create map-reduce boundary.
boundary = true: LOCoGroup, LOCross, LOJoin, LODistinct, LOSort
boundary = false: LOFilter, LOForEach, LODefine, LOLoad,
LOStore, LOSplit, LOSplitOutput, LOStream, LOUnion
LOJoin can be boundary or not, depends on the type of join
2. We collect required fields from the bottom, a reverse dependency
order walker algorithm is required to do this
3. We do not actually start from the leaf. We start from the last
LOForEach. Only LOForEach prune columns. If there is no LOForEach in
the script, then we cannot prune anything.
4. From a required output, we need an algorithm to figure required input
<= require $0, $2, $3
b = foreach a generate $0, $2+$3;
<= require $0, $1
5. From the bottom LOForEach, we collect required fields all the way
up, if we move over a boundary operator, save the position because it is
possible to put projection there
......
=> projection here
x = CoGroup .....
......
=> projection here
y = order ......
Put the projection right before boundary to make sure fewer data cross
the boundary
However, we do not make this decision and do the actual prune now, we
will do the actual pruning top down
6. While we traversing up, if we see operator containing more than one
inputs, we trace required fields in all directions; We rely on the
output schema of this operator to figure out which required fields
belong to which input. If we see operator containing more than one
outputs, we collects required fields until all outputs has been traced
7. If we see LOStream, LOStore, we stop
8. If we see LOLoad, we stop and set required fields in LOLoad
9. From LOLoad, we do a top down traverse to decide whether we need to
put projection, and if yes, insert ForEach
10. We only add projection if it is necessary. It is only necessary when
the required fields of that boundary operator is more than output fields
of operator before it.
Filter ...... (output fields: n1, n2, n3)
<= we can prune n3 here
x = CoGroup .... (required fields: n1, n2)
11. It is possible that we create a foreach which can be combined into
previous foreach, however, we do not handle it in PushUpProject rule
ForEach......
<= we will add a ForEach anyway here
x = CoGroup .....
12. Everytime we insert a LOForEach, we need to adjust the projection
map all the way down
13. To fit the PushUpProject into current optimizor framework, we hook
the check rule to LOForEach. Everytime we start from LOForEach and we
never push up over another LOForEach. So we stop at LOForEach and save
required fields upto this point.
> Logical optimizer: push up project
> ----------------------------------
>
> Key: PIG-922
> URL: https://issues.apache.org/jira/browse/PIG-922
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Affects Versions: 0.3.0
> Reporter: Daniel Dai
> Assignee: Daniel Dai
> Fix For: 0.4.0
>
>
> This is a continuation work of
> [PIG-697|https://issues.apache.org/jira/browse/PIG-697]. We need to add
> another rule to the logical optimizer: Push up project, ie, prune columns as
> early as possible.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.