[
https://issues.apache.org/jira/browse/PIG-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162448#comment-13162448
]
Daniel Dai commented on PIG-1324:
---------------------------------
Hi, Jie,
It's certainly solvable but we need some new data structure and algorithm.
Currently the algorithm is from bottom up, find the required input columns of
each statement. But if the input column is a bag, we don't trace into the bag.
Here is an example:
A = load '1.txt' as (a0, a1, a2);
B = filter A by a0==1;
C = foreach B generate a1;
>From bottom up, we first C needs B.a1, and B needs A.a0(plus A.a1 C needs), so
>the loader in A infers a2 is unnecessary. However, in the group by sample:
A = load '1.txt' as (a0, a1, a2);
B = group A by a0;
C = foreach B generate group, SUM(A.a1);
>From C, we figures required fields B.group, B.A, we didn't further mark we
>only need B.A.a1, current data structure does not support it.
> Logical Optimizer: Nested column pruning
> ----------------------------------------
>
> Key: PIG-1324
> URL: https://issues.apache.org/jira/browse/PIG-1324
> Project: Pig
> Issue Type: Sub-task
> Components: impl
> Affects Versions: 0.7.0
> Reporter: Daniel Dai
> Assignee: Daniel Dai
>
> Currently, column pruning does not prune sub-fields inside a complex
> data-type. For example:
> A = load '1.txt' as (a0, a1, a2);
> B = group A by a0;
> C = foreach B generate group, SUM(A.a1);
> Currently, since we group A as a bag, and some part of the bag is used in the
> following statement, so none of the fields inside A can be pruned. We shall
> keep track of sub-fields and figure out a2 is not actually needed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira