Hi Daniel, Thanks for the example. Does the current pruning happen before each statement, or just after LOAD? Because I can only see one-shot pruning for each table from the output.
Besides the implementation, is there any semantic issue about the pruning? For example, A = load '1.txt' as (a0, a1, a2); B = group A by a0; C = foreach B generate COUNT(A); If we prune A.a1 and A.a2, then A becomes NULL if a0 is NULL. Maybe the COUNT operator is a little special. Jie On Sun, Dec 4, 2011 at 2:40 PM, Daniel Dai (Commented) (JIRA) < [email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/PIG-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162448#comment-13162448] > > Daniel Dai commented on PIG-1324: > --------------------------------- > > Hi, Jie, > It's certainly solvable but we need some new data structure and algorithm. > Currently the algorithm is from bottom up, find the required input columns > of each statement. But if the input column is a bag, we don't trace into > the bag. Here is an example: > > A = load '1.txt' as (a0, a1, a2); > B = filter A by a0==1; > C = foreach B generate a1; > > From bottom up, we first C needs B.a1, and B needs A.a0(plus A.a1 C > needs), so the loader in A infers a2 is unnecessary. However, in the group > by sample: > > A = load '1.txt' as (a0, a1, a2); > B = group A by a0; > C = foreach B generate group, SUM(A.a1); > > From C, we figures required fields B.group, B.A, we didn't further mark we > only need B.A.a1, current data structure does not support it. > > > Logical Optimizer: Nested column pruning > > ---------------------------------------- > > > > Key: PIG-1324 > > URL: https://issues.apache.org/jira/browse/PIG-1324 > > Project: Pig > > Issue Type: Sub-task > > Components: impl > > Affects Versions: 0.7.0 > > Reporter: Daniel Dai > > Assignee: Daniel Dai > > > > Currently, column pruning does not prune sub-fields inside a complex > data-type. For example: > > A = load '1.txt' as (a0, a1, a2); > > B = group A by a0; > > C = foreach B generate group, SUM(A.a1); > > Currently, since we group A as a bag, and some part of the bag is used > in the following statement, so none of the fields inside A can be pruned. > We shall keep track of sub-fields and figure out a2 is not actually needed. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > > > >
