Pig prune column after load/filter/sort/join/split. For example: A = load '1.txt' as (a0, a1, a2); B = filter A by a0==1; -- assume a0, a1 is no longer used after this point
becomes: A = load '1.txt' as (a0, a1, a2); A1 = foreach A generate a0, a1; -- drop a2 B = filter A1 by a0==1; B1 = foreach B generate a1; -- drop a0 In your sample, COUNT(A) is a udf consumes all fields in A, so we cannot prune anything even we have a better column pruner. However, if we change it to COUNT(A.a1), then it's possible to prune A.a2, if we can have a better algorithm. Daniel On Sun, Dec 4, 2011 at 12:50 PM, Jie Li <[email protected]> wrote: > Hi Daniel, > > Thanks for the example. Does the current pruning happen before each > statement, or just after LOAD? Because I can only see one-shot pruning for > each table from the output. > > Besides the implementation, is there any semantic issue about the pruning? > For example, > > A = load '1.txt' as (a0, a1, a2); > B = group A by a0; > C = foreach B generate COUNT(A); > > If we prune A.a1 and A.a2, then A becomes NULL if a0 is NULL. Maybe the > COUNT operator is a little special. > > Jie > > On Sun, Dec 4, 2011 at 2:40 PM, Daniel Dai (Commented) (JIRA) < > [email protected]> wrote: > > > > > [ > > > https://issues.apache.org/jira/browse/PIG-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162448#comment-13162448 > ] > > > > Daniel Dai commented on PIG-1324: > > --------------------------------- > > > > Hi, Jie, > > It's certainly solvable but we need some new data structure and > algorithm. > > Currently the algorithm is from bottom up, find the required input > columns > > of each statement. But if the input column is a bag, we don't trace into > > the bag. Here is an example: > > > > A = load '1.txt' as (a0, a1, a2); > > B = filter A by a0==1; > > C = foreach B generate a1; > > > > From bottom up, we first C needs B.a1, and B needs A.a0(plus A.a1 C > > needs), so the loader in A infers a2 is unnecessary. However, in the > group > > by sample: > > > > A = load '1.txt' as (a0, a1, a2); > > B = group A by a0; > > C = foreach B generate group, SUM(A.a1); > > > > From C, we figures required fields B.group, B.A, we didn't further mark > we > > only need B.A.a1, current data structure does not support it. > > > > > Logical Optimizer: Nested column pruning > > > ---------------------------------------- > > > > > > Key: PIG-1324 > > > URL: https://issues.apache.org/jira/browse/PIG-1324 > > > Project: Pig > > > Issue Type: Sub-task > > > Components: impl > > > Affects Versions: 0.7.0 > > > Reporter: Daniel Dai > > > Assignee: Daniel Dai > > > > > > Currently, column pruning does not prune sub-fields inside a complex > > data-type. For example: > > > A = load '1.txt' as (a0, a1, a2); > > > B = group A by a0; > > > C = foreach B generate group, SUM(A.a1); > > > Currently, since we group A as a bag, and some part of the bag is used > > in the following statement, so none of the fields inside A can be pruned. > > We shall keep track of sub-fields and figure out a2 is not actually > needed. > > > > -- > > This message is automatically generated by JIRA. > > If you think it was sent incorrectly, please contact your JIRA > > administrators: > > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > > For more information on JIRA, see: > http://www.atlassian.com/software/jira > > > > > > > > >
