Pig prune column after load/filter/sort/join/split. For example:

A = load '1.txt' as (a0, a1, a2);
B = filter A by a0==1;
-- assume a0, a1 is no longer used after this point

becomes:
A = load '1.txt' as (a0, a1, a2);
A1 = foreach A generate a0, a1; -- drop a2
B = filter A1 by a0==1;
B1 = foreach B generate a1; -- drop a0

In your sample, COUNT(A) is a udf consumes all fields in A, so we cannot
prune anything even we have a better column pruner. However, if we change
it to COUNT(A.a1), then it's possible to prune A.a2, if we can have a
better algorithm.

Daniel

On Sun, Dec 4, 2011 at 12:50 PM, Jie Li <[email protected]> wrote:

> Hi Daniel,
>
> Thanks for the example. Does the current pruning happen before each
> statement, or just after LOAD? Because I can only see one-shot pruning for
> each table from the output.
>
> Besides the implementation, is there any semantic issue about the pruning?
> For example,
>
> A = load '1.txt' as (a0, a1, a2);
> B = group A by a0;
> C = foreach B generate COUNT(A);
>
> If we prune A.a1 and A.a2, then A becomes NULL if a0 is NULL. Maybe the
> COUNT operator is a little special.
>
> Jie
>
> On Sun, Dec 4, 2011 at 2:40 PM, Daniel Dai (Commented) (JIRA) <
> [email protected]> wrote:
>
> >
> >    [
> >
> https://issues.apache.org/jira/browse/PIG-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162448#comment-13162448
> ]
> >
> > Daniel Dai commented on PIG-1324:
> > ---------------------------------
> >
> > Hi, Jie,
> > It's certainly solvable but we need some new data structure and
> algorithm.
> > Currently the algorithm is from bottom up, find the required input
> columns
> > of each statement. But if the input column is a bag, we don't trace into
> > the bag. Here is an example:
> >
> > A = load '1.txt' as (a0, a1, a2);
> > B = filter A by a0==1;
> > C = foreach B generate a1;
> >
> > From bottom up, we first C needs B.a1, and B needs A.a0(plus A.a1 C
> > needs), so the loader in A infers a2 is unnecessary. However, in the
> group
> > by sample:
> >
> > A = load '1.txt' as (a0, a1, a2);
> > B = group A by a0;
> > C = foreach B generate group, SUM(A.a1);
> >
> > From C, we figures required fields B.group, B.A, we didn't further mark
> we
> > only need B.A.a1, current data structure does not support it.
> >
> > > Logical Optimizer: Nested column pruning
> > > ----------------------------------------
> > >
> > >                 Key: PIG-1324
> > >                 URL: https://issues.apache.org/jira/browse/PIG-1324
> > >             Project: Pig
> > >          Issue Type: Sub-task
> > >          Components: impl
> > >    Affects Versions: 0.7.0
> > >            Reporter: Daniel Dai
> > >            Assignee: Daniel Dai
> > >
> > > Currently, column pruning does not prune sub-fields inside a complex
> > data-type. For example:
> > > A = load '1.txt' as (a0, a1, a2);
> > > B = group A by a0;
> > > C = foreach B generate group, SUM(A.a1);
> > > Currently, since we group A as a bag, and some part of the bag is used
> > in the following statement, so none of the fields inside A can be pruned.
> > We shall keep track of sub-fields and figure out a2 is not actually
> needed.
> >
> > --
> > This message is automatically generated by JIRA.
> > If you think it was sent incorrectly, please contact your JIRA
> > administrators:
> > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >
> >
> >
> >
>

Reply via email to