Alan Gates commented on PIG-966:

I must not be clear on what pushing down to a loader does. My interpretation 
was that it allows pushing down operations to the point where you don't read 
unnecessary data off disk. A classic example of filter projection would be 
filtering by a partition key (so, dt >sysdate-30 , and our data is stored in 
files one per day). An example of projection pushdown is when we have a column 
store that simply avoids loading some of the columns.

I don't see how a loader can push down a join. That seems to require reading 
and changing data. Is the idea that such a join can be performed without an MR 
step? That seems like a Pig thing, not a loader thing.

In any case, yes, I think something like this would require a new interface in 
the same namespace, since it's a drastically different capability.

Any thoughts on advisability of simplifying projection pushdown to just work on 
an int array? I know it's limiting, but it's going to be a heck of a lot easier 
for users to implement.

Limiting the data you need to read off disk is partition pruning, or in the 
case of columnar stores, column pruning.  But this isn't the only case in which 
you might want to push down operators.  Consider
data that has (name, age, address) and is partitioned on name.  A user might 
want to query only over adults (age > 17).  This isn't a partition field.  But 
if it's a columnar store and age is compressed in
say run length or offset encoding the load function may be able to apply the 
filter on the compressed data.  This can be a huge win, as we avoid 
decompressing whole rows that we don't need.  To see another
case where we might want to push operators to the loader, consider the case 
where a user is loading a set of Zebra files, all of which are sorted on one 
key.  Pig may want to keep those zebra files
sorted.  It will need a way to tell the loader to merge those files as it loads 
them rather than concatenate them and force Pig to resort the input.

I understand your concern on making it difficult to pass down just projection.  
And you are not the only one to express this concern.  Though even there for 
full projections, we need more than a simple int array, so that we can
handle things like map, bag, etc. projections.  But maybe we need a simpler 
option for users who just want to push projection and then the full blown 
option for power users who want to push selection, etc.
Beginner and advanced interfaces I guess.

> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> ---------------------------------------------------------------
>                 Key: PIG-966
>                 URL: https://issues.apache.org/jira/browse/PIG-966
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
> significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
> full details

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to