Alan Gates commented on PIG-966:
I must not be clear on what pushing down to a loader does. My interpretation
was that it allows pushing down operations to the point where you don't read
unnecessary data off disk. A classic example of filter projection would be
filtering by a partition key (so, dt >sysdate-30 , and our data is stored in
files one per day). An example of projection pushdown is when we have a column
store that simply avoids loading some of the columns.
I don't see how a loader can push down a join. That seems to require reading
and changing data. Is the idea that such a join can be performed without an MR
step? That seems like a Pig thing, not a loader thing.
In any case, yes, I think something like this would require a new interface in
the same namespace, since it's a drastically different capability.
Any thoughts on advisability of simplifying projection pushdown to just work on
an int array? I know it's limiting, but it's going to be a heck of a lot easier
for users to implement.
Limiting the data you need to read off disk is partition pruning, or in the
case of columnar stores, column pruning. But this isn't the only case in which
you might want to push down operators. Consider
data that has (name, age, address) and is partitioned on name. A user might
want to query only over adults (age > 17). This isn't a partition field. But
if it's a columnar store and age is compressed in
say run length or offset encoding the load function may be able to apply the
filter on the compressed data. This can be a huge win, as we avoid
decompressing whole rows that we don't need. To see another
case where we might want to push operators to the loader, consider the case
where a user is loading a set of Zebra files, all of which are sorted on one
key. Pig may want to keep those zebra files
sorted. It will need a way to tell the loader to merge those files as it loads
them rather than concatenate them and force Pig to resort the input.
I understand your concern on making it difficult to pass down just projection.
And you are not the only one to express this concern. Though even there for
full projections, we need more than a simple int array, so that we can
handle things like map, bag, etc. projections. But maybe we need a simpler
option for users who just want to push projection and then the full blown
option for power users who want to push selection, etc.
Beginner and advanced interfaces I guess.
> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> Key: PIG-966
> URL: https://issues.apache.org/jira/browse/PIG-966
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Alan Gates
> Assignee: Alan Gates
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces
> significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for
> full details
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.