[ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758089#action_12758089 ]
Alan Gates commented on PIG-966: -------------------------------- {quote} I must not be clear on what pushing down to a loader does. My interpretation was that it allows pushing down operations to the point where you don't read unnecessary data off disk. A classic example of filter projection would be filtering by a partition key (so, dt >sysdate-30 , and our data is stored in files one per day). An example of projection pushdown is when we have a column store that simply avoids loading some of the columns. I don't see how a loader can push down a join. That seems to require reading and changing data. Is the idea that such a join can be performed without an MR step? That seems like a Pig thing, not a loader thing. In any case, yes, I think something like this would require a new interface in the same namespace, since it's a drastically different capability. Any thoughts on advisability of simplifying projection pushdown to just work on an int array? I know it's limiting, but it's going to be a heck of a lot easier for users to implement. {quote} Limiting the data you need to read off disk is partition pruning, or in the case of columnar stores, column pruning. But this isn't the only case in which you might want to push down operators. Consider data that has (name, age, address) and is partitioned on name. A user might want to query only over adults (age > 17). This isn't a partition field. But if it's a columnar store and age is compressed in say run length or offset encoding the load function may be able to apply the filter on the compressed data. This can be a huge win, as we avoid decompressing whole rows that we don't need. To see another case where we might want to push operators to the loader, consider the case where a user is loading a set of Zebra files, all of which are sorted on one key. Pig may want to keep those zebra files sorted. It will need a way to tell the loader to merge those files as it loads them rather than concatenate them and force Pig to resort the input. I understand your concern on making it difficult to pass down just projection. And you are not the only one to express this concern. Though even there for full projections, we need more than a simple int array, so that we can handle things like map, bag, etc. projections. But maybe we need a simpler option for users who just want to push projection and then the full blown option for power users who want to push selection, etc. Beginner and advanced interfaces I guess. > Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces > --------------------------------------------------------------- > > Key: PIG-966 > URL: https://issues.apache.org/jira/browse/PIG-966 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Alan Gates > Assignee: Alan Gates > > I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces > significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for > full details -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.