[ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758037#action_12758037 ]
Dmitriy V. Ryaboy commented on PIG-966: --------------------------------------- Hi Alan, Responses to responses: bq. Perhaps its best to leave this as strings but look for a scheme at the beginning and interpret it as a URI if it has one (which is what Pig does now). I understand the motivation more clearly now, thanks for the explanation. Agreed with the quoted approach. bq. [regarding single schema for partitioned datasets] Agreed, that is what I was trying to say. Perhaps it wasn't clear. Nope, it was clear, I just have a very verbose way of saying "yes". Regarding merging the Schemas you said: bq. No. It serves a different purpose, which is to define the content of data flows inside the logical plan. We should not tie these two together. I don't really understand the difference, but accept your superior knowledge of the codebase and accept your decision :-). bq. I'm not inclined to bend my programming style to match that of whoever wrote findbugs. +9.3 from the Russian judge. Gleefully accepted. bq. We need partition keys as part of this interface, as Pig will need to be able to pass partition keys to loaders that are capable of doing partition pruning. So we could add getPartitionKeys to the LoadMetadata interface. That's precisely what I am suggesting -- take it out of Schema, put it in LoadMetadata (or MetadataReader, as I like to call it....). bq. The problem with key/value set ups like this is it can be hard for people to understand what is already there. So they end up not using what already exists, or worse, re-inventing the wheel. My hope is that by versioning this we can get around the need for this key/value stuff. Hm, I see your point. I am interested in being able to augment the set of available statistics without requiring changes to the base classes, however. I guess that's where inheritance comes in handy. Any comments on how to handle missing data? Primitive types still don't work for that. bq. So what happens tomorrow when some loaders can do merge joins on sorted data? Now we have to have another interface. I want this to be easily extensible. I must not be clear on what pushing down to a loader does. My interpretation was that it allows pushing down operations to the point where you don't read unnecessary data off disk. A classic example of filter projection would be filtering by a partition key (so, dt >sysdate-30 , and our data is stored in files one per day). An example of projection pushdown is when we have a column store that simply avoids loading some of the columns. I don't see how a loader can push down a join. That seems to require reading and changing data. Is the idea that such a join can be performed without an MR step? That seems like a Pig thing, not a loader thing. In any case, yes, I think something like this would require a new interface in the same namespace, since it's a drastically different capability. Any thoughts on advisability of simplifying projection pushdown to just work on an int array? I know it's limiting, but it's going to be a heck of a lot easier for users to implement. bq. I'm assuming that a given StoreFunc is tied to a particular metadata instance, so it would return its implementation of StoreMetadata. I was assuming that Pig would have a preferred metadata store (such as Owl), and it would attempt to use it unless instructed otherwise. We could even try some cascading thing: if the user specifies a metadata store on the command line, use that; if not, see whether the loader suggests one; if not, use Owl; if owl doesn't have anything, see if it's an file in a known scheme (hdfs, file, s3n...) and at least get some file-level metadata such as create date and size. StoreMetadata can do the same (except for hdfs part). I'll take another look at PIG-967. > Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces > --------------------------------------------------------------- > > Key: PIG-966 > URL: https://issues.apache.org/jira/browse/PIG-966 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Alan Gates > Assignee: Alan Gates > > I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces > significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for > full details -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.