[ https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758026#action_12758026 ]
Alan Gates commented on PIG-966: -------------------------------- Responses to Dmitry's and Ashutosh's comments: {quote} Can you explain why everything has a Load prefix? Seems like this limits the interfaces unnecessarily, and is a bit inconsistent semantically (LoadMetadata does not represent metadata associated with loading - it loads metadata. LoadStatistics does not load statistics; it represents statistics, and is loaded using LoadMetadata). {quote} I don't claim to be a naming guru, so I'm open to other naming suggestions. I chose to prefix all of the interfaces with Load or Store to show that they were related to Load and Store. For example, by calling it LoadMetadata I did intend to show explicitly that this is metadata associated with loading. I agree that naming schemas and statistics something other than Load is good, because they aren't used solely for loading. {quote} In regards to the appropriate parameters for setURI - can you explain the advantage of this over Strings in more detail? I think the current setLocation approach is preferable; it gives users more flexibility. Plus Hadoop Paths are constructed from strings, not URIs, so we are forcing a string->uri->string conversion on the common case. {quote} The real concern I have here is I want Pig to be able to distinguish when users intend to refer to a filename and when they don't. This is important because Pig sometimes munges file names. Consider the following Pig Latin script: {code} cd '/user/gates'; A = load './bla'; ... Z = limit Y 10; cd '/tmp'; dump Z; {code} By the time Pig evaluates Z for dumping, ./bla will have a different meaning than it did when the user typed it. Pig understands that and transforms the load statement to load '/user/gates/bla'. But it needs to know not to mess with statements like: {code} A = load 'http://...'; {code} By explicitly making the location a URI we encourage users and load function writers to think this way. Your argument that Hadoop paths are by default strings is persuasive. Perhaps its best to leave this as strings but look for a scheme at the beginning and interpret it as a URI if it has one (which is what Pig does now). {quote} prepareToRead: does it need a finishReading() mate? {quote} A good idea. Same for finishWriting() below. {quote} I would like to see a "standard" method for getting the jobconf (or whatever it is called in 20/21), both for LoadFunc and StoreFunc. {quote} I agree, but I didn't take that on here. We need a standard way to move configuration information (Hadoop and Pig) into Load, Store, and Eval Funcs. But I viewed that as a separate issue that should be solved for all UDFs. {quote} We think that the schema should be uniform for everything a single instance of a loader is responsible for loading (and the loader can fill in null or defaults where appropriate if some resources are missing fields). {quote} Agreed, that is what I was trying to say. Perhaps it wasn't clear. {quote} Should org.apache.pig.impl.logicalLayer.schema.Schema be changed to use this as an internal representation? {quote} No. It serves a different purpose, which is to define the content of data flows inside the logical plan. We should not tie these two together. {quote} PartitionKeys aren't really part of schema; they are a storage/distribution property. This should go into the Metadata and refer to the schema. {quote} We need partition keys as part of this interface, as Pig will need to be able to pass partition keys to loaders that are capable of doing partition pruning. So we could add getPartitionKeys to the LoadMetadata interface. {quote} Why the public fields? Not that I am a huge fan of getters and setters but I sense findbugs warnings heading our way. {quote} LoadSchema and LoadStatistics as proposed are structs. I don't see any reason to pretend otherwise. And I'm not inclined to bend my programming style to match that of whoever wrote findbugs. {quote} I had envisioned statistics as more of a key-value thing, with some keys predefined in a separate class. So we would have: ResourceStats.NUM_RECORDS ResourceStats.SIZE_IN_BYTES //etc and to get the stats we would call MyResourceStats.getLong(ResourceStats.NUM_RECORDS) MyResourceStats.getFloat(ResourceStats.SOMETHING_THAT_IS_A_FLOAT) //etc This allows us to be far more flexible in regards to the things marked as "//probably more here." {quote} The problem with key/value set ups like this is it can be hard for people to understand what is already there. So they end up not using what already exists, or worse, re-inventing the wheel. My hope is that by versioning this we can get around the need for this key/value stuff. {quote} As alluded to above, I am not sure this is a good interface. The idea is that we allow users to define which operations can be pushed down to them; but the concept of a push down is really a Pig concept, not a Load concept. I think breaking this out into two interfaces would be more advisable. {quote} So what happens tomorrow when some loaders can do merge joins on sorted data? Now we have to have another interface. I want this to be easily extensible. {quote} Where does one specify what MetadataWriter to use? Is it inside the StoreFunc? In that case StoreFunc needs a method to return its implementation of MetadataWriter. Is it global? Then we need to specify how it gets set. Same applies to MetadataReader and LoadFunc. {quote} I'm assuming that a given StoreFunc is tied to a particular metadata instance, so it would return its implementation of StoreMetadata. This, and the related proposal for a metadata interface (see PIG-967) seek to insulate Pig from the metadata system. But I am not assuming that the loader and store functions themselves will be insulated. Those, I'm asserting, will be metadata system specific. I don't see how we can avoid it, as they'll need to do schema and statistics translations, possibly data type translations, etc. For thoughts on having a default metadata repository, see PIG-967 and the associated wiki page, which discusses that. {quote} I think the types table can be extended to support ArrayWritable and MapWritable as long as array members and key/value types are among the types listed. {quote} Probably, I'll take a look. {quote} As far as needing to do something special for loaders like BinStorage and JSONLoader - can't they get an underlying inputformat on the front end the same way the side files are proposed to be handled (new instance of IF, getSplits, RecordReader, read)? {quote} Probably. {quote} I can't claim to have a detailed understanding of the underlying issues, but it seems to me that those things are just interfaces and can be divorced from HDFS by creating implementations that deal with the local filesystem directly. Or is the idea to be able to run completely without the Hadoop libraries? {quote} I'm not proposing that Pig is able to run completely without Hadoop libraries. And I'm guessing that we can use the HDFS implementations on the local file system. But I don't know it for certain. That's a loose end we need to tie up before we declare this to be the plan. > Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces > --------------------------------------------------------------- > > Key: PIG-966 > URL: https://issues.apache.org/jira/browse/PIG-966 > Project: Pig > Issue Type: Improvement > Components: impl > Reporter: Alan Gates > Assignee: Alan Gates > > I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces > significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for > full details -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.