Alan Gates commented on PIG-966:

Responses to Dmitry's and Ashutosh's comments:

Can you explain why everything has a Load prefix? Seems like this limits the 
interfaces unnecessarily, and is a bit inconsistent semantically (LoadMetadata 
does not represent metadata associated with loading - it loads metadata. 
LoadStatistics does not load statistics; it represents statistics, and is 
loaded using LoadMetadata).
I don't claim to be a naming guru, so I'm open to other naming suggestions.  I 
chose to prefix all of the interfaces with Load or Store to show that they were 
related to Load and Store.  For example, by calling it LoadMetadata I
did intend to show explicitly that this is metadata associated with loading.  I 
agree that naming schemas and statistics something other than Load is good, 
because they aren't used solely for loading.

In regards to the appropriate parameters for setURI - can you explain the 
advantage of this over Strings in more detail? I think the current setLocation 
approach is preferable; it gives users more flexibility. Plus Hadoop Paths are 
constructed from strings, not URIs, so we are forcing a string->uri->string 
conversion on the common case.
The real concern I have here is I want Pig to be able to distinguish when users 
intend to refer to a filename and when they don't. This is important because 
Pig sometimes munges file names.  Consider the following Pig
Latin script:

cd '/user/gates';
A = load './bla';
Z = limit Y 10;
cd '/tmp';
dump Z;

By the time Pig evaluates Z for dumping, ./bla will have a different meaning 
than it did when the user typed it.  Pig understands that and transforms the 
load statement to load '/user/gates/bla'.  But it needs to know not
to mess with statements like:

A = load 'http://...';

By explicitly making the location a URI we encourage users and load function 
writers to think this way.  Your argument that Hadoop paths are by default 
strings is persuasive.  Perhaps its best to leave this as strings but look
for a scheme at the beginning and interpret it as a URI if it has one (which is 
what Pig does now).

prepareToRead: does it need a finishReading() mate?
A good idea.  Same for finishWriting() below.

I would like to see a "standard" method for getting the jobconf (or whatever it 
is called in 20/21), both for LoadFunc and StoreFunc.
I agree, but I didn't take that on here.  We need a standard way to move 
configuration information (Hadoop and Pig) into Load, Store, and Eval Funcs.  
But I viewed that as a separate issue that should be solved for all UDFs.

We think that the schema should be uniform for everything a single instance of 
a loader is responsible for loading (and the loader can fill in null or 
defaults where appropriate if some resources are missing fields).
Agreed, that is what I was trying to say.  Perhaps it wasn't clear.

Should org.apache.pig.impl.logicalLayer.schema.Schema be changed to use this as 
an internal representation?
No.  It serves a different purpose, which is to define the content of data 
flows inside the logical plan.  We should not tie these two together.

PartitionKeys aren't really part of schema; they are a storage/distribution 
property. This should go into the Metadata and refer to the schema.
We need partition keys as part of this interface, as Pig will need to be able 
to pass partition keys to loaders that are capable of doing partition pruning.  
So we could add getPartitionKeys to the LoadMetadata interface.

Why the public fields? Not that I am a huge fan of getters and setters but I 
sense findbugs warnings heading our way.
LoadSchema and LoadStatistics as proposed are structs.  I don't see any reason 
to pretend otherwise.  And I'm not inclined to bend my programming style to 
match that of whoever wrote findbugs.

I had envisioned statistics as more of a key-value thing, with some keys 
predefined in a separate class. So we would have:


and to get the stats we would call


This allows us to be far more flexible in regards to the things marked as 
"//probably more here."

The problem with key/value set ups like this is it can be hard for people to 
understand what is already there.  So they end up not using what already 
exists, or worse, re-inventing the wheel.  My hope is that by
versioning this we can get around the need for this key/value stuff.

As alluded to above, I am not sure this is a good interface. The idea is that 
we allow users to define which operations can be pushed down to them; but the 
concept of a push down is really a Pig concept, not a Load concept. I think 
breaking this out into two interfaces would be more advisable.
So what happens tomorrow when some loaders can do merge joins on sorted data?  
Now we have to have another interface.  I want this to be easily extensible.

Where does one specify what MetadataWriter to use? Is it inside the StoreFunc? 
In that case StoreFunc needs a method to return its implementation of 
MetadataWriter. Is it global? Then we need to specify how it gets set. Same 
applies to MetadataReader and LoadFunc.
I'm assuming that a given StoreFunc is tied to a particular metadata instance, 
so it would return its implementation of StoreMetadata.  This, and the related 
proposal for a metadata interface (see PIG-967) seek to insulate Pig from the 
metadata system.  But I am not assuming that the
loader and store functions themselves will be insulated.  Those, I'm asserting, 
will be metadata system specific.  I don't see how we can avoid it, as they'll 
need to do schema and statistics translations, possibly data type
translations, etc.

For thoughts on having a default metadata repository, see PIG-967 and the 
associated wiki page, which discusses that.

I think the types table can be extended to support ArrayWritable and 
MapWritable as long as array members and key/value types are among the types 
Probably, I'll take a look.

As far as needing to do something special for loaders like BinStorage and 
JSONLoader - can't they get an underlying inputformat on the front end the same 
way the side files are proposed to be handled (new instance of IF, getSplits, 
RecordReader, read)?

I can't claim to have a detailed understanding of the underlying issues, but it 
seems to me that those things are just interfaces and can be divorced from HDFS 
by creating implementations that deal with the local filesystem directly. Or is 
the idea to be able to run completely without the Hadoop libraries?
I'm not proposing that Pig is able to run completely without Hadoop libraries.  
And I'm guessing that we can use the HDFS implementations on the local file 
system.  But I don't know it for certain.  That's a loose end we need to
tie up before we declare this to be the plan.

> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> ---------------------------------------------------------------
>                 Key: PIG-966
>                 URL: https://issues.apache.org/jira/browse/PIG-966
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
> significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
> full details

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to