Alan Gates commented on PIG-966:
Responses to Dmitry's and Ashutosh's comments:
Can you explain why everything has a Load prefix? Seems like this limits the
interfaces unnecessarily, and is a bit inconsistent semantically (LoadMetadata
does not represent metadata associated with loading - it loads metadata.
LoadStatistics does not load statistics; it represents statistics, and is
loaded using LoadMetadata).
I don't claim to be a naming guru, so I'm open to other naming suggestions. I
chose to prefix all of the interfaces with Load or Store to show that they were
related to Load and Store. For example, by calling it LoadMetadata I
did intend to show explicitly that this is metadata associated with loading. I
agree that naming schemas and statistics something other than Load is good,
because they aren't used solely for loading.
In regards to the appropriate parameters for setURI - can you explain the
advantage of this over Strings in more detail? I think the current setLocation
approach is preferable; it gives users more flexibility. Plus Hadoop Paths are
constructed from strings, not URIs, so we are forcing a string->uri->string
conversion on the common case.
The real concern I have here is I want Pig to be able to distinguish when users
intend to refer to a filename and when they don't. This is important because
Pig sometimes munges file names. Consider the following Pig
A = load './bla';
Z = limit Y 10;
By the time Pig evaluates Z for dumping, ./bla will have a different meaning
than it did when the user typed it. Pig understands that and transforms the
load statement to load '/user/gates/bla'. But it needs to know not
to mess with statements like:
A = load 'http://...';
By explicitly making the location a URI we encourage users and load function
writers to think this way. Your argument that Hadoop paths are by default
strings is persuasive. Perhaps its best to leave this as strings but look
for a scheme at the beginning and interpret it as a URI if it has one (which is
what Pig does now).
prepareToRead: does it need a finishReading() mate?
A good idea. Same for finishWriting() below.
I would like to see a "standard" method for getting the jobconf (or whatever it
is called in 20/21), both for LoadFunc and StoreFunc.
I agree, but I didn't take that on here. We need a standard way to move
configuration information (Hadoop and Pig) into Load, Store, and Eval Funcs.
But I viewed that as a separate issue that should be solved for all UDFs.
We think that the schema should be uniform for everything a single instance of
a loader is responsible for loading (and the loader can fill in null or
defaults where appropriate if some resources are missing fields).
Agreed, that is what I was trying to say. Perhaps it wasn't clear.
Should org.apache.pig.impl.logicalLayer.schema.Schema be changed to use this as
an internal representation?
No. It serves a different purpose, which is to define the content of data
flows inside the logical plan. We should not tie these two together.
PartitionKeys aren't really part of schema; they are a storage/distribution
property. This should go into the Metadata and refer to the schema.
We need partition keys as part of this interface, as Pig will need to be able
to pass partition keys to loaders that are capable of doing partition pruning.
So we could add getPartitionKeys to the LoadMetadata interface.
Why the public fields? Not that I am a huge fan of getters and setters but I
sense findbugs warnings heading our way.
LoadSchema and LoadStatistics as proposed are structs. I don't see any reason
to pretend otherwise. And I'm not inclined to bend my programming style to
match that of whoever wrote findbugs.
I had envisioned statistics as more of a key-value thing, with some keys
predefined in a separate class. So we would have:
and to get the stats we would call
This allows us to be far more flexible in regards to the things marked as
"//probably more here."
The problem with key/value set ups like this is it can be hard for people to
understand what is already there. So they end up not using what already
exists, or worse, re-inventing the wheel. My hope is that by
versioning this we can get around the need for this key/value stuff.
As alluded to above, I am not sure this is a good interface. The idea is that
we allow users to define which operations can be pushed down to them; but the
concept of a push down is really a Pig concept, not a Load concept. I think
breaking this out into two interfaces would be more advisable.
So what happens tomorrow when some loaders can do merge joins on sorted data?
Now we have to have another interface. I want this to be easily extensible.
Where does one specify what MetadataWriter to use? Is it inside the StoreFunc?
In that case StoreFunc needs a method to return its implementation of
MetadataWriter. Is it global? Then we need to specify how it gets set. Same
applies to MetadataReader and LoadFunc.
I'm assuming that a given StoreFunc is tied to a particular metadata instance,
so it would return its implementation of StoreMetadata. This, and the related
proposal for a metadata interface (see PIG-967) seek to insulate Pig from the
metadata system. But I am not assuming that the
loader and store functions themselves will be insulated. Those, I'm asserting,
will be metadata system specific. I don't see how we can avoid it, as they'll
need to do schema and statistics translations, possibly data type
For thoughts on having a default metadata repository, see PIG-967 and the
associated wiki page, which discusses that.
I think the types table can be extended to support ArrayWritable and
MapWritable as long as array members and key/value types are among the types
Probably, I'll take a look.
As far as needing to do something special for loaders like BinStorage and
JSONLoader - can't they get an underlying inputformat on the front end the same
way the side files are proposed to be handled (new instance of IF, getSplits,
I can't claim to have a detailed understanding of the underlying issues, but it
seems to me that those things are just interfaces and can be divorced from HDFS
by creating implementations that deal with the local filesystem directly. Or is
the idea to be able to run completely without the Hadoop libraries?
I'm not proposing that Pig is able to run completely without Hadoop libraries.
And I'm guessing that we can use the HDFS implementations on the local file
system. But I don't know it for certain. That's a loose end we need to
tie up before we declare this to be the plan.
> Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
> Key: PIG-966
> URL: https://issues.apache.org/jira/browse/PIG-966
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Alan Gates
> Assignee: Alan Gates
> I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces
> significantly. See http://wiki.apache.org/pig/LoadStoreRedesignProposal for
> full details
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.