pig-user  

Re: Tuple and Datum implementations

Alan Gates
Mon, 22 Sep 2008 17:48:50 -0700



Pete Wyckoff wrote:
So, my use case would be to use the Hive MetaStore and
Serializes/Deserializers to implement:

1. a new Pig storage class based on looking up the metadata from the
metastore
Our general direction in metadata is to support 4 possible types of metadata:

1) none, we assume everything is uninterpreted bytes.  This works.
2) user specified in script.  This works.
3) provided by load function (for example if it's reading JSON or XML or whatever and can tell the schema of the data. This is coded but not yet tested. 4) provided by an external source. In this scenario pig would somehow be made aware of an external metadata source, and when it sees a load it would query that source for info on the file. I think this is what you want. We hope to start design work on this in the next month or two. Any input you have to this design is certainly welcome.
2. a new Tuple/datum/bag based on native Java list,bag, integer, ...
In the types branch we've already reimplemented what was DataAtom as java String, Integer, Double, etc. So it's only Tuple and DataBag that you'd need to give new implementations for. But these interfaces are much more complex than just a List (for example, a DataBag has to be able to spill to disk if it runs out of memory). If you have different underlying data representations then you might benefit from a re-implementation. But just trying to base it on java types I think you'll end up re-inventing what we have. Take a look at org.apache.pig.data.DefaultTuple and org.apache.pig.data.DefaultDataBag in the types branch to get and idea of what Tuples and Bags look like in pig now.
3. add a "describe" command to the grunt shell.
Do you mean you'd like to do describe on a file (or more generically a data input) instead of just on an alias? If so, yes, I think that would be a great idea.
Here, I will assume that the "filename" passed in to bindTo in storage is
the name of the "table".

Is this a plausible implementation as I don't know much about the Pig
internals and would people find such an optional feature useful?

Thanks, pete
Alan.