Alan Gates
Mon, 22 Sep 2008 17:48:50 -0700
Pete Wyckoff wrote:
Our general direction in metadata is to support 4 possible types of metadata:So, my use case would be to use the Hive MetaStore and Serializes/Deserializers to implement: 1. a new Pig storage class based on looking up the metadata from the metastore
1) none, we assume everything is uninterpreted bytes. This works. 2) user specified in script. This works.3) provided by load function (for example if it's reading JSON or XML or whatever and can tell the schema of the data. This is coded but not yet tested. 4) provided by an external source. In this scenario pig would somehow be made aware of an external metadata source, and when it sees a load it would query that source for info on the file. I think this is what you want. We hope to start design work on this in the next month or two. Any input you have to this design is certainly welcome.
In the types branch we've already reimplemented what was DataAtom as java String, Integer, Double, etc. So it's only Tuple and DataBag that you'd need to give new implementations for. But these interfaces are much more complex than just a List (for example, a DataBag has to be able to spill to disk if it runs out of memory). If you have different underlying data representations then you might benefit from a re-implementation. But just trying to base it on java types I think you'll end up re-inventing what we have. Take a look at org.apache.pig.data.DefaultTuple and org.apache.pig.data.DefaultDataBag in the types branch to get and idea of what Tuples and Bags look like in pig now.2. a new Tuple/datum/bag based on native Java list,bag, integer, ...
Do you mean you'd like to do describe on a file (or more generically a data input) instead of just on an alias? If so, yes, I think that would be a great idea.3. add a "describe" command to the grunt shell.
Here, I will assume that the "filename" passed in to bindTo in storage is the name of the "table". Is this a plausible implementation as I don't know much about the Pig internals and would people find such an optional feature useful? Thanks, pete
Alan.