pig-user  

Re: Tuple and Datum implementations

Pete Wyckoff
Tue, 23 Sep 2008 10:48:57 -0700

For #1.4, could I not implement a new storage implementation and when given
the file name, I choose the deserialization/serialization mechanism? This
would not allow me to hide the location of the file from the user, but would
still have the benefit of the storage implementation hiding the details of
the deserialization.

For #2, yes I see, I don't want to implement the full Bag API, just want to
construct a default data bag from a Set or a List native object.

As for Describe, I would mean on a symbolic name - presumably a name
returned by a "show" command I would also want to implement - both with
basically mysql semantics.

Thanks, pete


On 9/22/08 5:46 PM, "Alan Gates" <[EMAIL PROTECTED]> wrote:

> 
> 
> Pete Wyckoff wrote:
>> So, my use case would be to use the Hive MetaStore and
>> Serializes/Deserializers to implement:
>> 
>> 1. a new Pig storage class based on looking up the metadata from the
>> metastore
>>   
> Our general direction in metadata is to support 4 possible types of
> metadata:
> 
> 1) none, we assume everything is uninterpreted bytes.  This works.
> 2) user specified in script.  This works.
> 3) provided by load function (for example if it's reading JSON or XML or
> whatever and can tell the schema of the data.  This is coded but not yet
> tested.
> 4) provided by an external source.  In this scenario pig would somehow
> be made aware of an external metadata source, and when it sees a load it
> would query that source for info on the file.  I think this is what you
> want.  We hope to start design work on this in the next month or two.
> Any input you have to this design is certainly welcome.
>> 2. a new Tuple/datum/bag based on native Java list,bag, integer, ...
>>   
> In the types branch we've already reimplemented what was DataAtom as
> java String, Integer, Double, etc.  So it's only Tuple and DataBag that
> you'd need to give new implementations for.  But these interfaces are
> much more complex than just a List (for example, a DataBag has to be
> able to spill to disk if it runs out of memory).  If you have different
> underlying data representations then you might benefit from a
> re-implementation.  But just trying to base it on java types I think
> you'll end up re-inventing what we have.  Take a look at
> org.apache.pig.data.DefaultTuple and org.apache.pig.data.DefaultDataBag
> in the types branch to get and idea of what Tuples and Bags look like in
> pig now.
>> 3. add a "describe" command to the grunt shell.
>>   
> Do you mean you'd like to do describe on a file (or more generically a
> data input) instead of just on an alias?  If so, yes, I think that would
> be a great idea.
>> Here, I will assume that the "filename" passed in to bindTo in storage is
>> the name of the "table".
>> 
>> Is this a plausible implementation as I don't know much about the Pig
>> internals and would people find such an optional feature useful?
>> 
>> Thanks, pete
>>   
> Alan.