Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

pi song Fri, 30 May 2008 07:13:47 -0700

More,  adding meta data is conceptually adding another way to parameterize
load/store functions. Making UDFs to be parameterized by other UDFs
therefore is also possible functionally but I just couldn't think of any
good use cases.


On Sat, May 31, 2008 at 12:09 AM, pi song <[EMAIL PROTECTED]> wrote:

> Just out of curiosity. If you say somehow the UDF store in your example can
> "learn" from UDF load. That information still might not be useful because
> between "load" and "store", you've got processing logic which might or might
> not alter the validity of information directly transfered from "load" to
> "store". An example would be I do load a list of number and then I convert
> to string. Then information on the UDF store side is then not applicable.
>
> Don't you think the cases where this concept can be useful is very rare?
>
> Pi
>
>
>
> On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]>
> wrote:
>
>> Pi,
>>
>> Well... I was thinking... the three of them actually. Alan's list is quite
>> comprehensive, so it is not that easy to find a counvincing example, but I'm
>> sure UDF developper may need some additional information to communicate
>> metadata from one UDF to another.
>>
>> It does not make sense if you think "one UDF function", but it is a way to
>> have two coordinated UDF communicating.
>>
>> For instance the developper of a jdbc pig "connector" will typically write
>> a UDF load, and a UDF store. What if he wants the loader to discover the
>> field collection (case 3, Self describing data in Alan's page) from jdbc and
>> propagate the exact column type of a given field (as in "VARCHAR(42)"), to
>> create it the right way in the UDF store ? or the table name ? or the fact
>> that a column is indexed, a primary key, a foreign key constraint, some
>> encoding info... He may also want to develop a UDF pipeline function that
>> would perform some foreign key validation against the database at some point
>> in his script. Having the information in the metadata may be usefull.
>>
>> Some other fields of application we can not think of today may need some
>> completely different metadata. My whole point is: Pig should provide some
>> metadata extension point.
>>
>> Le 30 mai 08 à 13:54, pi song a écrit :
>>
>>
>>  I don't get it Mathieu.  UDF is a very broad term. It could be UDF Load,
>>> UDF
>>> Store, or UDF as function in pipeline.  Can you explain a bit more?
>>>
>>> On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>  All,
>>>>
>>>> Looking at the very extensive list of types of file specificic metadata,
>>>> I
>>>> think (from experience) that a UDF function may need to attach some
>>>> information (any information, actualy) to a given field (or file) to be
>>>> retrieved by another UDF downstream.
>>>>
>>>> What about adding a Map<String, Serializable> to each file and each
>>>> field ?
>>>>
>>>> --
>>>> Mathieu
>>>>
>>>> Le 30 mai 08 à 01:24, pi song a écrit :
>>>>
>>>>
>>>> Alan,
>>>>
>>>>>
>>>>> I will start thinking about this as well. When do you want to start the
>>>>> implementation?
>>>>>
>>>>> Pi
>>>>>
>>>>> On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>
>>>>>> Dear Wiki user,
>>>>>>
>>>>>> You have subscribed to a wiki page or wiki category on "Pig Wiki" for
>>>>>> change notification.
>>>>>>
>>>>>> The following page has been changed by AlanGates:
>>>>>> http://wiki.apache.org/pig/PigMetaData
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> information, histograms, etc.
>>>>>>
>>>>>> == Pig Interface to File Specific Metadata ==
>>>>>> - Pig should support four options with regard to file specific
>>>>>> metadata:
>>>>>> + Pig should support four options with regard to reading file specific
>>>>>> metadata:
>>>>>> 1.  No file specific metadata available.  Pig uses the file as input
>>>>>> with
>>>>>> no knowledge of its content.  All data is assumed to be !ByteArrays.
>>>>>> 2.  User provides schema in the script.  For example, `A = load
>>>>>> 'myfile'
>>>>>> as (a: chararray, b: int);`.
>>>>>> 3.  Self describing data.  Data may be in a format that describes the
>>>>>> schema, such as JSON.  Users may also have other proprietary ways to
>>>>>> store
>>>>>> information about the data in a file either in the file itself or in
>>>>>> an
>>>>>> associated file.  Changes to the !LoadFunc interface made as part of
>>>>>> the
>>>>>> pipeline rework support this for data type and column layout only.  It
>>>>>> will
>>>>>> need to be expanded to support other types of information about the
>>>>>> file.
>>>>>> 4.  Input from a data catalog.  Pig needs to be able to query an
>>>>>> external
>>>>>> data catalog to acquire information about a file.  All the same
>>>>>> information
>>>>>> available in option 3 should be available via this interface.  This
>>>>>> interface does not yet exist and needs to be designed.
>>>>>>
>>>>>> + It should support options 3 and 4 for writing file specific metadata
>>>>>> as
>>>>>> well.
>>>>>> +
>>>>>> == Pig Interface to Global Metadata ==
>>>>>> - An interface will need to be designed for pig to interface to an
>>>>>> external
>>>>>> data catalog.
>>>>>> + An interface will need to be designed for pig to read from and write
>>>>>> to
>>>>>> an external data catalog.
>>>>>>
>>>>>> == Architecture of Pig Interface to External Data Catalog ==
>>>>>> Pig needs to be able to connect to various types of external data
>>>>>> catalogs
>>>>>> (databases, catalogs stored in flat files, web services, etc.).  To
>>>>>> facilitate this
>>>>>> - pig will develop a generic interface that allows it to make specific
>>>>>> types of queries to a data catalog.  Drivers will then need to be
>>>>>> written
>>>>>> to
>>>>>> implement
>>>>>> + pig will develop a generic interface that allows it to query and
>>>>>> update
>>>>>> a
>>>>>> data catalog.  Drivers will then need to be written to implement
>>>>>> that interface and connect to a specific type of data catalog.
>>>>>>
>>>>>> == Types of File Specific Metadata Pig Will Use ==
>>>>>> - Pig should be able to acquire the following types of information
>>>>>> about
>>>>>> a
>>>>>> file via either self description or an external data catalog.  This is
>>>>>> not
>>>>>> to say
>>>>>> + Pig should be able to acquire and record the following types of
>>>>>> information about a file via either self description or an external
>>>>>> data
>>>>>> catalog.  This is not to say
>>>>>> that every self describing file or external data catalog must support
>>>>>> every
>>>>>> one of these items.  This is a list of items pig may find useful and
>>>>>> should
>>>>>> be
>>>>>> - able to query for.  If the metadata source cannot provide the
>>>>>> information, pig will simply not make use of it.
>>>>>> + able to query for and create.  If the metadata source cannot provide
>>>>>> or
>>>>>> store the information, pig will simply not make use of it or record
>>>>>> it.
>>>>>> * Field layout (already supported)
>>>>>> * Field types (already supported)
>>>>>> * Sortedness of the data, both key and direction
>>>>>> (ascending/descending)
>>>>>> @@ -52, +54 @@
>>>>>>
>>>>>>
>>>>>> == Priorities ==
>>>>>> Given that the usage for global metadata is unclear, the priority will
>>>>>> be
>>>>>> placed on supporting file specific metadata.  The first step should be
>>>>>> to
>>>>>> define the
>>>>>> - interface changes in !LoadFunc and the interface to external data
>>>>>> catalogs.
>>>>>> + interface changes in !LoadFunc, !StoreFunc and the interface to
>>>>>> external
>>>>>> data catalogs.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>
>

Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

Reply via email to