Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

pi song Fri, 30 May 2008 04:54:52 -0700

I don't get it Mathieu.  UDF is a very broad term. It could be UDF Load, UDF
Store, or UDF as function in pipeline.  Can you explain a bit more?


On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]> wrote:

> All,
>
> Looking at the very extensive list of types of file specificic metadata, I
> think (from experience) that a UDF function may need to attach some
> information (any information, actualy) to a given field (or file) to be
> retrieved by another UDF downstream.
>
> What about adding a Map<String, Serializable> to each file and each field ?
>
> --
> Mathieu
>
> Le 30 mai 08 à 01:24, pi song a écrit :
>
>
>  Alan,
>>
>> I will start thinking about this as well. When do you want to start the
>> implementation?
>>
>> Pi
>>
>> On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote:
>>
>>>
>>> Dear Wiki user,
>>>
>>> You have subscribed to a wiki page or wiki category on "Pig Wiki" for
>>> change notification.
>>>
>>> The following page has been changed by AlanGates:
>>> http://wiki.apache.org/pig/PigMetaData
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> information, histograms, etc.
>>>
>>> == Pig Interface to File Specific Metadata ==
>>> - Pig should support four options with regard to file specific metadata:
>>> + Pig should support four options with regard to reading file specific
>>> metadata:
>>>  1.  No file specific metadata available.  Pig uses the file as input
>>> with
>>> no knowledge of its content.  All data is assumed to be !ByteArrays.
>>>  2.  User provides schema in the script.  For example, `A = load 'myfile'
>>> as (a: chararray, b: int);`.
>>>  3.  Self describing data.  Data may be in a format that describes the
>>> schema, such as JSON.  Users may also have other proprietary ways to
>>> store
>>> information about the data in a file either in the file itself or in an
>>> associated file.  Changes to the !LoadFunc interface made as part of the
>>> pipeline rework support this for data type and column layout only.  It
>>> will
>>> need to be expanded to support other types of information about the file.
>>>  4.  Input from a data catalog.  Pig needs to be able to query an
>>> external
>>> data catalog to acquire information about a file.  All the same
>>> information
>>> available in option 3 should be available via this interface.  This
>>> interface does not yet exist and needs to be designed.
>>>
>>> + It should support options 3 and 4 for writing file specific metadata as
>>> well.
>>> +
>>> == Pig Interface to Global Metadata ==
>>> - An interface will need to be designed for pig to interface to an
>>> external
>>> data catalog.
>>> + An interface will need to be designed for pig to read from and write to
>>> an external data catalog.
>>>
>>> == Architecture of Pig Interface to External Data Catalog ==
>>> Pig needs to be able to connect to various types of external data
>>> catalogs
>>> (databases, catalogs stored in flat files, web services, etc.).  To
>>> facilitate this
>>> - pig will develop a generic interface that allows it to make specific
>>> types of queries to a data catalog.  Drivers will then need to be written
>>> to
>>> implement
>>> + pig will develop a generic interface that allows it to query and update
>>> a
>>> data catalog.  Drivers will then need to be written to implement
>>> that interface and connect to a specific type of data catalog.
>>>
>>> == Types of File Specific Metadata Pig Will Use ==
>>> - Pig should be able to acquire the following types of information about
>>> a
>>> file via either self description or an external data catalog.  This is
>>> not
>>> to say
>>> + Pig should be able to acquire and record the following types of
>>> information about a file via either self description or an external data
>>> catalog.  This is not to say
>>> that every self describing file or external data catalog must support
>>> every
>>> one of these items.  This is a list of items pig may find useful and
>>> should
>>> be
>>> - able to query for.  If the metadata source cannot provide the
>>> information, pig will simply not make use of it.
>>> + able to query for and create.  If the metadata source cannot provide or
>>> store the information, pig will simply not make use of it or record it.
>>>  * Field layout (already supported)
>>>  * Field types (already supported)
>>>  * Sortedness of the data, both key and direction (ascending/descending)
>>> @@ -52, +54 @@
>>>
>>>
>>> == Priorities ==
>>> Given that the usage for global metadata is unclear, the priority will be
>>> placed on supporting file specific metadata.  The first step should be to
>>> define the
>>> - interface changes in !LoadFunc and the interface to external data
>>> catalogs.
>>> + interface changes in !LoadFunc, !StoreFunc and the interface to
>>> external
>>> data catalogs.
>>>
>>>
>>>
>

Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

Reply via email to