Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

pi song Thu, 29 May 2008 16:24:51 -0700

Alan,

I will start thinking about this as well. When do you want to start the
implementation?


Pi

On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote:
>
> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Pig Wiki" for
> change notification.
>
> The following page has been changed by AlanGates:
> http://wiki.apache.org/pig/PigMetaData
>
>
> ------------------------------------------------------------------------------
> information, histograms, etc.
>
> == Pig Interface to File Specific Metadata ==
> - Pig should support four options with regard to file specific metadata:
> + Pig should support four options with regard to reading file specific
> metadata:
>   1.  No file specific metadata available.  Pig uses the file as input with
> no knowledge of its content.  All data is assumed to be !ByteArrays.
>   2.  User provides schema in the script.  For example, `A = load 'myfile'
> as (a: chararray, b: int);`.
>   3.  Self describing data.  Data may be in a format that describes the
> schema, such as JSON.  Users may also have other proprietary ways to store
> information about the data in a file either in the file itself or in an
> associated file.  Changes to the !LoadFunc interface made as part of the
> pipeline rework support this for data type and column layout only.  It will
> need to be expanded to support other types of information about the file.
>   4.  Input from a data catalog.  Pig needs to be able to query an external
> data catalog to acquire information about a file.  All the same information
> available in option 3 should be available via this interface.  This
> interface does not yet exist and needs to be designed.
>
> + It should support options 3 and 4 for writing file specific metadata as
> well.
> +
> == Pig Interface to Global Metadata ==
> - An interface will need to be designed for pig to interface to an external
> data catalog.
> + An interface will need to be designed for pig to read from and write to
> an external data catalog.
>
> == Architecture of Pig Interface to External Data Catalog ==
> Pig needs to be able to connect to various types of external data catalogs
> (databases, catalogs stored in flat files, web services, etc.).  To
> facilitate this
> - pig will develop a generic interface that allows it to make specific
> types of queries to a data catalog.  Drivers will then need to be written to
> implement
> + pig will develop a generic interface that allows it to query and update a
> data catalog.  Drivers will then need to be written to implement
> that interface and connect to a specific type of data catalog.
>
> == Types of File Specific Metadata Pig Will Use ==
> - Pig should be able to acquire the following types of information about a
> file via either self description or an external data catalog.  This is not
> to say
> + Pig should be able to acquire and record the following types of
> information about a file via either self description or an external data
> catalog.  This is not to say
> that every self describing file or external data catalog must support every
> one of these items.  This is a list of items pig may find useful and should
> be
> - able to query for.  If the metadata source cannot provide the
> information, pig will simply not make use of it.
> + able to query for and create.  If the metadata source cannot provide or
> store the information, pig will simply not make use of it or record it.
>   * Field layout (already supported)
>   * Field types (already supported)
>   * Sortedness of the data, both key and direction (ascending/descending)
> @@ -52, +54 @@
>
>
> == Priorities ==
> Given that the usage for global metadata is unclear, the priority will be
> placed on supporting file specific metadata.  The first step should be to
> define the
> - interface changes in !LoadFunc and the interface to external data
> catalogs.
> + interface changes in !LoadFunc, !StoreFunc and the interface to external
> data catalogs.
>
>

Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

Reply via email to