Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

Mathieu Poumeyrol Fri, 30 May 2008 06:45:15 -0700

Pi,

Well... I was thinking... the three of them actually. Alan's list isquite comprehensive, so it is not that easy to find a counvincingexample, but I'm sure UDF developper may need some additionalinformation to communicate metadata from one UDF to another.

It does not make sense if you think "one UDF function", but it is away to have two coordinated UDF communicating.

For instance the developper of a jdbc pig "connector" will typicallywrite a UDF load, and a UDF store. What if he wants the loader todiscover the field collection (case 3, Self describing data in Alan'spage) from jdbc and propagate the exact column type of a given field(as in "VARCHAR(42)"), to create it the right way in the UDF store ?or the table name ? or the fact that a column is indexed, a primarykey, a foreign key constraint, some encoding info... He may also wantto develop a UDF pipeline function that would perform some foreign keyvalidation against the database at some point in his script. Havingthe information in the metadata may be usefull.

Some other fields of application we can not think of today may needsome completely different metadata. My whole point is: Pig shouldprovide some metadata extension point.


Le 30 mai 08 à 13:54, pi song a écrit :

I don't get it Mathieu. UDF is a very broad term. It could be UDFLoad, UDF
Store, or UDF as function in pipeline.  Can you explain a bit more?
On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol<[EMAIL PROTECTED]> wrote:
All,
Looking at the very extensive list of types of file specificicmetadata, I
think (from experience) that a UDF function may need to attach some
information (any information, actualy) to a given field (or file)to be
retrieved by another UDF downstream.
What about adding a Map<String, Serializable> to each file and eachfield ?
--
Mathieu

Le 30 mai 08 à 01:24, pi song a écrit :


Alan,
I will start thinking about this as well. When do you want tostart the
implementation?

Pi

On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote:
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Pig Wiki"for
change notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/PigMetaData



------------------------------------------------------------------------------
information, histograms, etc.

== Pig Interface to File Specific Metadata ==
- Pig should support four options with regard to file specificmetadata:+ Pig should support four options with regard to reading filespecific
metadata:
1. No file specific metadata available. Pig uses the file asinput
with
no knowledge of its content. All data is assumed to be !ByteArrays.2. User provides schema in the script. For example, `A = load'myfile'
as (a: chararray, b: int);`.
3. Self describing data. Data may be in a format that describestheschema, such as JSON. Users may also have other proprietary waysto
store
information about the data in a file either in the file itself orin anassociated file. Changes to the !LoadFunc interface made as partof thepipeline rework support this for data type and column layoutonly. It
will
need to be expanded to support other types of information aboutthe file.
4.  Input from a data catalog.  Pig needs to be able to query an
external
data catalog to acquire information about a file.  All the same
information
available in option 3 should be available via this interface.  This
interface does not yet exist and needs to be designed.
+ It should support options 3 and 4 for writing file specificmetadata as
well.
+
== Pig Interface to Global Metadata ==
- An interface will need to be designed for pig to interface to an
external
data catalog.
+ An interface will need to be designed for pig to read from andwrite to
an external data catalog.

== Architecture of Pig Interface to External Data Catalog ==
Pig needs to be able to connect to various types of external data
catalogs
(databases, catalogs stored in flat files, web services, etc.).  To
facilitate this
- pig will develop a generic interface that allows it to makespecifictypes of queries to a data catalog. Drivers will then need to bewritten
to
implement
+ pig will develop a generic interface that allows it to queryand update
a
data catalog.  Drivers will then need to be written to implement
that interface and connect to a specific type of data catalog.

== Types of File Specific Metadata Pig Will Use ==
- Pig should be able to acquire the following types ofinformation about
a
file via either self description or an external data catalog.This is
not
to say
+ Pig should be able to acquire and record the following types of
information about a file via either self description or anexternal data
catalog.  This is not to say
that every self describing file or external data catalog mustsupport
every
one of these items. This is a list of items pig may find usefuland
should
be
- able to query for.  If the metadata source cannot provide the
information, pig will simply not make use of it.
+ able to query for and create. If the metadata source cannotprovide orstore the information, pig will simply not make use of it orrecord it.
* Field layout (already supported)
* Field types (already supported)
* Sortedness of the data, both key and direction (ascending/descending)
@@ -52, +54 @@


== Priorities ==
Given that the usage for global metadata is unclear, the prioritywill beplaced on supporting file specific metadata. The first stepshould be to
define the
- interface changes in !LoadFunc and the interface to external data
catalogs.
+ interface changes in !LoadFunc, !StoreFunc and the interface to
external
data catalogs.

Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

Reply via email to