Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

Mathieu Poumeyrol Fri, 30 May 2008 07:54:02 -0700

Well, it adds a way to *dynamically* parameterize UDF, withoutchanging the pig script itself.

I guess it comes back to the questions about "how big a pig scriptis". If we are only considering 5-line pig scripts, where you do loadexactly what you need to compute, crush numbers and dump them, I agreeit does not make much sense.

If one start thinking about something more ETL-ish (which I understandis not exactly the main purpose of pig) then one could want to use pigto "move" data around or load data from somewhere, do something"heavy" that ETL software can just not cope with efficiently enough —build index, process images, whatever — and store the resultssomewhere else, a scenario where there can be fields that pig willjust forward, without playing with.

I admit my background where we were using the same software for ETL-like stuff and heavy processing (that is, mostly building index) maygive me very a biaised opinion about pig and what it should be. But Iwould definitely like to use pig for what it is/will be excellent for,as well as for stuff where it will be just ok.

So I still think the extension point is worth having. Half my brain isalready thinking about way of cheating and using Alan's fields list topass other stuff around...

Another concrete example and I stop bothering you all, then :) In ourtools, we are using some field metadata to denote that a field contentis a primary key to a record. When we copy this field values tosomewhere else, we automaticaly tag them as foreign key (instead ofprimary). When we dump the data on disk (to a final-user CDROM imagein most cases) the fact that the column refers to a table present onthe disk too can be automagically stored as it is a feature of ourfinal format : without having the application developper re-specifyingthe relations, the "UDF store equivalent" is clever enough to storethe information.

The script the application developper who prepare a CDROM can beseveral screen long, with bits spread on separate files. The datamodel could be quite complex too. In this context, it is importantthat things like "this field acts as a record key" are said once.


Le 30 mai 08 à 16:13, pi song a écrit :

More, adding meta data is conceptually adding another way toparameterize
load/store functions. Making UDFs to be parameterized by other UDFs
therefore is also possible functionally but I just couldn't think ofany
good use cases.

On Sat, May 31, 2008 at 12:09 AM, pi song <[EMAIL PROTECTED]> wrote:
Just out of curiosity. If you say somehow the UDF store in yourexample can"learn" from UDF load. That information still might not be usefulbecausebetween "load" and "store", you've got processing logic which mightor mightnot alter the validity of information directly transfered from"load" to"store". An example would be I do load a list of number and then Iconvertto string. Then information on the UDF store side is then notapplicable.
Don't you think the cases where this concept can be useful is veryrare?
Pi
On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol<[EMAIL PROTECTED]>
wrote:
Pi,
Well... I was thinking... the three of them actually. Alan's listis quitecomprehensive, so it is not that easy to find a counvincingexample, but I'msure UDF developper may need some additional information tocommunicate
metadata from one UDF to another.
It does not make sense if you think "one UDF function", but it isa way to
have two coordinated UDF communicating.
For instance the developper of a jdbc pig "connector" willtypically writea UDF load, and a UDF store. What if he wants the loader todiscover thefield collection (case 3, Self describing data in Alan's page)from jdbc andpropagate the exact column type of a given field (as in"VARCHAR(42)"), tocreate it the right way in the UDF store ? or the table name ? orthe factthat a column is indexed, a primary key, a foreign key constraint,someencoding info... He may also want to develop a UDF pipelinefunction thatwould perform some foreign key validation against the database atsome pointin his script. Having the information in the metadata may beusefull.
Some other fields of application we can not think of today mayneed somecompletely different metadata. My whole point is: Pig shouldprovide some
metadata extension point.

Le 30 mai 08 à 13:54, pi song a écrit :
I don't get it Mathieu. UDF is a very broad term. It could be UDFLoad,
UDF
Store, or UDF as function in pipeline.  Can you explain a bit more?
On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]>
wrote:

All,
Looking at the very extensive list of types of file specificicmetadata,
I
think (from experience) that a UDF function may need to attachsomeinformation (any information, actualy) to a given field (orfile) to be
retrieved by another UDF downstream.
What about adding a Map<String, Serializable> to each file andeach
field ?

--
Mathieu

Le 30 mai 08 à 01:24, pi song a écrit :


Alan,
I will start thinking about this as well. When do you want tostart the
implementation?

Pi

On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote:
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "PigWiki" for
change notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/PigMetaData




------------------------------------------------------------------------------
information, histograms, etc.

== Pig Interface to File Specific Metadata ==
- Pig should support four options with regard to file specific
metadata:
+ Pig should support four options with regard to reading filespecific
metadata:
1. No file specific metadata available. Pig uses the file asinput
with
no knowledge of its content. All data is assumed to be !ByteArrays.
2.  User provides schema in the script.  For example, `A = load
'myfile'
as (a: chararray, b: int);`.
3. Self describing data. Data may be in a format thatdescribes theschema, such as JSON. Users may also have other proprietaryways to
store
information about the data in a file either in the file itselfor in
an
associated file. Changes to the !LoadFunc interface made aspart of
the
pipeline rework support this for data type and column layoutonly. It
will
need to be expanded to support other types of informationabout the
file.
4.  Input from a data catalog.  Pig needs to be able to query an
external
data catalog to acquire information about a file.  All the same
information
available in option 3 should be available via this interface.This
interface does not yet exist and needs to be designed.
+ It should support options 3 and 4 for writing file specificmetadata
as
well.
+
== Pig Interface to Global Metadata ==
- An interface will need to be designed for pig to interfaceto an
external
data catalog.
+ An interface will need to be designed for pig to read fromand write
to
an external data catalog.

== Architecture of Pig Interface to External Data Catalog ==
Pig needs to be able to connect to various types of externaldata
catalogs
(databases, catalogs stored in flat files, web services,etc.). To
facilitate this
- pig will develop a generic interface that allows it to makespecifictypes of queries to a data catalog. Drivers will then need tobe
written
to
implement
+ pig will develop a generic interface that allows it to queryand
update
a
data catalog.  Drivers will then need to be written to implement
that interface and connect to a specific type of data catalog.

== Types of File Specific Metadata Pig Will Use ==
- Pig should be able to acquire the following types ofinformation
about
a
file via either self description or an external data catalog.This is
not
to say
+ Pig should be able to acquire and record the following typesofinformation about a file via either self description or anexternal
data
catalog.  This is not to say
that every self describing file or external data catalog mustsupport
every
one of these items. This is a list of items pig may finduseful and
should
be
- able to query for.  If the metadata source cannot provide the
information, pig will simply not make use of it.
+ able to query for and create. If the metadata source cannotprovide
or
store the information, pig will simply not make use of it orrecord
it.
* Field layout (already supported)
* Field types (already supported)
* Sortedness of the data, both key and direction
(ascending/descending)
@@ -52, +54 @@


== Priorities ==
Given that the usage for global metadata is unclear, thepriority will
be
placed on supporting file specific metadata. The first stepshould be
to
define the
- interface changes in !LoadFunc and the interface to externaldata
catalogs.
+ interface changes in !LoadFunc, !StoreFunc and the interfaceto
external
data catalogs.

Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

Reply via email to