Mathieu, let me make sure I understand what you're trying to say. Some file level metadata is about the file as a whole, such as how many records are in the file. Some is about individual columns (such as column cardinality or value distribution histograms). You would like to see each stored in a map (one map for file wide and one for each column). You could then "cheat" in the load functions you write for yourself and add values specific to your application into those maps. Is that correct?

We will need to decide on a canonical set of metadata entries that the pig engine can ask for. But allowing for optional settings in addition to these canonical values seems like a good idea. The pig engine itself will only utilize the canonical set. But user contributed load, store, and eval functions are free to communicate with each via the optional set.

To address Pi's point about columns being transformed, my assumption would be that all this file level metadata will be stored in (or at least referenced from) the Schema object. This can be set so that metadata associated with a particular field survives projection, but not being passed to an eval UDF, being used in an arithmetic expression, etc. As the eval UDF can generate a schema for its output, it could set any optional (or canonical) values it wanted, thus facilitating Mathieu's communication.

Alan.

pi song wrote:
I love discussing about new idea, Mathieu. This is not bothering but
interesting. My colleague had spent sometime doing a Microsoft SSIS thing
that always breaks once the is a schema change and requires a manual script
change. Seems like you are trying to go beyond that.

On Sat, May 31, 2008 at 12:53 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]>
wrote:

Well, it adds a way to *dynamically* parameterize UDF, without changing the
pig script itself.

I guess it comes back to the questions about "how big a pig script is". If
we are only considering 5-line pig scripts, where you do load exactly what
you need to compute, crush numbers and dump them, I agree it does not make
much sense.

If one start thinking about something more ETL-ish (which I understand is
not exactly the main purpose of pig) then one could want to use pig to
"move" data around or load data from somewhere, do something "heavy" that
ETL software can just not cope with efficiently enough —build index, process
images, whatever — and store the results somewhere else, a scenario where
there can be fields that pig will just forward, without playing with.

I admit my background where we were using the same software for ETL-like
stuff and heavy processing (that is, mostly building index) may give me very
a biaised opinion about pig and what it should be. But I would definitely
like to use pig for what it is/will be excellent for, as well as for stuff
where it will be just ok.

So I still think the extension point is worth having. Half my brain is
already thinking about way of cheating and using Alan's fields list to pass
other stuff around...

Another concrete example and I stop bothering you all, then :) In our
tools, we are using some field metadata to denote that a field content is a
primary key to a record. When we copy this field values to somewhere else,
we automaticaly tag them as foreign key (instead of primary). When we dump
the data on disk (to a final-user CDROM image in most cases) the fact that
the column refers to a table present on the disk too can be automagically
stored as it is a feature of our final format : without having the
application developper re-specifying the relations, the "UDF store
equivalent" is clever enough to store the information.

The script the application developper who prepare a CDROM can be several
screen long, with bits spread on separate files. The data model could be
quite complex too. In this context, it is important that things like "this
field acts as a record key" are said once.

Le 30 mai 08 à 16:13, pi song a écrit :


 More,  adding meta data is conceptually adding another way to parameterize
load/store functions. Making UDFs to be parameterized by other UDFs
therefore is also possible functionally but I just couldn't think of any
good use cases.

On Sat, May 31, 2008 at 12:09 AM, pi song <[EMAIL PROTECTED]> wrote:

 Just out of curiosity. If you say somehow the UDF store in your example
can
"learn" from UDF load. That information still might not be useful because
between "load" and "store", you've got processing logic which might or
might
not alter the validity of information directly transfered from "load" to
"store". An example would be I do load a list of number and then I
convert
to string. Then information on the UDF store side is then not applicable.

Don't you think the cases where this concept can be useful is very rare?

Pi



On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]>
wrote:

 Pi,
Well... I was thinking... the three of them actually. Alan's list is
quite
comprehensive, so it is not that easy to find a counvincing example, but
I'm
sure UDF developper may need some additional information to communicate
metadata from one UDF to another.

It does not make sense if you think "one UDF function", but it is a way
to
have two coordinated UDF communicating.

For instance the developper of a jdbc pig "connector" will typically
write
a UDF load, and a UDF store. What if he wants the loader to discover the
field collection (case 3, Self describing data in Alan's page) from jdbc
and
propagate the exact column type of a given field (as in "VARCHAR(42)"),
to
create it the right way in the UDF store ? or the table name ? or the
fact
that a column is indexed, a primary key, a foreign key constraint, some
encoding info... He may also want to develop a UDF pipeline function
that
would perform some foreign key validation against the database at some
point
in his script. Having the information in the metadata may be usefull.

Some other fields of application we can not think of today may need some
completely different metadata. My whole point is: Pig should provide
some
metadata extension point.

Le 30 mai 08 à 13:54, pi song a écrit :


I don't get it Mathieu.  UDF is a very broad term. It could be UDF Load,

UDF
Store, or UDF as function in pipeline.  Can you explain a bit more?

On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]>
wrote:

All,

Looking at the very extensive list of types of file specificic
metadata,
I
think (from experience) that a UDF function may need to attach some
information (any information, actualy) to a given field (or file) to
be
retrieved by another UDF downstream.

What about adding a Map<String, Serializable> to each file and each
field ?

--
Mathieu

Le 30 mai 08 à 01:24, pi song a écrit :


Alan,


I will start thinking about this as well. When do you want to start
the
implementation?

Pi

On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote:


 Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Pig Wiki"
for
change notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/PigMetaData





------------------------------------------------------------------------------
information, histograms, etc.

== Pig Interface to File Specific Metadata ==
- Pig should support four options with regard to file specific
metadata:
+ Pig should support four options with regard to reading file
specific
metadata:
1.  No file specific metadata available.  Pig uses the file as input
with
no knowledge of its content.  All data is assumed to be !ByteArrays.
2.  User provides schema in the script.  For example, `A = load
'myfile'
as (a: chararray, b: int);`.
3.  Self describing data.  Data may be in a format that describes
the
schema, such as JSON.  Users may also have other proprietary ways to
store
information about the data in a file either in the file itself or in
an
associated file.  Changes to the !LoadFunc interface made as part of
the
pipeline rework support this for data type and column layout only.
 It
will
need to be expanded to support other types of information about the
file.
4.  Input from a data catalog.  Pig needs to be able to query an
external
data catalog to acquire information about a file.  All the same
information
available in option 3 should be available via this interface.  This
interface does not yet exist and needs to be designed.

+ It should support options 3 and 4 for writing file specific
metadata
as
well.
+
== Pig Interface to Global Metadata ==
- An interface will need to be designed for pig to interface to an
external
data catalog.
+ An interface will need to be designed for pig to read from and
write
to
an external data catalog.

== Architecture of Pig Interface to External Data Catalog ==
Pig needs to be able to connect to various types of external data
catalogs
(databases, catalogs stored in flat files, web services, etc.).  To
facilitate this
- pig will develop a generic interface that allows it to make
specific
types of queries to a data catalog.  Drivers will then need to be
written
to
implement
+ pig will develop a generic interface that allows it to query and
update
a
data catalog.  Drivers will then need to be written to implement
that interface and connect to a specific type of data catalog.

== Types of File Specific Metadata Pig Will Use ==
- Pig should be able to acquire the following types of information
about
a
file via either self description or an external data catalog.  This
is
not
to say
+ Pig should be able to acquire and record the following types of
information about a file via either self description or an external
data
catalog.  This is not to say
that every self describing file or external data catalog must
support
every
one of these items.  This is a list of items pig may find useful and
should
be
- able to query for.  If the metadata source cannot provide the
information, pig will simply not make use of it.
+ able to query for and create.  If the metadata source cannot
provide
or
store the information, pig will simply not make use of it or record
it.
* Field layout (already supported)
* Field types (already supported)
* Sortedness of the data, both key and direction
(ascending/descending)
@@ -52, +54 @@


== Priorities ==
Given that the usage for global metadata is unclear, the priority
will
be
placed on supporting file specific metadata.  The first step should
be
to
define the
- interface changes in !LoadFunc and the interface to external data
catalogs.
+ interface changes in !LoadFunc, !StoreFunc and the interface to
external
data catalogs.






Reply via email to