Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

pi song Tue, 03 Jun 2008 04:16:25 -0700

>From my understanding, we are trying to create something like plan-scoped
shared properties right? Potentially, it's good to have.


On Tue, Jun 3, 2008 at 2:20 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]> wrote:

> Alan, Pi,
>
> This overall summary sounds good to me, yes.
>
> But I merely ask for a map to be propagated when it looks possible. I'm not
> too sure about the "canonical" metadata having to be stored into the map
> actualy. I think I would keep the "canonical metadata" as properties of a
> Schema bean, merely adding a Map<String, Serializable> UDMetadata to the
> list. Putting the "canonical metadata" inside the map would just make pig's
> internal code more difficult to maintain and could lead to weird bugs when
> keys are overwritten... I prefer to let the UDF developper play in his
> sandbox.
>
> Le 2 juin 08 à 17:56, Alan Gates a écrit :
>
>
>  Mathieu, let me make sure I understand what you're trying to say.  Some
>> file level metadata is about the file as a whole, such as how many records
>> are in the file.  Some is about individual columns (such as column
>> cardinality or value distribution histograms).  You would like to see each
>> stored in a map (one map for file wide and one for each column).  You could
>> then "cheat" in the load functions you write for yourself and add values
>> specific to your application into those maps.  Is that correct?
>>
>> We will need to decide on a canonical set of metadata entries that the pig
>> engine can ask for.  But allowing for optional settings in addition to these
>> canonical values seems like a good idea.  The pig engine itself will only
>> utilize the canonical set.  But user contributed load, store, and eval
>> functions are free to communicate with each via the optional set.
>>
>> To address Pi's point about columns being transformed, my assumption would
>> be that all this file level metadata will be stored in (or at least
>> referenced from) the Schema object.  This can be set so that metadata
>> associated with a particular field survives projection, but not being passed
>> to an eval UDF, being used in an arithmetic expression, etc.  As the eval
>> UDF can generate a schema for its output, it could set any optional (or
>> canonical) values it wanted, thus facilitating Mathieu's communication.
>>
>> Alan.
>>
>> pi song wrote:
>>
>>> I love discussing about new idea, Mathieu. This is not bothering but
>>> interesting. My colleague had spent sometime doing a Microsoft SSIS thing
>>> that always breaks once the is a schema change and requires a manual
>>> script
>>> change. Seems like you are trying to go beyond that.
>>>
>>> On Sat, May 31, 2008 at 12:53 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>
>>>  Well, it adds a way to *dynamically* parameterize UDF, without changing
>>>> the
>>>> pig script itself.
>>>>
>>>> I guess it comes back to the questions about "how big a pig script is".
>>>> If
>>>> we are only considering 5-line pig scripts, where you do load exactly
>>>> what
>>>> you need to compute, crush numbers and dump them, I agree it does not
>>>> make
>>>> much sense.
>>>>
>>>> If one start thinking about something more ETL-ish (which I understand
>>>> is
>>>> not exactly the main purpose of pig) then one could want to use pig to
>>>> "move" data around or load data from somewhere, do something "heavy"
>>>> that
>>>> ETL software can just not cope with efficiently enough —build index,
>>>> process
>>>> images, whatever — and store the results somewhere else, a scenario
>>>> where
>>>> there can be fields that pig will just forward, without playing with.
>>>>
>>>> I admit my background where we were using the same software for ETL-like
>>>> stuff and heavy processing (that is, mostly building index) may give me
>>>> very
>>>> a biaised opinion about pig and what it should be. But I would
>>>> definitely
>>>> like to use pig for what it is/will be excellent for, as well as for
>>>> stuff
>>>> where it will be just ok.
>>>>
>>>> So I still think the extension point is worth having. Half my brain is
>>>> already thinking about way of cheating and using Alan's fields list to
>>>> pass
>>>> other stuff around...
>>>>
>>>> Another concrete example and I stop bothering you all, then :) In our
>>>> tools, we are using some field metadata to denote that a field content
>>>> is a
>>>> primary key to a record. When we copy this field values to somewhere
>>>> else,
>>>> we automaticaly tag them as foreign key (instead of primary). When we
>>>> dump
>>>> the data on disk (to a final-user CDROM image in most cases) the fact
>>>> that
>>>> the column refers to a table present on the disk too can be
>>>> automagically
>>>> stored as it is a feature of our final format : without having the
>>>> application developper re-specifying the relations, the "UDF store
>>>> equivalent" is clever enough to store the information.
>>>>
>>>> The script the application developper who prepare a CDROM can be several
>>>> screen long, with bits spread on separate files. The data model could be
>>>> quite complex too. In this context, it is important that things like
>>>> "this
>>>> field acts as a record key" are said once.
>>>>
>>>> Le 30 mai 08 à 16:13, pi song a écrit :
>>>>
>>>>
>>>> More,  adding meta data is conceptually adding another way to
>>>> parameterize
>>>>
>>>>  load/store functions. Making UDFs to be parameterized by other UDFs
>>>>> therefore is also possible functionally but I just couldn't think of
>>>>> any
>>>>> good use cases.
>>>>>
>>>>> On Sat, May 31, 2008 at 12:09 AM, pi song <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>> Just out of curiosity. If you say somehow the UDF store in your example
>>>>>
>>>>>  can
>>>>>> "learn" from UDF load. That information still might not be useful
>>>>>> because
>>>>>> between "load" and "store", you've got processing logic which might or
>>>>>> might
>>>>>> not alter the validity of information directly transfered from "load"
>>>>>> to
>>>>>> "store". An example would be I do load a list of number and then I
>>>>>> convert
>>>>>> to string. Then information on the UDF store side is then not
>>>>>> applicable.
>>>>>>
>>>>>> Don't you think the cases where this concept can be useful is very
>>>>>> rare?
>>>>>>
>>>>>> Pi
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>> Pi,
>>>>>>
>>>>>>  Well... I was thinking... the three of them actually. Alan's list is
>>>>>>> quite
>>>>>>> comprehensive, so it is not that easy to find a counvincing example,
>>>>>>> but
>>>>>>> I'm
>>>>>>> sure UDF developper may need some additional information to
>>>>>>> communicate
>>>>>>> metadata from one UDF to another.
>>>>>>>
>>>>>>> It does not make sense if you think "one UDF function", but it is a
>>>>>>> way
>>>>>>> to
>>>>>>> have two coordinated UDF communicating.
>>>>>>>
>>>>>>> For instance the developper of a jdbc pig "connector" will typically
>>>>>>> write
>>>>>>> a UDF load, and a UDF store. What if he wants the loader to discover
>>>>>>> the
>>>>>>> field collection (case 3, Self describing data in Alan's page) from
>>>>>>> jdbc
>>>>>>> and
>>>>>>> propagate the exact column type of a given field (as in
>>>>>>> "VARCHAR(42)"),
>>>>>>> to
>>>>>>> create it the right way in the UDF store ? or the table name ? or the
>>>>>>> fact
>>>>>>> that a column is indexed, a primary key, a foreign key constraint,
>>>>>>> some
>>>>>>> encoding info... He may also want to develop a UDF pipeline function
>>>>>>> that
>>>>>>> would perform some foreign key validation against the database at
>>>>>>> some
>>>>>>> point
>>>>>>> in his script. Having the information in the metadata may be usefull.
>>>>>>>
>>>>>>> Some other fields of application we can not think of today may need
>>>>>>> some
>>>>>>> completely different metadata. My whole point is: Pig should provide
>>>>>>> some
>>>>>>> metadata extension point.
>>>>>>>
>>>>>>> Le 30 mai 08 à 13:54, pi song a écrit :
>>>>>>>
>>>>>>>
>>>>>>> I don't get it Mathieu.  UDF is a very broad term. It could be UDF
>>>>>>> Load,
>>>>>>>
>>>>>>>
>>>>>>>  UDF
>>>>>>>> Store, or UDF as function in pipeline.  Can you explain a bit more?
>>>>>>>>
>>>>>>>> On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <
>>>>>>>> [EMAIL PROTECTED]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> All,
>>>>>>>>
>>>>>>>>
>>>>>>>>  Looking at the very extensive list of types of file specificic
>>>>>>>>> metadata,
>>>>>>>>> I
>>>>>>>>> think (from experience) that a UDF function may need to attach some
>>>>>>>>> information (any information, actualy) to a given field (or file)
>>>>>>>>> to
>>>>>>>>> be
>>>>>>>>> retrieved by another UDF downstream.
>>>>>>>>>
>>>>>>>>> What about adding a Map<String, Serializable> to each file and each
>>>>>>>>> field ?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Mathieu
>>>>>>>>>
>>>>>>>>> Le 30 mai 08 à 01:24, pi song a écrit :
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Alan,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  I will start thinking about this as well. When do you want to
>>>>>>>>>> start
>>>>>>>>>> the
>>>>>>>>>> implementation?
>>>>>>>>>>
>>>>>>>>>> Pi
>>>>>>>>>>
>>>>>>>>>> On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Dear Wiki user,
>>>>>>>>>>
>>>>>>>>>>  You have subscribed to a wiki page or wiki category on "Pig Wiki"
>>>>>>>>>>> for
>>>>>>>>>>> change notification.
>>>>>>>>>>>
>>>>>>>>>>> The following page has been changed by AlanGates:
>>>>>>>>>>> http://wiki.apache.org/pig/PigMetaData
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>> information, histograms, etc.
>>>>>>>>>>>
>>>>>>>>>>> == Pig Interface to File Specific Metadata ==
>>>>>>>>>>> - Pig should support four options with regard to file specific
>>>>>>>>>>> metadata:
>>>>>>>>>>> + Pig should support four options with regard to reading file
>>>>>>>>>>> specific
>>>>>>>>>>> metadata:
>>>>>>>>>>> 1.  No file specific metadata available.  Pig uses the file as
>>>>>>>>>>> input
>>>>>>>>>>> with
>>>>>>>>>>> no knowledge of its content.  All data is assumed to be
>>>>>>>>>>> !ByteArrays.
>>>>>>>>>>> 2.  User provides schema in the script.  For example, `A = load
>>>>>>>>>>> 'myfile'
>>>>>>>>>>> as (a: chararray, b: int);`.
>>>>>>>>>>> 3.  Self describing data.  Data may be in a format that describes
>>>>>>>>>>> the
>>>>>>>>>>> schema, such as JSON.  Users may also have other proprietary ways
>>>>>>>>>>> to
>>>>>>>>>>> store
>>>>>>>>>>> information about the data in a file either in the file itself or
>>>>>>>>>>> in
>>>>>>>>>>> an
>>>>>>>>>>> associated file.  Changes to the !LoadFunc interface made as part
>>>>>>>>>>> of
>>>>>>>>>>> the
>>>>>>>>>>> pipeline rework support this for data type and column layout
>>>>>>>>>>> only.
>>>>>>>>>>> It
>>>>>>>>>>> will
>>>>>>>>>>> need to be expanded to support other types of information about
>>>>>>>>>>> the
>>>>>>>>>>> file.
>>>>>>>>>>> 4.  Input from a data catalog.  Pig needs to be able to query an
>>>>>>>>>>> external
>>>>>>>>>>> data catalog to acquire information about a file.  All the same
>>>>>>>>>>> information
>>>>>>>>>>> available in option 3 should be available via this interface.
>>>>>>>>>>>  This
>>>>>>>>>>> interface does not yet exist and needs to be designed.
>>>>>>>>>>>
>>>>>>>>>>> + It should support options 3 and 4 for writing file specific
>>>>>>>>>>> metadata
>>>>>>>>>>> as
>>>>>>>>>>> well.
>>>>>>>>>>> +
>>>>>>>>>>> == Pig Interface to Global Metadata ==
>>>>>>>>>>> - An interface will need to be designed for pig to interface to
>>>>>>>>>>> an
>>>>>>>>>>> external
>>>>>>>>>>> data catalog.
>>>>>>>>>>> + An interface will need to be designed for pig to read from and
>>>>>>>>>>> write
>>>>>>>>>>> to
>>>>>>>>>>> an external data catalog.
>>>>>>>>>>>
>>>>>>>>>>> == Architecture of Pig Interface to External Data Catalog ==
>>>>>>>>>>> Pig needs to be able to connect to various types of external data
>>>>>>>>>>> catalogs
>>>>>>>>>>> (databases, catalogs stored in flat files, web services, etc.).
>>>>>>>>>>>  To
>>>>>>>>>>> facilitate this
>>>>>>>>>>> - pig will develop a generic interface that allows it to make
>>>>>>>>>>> specific
>>>>>>>>>>> types of queries to a data catalog.  Drivers will then need to be
>>>>>>>>>>> written
>>>>>>>>>>> to
>>>>>>>>>>> implement
>>>>>>>>>>> + pig will develop a generic interface that allows it to query
>>>>>>>>>>> and
>>>>>>>>>>> update
>>>>>>>>>>> a
>>>>>>>>>>> data catalog.  Drivers will then need to be written to implement
>>>>>>>>>>> that interface and connect to a specific type of data catalog.
>>>>>>>>>>>
>>>>>>>>>>> == Types of File Specific Metadata Pig Will Use ==
>>>>>>>>>>> - Pig should be able to acquire the following types of
>>>>>>>>>>> information
>>>>>>>>>>> about
>>>>>>>>>>> a
>>>>>>>>>>> file via either self description or an external data catalog.
>>>>>>>>>>>  This
>>>>>>>>>>> is
>>>>>>>>>>> not
>>>>>>>>>>> to say
>>>>>>>>>>> + Pig should be able to acquire and record the following types of
>>>>>>>>>>> information about a file via either self description or an
>>>>>>>>>>> external
>>>>>>>>>>> data
>>>>>>>>>>> catalog.  This is not to say
>>>>>>>>>>> that every self describing file or external data catalog must
>>>>>>>>>>> support
>>>>>>>>>>> every
>>>>>>>>>>> one of these items.  This is a list of items pig may find useful
>>>>>>>>>>> and
>>>>>>>>>>> should
>>>>>>>>>>> be
>>>>>>>>>>> - able to query for.  If the metadata source cannot provide the
>>>>>>>>>>> information, pig will simply not make use of it.
>>>>>>>>>>> + able to query for and create.  If the metadata source cannot
>>>>>>>>>>> provide
>>>>>>>>>>> or
>>>>>>>>>>> store the information, pig will simply not make use of it or
>>>>>>>>>>> record
>>>>>>>>>>> it.
>>>>>>>>>>> * Field layout (already supported)
>>>>>>>>>>> * Field types (already supported)
>>>>>>>>>>> * Sortedness of the data, both key and direction
>>>>>>>>>>> (ascending/descending)
>>>>>>>>>>> @@ -52, +54 @@
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> == Priorities ==
>>>>>>>>>>> Given that the usage for global metadata is unclear, the priority
>>>>>>>>>>> will
>>>>>>>>>>> be
>>>>>>>>>>> placed on supporting file specific metadata.  The first step
>>>>>>>>>>> should
>>>>>>>>>>> be
>>>>>>>>>>> to
>>>>>>>>>>> define the
>>>>>>>>>>> - interface changes in !LoadFunc and the interface to external
>>>>>>>>>>> data
>>>>>>>>>>> catalogs.
>>>>>>>>>>> + interface changes in !LoadFunc, !StoreFunc and the interface to
>>>>>>>>>>> external
>>>>>>>>>>> data catalogs.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>
>>>
>>
>

Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

Reply via email to