Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

pi song Thu, 05 Jun 2008 07:16:55 -0700

Interestingly, somebody just requested a feature which is sample use case.

https://issues.apache.org/jira/browse/PIG-255



On Tue, Jun 3, 2008 at 9:15 PM, pi song <[EMAIL PROTECTED]> wrote:

> From my understanding, we are trying to create something like plan-scoped
> shared properties right? Potentially, it's good to have.
>
>
> On Tue, Jun 3, 2008 at 2:20 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]>
> wrote:
>
>> Alan, Pi,
>>
>> This overall summary sounds good to me, yes.
>>
>> But I merely ask for a map to be propagated when it looks possible. I'm
>> not too sure about the "canonical" metadata having to be stored into the map
>> actualy. I think I would keep the "canonical metadata" as properties of a
>> Schema bean, merely adding a Map<String, Serializable> UDMetadata to the
>> list. Putting the "canonical metadata" inside the map would just make pig's
>> internal code more difficult to maintain and could lead to weird bugs when
>> keys are overwritten... I prefer to let the UDF developper play in his
>> sandbox.
>>
>> Le 2 juin 08 à 17:56, Alan Gates a écrit :
>>
>>
>>  Mathieu, let me make sure I understand what you're trying to say.  Some
>>> file level metadata is about the file as a whole, such as how many records
>>> are in the file.  Some is about individual columns (such as column
>>> cardinality or value distribution histograms).  You would like to see each
>>> stored in a map (one map for file wide and one for each column).  You could
>>> then "cheat" in the load functions you write for yourself and add values
>>> specific to your application into those maps.  Is that correct?
>>>
>>> We will need to decide on a canonical set of metadata entries that the
>>> pig engine can ask for.  But allowing for optional settings in addition to
>>> these canonical values seems like a good idea.  The pig engine itself will
>>> only utilize the canonical set.  But user contributed load, store, and eval
>>> functions are free to communicate with each via the optional set.
>>>
>>> To address Pi's point about columns being transformed, my assumption
>>> would be that all this file level metadata will be stored in (or at least
>>> referenced from) the Schema object.  This can be set so that metadata
>>> associated with a particular field survives projection, but not being passed
>>> to an eval UDF, being used in an arithmetic expression, etc.  As the eval
>>> UDF can generate a schema for its output, it could set any optional (or
>>> canonical) values it wanted, thus facilitating Mathieu's communication.
>>>
>>> Alan.
>>>
>>> pi song wrote:
>>>
>>>> I love discussing about new idea, Mathieu. This is not bothering but
>>>> interesting. My colleague had spent sometime doing a Microsoft SSIS
>>>> thing
>>>> that always breaks once the is a schema change and requires a manual
>>>> script
>>>> change. Seems like you are trying to go beyond that.
>>>>
>>>> On Sat, May 31, 2008 at 12:53 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>
>>>>  Well, it adds a way to *dynamically* parameterize UDF, without changing
>>>>> the
>>>>> pig script itself.
>>>>>
>>>>> I guess it comes back to the questions about "how big a pig script is".
>>>>> If
>>>>> we are only considering 5-line pig scripts, where you do load exactly
>>>>> what
>>>>> you need to compute, crush numbers and dump them, I agree it does not
>>>>> make
>>>>> much sense.
>>>>>
>>>>> If one start thinking about something more ETL-ish (which I understand
>>>>> is
>>>>> not exactly the main purpose of pig) then one could want to use pig to
>>>>> "move" data around or load data from somewhere, do something "heavy"
>>>>> that
>>>>> ETL software can just not cope with efficiently enough —build index,
>>>>> process
>>>>> images, whatever — and store the results somewhere else, a scenario
>>>>> where
>>>>> there can be fields that pig will just forward, without playing with.
>>>>>
>>>>> I admit my background where we were using the same software for
>>>>> ETL-like
>>>>> stuff and heavy processing (that is, mostly building index) may give me
>>>>> very
>>>>> a biaised opinion about pig and what it should be. But I would
>>>>> definitely
>>>>> like to use pig for what it is/will be excellent for, as well as for
>>>>> stuff
>>>>> where it will be just ok.
>>>>>
>>>>> So I still think the extension point is worth having. Half my brain is
>>>>> already thinking about way of cheating and using Alan's fields list to
>>>>> pass
>>>>> other stuff around...
>>>>>
>>>>> Another concrete example and I stop bothering you all, then :) In our
>>>>> tools, we are using some field metadata to denote that a field content
>>>>> is a
>>>>> primary key to a record. When we copy this field values to somewhere
>>>>> else,
>>>>> we automaticaly tag them as foreign key (instead of primary). When we
>>>>> dump
>>>>> the data on disk (to a final-user CDROM image in most cases) the fact
>>>>> that
>>>>> the column refers to a table present on the disk too can be
>>>>> automagically
>>>>> stored as it is a feature of our final format : without having the
>>>>> application developper re-specifying the relations, the "UDF store
>>>>> equivalent" is clever enough to store the information.
>>>>>
>>>>> The script the application developper who prepare a CDROM can be
>>>>> several
>>>>> screen long, with bits spread on separate files. The data model could
>>>>> be
>>>>> quite complex too. In this context, it is important that things like
>>>>> "this
>>>>> field acts as a record key" are said once.
>>>>>
>>>>> Le 30 mai 08 à 16:13, pi song a écrit :
>>>>>
>>>>>
>>>>> More,  adding meta data is conceptually adding another way to
>>>>> parameterize
>>>>>
>>>>>  load/store functions. Making UDFs to be parameterized by other UDFs
>>>>>> therefore is also possible functionally but I just couldn't think of
>>>>>> any
>>>>>> good use cases.
>>>>>>
>>>>>> On Sat, May 31, 2008 at 12:09 AM, pi song <[EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>> Just out of curiosity. If you say somehow the UDF store in your
>>>>>> example
>>>>>>
>>>>>>  can
>>>>>>> "learn" from UDF load. That information still might not be useful
>>>>>>> because
>>>>>>> between "load" and "store", you've got processing logic which might
>>>>>>> or
>>>>>>> might
>>>>>>> not alter the validity of information directly transfered from "load"
>>>>>>> to
>>>>>>> "store". An example would be I do load a list of number and then I
>>>>>>> convert
>>>>>>> to string. Then information on the UDF store side is then not
>>>>>>> applicable.
>>>>>>>
>>>>>>> Don't you think the cases where this concept can be useful is very
>>>>>>> rare?
>>>>>>>
>>>>>>> Pi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol <
>>>>>>> [EMAIL PROTECTED]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Pi,
>>>>>>>
>>>>>>>  Well... I was thinking... the three of them actually. Alan's list is
>>>>>>>> quite
>>>>>>>> comprehensive, so it is not that easy to find a counvincing example,
>>>>>>>> but
>>>>>>>> I'm
>>>>>>>> sure UDF developper may need some additional information to
>>>>>>>> communicate
>>>>>>>> metadata from one UDF to another.
>>>>>>>>
>>>>>>>> It does not make sense if you think "one UDF function", but it is a
>>>>>>>> way
>>>>>>>> to
>>>>>>>> have two coordinated UDF communicating.
>>>>>>>>
>>>>>>>> For instance the developper of a jdbc pig "connector" will typically
>>>>>>>> write
>>>>>>>> a UDF load, and a UDF store. What if he wants the loader to discover
>>>>>>>> the
>>>>>>>> field collection (case 3, Self describing data in Alan's page) from
>>>>>>>> jdbc
>>>>>>>> and
>>>>>>>> propagate the exact column type of a given field (as in
>>>>>>>> "VARCHAR(42)"),
>>>>>>>> to
>>>>>>>> create it the right way in the UDF store ? or the table name ? or
>>>>>>>> the
>>>>>>>> fact
>>>>>>>> that a column is indexed, a primary key, a foreign key constraint,
>>>>>>>> some
>>>>>>>> encoding info... He may also want to develop a UDF pipeline function
>>>>>>>> that
>>>>>>>> would perform some foreign key validation against the database at
>>>>>>>> some
>>>>>>>> point
>>>>>>>> in his script. Having the information in the metadata may be
>>>>>>>> usefull.
>>>>>>>>
>>>>>>>> Some other fields of application we can not think of today may need
>>>>>>>> some
>>>>>>>> completely different metadata. My whole point is: Pig should provide
>>>>>>>> some
>>>>>>>> metadata extension point.
>>>>>>>>
>>>>>>>> Le 30 mai 08 à 13:54, pi song a écrit :
>>>>>>>>
>>>>>>>>
>>>>>>>> I don't get it Mathieu.  UDF is a very broad term. It could be UDF
>>>>>>>> Load,
>>>>>>>>
>>>>>>>>
>>>>>>>>  UDF
>>>>>>>>> Store, or UDF as function in pipeline.  Can you explain a bit more?
>>>>>>>>>
>>>>>>>>> On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <
>>>>>>>>> [EMAIL PROTECTED]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> All,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Looking at the very extensive list of types of file specificic
>>>>>>>>>> metadata,
>>>>>>>>>> I
>>>>>>>>>> think (from experience) that a UDF function may need to attach
>>>>>>>>>> some
>>>>>>>>>> information (any information, actualy) to a given field (or file)
>>>>>>>>>> to
>>>>>>>>>> be
>>>>>>>>>> retrieved by another UDF downstream.
>>>>>>>>>>
>>>>>>>>>> What about adding a Map<String, Serializable> to each file and
>>>>>>>>>> each
>>>>>>>>>> field ?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Mathieu
>>>>>>>>>>
>>>>>>>>>> Le 30 mai 08 à 01:24, pi song a écrit :
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Alan,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  I will start thinking about this as well. When do you want to
>>>>>>>>>>> start
>>>>>>>>>>> the
>>>>>>>>>>> implementation?
>>>>>>>>>>>
>>>>>>>>>>> Pi
>>>>>>>>>>>
>>>>>>>>>>> On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dear Wiki user,
>>>>>>>>>>>
>>>>>>>>>>>  You have subscribed to a wiki page or wiki category on "Pig
>>>>>>>>>>>> Wiki"
>>>>>>>>>>>> for
>>>>>>>>>>>> change notification.
>>>>>>>>>>>>
>>>>>>>>>>>> The following page has been changed by AlanGates:
>>>>>>>>>>>> http://wiki.apache.org/pig/PigMetaData
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>> information, histograms, etc.
>>>>>>>>>>>>
>>>>>>>>>>>> == Pig Interface to File Specific Metadata ==
>>>>>>>>>>>> - Pig should support four options with regard to file specific
>>>>>>>>>>>> metadata:
>>>>>>>>>>>> + Pig should support four options with regard to reading file
>>>>>>>>>>>> specific
>>>>>>>>>>>> metadata:
>>>>>>>>>>>> 1.  No file specific metadata available.  Pig uses the file as
>>>>>>>>>>>> input
>>>>>>>>>>>> with
>>>>>>>>>>>> no knowledge of its content.  All data is assumed to be
>>>>>>>>>>>> !ByteArrays.
>>>>>>>>>>>> 2.  User provides schema in the script.  For example, `A = load
>>>>>>>>>>>> 'myfile'
>>>>>>>>>>>> as (a: chararray, b: int);`.
>>>>>>>>>>>> 3.  Self describing data.  Data may be in a format that
>>>>>>>>>>>> describes
>>>>>>>>>>>> the
>>>>>>>>>>>> schema, such as JSON.  Users may also have other proprietary
>>>>>>>>>>>> ways to
>>>>>>>>>>>> store
>>>>>>>>>>>> information about the data in a file either in the file itself
>>>>>>>>>>>> or in
>>>>>>>>>>>> an
>>>>>>>>>>>> associated file.  Changes to the !LoadFunc interface made as
>>>>>>>>>>>> part of
>>>>>>>>>>>> the
>>>>>>>>>>>> pipeline rework support this for data type and column layout
>>>>>>>>>>>> only.
>>>>>>>>>>>> It
>>>>>>>>>>>> will
>>>>>>>>>>>> need to be expanded to support other types of information about
>>>>>>>>>>>> the
>>>>>>>>>>>> file.
>>>>>>>>>>>> 4.  Input from a data catalog.  Pig needs to be able to query an
>>>>>>>>>>>> external
>>>>>>>>>>>> data catalog to acquire information about a file.  All the same
>>>>>>>>>>>> information
>>>>>>>>>>>> available in option 3 should be available via this interface.
>>>>>>>>>>>>  This
>>>>>>>>>>>> interface does not yet exist and needs to be designed.
>>>>>>>>>>>>
>>>>>>>>>>>> + It should support options 3 and 4 for writing file specific
>>>>>>>>>>>> metadata
>>>>>>>>>>>> as
>>>>>>>>>>>> well.
>>>>>>>>>>>> +
>>>>>>>>>>>> == Pig Interface to Global Metadata ==
>>>>>>>>>>>> - An interface will need to be designed for pig to interface to
>>>>>>>>>>>> an
>>>>>>>>>>>> external
>>>>>>>>>>>> data catalog.
>>>>>>>>>>>> + An interface will need to be designed for pig to read from and
>>>>>>>>>>>> write
>>>>>>>>>>>> to
>>>>>>>>>>>> an external data catalog.
>>>>>>>>>>>>
>>>>>>>>>>>> == Architecture of Pig Interface to External Data Catalog ==
>>>>>>>>>>>> Pig needs to be able to connect to various types of external
>>>>>>>>>>>> data
>>>>>>>>>>>> catalogs
>>>>>>>>>>>> (databases, catalogs stored in flat files, web services, etc.).
>>>>>>>>>>>>  To
>>>>>>>>>>>> facilitate this
>>>>>>>>>>>> - pig will develop a generic interface that allows it to make
>>>>>>>>>>>> specific
>>>>>>>>>>>> types of queries to a data catalog.  Drivers will then need to
>>>>>>>>>>>> be
>>>>>>>>>>>> written
>>>>>>>>>>>> to
>>>>>>>>>>>> implement
>>>>>>>>>>>> + pig will develop a generic interface that allows it to query
>>>>>>>>>>>> and
>>>>>>>>>>>> update
>>>>>>>>>>>> a
>>>>>>>>>>>> data catalog.  Drivers will then need to be written to implement
>>>>>>>>>>>> that interface and connect to a specific type of data catalog.
>>>>>>>>>>>>
>>>>>>>>>>>> == Types of File Specific Metadata Pig Will Use ==
>>>>>>>>>>>> - Pig should be able to acquire the following types of
>>>>>>>>>>>> information
>>>>>>>>>>>> about
>>>>>>>>>>>> a
>>>>>>>>>>>> file via either self description or an external data catalog.
>>>>>>>>>>>>  This
>>>>>>>>>>>> is
>>>>>>>>>>>> not
>>>>>>>>>>>> to say
>>>>>>>>>>>> + Pig should be able to acquire and record the following types
>>>>>>>>>>>> of
>>>>>>>>>>>> information about a file via either self description or an
>>>>>>>>>>>> external
>>>>>>>>>>>> data
>>>>>>>>>>>> catalog.  This is not to say
>>>>>>>>>>>> that every self describing file or external data catalog must
>>>>>>>>>>>> support
>>>>>>>>>>>> every
>>>>>>>>>>>> one of these items.  This is a list of items pig may find useful
>>>>>>>>>>>> and
>>>>>>>>>>>> should
>>>>>>>>>>>> be
>>>>>>>>>>>> - able to query for.  If the metadata source cannot provide the
>>>>>>>>>>>> information, pig will simply not make use of it.
>>>>>>>>>>>> + able to query for and create.  If the metadata source cannot
>>>>>>>>>>>> provide
>>>>>>>>>>>> or
>>>>>>>>>>>> store the information, pig will simply not make use of it or
>>>>>>>>>>>> record
>>>>>>>>>>>> it.
>>>>>>>>>>>> * Field layout (already supported)
>>>>>>>>>>>> * Field types (already supported)
>>>>>>>>>>>> * Sortedness of the data, both key and direction
>>>>>>>>>>>> (ascending/descending)
>>>>>>>>>>>> @@ -52, +54 @@
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> == Priorities ==
>>>>>>>>>>>> Given that the usage for global metadata is unclear, the
>>>>>>>>>>>> priority
>>>>>>>>>>>> will
>>>>>>>>>>>> be
>>>>>>>>>>>> placed on supporting file specific metadata.  The first step
>>>>>>>>>>>> should
>>>>>>>>>>>> be
>>>>>>>>>>>> to
>>>>>>>>>>>> define the
>>>>>>>>>>>> - interface changes in !LoadFunc and the interface to external
>>>>>>>>>>>> data
>>>>>>>>>>>> catalogs.
>>>>>>>>>>>> + interface changes in !LoadFunc, !StoreFunc and the interface
>>>>>>>>>>>> to
>>>>>>>>>>>> external
>>>>>>>>>>>> data catalogs.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>
>>>>
>>>
>>
>

Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

Reply via email to