Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

pi song Thu, 05 Jun 2008 07:21:48 -0700

Mathieu,

Can you give us some more practical use cases?


Pi

On Fri, Jun 6, 2008 at 12:16 AM, pi song <[EMAIL PROTECTED]> wrote:

> Interestingly, somebody just requested a feature which is sample use case.
>
> https://issues.apache.org/jira/browse/PIG-255
>
>
>
> On Tue, Jun 3, 2008 at 9:15 PM, pi song <[EMAIL PROTECTED]> wrote:
>
>> From my understanding, we are trying to create something like plan-scoped
>> shared properties right? Potentially, it's good to have.
>>
>>
>> On Tue, Jun 3, 2008 at 2:20 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]>
>> wrote:
>>
>>> Alan, Pi,
>>>
>>> This overall summary sounds good to me, yes.
>>>
>>> But I merely ask for a map to be propagated when it looks possible. I'm
>>> not too sure about the "canonical" metadata having to be stored into the map
>>> actualy. I think I would keep the "canonical metadata" as properties of a
>>> Schema bean, merely adding a Map<String, Serializable> UDMetadata to the
>>> list. Putting the "canonical metadata" inside the map would just make pig's
>>> internal code more difficult to maintain and could lead to weird bugs when
>>> keys are overwritten... I prefer to let the UDF developper play in his
>>> sandbox.
>>>
>>> Le 2 juin 08 à 17:56, Alan Gates a écrit :
>>>
>>>
>>>  Mathieu, let me make sure I understand what you're trying to say.  Some
>>>> file level metadata is about the file as a whole, such as how many records
>>>> are in the file.  Some is about individual columns (such as column
>>>> cardinality or value distribution histograms).  You would like to see each
>>>> stored in a map (one map for file wide and one for each column).  You could
>>>> then "cheat" in the load functions you write for yourself and add values
>>>> specific to your application into those maps.  Is that correct?
>>>>
>>>> We will need to decide on a canonical set of metadata entries that the
>>>> pig engine can ask for.  But allowing for optional settings in addition to
>>>> these canonical values seems like a good idea.  The pig engine itself will
>>>> only utilize the canonical set.  But user contributed load, store, and eval
>>>> functions are free to communicate with each via the optional set.
>>>>
>>>> To address Pi's point about columns being transformed, my assumption
>>>> would be that all this file level metadata will be stored in (or at least
>>>> referenced from) the Schema object.  This can be set so that metadata
>>>> associated with a particular field survives projection, but not being 
>>>> passed
>>>> to an eval UDF, being used in an arithmetic expression, etc.  As the eval
>>>> UDF can generate a schema for its output, it could set any optional (or
>>>> canonical) values it wanted, thus facilitating Mathieu's communication.
>>>>
>>>> Alan.
>>>>
>>>> pi song wrote:
>>>>
>>>>> I love discussing about new idea, Mathieu. This is not bothering but
>>>>> interesting. My colleague had spent sometime doing a Microsoft SSIS
>>>>> thing
>>>>> that always breaks once the is a schema change and requires a manual
>>>>> script
>>>>> change. Seems like you are trying to go beyond that.
>>>>>
>>>>> On Sat, May 31, 2008 at 12:53 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]>
>>>>> wrote:
>>>>>
>>>>>
>>>>>  Well, it adds a way to *dynamically* parameterize UDF, without
>>>>>> changing the
>>>>>> pig script itself.
>>>>>>
>>>>>> I guess it comes back to the questions about "how big a pig script
>>>>>> is". If
>>>>>> we are only considering 5-line pig scripts, where you do load exactly
>>>>>> what
>>>>>> you need to compute, crush numbers and dump them, I agree it does not
>>>>>> make
>>>>>> much sense.
>>>>>>
>>>>>> If one start thinking about something more ETL-ish (which I understand
>>>>>> is
>>>>>> not exactly the main purpose of pig) then one could want to use pig to
>>>>>> "move" data around or load data from somewhere, do something "heavy"
>>>>>> that
>>>>>> ETL software can just not cope with efficiently enough —build index,
>>>>>> process
>>>>>> images, whatever — and store the results somewhere else, a scenario
>>>>>> where
>>>>>> there can be fields that pig will just forward, without playing with.
>>>>>>
>>>>>> I admit my background where we were using the same software for
>>>>>> ETL-like
>>>>>> stuff and heavy processing (that is, mostly building index) may give
>>>>>> me very
>>>>>> a biaised opinion about pig and what it should be. But I would
>>>>>> definitely
>>>>>> like to use pig for what it is/will be excellent for, as well as for
>>>>>> stuff
>>>>>> where it will be just ok.
>>>>>>
>>>>>> So I still think the extension point is worth having. Half my brain is
>>>>>> already thinking about way of cheating and using Alan's fields list to
>>>>>> pass
>>>>>> other stuff around...
>>>>>>
>>>>>> Another concrete example and I stop bothering you all, then :) In our
>>>>>> tools, we are using some field metadata to denote that a field content
>>>>>> is a
>>>>>> primary key to a record. When we copy this field values to somewhere
>>>>>> else,
>>>>>> we automaticaly tag them as foreign key (instead of primary). When we
>>>>>> dump
>>>>>> the data on disk (to a final-user CDROM image in most cases) the fact
>>>>>> that
>>>>>> the column refers to a table present on the disk too can be
>>>>>> automagically
>>>>>> stored as it is a feature of our final format : without having the
>>>>>> application developper re-specifying the relations, the "UDF store
>>>>>> equivalent" is clever enough to store the information.
>>>>>>
>>>>>> The script the application developper who prepare a CDROM can be
>>>>>> several
>>>>>> screen long, with bits spread on separate files. The data model could
>>>>>> be
>>>>>> quite complex too. In this context, it is important that things like
>>>>>> "this
>>>>>> field acts as a record key" are said once.
>>>>>>
>>>>>> Le 30 mai 08 à 16:13, pi song a écrit :
>>>>>>
>>>>>>
>>>>>> More,  adding meta data is conceptually adding another way to
>>>>>> parameterize
>>>>>>
>>>>>>  load/store functions. Making UDFs to be parameterized by other UDFs
>>>>>>> therefore is also possible functionally but I just couldn't think of
>>>>>>> any
>>>>>>> good use cases.
>>>>>>>
>>>>>>> On Sat, May 31, 2008 at 12:09 AM, pi song <[EMAIL PROTECTED]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Just out of curiosity. If you say somehow the UDF store in your
>>>>>>> example
>>>>>>>
>>>>>>>  can
>>>>>>>> "learn" from UDF load. That information still might not be useful
>>>>>>>> because
>>>>>>>> between "load" and "store", you've got processing logic which might
>>>>>>>> or
>>>>>>>> might
>>>>>>>> not alter the validity of information directly transfered from
>>>>>>>> "load" to
>>>>>>>> "store". An example would be I do load a list of number and then I
>>>>>>>> convert
>>>>>>>> to string. Then information on the UDF store side is then not
>>>>>>>> applicable.
>>>>>>>>
>>>>>>>> Don't you think the cases where this concept can be useful is very
>>>>>>>> rare?
>>>>>>>>
>>>>>>>> Pi
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol <
>>>>>>>> [EMAIL PROTECTED]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Pi,
>>>>>>>>
>>>>>>>>  Well... I was thinking... the three of them actually. Alan's list
>>>>>>>>> is
>>>>>>>>> quite
>>>>>>>>> comprehensive, so it is not that easy to find a counvincing
>>>>>>>>> example, but
>>>>>>>>> I'm
>>>>>>>>> sure UDF developper may need some additional information to
>>>>>>>>> communicate
>>>>>>>>> metadata from one UDF to another.
>>>>>>>>>
>>>>>>>>> It does not make sense if you think "one UDF function", but it is a
>>>>>>>>> way
>>>>>>>>> to
>>>>>>>>> have two coordinated UDF communicating.
>>>>>>>>>
>>>>>>>>> For instance the developper of a jdbc pig "connector" will
>>>>>>>>> typically
>>>>>>>>> write
>>>>>>>>> a UDF load, and a UDF store. What if he wants the loader to
>>>>>>>>> discover the
>>>>>>>>> field collection (case 3, Self describing data in Alan's page) from
>>>>>>>>> jdbc
>>>>>>>>> and
>>>>>>>>> propagate the exact column type of a given field (as in
>>>>>>>>> "VARCHAR(42)"),
>>>>>>>>> to
>>>>>>>>> create it the right way in the UDF store ? or the table name ? or
>>>>>>>>> the
>>>>>>>>> fact
>>>>>>>>> that a column is indexed, a primary key, a foreign key constraint,
>>>>>>>>> some
>>>>>>>>> encoding info... He may also want to develop a UDF pipeline
>>>>>>>>> function
>>>>>>>>> that
>>>>>>>>> would perform some foreign key validation against the database at
>>>>>>>>> some
>>>>>>>>> point
>>>>>>>>> in his script. Having the information in the metadata may be
>>>>>>>>> usefull.
>>>>>>>>>
>>>>>>>>> Some other fields of application we can not think of today may need
>>>>>>>>> some
>>>>>>>>> completely different metadata. My whole point is: Pig should
>>>>>>>>> provide
>>>>>>>>> some
>>>>>>>>> metadata extension point.
>>>>>>>>>
>>>>>>>>> Le 30 mai 08 à 13:54, pi song a écrit :
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I don't get it Mathieu.  UDF is a very broad term. It could be UDF
>>>>>>>>> Load,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  UDF
>>>>>>>>>> Store, or UDF as function in pipeline.  Can you explain a bit
>>>>>>>>>> more?
>>>>>>>>>>
>>>>>>>>>> On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <
>>>>>>>>>> [EMAIL PROTECTED]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> All,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Looking at the very extensive list of types of file specificic
>>>>>>>>>>> metadata,
>>>>>>>>>>> I
>>>>>>>>>>> think (from experience) that a UDF function may need to attach
>>>>>>>>>>> some
>>>>>>>>>>> information (any information, actualy) to a given field (or file)
>>>>>>>>>>> to
>>>>>>>>>>> be
>>>>>>>>>>> retrieved by another UDF downstream.
>>>>>>>>>>>
>>>>>>>>>>> What about adding a Map<String, Serializable> to each file and
>>>>>>>>>>> each
>>>>>>>>>>> field ?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Mathieu
>>>>>>>>>>>
>>>>>>>>>>> Le 30 mai 08 à 01:24, pi song a écrit :
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Alan,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  I will start thinking about this as well. When do you want to
>>>>>>>>>>>> start
>>>>>>>>>>>> the
>>>>>>>>>>>> implementation?
>>>>>>>>>>>>
>>>>>>>>>>>> Pi
>>>>>>>>>>>>
>>>>>>>>>>>> On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Dear Wiki user,
>>>>>>>>>>>>
>>>>>>>>>>>>  You have subscribed to a wiki page or wiki category on "Pig
>>>>>>>>>>>>> Wiki"
>>>>>>>>>>>>> for
>>>>>>>>>>>>> change notification.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The following page has been changed by AlanGates:
>>>>>>>>>>>>> http://wiki.apache.org/pig/PigMetaData
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>>> information, histograms, etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> == Pig Interface to File Specific Metadata ==
>>>>>>>>>>>>> - Pig should support four options with regard to file specific
>>>>>>>>>>>>> metadata:
>>>>>>>>>>>>> + Pig should support four options with regard to reading file
>>>>>>>>>>>>> specific
>>>>>>>>>>>>> metadata:
>>>>>>>>>>>>> 1.  No file specific metadata available.  Pig uses the file as
>>>>>>>>>>>>> input
>>>>>>>>>>>>> with
>>>>>>>>>>>>> no knowledge of its content.  All data is assumed to be
>>>>>>>>>>>>> !ByteArrays.
>>>>>>>>>>>>> 2.  User provides schema in the script.  For example, `A = load
>>>>>>>>>>>>> 'myfile'
>>>>>>>>>>>>> as (a: chararray, b: int);`.
>>>>>>>>>>>>> 3.  Self describing data.  Data may be in a format that
>>>>>>>>>>>>> describes
>>>>>>>>>>>>> the
>>>>>>>>>>>>> schema, such as JSON.  Users may also have other proprietary
>>>>>>>>>>>>> ways to
>>>>>>>>>>>>> store
>>>>>>>>>>>>> information about the data in a file either in the file itself
>>>>>>>>>>>>> or in
>>>>>>>>>>>>> an
>>>>>>>>>>>>> associated file.  Changes to the !LoadFunc interface made as
>>>>>>>>>>>>> part of
>>>>>>>>>>>>> the
>>>>>>>>>>>>> pipeline rework support this for data type and column layout
>>>>>>>>>>>>> only.
>>>>>>>>>>>>> It
>>>>>>>>>>>>> will
>>>>>>>>>>>>> need to be expanded to support other types of information about
>>>>>>>>>>>>> the
>>>>>>>>>>>>> file.
>>>>>>>>>>>>> 4.  Input from a data catalog.  Pig needs to be able to query
>>>>>>>>>>>>> an
>>>>>>>>>>>>> external
>>>>>>>>>>>>> data catalog to acquire information about a file.  All the same
>>>>>>>>>>>>> information
>>>>>>>>>>>>> available in option 3 should be available via this interface.
>>>>>>>>>>>>>  This
>>>>>>>>>>>>> interface does not yet exist and needs to be designed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> + It should support options 3 and 4 for writing file specific
>>>>>>>>>>>>> metadata
>>>>>>>>>>>>> as
>>>>>>>>>>>>> well.
>>>>>>>>>>>>> +
>>>>>>>>>>>>> == Pig Interface to Global Metadata ==
>>>>>>>>>>>>> - An interface will need to be designed for pig to interface to
>>>>>>>>>>>>> an
>>>>>>>>>>>>> external
>>>>>>>>>>>>> data catalog.
>>>>>>>>>>>>> + An interface will need to be designed for pig to read from
>>>>>>>>>>>>> and
>>>>>>>>>>>>> write
>>>>>>>>>>>>> to
>>>>>>>>>>>>> an external data catalog.
>>>>>>>>>>>>>
>>>>>>>>>>>>> == Architecture of Pig Interface to External Data Catalog ==
>>>>>>>>>>>>> Pig needs to be able to connect to various types of external
>>>>>>>>>>>>> data
>>>>>>>>>>>>> catalogs
>>>>>>>>>>>>> (databases, catalogs stored in flat files, web services, etc.).
>>>>>>>>>>>>>  To
>>>>>>>>>>>>> facilitate this
>>>>>>>>>>>>> - pig will develop a generic interface that allows it to make
>>>>>>>>>>>>> specific
>>>>>>>>>>>>> types of queries to a data catalog.  Drivers will then need to
>>>>>>>>>>>>> be
>>>>>>>>>>>>> written
>>>>>>>>>>>>> to
>>>>>>>>>>>>> implement
>>>>>>>>>>>>> + pig will develop a generic interface that allows it to query
>>>>>>>>>>>>> and
>>>>>>>>>>>>> update
>>>>>>>>>>>>> a
>>>>>>>>>>>>> data catalog.  Drivers will then need to be written to
>>>>>>>>>>>>> implement
>>>>>>>>>>>>> that interface and connect to a specific type of data catalog.
>>>>>>>>>>>>>
>>>>>>>>>>>>> == Types of File Specific Metadata Pig Will Use ==
>>>>>>>>>>>>> - Pig should be able to acquire the following types of
>>>>>>>>>>>>> information
>>>>>>>>>>>>> about
>>>>>>>>>>>>> a
>>>>>>>>>>>>> file via either self description or an external data catalog.
>>>>>>>>>>>>>  This
>>>>>>>>>>>>> is
>>>>>>>>>>>>> not
>>>>>>>>>>>>> to say
>>>>>>>>>>>>> + Pig should be able to acquire and record the following types
>>>>>>>>>>>>> of
>>>>>>>>>>>>> information about a file via either self description or an
>>>>>>>>>>>>> external
>>>>>>>>>>>>> data
>>>>>>>>>>>>> catalog.  This is not to say
>>>>>>>>>>>>> that every self describing file or external data catalog must
>>>>>>>>>>>>> support
>>>>>>>>>>>>> every
>>>>>>>>>>>>> one of these items.  This is a list of items pig may find
>>>>>>>>>>>>> useful and
>>>>>>>>>>>>> should
>>>>>>>>>>>>> be
>>>>>>>>>>>>> - able to query for.  If the metadata source cannot provide the
>>>>>>>>>>>>> information, pig will simply not make use of it.
>>>>>>>>>>>>> + able to query for and create.  If the metadata source cannot
>>>>>>>>>>>>> provide
>>>>>>>>>>>>> or
>>>>>>>>>>>>> store the information, pig will simply not make use of it or
>>>>>>>>>>>>> record
>>>>>>>>>>>>> it.
>>>>>>>>>>>>> * Field layout (already supported)
>>>>>>>>>>>>> * Field types (already supported)
>>>>>>>>>>>>> * Sortedness of the data, both key and direction
>>>>>>>>>>>>> (ascending/descending)
>>>>>>>>>>>>> @@ -52, +54 @@
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> == Priorities ==
>>>>>>>>>>>>> Given that the usage for global metadata is unclear, the
>>>>>>>>>>>>> priority
>>>>>>>>>>>>> will
>>>>>>>>>>>>> be
>>>>>>>>>>>>> placed on supporting file specific metadata.  The first step
>>>>>>>>>>>>> should
>>>>>>>>>>>>> be
>>>>>>>>>>>>> to
>>>>>>>>>>>>> define the
>>>>>>>>>>>>> - interface changes in !LoadFunc and the interface to external
>>>>>>>>>>>>> data
>>>>>>>>>>>>> catalogs.
>>>>>>>>>>>>> + interface changes in !LoadFunc, !StoreFunc and the interface
>>>>>>>>>>>>> to
>>>>>>>>>>>>> external
>>>>>>>>>>>>> data catalogs.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [Pig Wiki] Update of "PigMetaData" by AlanGates

Reply via email to