Interestingly, somebody just requested a feature which is sample use case. https://issues.apache.org/jira/browse/PIG-255
On Tue, Jun 3, 2008 at 9:15 PM, pi song <[EMAIL PROTECTED]> wrote: > From my understanding, we are trying to create something like plan-scoped > shared properties right? Potentially, it's good to have. > > > On Tue, Jun 3, 2008 at 2:20 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]> > wrote: > >> Alan, Pi, >> >> This overall summary sounds good to me, yes. >> >> But I merely ask for a map to be propagated when it looks possible. I'm >> not too sure about the "canonical" metadata having to be stored into the map >> actualy. I think I would keep the "canonical metadata" as properties of a >> Schema bean, merely adding a Map<String, Serializable> UDMetadata to the >> list. Putting the "canonical metadata" inside the map would just make pig's >> internal code more difficult to maintain and could lead to weird bugs when >> keys are overwritten... I prefer to let the UDF developper play in his >> sandbox. >> >> Le 2 juin 08 à 17:56, Alan Gates a écrit : >> >> >> Mathieu, let me make sure I understand what you're trying to say. Some >>> file level metadata is about the file as a whole, such as how many records >>> are in the file. Some is about individual columns (such as column >>> cardinality or value distribution histograms). You would like to see each >>> stored in a map (one map for file wide and one for each column). You could >>> then "cheat" in the load functions you write for yourself and add values >>> specific to your application into those maps. Is that correct? >>> >>> We will need to decide on a canonical set of metadata entries that the >>> pig engine can ask for. But allowing for optional settings in addition to >>> these canonical values seems like a good idea. The pig engine itself will >>> only utilize the canonical set. But user contributed load, store, and eval >>> functions are free to communicate with each via the optional set. >>> >>> To address Pi's point about columns being transformed, my assumption >>> would be that all this file level metadata will be stored in (or at least >>> referenced from) the Schema object. This can be set so that metadata >>> associated with a particular field survives projection, but not being passed >>> to an eval UDF, being used in an arithmetic expression, etc. As the eval >>> UDF can generate a schema for its output, it could set any optional (or >>> canonical) values it wanted, thus facilitating Mathieu's communication. >>> >>> Alan. >>> >>> pi song wrote: >>> >>>> I love discussing about new idea, Mathieu. This is not bothering but >>>> interesting. My colleague had spent sometime doing a Microsoft SSIS >>>> thing >>>> that always breaks once the is a schema change and requires a manual >>>> script >>>> change. Seems like you are trying to go beyond that. >>>> >>>> On Sat, May 31, 2008 at 12:53 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> >>>> Well, it adds a way to *dynamically* parameterize UDF, without changing >>>>> the >>>>> pig script itself. >>>>> >>>>> I guess it comes back to the questions about "how big a pig script is". >>>>> If >>>>> we are only considering 5-line pig scripts, where you do load exactly >>>>> what >>>>> you need to compute, crush numbers and dump them, I agree it does not >>>>> make >>>>> much sense. >>>>> >>>>> If one start thinking about something more ETL-ish (which I understand >>>>> is >>>>> not exactly the main purpose of pig) then one could want to use pig to >>>>> "move" data around or load data from somewhere, do something "heavy" >>>>> that >>>>> ETL software can just not cope with efficiently enough —build index, >>>>> process >>>>> images, whatever — and store the results somewhere else, a scenario >>>>> where >>>>> there can be fields that pig will just forward, without playing with. >>>>> >>>>> I admit my background where we were using the same software for >>>>> ETL-like >>>>> stuff and heavy processing (that is, mostly building index) may give me >>>>> very >>>>> a biaised opinion about pig and what it should be. But I would >>>>> definitely >>>>> like to use pig for what it is/will be excellent for, as well as for >>>>> stuff >>>>> where it will be just ok. >>>>> >>>>> So I still think the extension point is worth having. Half my brain is >>>>> already thinking about way of cheating and using Alan's fields list to >>>>> pass >>>>> other stuff around... >>>>> >>>>> Another concrete example and I stop bothering you all, then :) In our >>>>> tools, we are using some field metadata to denote that a field content >>>>> is a >>>>> primary key to a record. When we copy this field values to somewhere >>>>> else, >>>>> we automaticaly tag them as foreign key (instead of primary). When we >>>>> dump >>>>> the data on disk (to a final-user CDROM image in most cases) the fact >>>>> that >>>>> the column refers to a table present on the disk too can be >>>>> automagically >>>>> stored as it is a feature of our final format : without having the >>>>> application developper re-specifying the relations, the "UDF store >>>>> equivalent" is clever enough to store the information. >>>>> >>>>> The script the application developper who prepare a CDROM can be >>>>> several >>>>> screen long, with bits spread on separate files. The data model could >>>>> be >>>>> quite complex too. In this context, it is important that things like >>>>> "this >>>>> field acts as a record key" are said once. >>>>> >>>>> Le 30 mai 08 à 16:13, pi song a écrit : >>>>> >>>>> >>>>> More, adding meta data is conceptually adding another way to >>>>> parameterize >>>>> >>>>> load/store functions. Making UDFs to be parameterized by other UDFs >>>>>> therefore is also possible functionally but I just couldn't think of >>>>>> any >>>>>> good use cases. >>>>>> >>>>>> On Sat, May 31, 2008 at 12:09 AM, pi song <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>> Just out of curiosity. If you say somehow the UDF store in your >>>>>> example >>>>>> >>>>>> can >>>>>>> "learn" from UDF load. That information still might not be useful >>>>>>> because >>>>>>> between "load" and "store", you've got processing logic which might >>>>>>> or >>>>>>> might >>>>>>> not alter the validity of information directly transfered from "load" >>>>>>> to >>>>>>> "store". An example would be I do load a list of number and then I >>>>>>> convert >>>>>>> to string. Then information on the UDF store side is then not >>>>>>> applicable. >>>>>>> >>>>>>> Don't you think the cases where this concept can be useful is very >>>>>>> rare? >>>>>>> >>>>>>> Pi >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol < >>>>>>> [EMAIL PROTECTED]> >>>>>>> wrote: >>>>>>> >>>>>>> Pi, >>>>>>> >>>>>>> Well... I was thinking... the three of them actually. Alan's list is >>>>>>>> quite >>>>>>>> comprehensive, so it is not that easy to find a counvincing example, >>>>>>>> but >>>>>>>> I'm >>>>>>>> sure UDF developper may need some additional information to >>>>>>>> communicate >>>>>>>> metadata from one UDF to another. >>>>>>>> >>>>>>>> It does not make sense if you think "one UDF function", but it is a >>>>>>>> way >>>>>>>> to >>>>>>>> have two coordinated UDF communicating. >>>>>>>> >>>>>>>> For instance the developper of a jdbc pig "connector" will typically >>>>>>>> write >>>>>>>> a UDF load, and a UDF store. What if he wants the loader to discover >>>>>>>> the >>>>>>>> field collection (case 3, Self describing data in Alan's page) from >>>>>>>> jdbc >>>>>>>> and >>>>>>>> propagate the exact column type of a given field (as in >>>>>>>> "VARCHAR(42)"), >>>>>>>> to >>>>>>>> create it the right way in the UDF store ? or the table name ? or >>>>>>>> the >>>>>>>> fact >>>>>>>> that a column is indexed, a primary key, a foreign key constraint, >>>>>>>> some >>>>>>>> encoding info... He may also want to develop a UDF pipeline function >>>>>>>> that >>>>>>>> would perform some foreign key validation against the database at >>>>>>>> some >>>>>>>> point >>>>>>>> in his script. Having the information in the metadata may be >>>>>>>> usefull. >>>>>>>> >>>>>>>> Some other fields of application we can not think of today may need >>>>>>>> some >>>>>>>> completely different metadata. My whole point is: Pig should provide >>>>>>>> some >>>>>>>> metadata extension point. >>>>>>>> >>>>>>>> Le 30 mai 08 à 13:54, pi song a écrit : >>>>>>>> >>>>>>>> >>>>>>>> I don't get it Mathieu. UDF is a very broad term. It could be UDF >>>>>>>> Load, >>>>>>>> >>>>>>>> >>>>>>>> UDF >>>>>>>>> Store, or UDF as function in pipeline. Can you explain a bit more? >>>>>>>>> >>>>>>>>> On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol < >>>>>>>>> [EMAIL PROTECTED]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> All, >>>>>>>>> >>>>>>>>> >>>>>>>>> Looking at the very extensive list of types of file specificic >>>>>>>>>> metadata, >>>>>>>>>> I >>>>>>>>>> think (from experience) that a UDF function may need to attach >>>>>>>>>> some >>>>>>>>>> information (any information, actualy) to a given field (or file) >>>>>>>>>> to >>>>>>>>>> be >>>>>>>>>> retrieved by another UDF downstream. >>>>>>>>>> >>>>>>>>>> What about adding a Map<String, Serializable> to each file and >>>>>>>>>> each >>>>>>>>>> field ? >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Mathieu >>>>>>>>>> >>>>>>>>>> Le 30 mai 08 à 01:24, pi song a écrit : >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Alan, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I will start thinking about this as well. When do you want to >>>>>>>>>>> start >>>>>>>>>>> the >>>>>>>>>>> implementation? >>>>>>>>>>> >>>>>>>>>>> Pi >>>>>>>>>>> >>>>>>>>>>> On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Dear Wiki user, >>>>>>>>>>> >>>>>>>>>>> You have subscribed to a wiki page or wiki category on "Pig >>>>>>>>>>>> Wiki" >>>>>>>>>>>> for >>>>>>>>>>>> change notification. >>>>>>>>>>>> >>>>>>>>>>>> The following page has been changed by AlanGates: >>>>>>>>>>>> http://wiki.apache.org/pig/PigMetaData >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>> information, histograms, etc. >>>>>>>>>>>> >>>>>>>>>>>> == Pig Interface to File Specific Metadata == >>>>>>>>>>>> - Pig should support four options with regard to file specific >>>>>>>>>>>> metadata: >>>>>>>>>>>> + Pig should support four options with regard to reading file >>>>>>>>>>>> specific >>>>>>>>>>>> metadata: >>>>>>>>>>>> 1. No file specific metadata available. Pig uses the file as >>>>>>>>>>>> input >>>>>>>>>>>> with >>>>>>>>>>>> no knowledge of its content. All data is assumed to be >>>>>>>>>>>> !ByteArrays. >>>>>>>>>>>> 2. User provides schema in the script. For example, `A = load >>>>>>>>>>>> 'myfile' >>>>>>>>>>>> as (a: chararray, b: int);`. >>>>>>>>>>>> 3. Self describing data. Data may be in a format that >>>>>>>>>>>> describes >>>>>>>>>>>> the >>>>>>>>>>>> schema, such as JSON. Users may also have other proprietary >>>>>>>>>>>> ways to >>>>>>>>>>>> store >>>>>>>>>>>> information about the data in a file either in the file itself >>>>>>>>>>>> or in >>>>>>>>>>>> an >>>>>>>>>>>> associated file. Changes to the !LoadFunc interface made as >>>>>>>>>>>> part of >>>>>>>>>>>> the >>>>>>>>>>>> pipeline rework support this for data type and column layout >>>>>>>>>>>> only. >>>>>>>>>>>> It >>>>>>>>>>>> will >>>>>>>>>>>> need to be expanded to support other types of information about >>>>>>>>>>>> the >>>>>>>>>>>> file. >>>>>>>>>>>> 4. Input from a data catalog. Pig needs to be able to query an >>>>>>>>>>>> external >>>>>>>>>>>> data catalog to acquire information about a file. All the same >>>>>>>>>>>> information >>>>>>>>>>>> available in option 3 should be available via this interface. >>>>>>>>>>>> This >>>>>>>>>>>> interface does not yet exist and needs to be designed. >>>>>>>>>>>> >>>>>>>>>>>> + It should support options 3 and 4 for writing file specific >>>>>>>>>>>> metadata >>>>>>>>>>>> as >>>>>>>>>>>> well. >>>>>>>>>>>> + >>>>>>>>>>>> == Pig Interface to Global Metadata == >>>>>>>>>>>> - An interface will need to be designed for pig to interface to >>>>>>>>>>>> an >>>>>>>>>>>> external >>>>>>>>>>>> data catalog. >>>>>>>>>>>> + An interface will need to be designed for pig to read from and >>>>>>>>>>>> write >>>>>>>>>>>> to >>>>>>>>>>>> an external data catalog. >>>>>>>>>>>> >>>>>>>>>>>> == Architecture of Pig Interface to External Data Catalog == >>>>>>>>>>>> Pig needs to be able to connect to various types of external >>>>>>>>>>>> data >>>>>>>>>>>> catalogs >>>>>>>>>>>> (databases, catalogs stored in flat files, web services, etc.). >>>>>>>>>>>> To >>>>>>>>>>>> facilitate this >>>>>>>>>>>> - pig will develop a generic interface that allows it to make >>>>>>>>>>>> specific >>>>>>>>>>>> types of queries to a data catalog. Drivers will then need to >>>>>>>>>>>> be >>>>>>>>>>>> written >>>>>>>>>>>> to >>>>>>>>>>>> implement >>>>>>>>>>>> + pig will develop a generic interface that allows it to query >>>>>>>>>>>> and >>>>>>>>>>>> update >>>>>>>>>>>> a >>>>>>>>>>>> data catalog. Drivers will then need to be written to implement >>>>>>>>>>>> that interface and connect to a specific type of data catalog. >>>>>>>>>>>> >>>>>>>>>>>> == Types of File Specific Metadata Pig Will Use == >>>>>>>>>>>> - Pig should be able to acquire the following types of >>>>>>>>>>>> information >>>>>>>>>>>> about >>>>>>>>>>>> a >>>>>>>>>>>> file via either self description or an external data catalog. >>>>>>>>>>>> This >>>>>>>>>>>> is >>>>>>>>>>>> not >>>>>>>>>>>> to say >>>>>>>>>>>> + Pig should be able to acquire and record the following types >>>>>>>>>>>> of >>>>>>>>>>>> information about a file via either self description or an >>>>>>>>>>>> external >>>>>>>>>>>> data >>>>>>>>>>>> catalog. This is not to say >>>>>>>>>>>> that every self describing file or external data catalog must >>>>>>>>>>>> support >>>>>>>>>>>> every >>>>>>>>>>>> one of these items. This is a list of items pig may find useful >>>>>>>>>>>> and >>>>>>>>>>>> should >>>>>>>>>>>> be >>>>>>>>>>>> - able to query for. If the metadata source cannot provide the >>>>>>>>>>>> information, pig will simply not make use of it. >>>>>>>>>>>> + able to query for and create. If the metadata source cannot >>>>>>>>>>>> provide >>>>>>>>>>>> or >>>>>>>>>>>> store the information, pig will simply not make use of it or >>>>>>>>>>>> record >>>>>>>>>>>> it. >>>>>>>>>>>> * Field layout (already supported) >>>>>>>>>>>> * Field types (already supported) >>>>>>>>>>>> * Sortedness of the data, both key and direction >>>>>>>>>>>> (ascending/descending) >>>>>>>>>>>> @@ -52, +54 @@ >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> == Priorities == >>>>>>>>>>>> Given that the usage for global metadata is unclear, the >>>>>>>>>>>> priority >>>>>>>>>>>> will >>>>>>>>>>>> be >>>>>>>>>>>> placed on supporting file specific metadata. The first step >>>>>>>>>>>> should >>>>>>>>>>>> be >>>>>>>>>>>> to >>>>>>>>>>>> define the >>>>>>>>>>>> - interface changes in !LoadFunc and the interface to external >>>>>>>>>>>> data >>>>>>>>>>>> catalogs. >>>>>>>>>>>> + interface changes in !LoadFunc, !StoreFunc and the interface >>>>>>>>>>>> to >>>>>>>>>>>> external >>>>>>>>>>>> data catalogs. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>> >>>> >>> >> >
