>From my understanding, we are trying to create something like plan-scoped shared properties right? Potentially, it's good to have.
On Tue, Jun 3, 2008 at 2:20 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]> wrote: > Alan, Pi, > > This overall summary sounds good to me, yes. > > But I merely ask for a map to be propagated when it looks possible. I'm not > too sure about the "canonical" metadata having to be stored into the map > actualy. I think I would keep the "canonical metadata" as properties of a > Schema bean, merely adding a Map<String, Serializable> UDMetadata to the > list. Putting the "canonical metadata" inside the map would just make pig's > internal code more difficult to maintain and could lead to weird bugs when > keys are overwritten... I prefer to let the UDF developper play in his > sandbox. > > Le 2 juin 08 à 17:56, Alan Gates a écrit : > > > Mathieu, let me make sure I understand what you're trying to say. Some >> file level metadata is about the file as a whole, such as how many records >> are in the file. Some is about individual columns (such as column >> cardinality or value distribution histograms). You would like to see each >> stored in a map (one map for file wide and one for each column). You could >> then "cheat" in the load functions you write for yourself and add values >> specific to your application into those maps. Is that correct? >> >> We will need to decide on a canonical set of metadata entries that the pig >> engine can ask for. But allowing for optional settings in addition to these >> canonical values seems like a good idea. The pig engine itself will only >> utilize the canonical set. But user contributed load, store, and eval >> functions are free to communicate with each via the optional set. >> >> To address Pi's point about columns being transformed, my assumption would >> be that all this file level metadata will be stored in (or at least >> referenced from) the Schema object. This can be set so that metadata >> associated with a particular field survives projection, but not being passed >> to an eval UDF, being used in an arithmetic expression, etc. As the eval >> UDF can generate a schema for its output, it could set any optional (or >> canonical) values it wanted, thus facilitating Mathieu's communication. >> >> Alan. >> >> pi song wrote: >> >>> I love discussing about new idea, Mathieu. This is not bothering but >>> interesting. My colleague had spent sometime doing a Microsoft SSIS thing >>> that always breaks once the is a schema change and requires a manual >>> script >>> change. Seems like you are trying to go beyond that. >>> >>> On Sat, May 31, 2008 at 12:53 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]> >>> wrote: >>> >>> >>> Well, it adds a way to *dynamically* parameterize UDF, without changing >>>> the >>>> pig script itself. >>>> >>>> I guess it comes back to the questions about "how big a pig script is". >>>> If >>>> we are only considering 5-line pig scripts, where you do load exactly >>>> what >>>> you need to compute, crush numbers and dump them, I agree it does not >>>> make >>>> much sense. >>>> >>>> If one start thinking about something more ETL-ish (which I understand >>>> is >>>> not exactly the main purpose of pig) then one could want to use pig to >>>> "move" data around or load data from somewhere, do something "heavy" >>>> that >>>> ETL software can just not cope with efficiently enough —build index, >>>> process >>>> images, whatever — and store the results somewhere else, a scenario >>>> where >>>> there can be fields that pig will just forward, without playing with. >>>> >>>> I admit my background where we were using the same software for ETL-like >>>> stuff and heavy processing (that is, mostly building index) may give me >>>> very >>>> a biaised opinion about pig and what it should be. But I would >>>> definitely >>>> like to use pig for what it is/will be excellent for, as well as for >>>> stuff >>>> where it will be just ok. >>>> >>>> So I still think the extension point is worth having. Half my brain is >>>> already thinking about way of cheating and using Alan's fields list to >>>> pass >>>> other stuff around... >>>> >>>> Another concrete example and I stop bothering you all, then :) In our >>>> tools, we are using some field metadata to denote that a field content >>>> is a >>>> primary key to a record. When we copy this field values to somewhere >>>> else, >>>> we automaticaly tag them as foreign key (instead of primary). When we >>>> dump >>>> the data on disk (to a final-user CDROM image in most cases) the fact >>>> that >>>> the column refers to a table present on the disk too can be >>>> automagically >>>> stored as it is a feature of our final format : without having the >>>> application developper re-specifying the relations, the "UDF store >>>> equivalent" is clever enough to store the information. >>>> >>>> The script the application developper who prepare a CDROM can be several >>>> screen long, with bits spread on separate files. The data model could be >>>> quite complex too. In this context, it is important that things like >>>> "this >>>> field acts as a record key" are said once. >>>> >>>> Le 30 mai 08 à 16:13, pi song a écrit : >>>> >>>> >>>> More, adding meta data is conceptually adding another way to >>>> parameterize >>>> >>>> load/store functions. Making UDFs to be parameterized by other UDFs >>>>> therefore is also possible functionally but I just couldn't think of >>>>> any >>>>> good use cases. >>>>> >>>>> On Sat, May 31, 2008 at 12:09 AM, pi song <[EMAIL PROTECTED]> wrote: >>>>> >>>>> Just out of curiosity. If you say somehow the UDF store in your example >>>>> >>>>> can >>>>>> "learn" from UDF load. That information still might not be useful >>>>>> because >>>>>> between "load" and "store", you've got processing logic which might or >>>>>> might >>>>>> not alter the validity of information directly transfered from "load" >>>>>> to >>>>>> "store". An example would be I do load a list of number and then I >>>>>> convert >>>>>> to string. Then information on the UDF store side is then not >>>>>> applicable. >>>>>> >>>>>> Don't you think the cases where this concept can be useful is very >>>>>> rare? >>>>>> >>>>>> Pi >>>>>> >>>>>> >>>>>> >>>>>> On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol <[EMAIL PROTECTED] >>>>>> > >>>>>> wrote: >>>>>> >>>>>> Pi, >>>>>> >>>>>> Well... I was thinking... the three of them actually. Alan's list is >>>>>>> quite >>>>>>> comprehensive, so it is not that easy to find a counvincing example, >>>>>>> but >>>>>>> I'm >>>>>>> sure UDF developper may need some additional information to >>>>>>> communicate >>>>>>> metadata from one UDF to another. >>>>>>> >>>>>>> It does not make sense if you think "one UDF function", but it is a >>>>>>> way >>>>>>> to >>>>>>> have two coordinated UDF communicating. >>>>>>> >>>>>>> For instance the developper of a jdbc pig "connector" will typically >>>>>>> write >>>>>>> a UDF load, and a UDF store. What if he wants the loader to discover >>>>>>> the >>>>>>> field collection (case 3, Self describing data in Alan's page) from >>>>>>> jdbc >>>>>>> and >>>>>>> propagate the exact column type of a given field (as in >>>>>>> "VARCHAR(42)"), >>>>>>> to >>>>>>> create it the right way in the UDF store ? or the table name ? or the >>>>>>> fact >>>>>>> that a column is indexed, a primary key, a foreign key constraint, >>>>>>> some >>>>>>> encoding info... He may also want to develop a UDF pipeline function >>>>>>> that >>>>>>> would perform some foreign key validation against the database at >>>>>>> some >>>>>>> point >>>>>>> in his script. Having the information in the metadata may be usefull. >>>>>>> >>>>>>> Some other fields of application we can not think of today may need >>>>>>> some >>>>>>> completely different metadata. My whole point is: Pig should provide >>>>>>> some >>>>>>> metadata extension point. >>>>>>> >>>>>>> Le 30 mai 08 à 13:54, pi song a écrit : >>>>>>> >>>>>>> >>>>>>> I don't get it Mathieu. UDF is a very broad term. It could be UDF >>>>>>> Load, >>>>>>> >>>>>>> >>>>>>> UDF >>>>>>>> Store, or UDF as function in pipeline. Can you explain a bit more? >>>>>>>> >>>>>>>> On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol < >>>>>>>> [EMAIL PROTECTED]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> All, >>>>>>>> >>>>>>>> >>>>>>>> Looking at the very extensive list of types of file specificic >>>>>>>>> metadata, >>>>>>>>> I >>>>>>>>> think (from experience) that a UDF function may need to attach some >>>>>>>>> information (any information, actualy) to a given field (or file) >>>>>>>>> to >>>>>>>>> be >>>>>>>>> retrieved by another UDF downstream. >>>>>>>>> >>>>>>>>> What about adding a Map<String, Serializable> to each file and each >>>>>>>>> field ? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Mathieu >>>>>>>>> >>>>>>>>> Le 30 mai 08 à 01:24, pi song a écrit : >>>>>>>>> >>>>>>>>> >>>>>>>>> Alan, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I will start thinking about this as well. When do you want to >>>>>>>>>> start >>>>>>>>>> the >>>>>>>>>> implementation? >>>>>>>>>> >>>>>>>>>> Pi >>>>>>>>>> >>>>>>>>>> On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Dear Wiki user, >>>>>>>>>> >>>>>>>>>> You have subscribed to a wiki page or wiki category on "Pig Wiki" >>>>>>>>>>> for >>>>>>>>>>> change notification. >>>>>>>>>>> >>>>>>>>>>> The following page has been changed by AlanGates: >>>>>>>>>>> http://wiki.apache.org/pig/PigMetaData >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>> information, histograms, etc. >>>>>>>>>>> >>>>>>>>>>> == Pig Interface to File Specific Metadata == >>>>>>>>>>> - Pig should support four options with regard to file specific >>>>>>>>>>> metadata: >>>>>>>>>>> + Pig should support four options with regard to reading file >>>>>>>>>>> specific >>>>>>>>>>> metadata: >>>>>>>>>>> 1. No file specific metadata available. Pig uses the file as >>>>>>>>>>> input >>>>>>>>>>> with >>>>>>>>>>> no knowledge of its content. All data is assumed to be >>>>>>>>>>> !ByteArrays. >>>>>>>>>>> 2. User provides schema in the script. For example, `A = load >>>>>>>>>>> 'myfile' >>>>>>>>>>> as (a: chararray, b: int);`. >>>>>>>>>>> 3. Self describing data. Data may be in a format that describes >>>>>>>>>>> the >>>>>>>>>>> schema, such as JSON. Users may also have other proprietary ways >>>>>>>>>>> to >>>>>>>>>>> store >>>>>>>>>>> information about the data in a file either in the file itself or >>>>>>>>>>> in >>>>>>>>>>> an >>>>>>>>>>> associated file. Changes to the !LoadFunc interface made as part >>>>>>>>>>> of >>>>>>>>>>> the >>>>>>>>>>> pipeline rework support this for data type and column layout >>>>>>>>>>> only. >>>>>>>>>>> It >>>>>>>>>>> will >>>>>>>>>>> need to be expanded to support other types of information about >>>>>>>>>>> the >>>>>>>>>>> file. >>>>>>>>>>> 4. Input from a data catalog. Pig needs to be able to query an >>>>>>>>>>> external >>>>>>>>>>> data catalog to acquire information about a file. All the same >>>>>>>>>>> information >>>>>>>>>>> available in option 3 should be available via this interface. >>>>>>>>>>> This >>>>>>>>>>> interface does not yet exist and needs to be designed. >>>>>>>>>>> >>>>>>>>>>> + It should support options 3 and 4 for writing file specific >>>>>>>>>>> metadata >>>>>>>>>>> as >>>>>>>>>>> well. >>>>>>>>>>> + >>>>>>>>>>> == Pig Interface to Global Metadata == >>>>>>>>>>> - An interface will need to be designed for pig to interface to >>>>>>>>>>> an >>>>>>>>>>> external >>>>>>>>>>> data catalog. >>>>>>>>>>> + An interface will need to be designed for pig to read from and >>>>>>>>>>> write >>>>>>>>>>> to >>>>>>>>>>> an external data catalog. >>>>>>>>>>> >>>>>>>>>>> == Architecture of Pig Interface to External Data Catalog == >>>>>>>>>>> Pig needs to be able to connect to various types of external data >>>>>>>>>>> catalogs >>>>>>>>>>> (databases, catalogs stored in flat files, web services, etc.). >>>>>>>>>>> To >>>>>>>>>>> facilitate this >>>>>>>>>>> - pig will develop a generic interface that allows it to make >>>>>>>>>>> specific >>>>>>>>>>> types of queries to a data catalog. Drivers will then need to be >>>>>>>>>>> written >>>>>>>>>>> to >>>>>>>>>>> implement >>>>>>>>>>> + pig will develop a generic interface that allows it to query >>>>>>>>>>> and >>>>>>>>>>> update >>>>>>>>>>> a >>>>>>>>>>> data catalog. Drivers will then need to be written to implement >>>>>>>>>>> that interface and connect to a specific type of data catalog. >>>>>>>>>>> >>>>>>>>>>> == Types of File Specific Metadata Pig Will Use == >>>>>>>>>>> - Pig should be able to acquire the following types of >>>>>>>>>>> information >>>>>>>>>>> about >>>>>>>>>>> a >>>>>>>>>>> file via either self description or an external data catalog. >>>>>>>>>>> This >>>>>>>>>>> is >>>>>>>>>>> not >>>>>>>>>>> to say >>>>>>>>>>> + Pig should be able to acquire and record the following types of >>>>>>>>>>> information about a file via either self description or an >>>>>>>>>>> external >>>>>>>>>>> data >>>>>>>>>>> catalog. This is not to say >>>>>>>>>>> that every self describing file or external data catalog must >>>>>>>>>>> support >>>>>>>>>>> every >>>>>>>>>>> one of these items. This is a list of items pig may find useful >>>>>>>>>>> and >>>>>>>>>>> should >>>>>>>>>>> be >>>>>>>>>>> - able to query for. If the metadata source cannot provide the >>>>>>>>>>> information, pig will simply not make use of it. >>>>>>>>>>> + able to query for and create. If the metadata source cannot >>>>>>>>>>> provide >>>>>>>>>>> or >>>>>>>>>>> store the information, pig will simply not make use of it or >>>>>>>>>>> record >>>>>>>>>>> it. >>>>>>>>>>> * Field layout (already supported) >>>>>>>>>>> * Field types (already supported) >>>>>>>>>>> * Sortedness of the data, both key and direction >>>>>>>>>>> (ascending/descending) >>>>>>>>>>> @@ -52, +54 @@ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> == Priorities == >>>>>>>>>>> Given that the usage for global metadata is unclear, the priority >>>>>>>>>>> will >>>>>>>>>>> be >>>>>>>>>>> placed on supporting file specific metadata. The first step >>>>>>>>>>> should >>>>>>>>>>> be >>>>>>>>>>> to >>>>>>>>>>> define the >>>>>>>>>>> - interface changes in !LoadFunc and the interface to external >>>>>>>>>>> data >>>>>>>>>>> catalogs. >>>>>>>>>>> + interface changes in !LoadFunc, !StoreFunc and the interface to >>>>>>>>>>> external >>>>>>>>>>> data catalogs. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>> >>> >> >
