Mathieu, Can you give us some more practical use cases?
Pi On Fri, Jun 6, 2008 at 12:16 AM, pi song <[EMAIL PROTECTED]> wrote: > Interestingly, somebody just requested a feature which is sample use case. > > https://issues.apache.org/jira/browse/PIG-255 > > > > On Tue, Jun 3, 2008 at 9:15 PM, pi song <[EMAIL PROTECTED]> wrote: > >> From my understanding, we are trying to create something like plan-scoped >> shared properties right? Potentially, it's good to have. >> >> >> On Tue, Jun 3, 2008 at 2:20 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]> >> wrote: >> >>> Alan, Pi, >>> >>> This overall summary sounds good to me, yes. >>> >>> But I merely ask for a map to be propagated when it looks possible. I'm >>> not too sure about the "canonical" metadata having to be stored into the map >>> actualy. I think I would keep the "canonical metadata" as properties of a >>> Schema bean, merely adding a Map<String, Serializable> UDMetadata to the >>> list. Putting the "canonical metadata" inside the map would just make pig's >>> internal code more difficult to maintain and could lead to weird bugs when >>> keys are overwritten... I prefer to let the UDF developper play in his >>> sandbox. >>> >>> Le 2 juin 08 à 17:56, Alan Gates a écrit : >>> >>> >>> Mathieu, let me make sure I understand what you're trying to say. Some >>>> file level metadata is about the file as a whole, such as how many records >>>> are in the file. Some is about individual columns (such as column >>>> cardinality or value distribution histograms). You would like to see each >>>> stored in a map (one map for file wide and one for each column). You could >>>> then "cheat" in the load functions you write for yourself and add values >>>> specific to your application into those maps. Is that correct? >>>> >>>> We will need to decide on a canonical set of metadata entries that the >>>> pig engine can ask for. But allowing for optional settings in addition to >>>> these canonical values seems like a good idea. The pig engine itself will >>>> only utilize the canonical set. But user contributed load, store, and eval >>>> functions are free to communicate with each via the optional set. >>>> >>>> To address Pi's point about columns being transformed, my assumption >>>> would be that all this file level metadata will be stored in (or at least >>>> referenced from) the Schema object. This can be set so that metadata >>>> associated with a particular field survives projection, but not being >>>> passed >>>> to an eval UDF, being used in an arithmetic expression, etc. As the eval >>>> UDF can generate a schema for its output, it could set any optional (or >>>> canonical) values it wanted, thus facilitating Mathieu's communication. >>>> >>>> Alan. >>>> >>>> pi song wrote: >>>> >>>>> I love discussing about new idea, Mathieu. This is not bothering but >>>>> interesting. My colleague had spent sometime doing a Microsoft SSIS >>>>> thing >>>>> that always breaks once the is a schema change and requires a manual >>>>> script >>>>> change. Seems like you are trying to go beyond that. >>>>> >>>>> On Sat, May 31, 2008 at 12:53 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>> >>>>> Well, it adds a way to *dynamically* parameterize UDF, without >>>>>> changing the >>>>>> pig script itself. >>>>>> >>>>>> I guess it comes back to the questions about "how big a pig script >>>>>> is". If >>>>>> we are only considering 5-line pig scripts, where you do load exactly >>>>>> what >>>>>> you need to compute, crush numbers and dump them, I agree it does not >>>>>> make >>>>>> much sense. >>>>>> >>>>>> If one start thinking about something more ETL-ish (which I understand >>>>>> is >>>>>> not exactly the main purpose of pig) then one could want to use pig to >>>>>> "move" data around or load data from somewhere, do something "heavy" >>>>>> that >>>>>> ETL software can just not cope with efficiently enough —build index, >>>>>> process >>>>>> images, whatever — and store the results somewhere else, a scenario >>>>>> where >>>>>> there can be fields that pig will just forward, without playing with. >>>>>> >>>>>> I admit my background where we were using the same software for >>>>>> ETL-like >>>>>> stuff and heavy processing (that is, mostly building index) may give >>>>>> me very >>>>>> a biaised opinion about pig and what it should be. But I would >>>>>> definitely >>>>>> like to use pig for what it is/will be excellent for, as well as for >>>>>> stuff >>>>>> where it will be just ok. >>>>>> >>>>>> So I still think the extension point is worth having. Half my brain is >>>>>> already thinking about way of cheating and using Alan's fields list to >>>>>> pass >>>>>> other stuff around... >>>>>> >>>>>> Another concrete example and I stop bothering you all, then :) In our >>>>>> tools, we are using some field metadata to denote that a field content >>>>>> is a >>>>>> primary key to a record. When we copy this field values to somewhere >>>>>> else, >>>>>> we automaticaly tag them as foreign key (instead of primary). When we >>>>>> dump >>>>>> the data on disk (to a final-user CDROM image in most cases) the fact >>>>>> that >>>>>> the column refers to a table present on the disk too can be >>>>>> automagically >>>>>> stored as it is a feature of our final format : without having the >>>>>> application developper re-specifying the relations, the "UDF store >>>>>> equivalent" is clever enough to store the information. >>>>>> >>>>>> The script the application developper who prepare a CDROM can be >>>>>> several >>>>>> screen long, with bits spread on separate files. The data model could >>>>>> be >>>>>> quite complex too. In this context, it is important that things like >>>>>> "this >>>>>> field acts as a record key" are said once. >>>>>> >>>>>> Le 30 mai 08 à 16:13, pi song a écrit : >>>>>> >>>>>> >>>>>> More, adding meta data is conceptually adding another way to >>>>>> parameterize >>>>>> >>>>>> load/store functions. Making UDFs to be parameterized by other UDFs >>>>>>> therefore is also possible functionally but I just couldn't think of >>>>>>> any >>>>>>> good use cases. >>>>>>> >>>>>>> On Sat, May 31, 2008 at 12:09 AM, pi song <[EMAIL PROTECTED]> >>>>>>> wrote: >>>>>>> >>>>>>> Just out of curiosity. If you say somehow the UDF store in your >>>>>>> example >>>>>>> >>>>>>> can >>>>>>>> "learn" from UDF load. That information still might not be useful >>>>>>>> because >>>>>>>> between "load" and "store", you've got processing logic which might >>>>>>>> or >>>>>>>> might >>>>>>>> not alter the validity of information directly transfered from >>>>>>>> "load" to >>>>>>>> "store". An example would be I do load a list of number and then I >>>>>>>> convert >>>>>>>> to string. Then information on the UDF store side is then not >>>>>>>> applicable. >>>>>>>> >>>>>>>> Don't you think the cases where this concept can be useful is very >>>>>>>> rare? >>>>>>>> >>>>>>>> Pi >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol < >>>>>>>> [EMAIL PROTECTED]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Pi, >>>>>>>> >>>>>>>> Well... I was thinking... the three of them actually. Alan's list >>>>>>>>> is >>>>>>>>> quite >>>>>>>>> comprehensive, so it is not that easy to find a counvincing >>>>>>>>> example, but >>>>>>>>> I'm >>>>>>>>> sure UDF developper may need some additional information to >>>>>>>>> communicate >>>>>>>>> metadata from one UDF to another. >>>>>>>>> >>>>>>>>> It does not make sense if you think "one UDF function", but it is a >>>>>>>>> way >>>>>>>>> to >>>>>>>>> have two coordinated UDF communicating. >>>>>>>>> >>>>>>>>> For instance the developper of a jdbc pig "connector" will >>>>>>>>> typically >>>>>>>>> write >>>>>>>>> a UDF load, and a UDF store. What if he wants the loader to >>>>>>>>> discover the >>>>>>>>> field collection (case 3, Self describing data in Alan's page) from >>>>>>>>> jdbc >>>>>>>>> and >>>>>>>>> propagate the exact column type of a given field (as in >>>>>>>>> "VARCHAR(42)"), >>>>>>>>> to >>>>>>>>> create it the right way in the UDF store ? or the table name ? or >>>>>>>>> the >>>>>>>>> fact >>>>>>>>> that a column is indexed, a primary key, a foreign key constraint, >>>>>>>>> some >>>>>>>>> encoding info... He may also want to develop a UDF pipeline >>>>>>>>> function >>>>>>>>> that >>>>>>>>> would perform some foreign key validation against the database at >>>>>>>>> some >>>>>>>>> point >>>>>>>>> in his script. Having the information in the metadata may be >>>>>>>>> usefull. >>>>>>>>> >>>>>>>>> Some other fields of application we can not think of today may need >>>>>>>>> some >>>>>>>>> completely different metadata. My whole point is: Pig should >>>>>>>>> provide >>>>>>>>> some >>>>>>>>> metadata extension point. >>>>>>>>> >>>>>>>>> Le 30 mai 08 à 13:54, pi song a écrit : >>>>>>>>> >>>>>>>>> >>>>>>>>> I don't get it Mathieu. UDF is a very broad term. It could be UDF >>>>>>>>> Load, >>>>>>>>> >>>>>>>>> >>>>>>>>> UDF >>>>>>>>>> Store, or UDF as function in pipeline. Can you explain a bit >>>>>>>>>> more? >>>>>>>>>> >>>>>>>>>> On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol < >>>>>>>>>> [EMAIL PROTECTED]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> All, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Looking at the very extensive list of types of file specificic >>>>>>>>>>> metadata, >>>>>>>>>>> I >>>>>>>>>>> think (from experience) that a UDF function may need to attach >>>>>>>>>>> some >>>>>>>>>>> information (any information, actualy) to a given field (or file) >>>>>>>>>>> to >>>>>>>>>>> be >>>>>>>>>>> retrieved by another UDF downstream. >>>>>>>>>>> >>>>>>>>>>> What about adding a Map<String, Serializable> to each file and >>>>>>>>>>> each >>>>>>>>>>> field ? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Mathieu >>>>>>>>>>> >>>>>>>>>>> Le 30 mai 08 à 01:24, pi song a écrit : >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Alan, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I will start thinking about this as well. When do you want to >>>>>>>>>>>> start >>>>>>>>>>>> the >>>>>>>>>>>> implementation? >>>>>>>>>>>> >>>>>>>>>>>> Pi >>>>>>>>>>>> >>>>>>>>>>>> On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Dear Wiki user, >>>>>>>>>>>> >>>>>>>>>>>> You have subscribed to a wiki page or wiki category on "Pig >>>>>>>>>>>>> Wiki" >>>>>>>>>>>>> for >>>>>>>>>>>>> change notification. >>>>>>>>>>>>> >>>>>>>>>>>>> The following page has been changed by AlanGates: >>>>>>>>>>>>> http://wiki.apache.org/pig/PigMetaData >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>> information, histograms, etc. >>>>>>>>>>>>> >>>>>>>>>>>>> == Pig Interface to File Specific Metadata == >>>>>>>>>>>>> - Pig should support four options with regard to file specific >>>>>>>>>>>>> metadata: >>>>>>>>>>>>> + Pig should support four options with regard to reading file >>>>>>>>>>>>> specific >>>>>>>>>>>>> metadata: >>>>>>>>>>>>> 1. No file specific metadata available. Pig uses the file as >>>>>>>>>>>>> input >>>>>>>>>>>>> with >>>>>>>>>>>>> no knowledge of its content. All data is assumed to be >>>>>>>>>>>>> !ByteArrays. >>>>>>>>>>>>> 2. User provides schema in the script. For example, `A = load >>>>>>>>>>>>> 'myfile' >>>>>>>>>>>>> as (a: chararray, b: int);`. >>>>>>>>>>>>> 3. Self describing data. Data may be in a format that >>>>>>>>>>>>> describes >>>>>>>>>>>>> the >>>>>>>>>>>>> schema, such as JSON. Users may also have other proprietary >>>>>>>>>>>>> ways to >>>>>>>>>>>>> store >>>>>>>>>>>>> information about the data in a file either in the file itself >>>>>>>>>>>>> or in >>>>>>>>>>>>> an >>>>>>>>>>>>> associated file. Changes to the !LoadFunc interface made as >>>>>>>>>>>>> part of >>>>>>>>>>>>> the >>>>>>>>>>>>> pipeline rework support this for data type and column layout >>>>>>>>>>>>> only. >>>>>>>>>>>>> It >>>>>>>>>>>>> will >>>>>>>>>>>>> need to be expanded to support other types of information about >>>>>>>>>>>>> the >>>>>>>>>>>>> file. >>>>>>>>>>>>> 4. Input from a data catalog. Pig needs to be able to query >>>>>>>>>>>>> an >>>>>>>>>>>>> external >>>>>>>>>>>>> data catalog to acquire information about a file. All the same >>>>>>>>>>>>> information >>>>>>>>>>>>> available in option 3 should be available via this interface. >>>>>>>>>>>>> This >>>>>>>>>>>>> interface does not yet exist and needs to be designed. >>>>>>>>>>>>> >>>>>>>>>>>>> + It should support options 3 and 4 for writing file specific >>>>>>>>>>>>> metadata >>>>>>>>>>>>> as >>>>>>>>>>>>> well. >>>>>>>>>>>>> + >>>>>>>>>>>>> == Pig Interface to Global Metadata == >>>>>>>>>>>>> - An interface will need to be designed for pig to interface to >>>>>>>>>>>>> an >>>>>>>>>>>>> external >>>>>>>>>>>>> data catalog. >>>>>>>>>>>>> + An interface will need to be designed for pig to read from >>>>>>>>>>>>> and >>>>>>>>>>>>> write >>>>>>>>>>>>> to >>>>>>>>>>>>> an external data catalog. >>>>>>>>>>>>> >>>>>>>>>>>>> == Architecture of Pig Interface to External Data Catalog == >>>>>>>>>>>>> Pig needs to be able to connect to various types of external >>>>>>>>>>>>> data >>>>>>>>>>>>> catalogs >>>>>>>>>>>>> (databases, catalogs stored in flat files, web services, etc.). >>>>>>>>>>>>> To >>>>>>>>>>>>> facilitate this >>>>>>>>>>>>> - pig will develop a generic interface that allows it to make >>>>>>>>>>>>> specific >>>>>>>>>>>>> types of queries to a data catalog. Drivers will then need to >>>>>>>>>>>>> be >>>>>>>>>>>>> written >>>>>>>>>>>>> to >>>>>>>>>>>>> implement >>>>>>>>>>>>> + pig will develop a generic interface that allows it to query >>>>>>>>>>>>> and >>>>>>>>>>>>> update >>>>>>>>>>>>> a >>>>>>>>>>>>> data catalog. Drivers will then need to be written to >>>>>>>>>>>>> implement >>>>>>>>>>>>> that interface and connect to a specific type of data catalog. >>>>>>>>>>>>> >>>>>>>>>>>>> == Types of File Specific Metadata Pig Will Use == >>>>>>>>>>>>> - Pig should be able to acquire the following types of >>>>>>>>>>>>> information >>>>>>>>>>>>> about >>>>>>>>>>>>> a >>>>>>>>>>>>> file via either self description or an external data catalog. >>>>>>>>>>>>> This >>>>>>>>>>>>> is >>>>>>>>>>>>> not >>>>>>>>>>>>> to say >>>>>>>>>>>>> + Pig should be able to acquire and record the following types >>>>>>>>>>>>> of >>>>>>>>>>>>> information about a file via either self description or an >>>>>>>>>>>>> external >>>>>>>>>>>>> data >>>>>>>>>>>>> catalog. This is not to say >>>>>>>>>>>>> that every self describing file or external data catalog must >>>>>>>>>>>>> support >>>>>>>>>>>>> every >>>>>>>>>>>>> one of these items. This is a list of items pig may find >>>>>>>>>>>>> useful and >>>>>>>>>>>>> should >>>>>>>>>>>>> be >>>>>>>>>>>>> - able to query for. If the metadata source cannot provide the >>>>>>>>>>>>> information, pig will simply not make use of it. >>>>>>>>>>>>> + able to query for and create. If the metadata source cannot >>>>>>>>>>>>> provide >>>>>>>>>>>>> or >>>>>>>>>>>>> store the information, pig will simply not make use of it or >>>>>>>>>>>>> record >>>>>>>>>>>>> it. >>>>>>>>>>>>> * Field layout (already supported) >>>>>>>>>>>>> * Field types (already supported) >>>>>>>>>>>>> * Sortedness of the data, both key and direction >>>>>>>>>>>>> (ascending/descending) >>>>>>>>>>>>> @@ -52, +54 @@ >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> == Priorities == >>>>>>>>>>>>> Given that the usage for global metadata is unclear, the >>>>>>>>>>>>> priority >>>>>>>>>>>>> will >>>>>>>>>>>>> be >>>>>>>>>>>>> placed on supporting file specific metadata. The first step >>>>>>>>>>>>> should >>>>>>>>>>>>> be >>>>>>>>>>>>> to >>>>>>>>>>>>> define the >>>>>>>>>>>>> - interface changes in !LoadFunc and the interface to external >>>>>>>>>>>>> data >>>>>>>>>>>>> catalogs. >>>>>>>>>>>>> + interface changes in !LoadFunc, !StoreFunc and the interface >>>>>>>>>>>>> to >>>>>>>>>>>>> external >>>>>>>>>>>>> data catalogs. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>> >>>>> >>>> >>> >> >
