I love discussing about new idea, Mathieu. This is not bothering but interesting. My colleague had spent sometime doing a Microsoft SSIS thing that always breaks once the is a schema change and requires a manual script change. Seems like you are trying to go beyond that.
On Sat, May 31, 2008 at 12:53 AM, Mathieu Poumeyrol <[EMAIL PROTECTED]> wrote: > Well, it adds a way to *dynamically* parameterize UDF, without changing the > pig script itself. > > I guess it comes back to the questions about "how big a pig script is". If > we are only considering 5-line pig scripts, where you do load exactly what > you need to compute, crush numbers and dump them, I agree it does not make > much sense. > > If one start thinking about something more ETL-ish (which I understand is > not exactly the main purpose of pig) then one could want to use pig to > "move" data around or load data from somewhere, do something "heavy" that > ETL software can just not cope with efficiently enough —build index, process > images, whatever — and store the results somewhere else, a scenario where > there can be fields that pig will just forward, without playing with. > > I admit my background where we were using the same software for ETL-like > stuff and heavy processing (that is, mostly building index) may give me very > a biaised opinion about pig and what it should be. But I would definitely > like to use pig for what it is/will be excellent for, as well as for stuff > where it will be just ok. > > So I still think the extension point is worth having. Half my brain is > already thinking about way of cheating and using Alan's fields list to pass > other stuff around... > > Another concrete example and I stop bothering you all, then :) In our > tools, we are using some field metadata to denote that a field content is a > primary key to a record. When we copy this field values to somewhere else, > we automaticaly tag them as foreign key (instead of primary). When we dump > the data on disk (to a final-user CDROM image in most cases) the fact that > the column refers to a table present on the disk too can be automagically > stored as it is a feature of our final format : without having the > application developper re-specifying the relations, the "UDF store > equivalent" is clever enough to store the information. > > The script the application developper who prepare a CDROM can be several > screen long, with bits spread on separate files. The data model could be > quite complex too. In this context, it is important that things like "this > field acts as a record key" are said once. > > Le 30 mai 08 à 16:13, pi song a écrit : > > > More, adding meta data is conceptually adding another way to parameterize >> load/store functions. Making UDFs to be parameterized by other UDFs >> therefore is also possible functionally but I just couldn't think of any >> good use cases. >> >> On Sat, May 31, 2008 at 12:09 AM, pi song <[EMAIL PROTECTED]> wrote: >> >> Just out of curiosity. If you say somehow the UDF store in your example >>> can >>> "learn" from UDF load. That information still might not be useful because >>> between "load" and "store", you've got processing logic which might or >>> might >>> not alter the validity of information directly transfered from "load" to >>> "store". An example would be I do load a list of number and then I >>> convert >>> to string. Then information on the UDF store side is then not applicable. >>> >>> Don't you think the cases where this concept can be useful is very rare? >>> >>> Pi >>> >>> >>> >>> On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]> >>> wrote: >>> >>> Pi, >>>> >>>> Well... I was thinking... the three of them actually. Alan's list is >>>> quite >>>> comprehensive, so it is not that easy to find a counvincing example, but >>>> I'm >>>> sure UDF developper may need some additional information to communicate >>>> metadata from one UDF to another. >>>> >>>> It does not make sense if you think "one UDF function", but it is a way >>>> to >>>> have two coordinated UDF communicating. >>>> >>>> For instance the developper of a jdbc pig "connector" will typically >>>> write >>>> a UDF load, and a UDF store. What if he wants the loader to discover the >>>> field collection (case 3, Self describing data in Alan's page) from jdbc >>>> and >>>> propagate the exact column type of a given field (as in "VARCHAR(42)"), >>>> to >>>> create it the right way in the UDF store ? or the table name ? or the >>>> fact >>>> that a column is indexed, a primary key, a foreign key constraint, some >>>> encoding info... He may also want to develop a UDF pipeline function >>>> that >>>> would perform some foreign key validation against the database at some >>>> point >>>> in his script. Having the information in the metadata may be usefull. >>>> >>>> Some other fields of application we can not think of today may need some >>>> completely different metadata. My whole point is: Pig should provide >>>> some >>>> metadata extension point. >>>> >>>> Le 30 mai 08 à 13:54, pi song a écrit : >>>> >>>> >>>> I don't get it Mathieu. UDF is a very broad term. It could be UDF Load, >>>> >>>>> UDF >>>>> Store, or UDF as function in pipeline. Can you explain a bit more? >>>>> >>>>> On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>> All, >>>>> >>>>>> >>>>>> Looking at the very extensive list of types of file specificic >>>>>> metadata, >>>>>> I >>>>>> think (from experience) that a UDF function may need to attach some >>>>>> information (any information, actualy) to a given field (or file) to >>>>>> be >>>>>> retrieved by another UDF downstream. >>>>>> >>>>>> What about adding a Map<String, Serializable> to each file and each >>>>>> field ? >>>>>> >>>>>> -- >>>>>> Mathieu >>>>>> >>>>>> Le 30 mai 08 à 01:24, pi song a écrit : >>>>>> >>>>>> >>>>>> Alan, >>>>>> >>>>>> >>>>>>> I will start thinking about this as well. When do you want to start >>>>>>> the >>>>>>> implementation? >>>>>>> >>>>>>> Pi >>>>>>> >>>>>>> On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote: >>>>>>> >>>>>>> >>>>>>> Dear Wiki user, >>>>>>>> >>>>>>>> You have subscribed to a wiki page or wiki category on "Pig Wiki" >>>>>>>> for >>>>>>>> change notification. >>>>>>>> >>>>>>>> The following page has been changed by AlanGates: >>>>>>>> http://wiki.apache.org/pig/PigMetaData >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> information, histograms, etc. >>>>>>>> >>>>>>>> == Pig Interface to File Specific Metadata == >>>>>>>> - Pig should support four options with regard to file specific >>>>>>>> metadata: >>>>>>>> + Pig should support four options with regard to reading file >>>>>>>> specific >>>>>>>> metadata: >>>>>>>> 1. No file specific metadata available. Pig uses the file as input >>>>>>>> with >>>>>>>> no knowledge of its content. All data is assumed to be !ByteArrays. >>>>>>>> 2. User provides schema in the script. For example, `A = load >>>>>>>> 'myfile' >>>>>>>> as (a: chararray, b: int);`. >>>>>>>> 3. Self describing data. Data may be in a format that describes >>>>>>>> the >>>>>>>> schema, such as JSON. Users may also have other proprietary ways to >>>>>>>> store >>>>>>>> information about the data in a file either in the file itself or in >>>>>>>> an >>>>>>>> associated file. Changes to the !LoadFunc interface made as part of >>>>>>>> the >>>>>>>> pipeline rework support this for data type and column layout only. >>>>>>>> It >>>>>>>> will >>>>>>>> need to be expanded to support other types of information about the >>>>>>>> file. >>>>>>>> 4. Input from a data catalog. Pig needs to be able to query an >>>>>>>> external >>>>>>>> data catalog to acquire information about a file. All the same >>>>>>>> information >>>>>>>> available in option 3 should be available via this interface. This >>>>>>>> interface does not yet exist and needs to be designed. >>>>>>>> >>>>>>>> + It should support options 3 and 4 for writing file specific >>>>>>>> metadata >>>>>>>> as >>>>>>>> well. >>>>>>>> + >>>>>>>> == Pig Interface to Global Metadata == >>>>>>>> - An interface will need to be designed for pig to interface to an >>>>>>>> external >>>>>>>> data catalog. >>>>>>>> + An interface will need to be designed for pig to read from and >>>>>>>> write >>>>>>>> to >>>>>>>> an external data catalog. >>>>>>>> >>>>>>>> == Architecture of Pig Interface to External Data Catalog == >>>>>>>> Pig needs to be able to connect to various types of external data >>>>>>>> catalogs >>>>>>>> (databases, catalogs stored in flat files, web services, etc.). To >>>>>>>> facilitate this >>>>>>>> - pig will develop a generic interface that allows it to make >>>>>>>> specific >>>>>>>> types of queries to a data catalog. Drivers will then need to be >>>>>>>> written >>>>>>>> to >>>>>>>> implement >>>>>>>> + pig will develop a generic interface that allows it to query and >>>>>>>> update >>>>>>>> a >>>>>>>> data catalog. Drivers will then need to be written to implement >>>>>>>> that interface and connect to a specific type of data catalog. >>>>>>>> >>>>>>>> == Types of File Specific Metadata Pig Will Use == >>>>>>>> - Pig should be able to acquire the following types of information >>>>>>>> about >>>>>>>> a >>>>>>>> file via either self description or an external data catalog. This >>>>>>>> is >>>>>>>> not >>>>>>>> to say >>>>>>>> + Pig should be able to acquire and record the following types of >>>>>>>> information about a file via either self description or an external >>>>>>>> data >>>>>>>> catalog. This is not to say >>>>>>>> that every self describing file or external data catalog must >>>>>>>> support >>>>>>>> every >>>>>>>> one of these items. This is a list of items pig may find useful and >>>>>>>> should >>>>>>>> be >>>>>>>> - able to query for. If the metadata source cannot provide the >>>>>>>> information, pig will simply not make use of it. >>>>>>>> + able to query for and create. If the metadata source cannot >>>>>>>> provide >>>>>>>> or >>>>>>>> store the information, pig will simply not make use of it or record >>>>>>>> it. >>>>>>>> * Field layout (already supported) >>>>>>>> * Field types (already supported) >>>>>>>> * Sortedness of the data, both key and direction >>>>>>>> (ascending/descending) >>>>>>>> @@ -52, +54 @@ >>>>>>>> >>>>>>>> >>>>>>>> == Priorities == >>>>>>>> Given that the usage for global metadata is unclear, the priority >>>>>>>> will >>>>>>>> be >>>>>>>> placed on supporting file specific metadata. The first step should >>>>>>>> be >>>>>>>> to >>>>>>>> define the >>>>>>>> - interface changes in !LoadFunc and the interface to external data >>>>>>>> catalogs. >>>>>>>> + interface changes in !LoadFunc, !StoreFunc and the interface to >>>>>>>> external >>>>>>>> data catalogs. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>> >>> >
