More, adding meta data is conceptually adding another way to parameterize load/store functions. Making UDFs to be parameterized by other UDFs therefore is also possible functionally but I just couldn't think of any good use cases.
On Sat, May 31, 2008 at 12:09 AM, pi song <[EMAIL PROTECTED]> wrote: > Just out of curiosity. If you say somehow the UDF store in your example can > "learn" from UDF load. That information still might not be useful because > between "load" and "store", you've got processing logic which might or might > not alter the validity of information directly transfered from "load" to > "store". An example would be I do load a list of number and then I convert > to string. Then information on the UDF store side is then not applicable. > > Don't you think the cases where this concept can be useful is very rare? > > Pi > > > > On Fri, May 30, 2008 at 11:44 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]> > wrote: > >> Pi, >> >> Well... I was thinking... the three of them actually. Alan's list is quite >> comprehensive, so it is not that easy to find a counvincing example, but I'm >> sure UDF developper may need some additional information to communicate >> metadata from one UDF to another. >> >> It does not make sense if you think "one UDF function", but it is a way to >> have two coordinated UDF communicating. >> >> For instance the developper of a jdbc pig "connector" will typically write >> a UDF load, and a UDF store. What if he wants the loader to discover the >> field collection (case 3, Self describing data in Alan's page) from jdbc and >> propagate the exact column type of a given field (as in "VARCHAR(42)"), to >> create it the right way in the UDF store ? or the table name ? or the fact >> that a column is indexed, a primary key, a foreign key constraint, some >> encoding info... He may also want to develop a UDF pipeline function that >> would perform some foreign key validation against the database at some point >> in his script. Having the information in the metadata may be usefull. >> >> Some other fields of application we can not think of today may need some >> completely different metadata. My whole point is: Pig should provide some >> metadata extension point. >> >> Le 30 mai 08 à 13:54, pi song a écrit : >> >> >> I don't get it Mathieu. UDF is a very broad term. It could be UDF Load, >>> UDF >>> Store, or UDF as function in pipeline. Can you explain a bit more? >>> >>> On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]> >>> wrote: >>> >>> All, >>>> >>>> Looking at the very extensive list of types of file specificic metadata, >>>> I >>>> think (from experience) that a UDF function may need to attach some >>>> information (any information, actualy) to a given field (or file) to be >>>> retrieved by another UDF downstream. >>>> >>>> What about adding a Map<String, Serializable> to each file and each >>>> field ? >>>> >>>> -- >>>> Mathieu >>>> >>>> Le 30 mai 08 à 01:24, pi song a écrit : >>>> >>>> >>>> Alan, >>>> >>>>> >>>>> I will start thinking about this as well. When do you want to start the >>>>> implementation? >>>>> >>>>> Pi >>>>> >>>>> On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote: >>>>> >>>>> >>>>>> Dear Wiki user, >>>>>> >>>>>> You have subscribed to a wiki page or wiki category on "Pig Wiki" for >>>>>> change notification. >>>>>> >>>>>> The following page has been changed by AlanGates: >>>>>> http://wiki.apache.org/pig/PigMetaData >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> information, histograms, etc. >>>>>> >>>>>> == Pig Interface to File Specific Metadata == >>>>>> - Pig should support four options with regard to file specific >>>>>> metadata: >>>>>> + Pig should support four options with regard to reading file specific >>>>>> metadata: >>>>>> 1. No file specific metadata available. Pig uses the file as input >>>>>> with >>>>>> no knowledge of its content. All data is assumed to be !ByteArrays. >>>>>> 2. User provides schema in the script. For example, `A = load >>>>>> 'myfile' >>>>>> as (a: chararray, b: int);`. >>>>>> 3. Self describing data. Data may be in a format that describes the >>>>>> schema, such as JSON. Users may also have other proprietary ways to >>>>>> store >>>>>> information about the data in a file either in the file itself or in >>>>>> an >>>>>> associated file. Changes to the !LoadFunc interface made as part of >>>>>> the >>>>>> pipeline rework support this for data type and column layout only. It >>>>>> will >>>>>> need to be expanded to support other types of information about the >>>>>> file. >>>>>> 4. Input from a data catalog. Pig needs to be able to query an >>>>>> external >>>>>> data catalog to acquire information about a file. All the same >>>>>> information >>>>>> available in option 3 should be available via this interface. This >>>>>> interface does not yet exist and needs to be designed. >>>>>> >>>>>> + It should support options 3 and 4 for writing file specific metadata >>>>>> as >>>>>> well. >>>>>> + >>>>>> == Pig Interface to Global Metadata == >>>>>> - An interface will need to be designed for pig to interface to an >>>>>> external >>>>>> data catalog. >>>>>> + An interface will need to be designed for pig to read from and write >>>>>> to >>>>>> an external data catalog. >>>>>> >>>>>> == Architecture of Pig Interface to External Data Catalog == >>>>>> Pig needs to be able to connect to various types of external data >>>>>> catalogs >>>>>> (databases, catalogs stored in flat files, web services, etc.). To >>>>>> facilitate this >>>>>> - pig will develop a generic interface that allows it to make specific >>>>>> types of queries to a data catalog. Drivers will then need to be >>>>>> written >>>>>> to >>>>>> implement >>>>>> + pig will develop a generic interface that allows it to query and >>>>>> update >>>>>> a >>>>>> data catalog. Drivers will then need to be written to implement >>>>>> that interface and connect to a specific type of data catalog. >>>>>> >>>>>> == Types of File Specific Metadata Pig Will Use == >>>>>> - Pig should be able to acquire the following types of information >>>>>> about >>>>>> a >>>>>> file via either self description or an external data catalog. This is >>>>>> not >>>>>> to say >>>>>> + Pig should be able to acquire and record the following types of >>>>>> information about a file via either self description or an external >>>>>> data >>>>>> catalog. This is not to say >>>>>> that every self describing file or external data catalog must support >>>>>> every >>>>>> one of these items. This is a list of items pig may find useful and >>>>>> should >>>>>> be >>>>>> - able to query for. If the metadata source cannot provide the >>>>>> information, pig will simply not make use of it. >>>>>> + able to query for and create. If the metadata source cannot provide >>>>>> or >>>>>> store the information, pig will simply not make use of it or record >>>>>> it. >>>>>> * Field layout (already supported) >>>>>> * Field types (already supported) >>>>>> * Sortedness of the data, both key and direction >>>>>> (ascending/descending) >>>>>> @@ -52, +54 @@ >>>>>> >>>>>> >>>>>> == Priorities == >>>>>> Given that the usage for global metadata is unclear, the priority will >>>>>> be >>>>>> placed on supporting file specific metadata. The first step should be >>>>>> to >>>>>> define the >>>>>> - interface changes in !LoadFunc and the interface to external data >>>>>> catalogs. >>>>>> + interface changes in !LoadFunc, !StoreFunc and the interface to >>>>>> external >>>>>> data catalogs. >>>>>> >>>>>> >>>>>> >>>>>> >>>> >> >
