I don't get it Mathieu. UDF is a very broad term. It could be UDF Load, UDF Store, or UDF as function in pipeline. Can you explain a bit more?
On Fri, May 30, 2008 at 9:14 PM, Mathieu Poumeyrol <[EMAIL PROTECTED]> wrote: > All, > > Looking at the very extensive list of types of file specificic metadata, I > think (from experience) that a UDF function may need to attach some > information (any information, actualy) to a given field (or file) to be > retrieved by another UDF downstream. > > What about adding a Map<String, Serializable> to each file and each field ? > > -- > Mathieu > > Le 30 mai 08 à 01:24, pi song a écrit : > > > Alan, >> >> I will start thinking about this as well. When do you want to start the >> implementation? >> >> Pi >> >> On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote: >> >>> >>> Dear Wiki user, >>> >>> You have subscribed to a wiki page or wiki category on "Pig Wiki" for >>> change notification. >>> >>> The following page has been changed by AlanGates: >>> http://wiki.apache.org/pig/PigMetaData >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> information, histograms, etc. >>> >>> == Pig Interface to File Specific Metadata == >>> - Pig should support four options with regard to file specific metadata: >>> + Pig should support four options with regard to reading file specific >>> metadata: >>> 1. No file specific metadata available. Pig uses the file as input >>> with >>> no knowledge of its content. All data is assumed to be !ByteArrays. >>> 2. User provides schema in the script. For example, `A = load 'myfile' >>> as (a: chararray, b: int);`. >>> 3. Self describing data. Data may be in a format that describes the >>> schema, such as JSON. Users may also have other proprietary ways to >>> store >>> information about the data in a file either in the file itself or in an >>> associated file. Changes to the !LoadFunc interface made as part of the >>> pipeline rework support this for data type and column layout only. It >>> will >>> need to be expanded to support other types of information about the file. >>> 4. Input from a data catalog. Pig needs to be able to query an >>> external >>> data catalog to acquire information about a file. All the same >>> information >>> available in option 3 should be available via this interface. This >>> interface does not yet exist and needs to be designed. >>> >>> + It should support options 3 and 4 for writing file specific metadata as >>> well. >>> + >>> == Pig Interface to Global Metadata == >>> - An interface will need to be designed for pig to interface to an >>> external >>> data catalog. >>> + An interface will need to be designed for pig to read from and write to >>> an external data catalog. >>> >>> == Architecture of Pig Interface to External Data Catalog == >>> Pig needs to be able to connect to various types of external data >>> catalogs >>> (databases, catalogs stored in flat files, web services, etc.). To >>> facilitate this >>> - pig will develop a generic interface that allows it to make specific >>> types of queries to a data catalog. Drivers will then need to be written >>> to >>> implement >>> + pig will develop a generic interface that allows it to query and update >>> a >>> data catalog. Drivers will then need to be written to implement >>> that interface and connect to a specific type of data catalog. >>> >>> == Types of File Specific Metadata Pig Will Use == >>> - Pig should be able to acquire the following types of information about >>> a >>> file via either self description or an external data catalog. This is >>> not >>> to say >>> + Pig should be able to acquire and record the following types of >>> information about a file via either self description or an external data >>> catalog. This is not to say >>> that every self describing file or external data catalog must support >>> every >>> one of these items. This is a list of items pig may find useful and >>> should >>> be >>> - able to query for. If the metadata source cannot provide the >>> information, pig will simply not make use of it. >>> + able to query for and create. If the metadata source cannot provide or >>> store the information, pig will simply not make use of it or record it. >>> * Field layout (already supported) >>> * Field types (already supported) >>> * Sortedness of the data, both key and direction (ascending/descending) >>> @@ -52, +54 @@ >>> >>> >>> == Priorities == >>> Given that the usage for global metadata is unclear, the priority will be >>> placed on supporting file specific metadata. The first step should be to >>> define the >>> - interface changes in !LoadFunc and the interface to external data >>> catalogs. >>> + interface changes in !LoadFunc, !StoreFunc and the interface to >>> external >>> data catalogs. >>> >>> >>> >
