Alan, I will start thinking about this as well. When do you want to start the implementation?
Pi On 5/29/08, Apache Wiki <[EMAIL PROTECTED]> wrote: > > Dear Wiki user, > > You have subscribed to a wiki page or wiki category on "Pig Wiki" for > change notification. > > The following page has been changed by AlanGates: > http://wiki.apache.org/pig/PigMetaData > > > ------------------------------------------------------------------------------ > information, histograms, etc. > > == Pig Interface to File Specific Metadata == > - Pig should support four options with regard to file specific metadata: > + Pig should support four options with regard to reading file specific > metadata: > 1. No file specific metadata available. Pig uses the file as input with > no knowledge of its content. All data is assumed to be !ByteArrays. > 2. User provides schema in the script. For example, `A = load 'myfile' > as (a: chararray, b: int);`. > 3. Self describing data. Data may be in a format that describes the > schema, such as JSON. Users may also have other proprietary ways to store > information about the data in a file either in the file itself or in an > associated file. Changes to the !LoadFunc interface made as part of the > pipeline rework support this for data type and column layout only. It will > need to be expanded to support other types of information about the file. > 4. Input from a data catalog. Pig needs to be able to query an external > data catalog to acquire information about a file. All the same information > available in option 3 should be available via this interface. This > interface does not yet exist and needs to be designed. > > + It should support options 3 and 4 for writing file specific metadata as > well. > + > == Pig Interface to Global Metadata == > - An interface will need to be designed for pig to interface to an external > data catalog. > + An interface will need to be designed for pig to read from and write to > an external data catalog. > > == Architecture of Pig Interface to External Data Catalog == > Pig needs to be able to connect to various types of external data catalogs > (databases, catalogs stored in flat files, web services, etc.). To > facilitate this > - pig will develop a generic interface that allows it to make specific > types of queries to a data catalog. Drivers will then need to be written to > implement > + pig will develop a generic interface that allows it to query and update a > data catalog. Drivers will then need to be written to implement > that interface and connect to a specific type of data catalog. > > == Types of File Specific Metadata Pig Will Use == > - Pig should be able to acquire the following types of information about a > file via either self description or an external data catalog. This is not > to say > + Pig should be able to acquire and record the following types of > information about a file via either self description or an external data > catalog. This is not to say > that every self describing file or external data catalog must support every > one of these items. This is a list of items pig may find useful and should > be > - able to query for. If the metadata source cannot provide the > information, pig will simply not make use of it. > + able to query for and create. If the metadata source cannot provide or > store the information, pig will simply not make use of it or record it. > * Field layout (already supported) > * Field types (already supported) > * Sortedness of the data, both key and direction (ascending/descending) > @@ -52, +54 @@ > > > == Priorities == > Given that the usage for global metadata is unclear, the priority will be > placed on supporting file specific metadata. The first step should be to > define the > - interface changes in !LoadFunc and the interface to external data > catalogs. > + interface changes in !LoadFunc, !StoreFunc and the interface to external > data catalogs. > >
