Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/PigMetaData

------------------------------------------------------------------------------
  information, histograms, etc.
  
  == Pig Interface to File Specific Metadata ==
- Pig should support four options with regard to file specific metadata:
+ Pig should support four options with regard to reading file specific metadata:
   1.  No file specific metadata available.  Pig uses the file as input with no 
knowledge of its content.  All data is assumed to be !ByteArrays.
   2.  User provides schema in the script.  For example, `A = load 'myfile' as 
(a: chararray, b: int);`.  
   3.  Self describing data.  Data may be in a format that describes the 
schema, such as JSON.  Users may also have other proprietary ways to store 
information about the data in a file either in the file itself or in an 
associated file.  Changes to the !LoadFunc interface made as part of the 
pipeline rework support this for data type and column layout only.  It will 
need to be expanded to support other types of information about the file.
   4.  Input from a data catalog.  Pig needs to be able to query an external 
data catalog to acquire information about a file.  All the same information 
available in option 3 should be available via this interface.  This interface 
does not yet exist and needs to be designed.
  
+ It should support options 3 and 4 for writing file specific metadata as well.
+ 
  == Pig Interface to Global Metadata ==
- An interface will need to be designed for pig to interface to an external 
data catalog.
+ An interface will need to be designed for pig to read from and write to an 
external data catalog.
  
  == Architecture of Pig Interface to External Data Catalog ==
  Pig needs to be able to connect to various types of external data catalogs 
(databases, catalogs stored in flat files, web services, etc.).  To facilitate 
this
- pig will develop a generic interface that allows it to make specific types of 
queries to a data catalog.  Drivers will then need to be written to implement
+ pig will develop a generic interface that allows it to query and update a 
data catalog.  Drivers will then need to be written to implement
  that interface and connect to a specific type of data catalog.
  
  == Types of File Specific Metadata Pig Will Use ==
- Pig should be able to acquire the following types of information about a file 
via either self description or an external data catalog.  This is not to say
+ Pig should be able to acquire and record the following types of information 
about a file via either self description or an external data catalog.  This is 
not to say
  that every self describing file or external data catalog must support every 
one of these items.  This is a list of items pig may find useful and should be
- able to query for.  If the metadata source cannot provide the information, 
pig will simply not make use of it.
+ able to query for and create.  If the metadata source cannot provide or store 
the information, pig will simply not make use of it or record it.
   * Field layout (already supported)
   * Field types (already supported)
   * Sortedness of the data, both key and direction (ascending/descending)
@@ -52, +54 @@

  
  == Priorities ==
  Given that the usage for global metadata is unclear, the priority will be 
placed on supporting file specific metadata.  The first step should be to 
define the
- interface changes in !LoadFunc and the interface to external data catalogs.
+ interface changes in !LoadFunc, !StoreFunc and the interface to external data 
catalogs.
  

Reply via email to