Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by AlanGates:
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=4&rev2=5

  interface LoadFunc {
  
      /**
-      * Communicate to the loader the URIs used in Pig Latin to refer to the 
+      * Communicate to the loader the load string used in Pig Latin to refer 
to the 
       * object(s) being loaded.  That is, if the PL script is
       * <b>A = load 'bla'</b>
-      * then 'bla' is the URI.  Load functions should assume that if no
-      * scheme is provided in the URI it is an hdfs file.  This will be 
+      * then 'bla' is the load string.  In general Pig expects these to be
+      * a path name, a glob, or a URI.  If there is no URI scheme present,
+      * Pig will assume it is a file name.  This will be 
       * called during planning on the front end, not during execution on
       * the backend.
-      * @param uri URIs referenced in load statement.
+      * @param location Location indicated in load statement.
+      * @throws IOException if the location is not valid.
       */
-     void setURI(URI[] uri);
+     void setLocation(String location) throws IOException;
      
      /**
       * Return the InputFormat associated with this loader.  This will be
       * called during planning on the front end.  The LoadFunc need not
       * carry the InputFormat information to the backend, as it will
-      * be provided with the appropriate RecordReader there.
+      * be provided with the appropriate RecordReader there.  This is the
+      * instance of InputFormat (rather than the class name) because the 
+      * load function may need to instantiate the InputFormat in order 
+      * to control how it is constructed.
       */
      InputFormat getInputFormat();
  
@@ -77, +82 @@

  
      /**
       * Initializes LoadFunc for reading data.  This will be called during 
execution
-      * before any calls to getNext.
+      * before any calls to getNext.  The RecordReader needs to be passed here 
because
+      * it has been instantiated for a particular InputSplit.
       * @param reader RecordReader to be used by this instance of the LoadFunc
       */
      void prepareToRead(RecordReader reader);
@@ -100, +106 @@

  }}}
  
  Open questions for !LoadFunc:
-  1. Should setURI instead be setLocation and just take a String?  The 
advantage of a URI is we know exactly what users are trying to communicate 
with, and we can define what Pig does in default cases (when a scheme is not 
given).  The disadvantage is forcing more structure on users and their load 
functions.  I'm still pretty strongly on the side of using URI.
+  1. Should setLocation instead be setURI and take a URI?  The advantage of a 
URI is we know exactly what users are trying to communicate with, and we can 
define what Pig does in default cases (when a scheme is not given).  The 
disadvantage is forcing more structure on users and their load functions.
  
  The '''!LoadCaster''' interface will include bytesToInt, bytesToLong, etc. 
functions
  currently in !LoadFunc.  UTF8!StorageConverter will implement this interface.
@@ -121, +127 @@

       * not possible to return a schema that represents all returned data,
       * then null should be returned.
       */
-     LoadSchema getSchema();
+     ResourceSchema getSchema();
  
      /**
       * Get statistics about the data to be loaded.  If no statistics are
       * available, then null should be returned.
       */
-     LoadStatistics getStatistics();
+     ResourceStatistics getStatistics();
+ 
+     /**
+      * Find what columns are partition keys for this input.
+      * This function assumes that setLocation has already been called.
+      * @return array of field names of the partition keys.
+      */
+     String>[] getPartitionKeys();
+ 
+     /**
+      * Set the filter for partitioning.  It is assumed that this filter
+      * will only contain references to fields given as partition keys in
+      * getPartitionKeys
+      * @param plan that describes filter for partitioning
+      * @throws IOException if the filter is not compatible with the storage
+      * mechanism or contains non-partition fields.
+      */
+     void setParitionFilter(OperatorPlan plan) throws IOException;
  
  }
  
  }}}
  
- '''!LoadSchema''' will be a top level object (`org.apache.pig.LoadSchema`) 
used to communicate information about data to be loaded or that is being
+ '''!ResourceSchema''' will be a top level object 
(`org.apache.pig.ResourceSchema`) used to communicate information about data to 
be loaded or that is being
  stored.  It is not the same as the existing 
`org.apache.pig.impl.logicalLayer.schema.Schema`.
  
  {{{
-     public class LoadSchema {
+     public class ResourceSchema {
  
          int version;
  
-         public class LoadFieldSchema {
+         public class ResourceFieldSchema {
              public String name;
              public DataType type;
              public String description;
-             public LoadFieldSchema schema; // nested tuples and bags will 
have their own schema
+             public ResourceFieldSchema schema; // nested tuples and bags will 
have their own schema
          }
  
-         public LoadFieldSchema[] fields;
+         public ResourceFieldSchema[] fields;
          public Map<String, Integer> byName;
  
          enum Order { ASCENDING, DESCENDING }
          public int[] sortKeys; // each entry is an offset into the fields 
array.
          public Order[] sortKeyOrders; 
-         public int[] partitionKeys; // each entry is an offset into the 
fields array.
      }
  }}}
  
  Feedback from Pradeep:  We must fix the two level access issues with schema 
of bags in current schema before we make these changes, otherwise that
  same contagion will afflict us here.
  
- '''!LoadStatistics'''
+ '''!ResourceStatistics'''
  {{{
-     public class LoadStatistics {
+     public class ResourceStatistics {
  
-         public class LoadFieldStatistics {
+         public class ResourceFieldStatistics {
  
              int version;
  
@@ -184, +206 @@

  
          public long mBytes; // size in megabytes
          public long numRecords;  // number of records
-         public LoadFieldStatistics[] fields;
+         public ResourceFieldStatistics[] fields;
  
          // Probably more in here
      }
  }}}
  
- At this point, !LoadStatistics is poorly understood.  In initial versions we 
may choose not to implement it.  In additions to questions
+ At this point, !ResourceStatistics is poorly understood.  In initial versions 
we may choose not to implement it.  In additions to questions
  on what should be in the statistics, there are questions on how statistics 
should be communicated in relation to partitions.  For example,
  when loading from a table that is stored in owl or hive, one or more 
partitions may be being loaded.  Assuming statistics are kept on the
  partition level, how are these statistics then communicated to Pig?  Is it 
the loader's job to combine the statistics for all of the
- partitions being read?  Or does it return an array of !LoadStatistics?  But 
if so, what does Pig do with it since it does not know which
+ partitions being read?  Or does it return an array of !ResourceStatistics?  
But if so, what does Pig do with it since it does not know which
  tuples belong to which partitions (and doesn't want to know).  Even worse on 
store, any statistics Pig has to report is across all data
  being stored.  But the storage function underneath may choose to partition 
the data.  How does it then separate those statistics for the
  different partitions?  In these cases should store functions be in charge of 
calculating statistics?  Perhaps some statistics that can be
@@ -259, +281 @@

      OutputFormat getOutputFormat();
  
      /**
-      * Communicate to the store function the URIs used in Pig Latin to refer 
+      * Communicate to the store function the location used in Pig Latin to 
refer 
       * to the object(s) being stored.  That is, if the PL script is
       * <b>store A into 'bla'</b>
-      * then 'bla' is the URI.  Store functions should assume that if no
-      * scheme is provided in the URI it is an hdfs file.  This will be 
+      * then 'bla' is the location.  This location should be either a file name
+      * or a URI.  If it does not have a URI scheme Pig will assume it is a 
+      * filename.  This will be 
       * called during planning on the front end, not during execution on
       * the backend.
-      * @param uri URIs referenced in store statement.
+      * @param location Location indicated in store statement.
+      * @throws IOException if the location is not valid.
       */
-     void setURI(URI[] uri);
+     void setLocation(String location) throws IOException;
   
      /**
       * Set the schema for data to be stored.  This will be called on the
@@ -284, +308 @@

       * @throw IOException if this schema is not acceptable.  It should include
       * a detailed error message indicating what is wrong with the schema.
       */
-     void setSchema(LoadSchema s) throws IOException;
+     void setSchema(ResourceSchema s) throws IOException;
  
      /**
       * Initialize StoreFunc to write data.  This will be called during
@@ -324, +348 @@

      /**
       * Set statistics about the data being written.
       */
-     void setStatistics(LoadStatistics stats);
+     void setStatistics(ResourceStatistics stats);
  
  }
  
  }}}
  
- Given the uncertainly noted above under !LoadStatistics on how statistics 
should be stored, it is not clear that this interface makes sense.
+ Given the uncertainly noted above under !ResourceStatistics on how statistics 
should be stored, it is not clear that this interface makes sense.
  
  == LoadFunc and InputFormat Interaction ==
  
@@ -434, +458 @@

  Open Questions:
   1. Does all this force us to switch to Hadoop for local mode as well?  We 
aren't opposed to using Hadoop for local mode it just needs to get reasonable 
fast.  Can we use !InputFormat ''et. al.'' on local files without using the 
whole HDFS structure?
  
+ == Changes ==
+ Sept 23 2009, Gates
+  * Changed setURI to setLocation in !LoadFunc and !StoreFunc.  Also changed 
it to throw IOException in the cases where the passed in location is not valid 
for this load or store mechanism.
+  * Changed LoadSchema to ResourceSchema and LoadStatistics to 
ResourceStatistics
+  * Added getPartitionKeys and setPartitionFilter to LoadMetadata
+ 

Reply via email to