Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=8&rev2=9

--------------------------------------------------

  '''!LoadFunc'''
  
  {{{
+ 
  /**
   * This interface is used to implement functions to parse records
   * from a dataset.
   */
- interface LoadFunc {
+ public interface LoadFunc {
+     /**
+      * This method is called by the Pig runtime in the front end to convert 
the
+      * input location to an absolute path if the location is relative. The
+      * loadFunc implementation is free to choose how it converts a relative 
+      * location to an absolute location since this may depend on what the 
location
+      * string represent (hdfs path or some other data source)
+      * 
+      * @param location location as provided in the "load" statement of the 
script
+      * @param curDir the current working direction based on any "cd" 
statements
+      * in the script before the "load" statement
+      * @return the absolute location based on the arguments passed
+      * @throws IOException if the conversion is not possible
+      */
+     String relativeToAbsolutePath(String location, String curDir) throws 
IOException;
  
      /**
       * Communicate to the loader the load string used in Pig Latin to refer 
to the 
-      * object(s) being loaded.  That is, if the PL script is
-      * <b>A = load 'bla'</b>
-      * then 'bla' is the load string.  In general Pig expects these to be
-      * a path name, a glob, or a URI.  If there is no URI scheme present,
-      * Pig will assume it is a file name.  This will be 
-      * called during planning on the front end, not during execution on
-      * the backend.
-      * @param location Location indicated in load statement.
+      * object(s) being loaded.  The location string passed to the LoadFunc 
here 
+      * is the return value of {...@link 
LoadFunc#relativeToAbsolutePath(String, String)}
+      * 
+      * This method will be called in the backend multiple times. 
Implementations
+      * should bear in mind that this method is called multiple times and 
should
+      * ensure there are no inconsistent side effects due to the multiple 
calls.
+      * 
+      * @param location Location as returned by 
+      * {...@link LoadFunc#relativeToAbsolutePath(String, String)}.
+      * @param job the {...@link Job} object
       * @throws IOException if the location is not valid.
       */
-     void setLocation(String location) throws IOException;
+     void setLocation(String location, Job job) throws IOException;
      
      /**
+      * This will be called during planning on the front end. This is the
-      * Return the InputFormat associated with this loader.  This will be
-      * called during planning on the front end.  The LoadFunc need not
-      * carry the InputFormat information to the backend, as it will
-      * be provided with the appropriate RecordReader there.  This is the
       * instance of InputFormat (rather than the class name) because the 
       * load function may need to instantiate the InputFormat in order 
       * to control how it is constructed.
+      * @return the InputFormat associated with this loader.
+      * @throws IOException if there is an exception during InputFormat 
+      * construction
       */
-     InputFormat getInputFormat();
+     InputFormat getInputFormat() throws IOException;
  
      /**
+      * This will be called on the front end during planning and not on the 
back 
+      * end during execution.
-      * Return the LoadCaster associated with this loader.  Returning
+      * @return the {...@link LoadCaster} associated with this loader. 
Returning null 
-      * null indicates that casts from byte array are not supported
+      * indicates that casts from byte array are not supported for this 
loader. 
-      * for this loader.  This will be called on the front end during
-      * planning and not on the back end during execution.
+      * construction
+      * @throws IOException if there is an exception during LoadCaster 
       */
-     LoadCaster getLoadCaster();
+     LoadCaster getLoadCaster() throws IOException;
  
      /**
       * Initializes LoadFunc for reading data.  This will be called during 
execution
       * before any calls to getNext.  The RecordReader needs to be passed here 
because
       * it has been instantiated for a particular InputSplit.
-      * @param reader RecordReader to be used by this instance of the LoadFunc
+      * @param reader {...@link RecordReader} to be used by this instance of 
the LoadFunc
+      * @param split The input {...@link PigSplit} to process
+      * @throws IOException if there is an exception during initialization
       */
+     void prepareToRead(RecordReader reader, PigSplit split) throws 
IOException;
-     void prepareToRead(RecordReader reader);
- 
-     /**
-      * Called after all reading is finished.
-      */
-     void doneReading();
  
      /**
       * Retrieves the next tuple to be processed.
       * @return the next tuple to be processed or null if there are no more 
tuples
       * to be processed.
-      * @throws IOException
+      * @throws IOException if there is an exception while retrieving the next
+      * tuple
       */
      Tuple getNext() throws IOException;
  
@@ -119, +136 @@

   * If a given loader does not implement this interface, it will be assumed 
that it
   * is unable to provide metadata about the associated data.
   */
- interface LoadMetadata {
+ public interface LoadMetadata {
  
      /**
+      * Get a schema for the data to be loaded.  
+      * @param location Location as returned by 
+      * {...@link LoadFunc#relativeToAbsolutePath(String, String)}
+      * @param conf The {...@link Configuration} object 
-      * Get a schema for the data to be loaded.  This schema should represent
+      * @return schema for the data to be loaded. This schema should represent
       * all tuples of the returned data.  If the schema is unknown or it is
       * not possible to return a schema that represents all returned data,
       * then null should be returned.
+      * @throws IOException if an exception occurs while determining the schema
       */
-     ResourceSchema getSchema();
+     ResourceSchema getSchema(String location, Configuration conf) throws 
+     IOException;
  
      /**
       * Get statistics about the data to be loaded.  If no statistics are
       * available, then null should be returned.
+      * @param location Location as returned by 
+      * {...@link LoadFunc#relativeToAbsolutePath(String, String)}
+      * @param conf The {...@link Configuration} object
+      * @return statistics about the data to be loaded.  If no statistics are
+      * available, then null should be returned.
+      * @throws IOException if an exception occurs while retrieving statistics
       */
-     ResourceStatistics getStatistics();
+     ResourceStatistics getStatistics(String location, Configuration conf) 
+     throws IOException;
  
      /**
       * Find what columns are partition keys for this input.
-      * This function assumes that setLocation has already been called.
+      * @param location Location as returned by 
+      * {...@link LoadFunc#relativeToAbsolutePath(String, String)}
+      * @param conf The {...@link Configuration} object
       * @return array of field names of the partition keys.
+      * @throws IOException if an exception occurs while retrieving partition 
keys
       */
-     String>[] getPartitionKeys();
+     String[] getPartitionKeys(String location, Configuration conf) 
+     throws IOException;
  
      /**
       * Set the filter for partitioning.  It is assumed that this filter
@@ -484, +518 @@

  Sept 29 2009, Gates
   * Added answer for open question 1.  Added and answered open questions 2 and 
3.
  
+ Nov 2 2009, Pradeep Kamath
+ 
+ In LoadFunc:
+  * Added relativeToAbsolutePath() method in LoadFunc per 
http://issues.apache.org/jira/browse/PIG-879?focusedCommentId=12768818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12768818
+  * Changed comments in setLocation regarding the location passed - the 
location will now be the return value of relativeToAbsolutePath()
+  * setLocation() now also takes a Job argument since the main purpose of this 
call is to an opportunity to the LoadFunc implementation to communicate the 
input location to underlying InputFormat. InputFormat implementations inturn 
seem to be storing this information inthe Job. For example, FileInputFormat has 
the following static method to set the input location: setInputPaths(JobConf 
conf, String commaSeparatedPaths) ;
+  * All methods now can throw IOException - this keeps the interface more 
flexible for exception cases
+ 
+ In LoadMetadata:
+  * getSchema(), getStatistics() and getPartitionKeys() methods now take a 
location and Configuration argument so that the implementation can use that 
information in returning the information requested.
+ 

Reply via email to