[Pig Wiki] Update of "Pig070LoadStoreHowTo" by PradeepK amath

Apache Wiki Mon, 22 Mar 2010 15:06:54 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The "Pig070LoadStoreHowTo" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diff&rev1=13&rev2=14

--------------------------------------------------

  
  The !LoadFunc abstract class is the main class to extend for implementing a 
loader. The methods which need to be overriden are explained below:
   * getInputFormat() :This method will be called by Pig to get the 
!InputFormat used by the loader. The methods in the !InputFormat (and 
underlying !RecordReader) will be called by pig in the same manner (and in the 
same context) as by Hadoop in a map-reduce java program. If the !InputFormat is 
a hadoop packaged one, the implementation should use the new API based one 
under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be 
implemented using the new API in org.apache.hadoop.mapreduce.  If a custom 
loader using a text-based !InputFormat or a file based !InputFormat would like 
to read files in all subdirectories under a given input directory recursively, 
then it should use the !PigFileInputFormat and !PigTextInputFormat classes 
provided in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. This 
is to work around the current limitation in Hadoop's !TextInputFormat and 
!FileInputFormat which only read one level down from provided input directory. 
So for example if the input in the load statement is 'dir1' and there are 
subdirs 'dir2' and 'dir2/dir3' underneath dir1, using Hadoop's !TextInputFormat 
or !FileInputFormat only files under 'dir1' can be read. Using 
!PigFileInputFormat or !PigTextInputFormat (or by extending them), files in all 
the directories can be read.
-       
- 
- Changes to custom Load Functions
-       
- 
- Low to medium
-       
- 
- 
-       
- 
- This is to get around the problem of MAPREDUCE-1577. 
   * setLocation() :This method is called by Pig to communicate the load 
location to the loader. The loader should use this method to communicate the 
same information to the underlying !InputFormat. This method is called multiple 
times by pig - implementations should bear this in mind and should ensure there 
are no inconsistent side effects due to the multiple calls.
   * prepareToRead() : Through this method the !RecordReader associated with 
the !InputFormat provided by the !LoadFunc is passed to the !LoadFunc. The 
!RecordReader can then be used by the implementation in getNext() to return a 
tuple representing a record of data back to pig.
   * getNext() :The meaning of getNext() has not changed and is called by Pig 
runtime to get the next tuple in the data - in this method the implementation 
should use the the underlying !RecordReader and construct the tuple to return.

[Pig Wiki] Update of "Pig070LoadStoreHowTo" by PradeepK amath

Reply via email to