[Pig Wiki] Update of "Pig070LoadStoreHowTo" by PradeepK amath

Apache Wiki Mon, 22 Mar 2010 15:06:12 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070LoadStoreHowTo" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diff&rev1=12&rev2=13

--------------------------------------------------

*
[[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadCaster.java?view=markup
| LoadCaster]] has methods to convert byte arrays to specific types. A loader
implementation should implement this if casts (implicit or explicit) from
!DataByteArray fields to other types need to be supported.

The !LoadFunc abstract class is the main class to extend for implementing a
loader. The methods which need to be overriden are explained below:
- * getInputFormat() :This method will be called by Pig to get the
!InputFormat used by the loader. The methods in the !InputFormat (and
underlying !RecordReader) will be called by pig in the same manner (and in the
same context) as by Hadoop in a map-reduce java program. If the !InputFormat is
a hadoop packaged one, the implementation should use the new API based one
under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be
implemented using the new API in org.apache.hadoop.mapreduce.
+ * getInputFormat() :This method will be called by Pig to get the
!InputFormat used by the loader. The methods in the !InputFormat (and
underlying !RecordReader) will be called by pig in the same manner (and in the
same context) as by Hadoop in a map-reduce java program. If the !InputFormat is
a hadoop packaged one, the implementation should use the new API based one
under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be
implemented using the new API in org.apache.hadoop.mapreduce. If a custom
loader using a text-based !InputFormat or a file based !InputFormat would like
to read files in all subdirectories under a given input directory recursively,
then it should use the !PigFileInputFormat and !PigTextInputFormat classes
provided in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. This
is to work around the current limitation in Hadoop's !TextInputFormat and
!FileInputFormat which only read one level down from provided input directory.
So for example if the input in the load statement is 'dir1' and there are
subdirs 'dir2' and 'dir2/dir3' underneath dir1, using Hadoop's !TextInputFormat
or !FileInputFormat only files under 'dir1' can be read. Using
!PigFileInputFormat or !PigTextInputFormat (or by extending them), files in all
the directories can be read.
+
+
+ Changes to custom Load Functions
+
+
+ Low to medium
+
+
+
+
+
+ This is to get around the problem of MAPREDUCE-1577.
* setLocation() :This method is called by Pig to communicate the load
location to the loader. The loader should use this method to communicate the
same information to the underlying !InputFormat. This method is called multiple
times by pig - implementations should bear this in mind and should ensure there
are no inconsistent side effects due to the multiple calls.
* prepareToRead() : Through this method the !RecordReader associated with
the !InputFormat provided by the !LoadFunc is passed to the !LoadFunc. The
!RecordReader can then be used by the implementation in getNext() to return a
tuple representing a record of data back to pig.
* getNext() :The meaning of getNext() has not changed and is called by Pig
runtime to get the next tuple in the data - in this method the implementation
should use the the underlying !RecordReader and construct the tuple to return.

[Pig Wiki] Update of "Pig070LoadStoreHowTo" by PradeepK amath

Reply via email to