[Pig Wiki] Update of "Pig070LoadStoreHowTo" by PradeepK amath

Apache Wiki Fri, 05 Mar 2010 11:42:57 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The "Pig070LoadStoreHowTo" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diff&rev1=10&rev2=11

--------------------------------------------------

  
  = How to implement a Loader =
  
[[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup
 | LoadFunc]]  abstract class which has the main methods for loading data and 
for most use case it might suffice to extend it. There are 3 other optional 
interfaces which can be implemented to achieve extended functionality:
-  * !LoadMetadata has methods to deal with metadata - most implementation of 
loaders don't need to implement this unless they interact with some metadata 
system. The getSchema() method in this interface provides a way for loader 
implementations to communicate the schema of the data back to pig. If a loader 
implementation returns data comprised of fields of real types (rather than 
!DataByteArray fields), it should provide the schema describing the data 
returned through the getSchema() method. The other methods are concerned with 
other types of metadata like partition keys and statistics. Implementations can 
return null return values for these methods if they are not applicable for that 
implementation.
+  * 
[[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadMetadata.java?view=markup
 | LoadMetadata]] has methods to deal with metadata - most implementation of 
loaders don't need to implement this unless they interact with some metadata 
system. The getSchema() method in this interface provides a way for loader 
implementations to communicate the schema of the data back to pig. If a loader 
implementation returns data comprised of fields of real types (rather than 
!DataByteArray fields), it should provide the schema describing the data 
returned through the getSchema() method. The other methods are concerned with 
other types of metadata like partition keys and statistics. Implementations can 
return null return values for these methods if they are not applicable for that 
implementation.
-  * !LoadPushDown has methods to push operations from pig runtime into loader 
implementations - currently only projections .i.e the pushProjection() method 
is called by Pig to communicate to the loader what exact fields are required in 
the pig script. The loader implementation can choose to honor the request or 
respond that it will not honor the request and return all fields in the data.If 
a loader implementation is able to efficiently return only required fields, it 
should implement !LoadPushDown to improve query performance.
+  * 
[[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadPushDown.java?view=markup
 | LoadPushDown]] has methods to push operations from pig runtime into loader 
implementations - currently only projections .i.e the pushProjection() method 
is called by Pig to communicate to the loader what exact fields are required in 
the pig script. The loader implementation can choose to honor the request or 
respond that it will not honor the request and return all fields in the data.If 
a loader implementation is able to efficiently return only required fields, it 
should implement !LoadPushDown to improve query performance.
-  * !LoadCaster has methods to convert byte arrays to specific types. A loader 
implementation should implement this if casts (implicit or explicit) from 
!DataByteArray fields to other types need to be supported.
+  * 
[[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadCaster.java?view=markup
 | LoadCaster]] has methods to convert byte arrays to specific types. A loader 
implementation should implement this if casts (implicit or explicit) from 
!DataByteArray fields to other types need to be supported.
  
  The !LoadFunc abstract class is the main class to extend to implement a 
loader. The methods which need to be overriden are explained below:
   * getInputFormat() :This method will be called by Pig to get the 
!InputFormat used by the loader. The methods in the !InputFormat (and 
underlying !RecordReader) will be called by pig in the same manner (and in the 
same context) as by Hadoop in a map-reduce java program. If the !InputFormat is 
a hadoop packaged one, the implementation should use the new API based one 
under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be 
implemented using the new API in org.apache.hadoop.mapreduce.
@@ -144, +144 @@

  
  = How to implement a Storer =
  
[[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/StoreFunc.java?view=markup
 | StoreFunc]]  abstract class has the main methods for storing data and for 
most use case it might suffice to extend it. There is an optional interface 
which can be implemented to achieve extended functionality:
-  * storeMetadata: This interface has methods to interact with metadata 
systems to store schema and store statistics. This interface is truely optional 
and should only be implemented if metadata needs to stored.
+  * 
[[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/StoreMetadata.java?view=markup
 | StoreMetadata]]: This interface has methods to interact with metadata 
systems to store schema and store statistics. This interface is truely optional 
and should only be implemented if metadata needs to stored.
  
  The methods which need to be overridden in !StoreFunc are explained below:
   * getOutputFormat(): This method will be called by Pig to get the 
!OutputFormat used by the storer. The methods in the !OutputFormat (and 
underlying !RecordWriter and !OutputCommitter) will be called by pig in the 
same manner (and in the same context) as by Hadoop in a map-reduce java 
program. If the !OutputFormat is a hadoop packaged one, the implementation 
should use the new API based one in org.apache.hadoop.mapreduce. If it is a 
custom !OutputFormat, it should be implemented using the new API under 
org.apache.hadoop.mapreduce. The checkOutputSpecs() method of the !OutputFormat 
will be called by pig to check the output location up-front. This method will 
also be called as part of the Hadoop call sequence when the job is launched. So 
implementations should ensure that this method can be called multiple times 
without inconsistent side effects.

[Pig Wiki] Update of "Pig070LoadStoreHowTo" by PradeepK amath

Reply via email to