[Pig Wiki] Update of "LoadStoreMigrationGuide" by Prade epKamath

Apache Wiki Wed, 17 Feb 2010 12:05:12 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The "LoadStoreMigrationGuide" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreMigrationGuide?action=diff&rev1=20&rev2=21

--------------------------------------------------

  This page describes how to migrate from the old !LoadFunc and !StoreFunc 
interface (Pig 0.1.0 through Pig 0.6.0) to the new interfaces proposed in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal and planned to be released 
in Pig 0.7.0. Besides the example in this page, users can also look at 
!LoadFunc and !StoreFunc implementation in the piggybank codebase 
(contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage) for 
examples of migration. For example, !MultiStorage implements a custom 
!OutputFormat.
  
- A general note applicable to both !LoadFunc and !StoreFunc implementations is 
that the implementation should use the new Hadoop 20 API based on 
org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred 
package.
+ *A general note applicable to both !LoadFunc and !StoreFunc implementations 
is that the implementation should use the new Hadoop 20 API based on 
org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred 
package.*
  
- The main motivation for these changes is to move closer to using !Hadoop's 
!InputFormat and !OutputFormat classes. This way pig users/developers can 
create new !LoadFunc and !StoreFunc implementation based on existing !Hadoop 
!InputFormat and !OutputFormat classes with minimal code. The complexity of 
reading the data and creating a record will now lie in the !InputFormat and 
likewise on the writing end, the complexity of writing will lie in the 
!OutputFormat. This enables !Pig to easily read/write data in new storage 
formats as and when an !Hadoop !InputFormat and !OutputFormat is available for 
them.
+ The main motivation for these changes is to move closer to using Hadoop's 
!InputFormat and !OutputFormat classes. This way pig users/developers can 
create new !LoadFunc and !StoreFunc implementation based on existing Hadoop 
!InputFormat and !OutputFormat classes with minimal code. The complexity of 
reading the data and creating a record will now lie in the !InputFormat and 
likewise on the writing end, the complexity of writing will lie in the 
!OutputFormat. This enables !Pig to easily read/write data in new storage 
formats as and when an Hadoop !InputFormat and !OutputFormat is available for 
them.
  
  
  = LoadFunc Migration =
@@ -17, +17 @@

   
  == Table mapping old API calls to new API calls in rough order of call 
sequence ==
  || '''Old Method in !LoadFunc''' || '''Equivalent New Method''' || '''New 
Class/Interface in which method is present''' || '''Explanation''' ||
- || No equivalent method || setUDFContextSignature() || !LoadFunc || This 
method will be called by Pig both in the front end and back end to pass a 
unique signature to the Loader. The signature can be used to store into the 
UDFContext} any information which the Loader needs to store between various 
method invocations in the front end and back end. A use case is to store 
!RequiredFieldList passed to it in 
!LoadPushDown.pushProjection(!RequiredFieldList) for use in the back end before 
returning tuples in getNext()||
+ || No equivalent method || setUDFContextSignature() || !LoadFunc || This 
method will be called by Pig both in the front end and back end to pass a 
unique signature to the Loader. The signature can be used to store into the 
!UDFContext any information which the Loader needs to store between various 
method invocations in the front end and back end. A use case is to store 
!RequiredFieldList passed to it in 
!LoadPushDown.pushProjection(!RequiredFieldList) for use in the back end before 
returning tuples in getNext()||
  || No equivalent method || relativeToAbsolutePath() || !LoadFunc || Pig 
runtime will call this method to allow the Loader to convert a relative load 
location to an absolute location. The default implementation provided in 
!LoadFunc handles this for hdfs files and directories. If the load source is 
something else, loader implementation may choose to override this.||
  || determineSchema() || getSchema() || !LoadMetadata || determineSchema() was 
used by old code to ask the loader to provide a schema for the data returned by 
it - the same semantics are now achieved through getSchema() of the 
!LoadMetadata interface. !LoadMetadata is an optional interface for loaders to 
implement - if a loader does not implement it, this will indicate to the pig 
runtime that the loader cannot return a schema for the data ||
  || fieldsToRead() || pushProject() || !LoadPushDown || fieldsToRead() was 
used by old code to convey to the loader the exact fields required by the pig 
script -the same semantics are now achieved through pushProject() of the 
!LoadPushDown interface. !LoadPushDown is an optional interface for loaders to 
implement - if a loader does not implement it, this will indicate to the pig 
runtime that the loader is not capable of returning just the required fields 
and will return all fields in the data. If a loader implementation is able to 
efficiently return only required fields, it should implement !LoadPushDown to 
improve query performance||
@@ -35, +35 @@

  
  == Table mapping old API calls to new API calls in rough order of call 
sequence ==
  || '''Old Method in !StoreFunc''' || '''Equivalent New Method''' || '''New 
Class/Interface in which method is present''' || '''Explanation''' ||
- || No equivalent method || setStoreFuncUDFContextSignature() || !StoreFunc || 
This method will be called by Pig both in the front end and back end to pass a 
unique signature to the Storer. The signature can be used to store into the 
UDFContext} any information which the Storer needs to store between various 
method invocations in the front end and back end.||
+ || No equivalent method || setStoreFuncUDFContextSignature() || !StoreFunc || 
This method will be called by Pig both in the front end and back end to pass a 
unique signature to the Storer. The signature can be used to store into the 
UDFContext any information which the Storer needs to store between various 
method invocations in the front end and back end.||
- || No equivalent method || relToAbsPathForStoreLocation() || !StoreFunc || 
Pig runtime will call this method to allow the Storer to convert a relative 
load location to an absolute location. An implementation is provided in 
!LoadFunc (as a static method) which handles this for hdfs files and 
directories.||
+ || No equivalent method || relToAbsPathForStoreLocation() || !StoreFunc || 
Pig runtime will call this method to allow the Storer to convert a relative 
store location to an absolute location. An implementation is provided in 
!LoadFunc (as a static method) which handles this for hdfs files and 
directories.||
- || No equivalent method || checkSchema() || !StoreFunc || A Store function 
should implement this function to check that a given schema is acceptable to it 
||
+ || No equivalent method || checkSchema() || !StoreFunc || A Store function 
should implement this function to check that a given schema describing the data 
to be written is acceptable to it ||
  || No equivalent method || setStoreLocation() || !StoreFunc || This method is 
called by Pig to communicate the store location to the storer. The storer 
should use this method to communicate the same information to the underlying 
!OutputFormat. This method is called multiple times by pig - implementations 
should bear in mind that this method is called multiple times and should ensure 
there are no inconsistent side effects due to the multiple calls.||
- || getStorePreparationClass() || getOutputFormat() || !StoreFunc ||In the old 
API, getStorePreparationClass() was the means by which the implementation could 
communicate to Pig the !OutputFormat to use for writing - this is now achieved 
through getOutputFormat(). getOutputFormat() is NOT an optional method and 
implementation SHOULD provide an !OutputFormat to use. The methods in the 
!OutputFormat (and underlying !RecordWriter and !OutputCommitter) will be 
called by pig in the same manner (and in the same context) as by Hadoop in a 
map-reduce java program.||
+ || getStorePreparationClass() || getOutputFormat() || !StoreFunc ||In the old 
API, getStorePreparationClass() was the means by which the implementation could 
communicate to Pig the !OutputFormat to use for writing - this is now achieved 
through getOutputFormat(). getOutputFormat() is NOT an optional method and 
implementation SHOULD provide an !OutputFormat to use. The methods in the 
!OutputFormat (and underlying !RecordWriter and !OutputCommitter) will be 
called by pig in the same manner (and in the same context) as by Hadoop in a 
map-reduce java program. The checkOutputSpecs() method of the !OutputFormat 
will be called by pig to check the output location up-front. This method will 
also be called as part of the Hadoop call sequence when the job is launched. So 
implementations should ensure that this method can be called multiple times 
without inconsistent side effects.||
  || bindTo() || prepareToWrite() || !StoreFunc || bindTo() was the old method 
which would provide an !OutputStream among other things to the !StoreFunc. The 
!StoreFunc implementation would then write to the !OutputStream in putNext(). 
In the new API, writing of the data is through the !OutputFormat provided by 
the !StoreFunc. So the equivalent call is prepareToWrite() wherein the 
!RecordWriter associated with the !OutputFormat provided by the !StoreFunc is 
passed to the !StoreFunc. The !RecordWriter can then be used by the 
implementation in putNext() to write a tuple representing a record of data in a 
manner expected by the !RecordWriter. ||
  || putNext() || putNext() || !StoreFunc || The meaning of putNext() has not 
changed and is called by Pig runtime to write the next tuple of data - in the 
new API, this is the method wherein the implementation will use the the 
underlying !RecordWriter to write the Tuple out ||
  || finish() || no equivalent method in !StoreFunc - implementations can use 
close() in !RecordWriter or commitTask in !OutputCommitter || !RecordWriter or 
!OutputCommitter || finish() has been removed from !StoreFunc since the same 
semantics can be achieved by !RecordWriter.close() or 
!OutputCommitter.commitTask() - in the latter case 
!OutputCommitter.needsTaskCommit() should return true.||

[Pig Wiki] Update of "LoadStoreMigrationGuide" by Prade epKamath

Reply via email to