[Pig Wiki] Update of "LoadStoreMigrationGuide" by Prade epKamath

Apache Wiki Wed, 10 Feb 2010 13:54:48 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The "LoadStoreMigrationGuide" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreMigrationGuide?action=diff&rev1=9&rev2=10

--------------------------------------------------

  
  The main change is that the new !LoadFunc API is based on a !InputFormat to 
read the data. Implementations can choose to use existing !InputFormats like 
!TextInputFormat or implement a new one.
   
- == Table mapping old API calls to new API calls ==
+ == Table mapping old API calls to new API calls in rough order of call 
sequence==
  || '''Old Method in !LoadFunc''' || '''Equivalent New Method''' || '''New 
Class/Interface in which method is present''' || '''Explanation''' ||
+ || No equivalent method || setUDFContextSignature() || !LoadFunc || This 
method will be called by Pig both in the front end and back end to pass a 
unique signature to the Loader. The signature can be used to store into the 
UDFContext} any information which the Loader needs to store between various 
method invocations in the front end and back end. A use case is to store 
!RequiredFieldList passed to it in 
!LoadPushDown.pushProjection(!RequiredFieldList) for use in the back end before 
returning tuples in getNext()||
+ || No equivalent method || relativeToAbsolutePath() || !LoadFunc || Pig 
runtime will call this method to allow the Loader to convert a relative load 
location to an absolute location. The default implementation provided in 
!LoadFunc handles this for hdfs files and directories. If the load source is 
something else, loader implementation may choose to override this.||
+ || determineSchema() || getSchema() || !LoadMetadata || determineSchema() was 
used by old code to ask the loader to provide a schema for the data returned by 
it - the same semantics are now achieved through getSchema() of the 
!LoadMetadata interface. !LoadMetadata is an optional interface for loaders to 
implement - if a loader does not implement it, this will indicate to the pig 
runtime that the loader cannot return a schema for the data ||
+ || fieldsToRead() || pushProject() || !LoadPushDown || fieldsToRead() was 
used by old code to convey to the loader the exact fields required by the pig 
script -the same semantics are now achieved through pushProject() of the 
!LoadPushDown interface. !LoadPushDown is an optional interface for loaders to 
implement - if a loader does not implement it, this will indicate to the pig 
runtime that the loader is not capable of returning just the required fields 
and will return all fields in the data. If a loader implementation is able to 
efficiently return only required fields, it should implement !LoadPushDown to 
improve query performance||
+ || No equivalent method || getInputFormat() ||!LoadFunc ||  This method will 
be called by Pig to get the !InputFormat used by the loader. The methods in the 
!InputFormat (and underlying !RecordReader) will be called by pig in the same 
manner (and in the same context) as by Hadoop in a map-reduce java program.||
+ || No equivalent method || setLocation() || !LoadFunc || This method is 
called by Pig to communicate the load location to the loader. The loader should 
use this method to communicate the same information to the underlying 
!InputFormat. This method is called multiple times by pig - implementations 
should bear in mind that this method is called multiple times and should ensure 
there are no inconsistent side effects due to the multiple calls.||
  || bindTo() || prepareToRead() || !LoadFunc || bindTo() was the old method 
which would provide an !InputStream among other things to the !LoadFunc. The 
!LoadFunc implementation would then read from the !InputStream in getNext(). In 
the new API, reading of the data is through the !InputFormat provided by the 
!LoadFunc. So the equivalent call is prepareToRead() wherein the !RecordReader 
associated with the !InputFormat provided by the !LoadFunc is passed to the 
!LoadFunc. The !RecordReader can then be used by the implementation in 
getNext() to return a tuple representing a record of data back to pig. ||
  || getNext() || getNext() || !LoadFunc || The meaning of getNext() has not 
changed and is called by Pig runtime to get the next tuple in the data ||
  || bytesToInteger(),...bytesToBag() ||  bytesToInteger(),...bytesToBag() || 
!LoadCaster || The meaning of these methods has not changed and is called by 
Pig runtime to cast a !DataByteArray fields to the right type when needed. In 
the new API, a !LoadFunc implementation should give a !LoadCaster object back 
to pig as the return value of getLoadCaster() method so that it can be used for 
casting. If a null is returned then casting from !DataByteArray to any other 
type (implicitly or explicitly) in the pig script will not be possible ||
- || fieldsToRead() || pushProject() || !LoadPushDown || fieldsToRead() was 
used by old code to convey to the loader the exact fields required by the pig 
script -the same semantics are now achieved through pushProject() of the 
!LoadPushDown interface. !LoadPushDown is an optional interface for loaders to 
implement - if a loader does not implement it, this will indicate to the pig 
runtime that the loader is not capable of returning just the required fields 
and will return all fields in the data. If a loader implementation is able to 
efficiently return only required fields, it should implement !LoadPushDown to 
improve query performance||
- || determineSchema() || getSchema() || !LoadMetadata || determineSchema() was 
used by old code to ask the loader to provide a schema for the data returned by 
it - the same semantics are now achieved through getSchema() of the 
!LoadMetadata interface. !LoadMetadata is an optional interface for loaders to 
implement - if a loader does not implement it, this will indicate to the pig 
runtime that the loader cannot return a schema for the data ||
- || No equivalent method || relativeToAbsolutePath() || Pig runtime will call 
this method to allow the Loader to convert a relative load location to an 
absolute location. The default implementation provided in !LoadFunc handles 
this for hdfs files and directories. If the load source is something else, 
loader implementation may choose to override this.||
- || No equivalent method || getInputFormat() || This method will be called by 
Pig to get the !InputFormat used by the loader. The methods in the !InputFormat 
(and underlying !RecordReader) will be called by pig in the same manner (and in 
the same context) as by Hadoop in a map-reduce java program.||
- || No equivalent method || setLocation() || This method is called by Pig to 
communicate the load location to the loader. The loader should use this method 
to communicate the same information to the underlying !InputFormat. This method 
is called multiple times by pig - implementations should bear in mind that this 
method is called multiple times and should ensure there are no inconsistent 
side effects due to the multiple calls.||
- || No equivalent method || setUDFContextSignature() || This method will be 
called by Pig both in the front end and back end to pass a unique signature to 
the Loader. The signature can be used to store into the UDFContext} any 
information which the Loader needs to store between various method invocations 
in the front end and back end. A use case is to store RequiredFieldList passed 
to it in LoadPushDown.pushProjection(RequiredFieldList) for use in the back end 
before returning tuples in getNext()||
  
-  An example of how a simple !LoadFunc implementation based on old interface 
can be converted to the new interfaces will be shown below. The loader 
implementation in the example is a loader for text data with line delimiter as 
'\n' and '\t' as default field delimiter (which can be overridden by passing a 
different field delimiter in the constructor) - this is similar to current 
!PigStorage loader in Pig.
+  An example of how a simple !LoadFunc implementation based on old interface 
can be converted to the new interfaces will be shown below. The loader 
implementation in the example is a loader for text data with line delimiter as 
'\n' and '\t' as default field delimiter (which can be overridden by passing a 
different field delimiter in the constructor) - this is similar to current 
!PigStorage loader in Pig. The new implementation uses an existing Hadoop 
supported !Inputformat - !TextInputFormat as the underlying !InputFormat.
  
  == Old Implementation ==
  {{{

[Pig Wiki] Update of "LoadStoreMigrationGuide" by Prade epKamath

Reply via email to