Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The "Pig070LoadStoreHowTo" page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diff&rev1=7&rev2=8 -------------------------------------------------- = Overview = This page describes how to go about writing Load functions and Store functions using the API available in Pig 0.7.0. - The main motivation for the changes in Pig 0.7.0 load/store api is to move closer to using Hadoop's InputFormat and OutputFormat classes. This way pig users/developers can create new LoadFunc and StoreFunc implementation based on existing Hadoop InputFormat and OutputFormat classes with minimal code. The complexity of reading the data and creating a record will now lie in the InputFormat and likewise on the writing end, the complexity of writing will lie in the OutputFormat. This enables !Pig to easily read/write data in new storage formats as and when an Hadoop InputFormat and OutputFormat is available for them. + The main motivation for the changes in Pig 0.7.0 load/store api is to move closer to using Hadoop's !InputFormat and !OutputFormat classes. This way pig users/developers can create new !LoadFunc and !StoreFunc implementation based on existing Hadoop !InputFormat and !OutputFormat classes with minimal code. The complexity of reading the data and creating a record will now lie in the !InputFormat and likewise on the writing end, the complexity of writing will lie in the !OutputFormat. This enables !Pig to easily read/write data in new storage formats as and when an Hadoop !InputFormat and !OutputFormat is available for them. - '''A general note applicable to both LoadFunc and StoreFunc implementations is that the implementation should use the new Hadoop 20 API based classes (InputFormat/OutputFormat and related classes) under the org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred package.''' + '''A general note applicable to both !LoadFunc and !StoreFunc implementations is that the implementation should use the new Hadoop 20 API based classes (!InputFormat/!OutputFormat and related classes) under the org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred package.''' = How to implement a Loader = [[http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup || LoadFunc]] abstract class which has the main methods for loading data and for most use case it might suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: @@ -147, +147 @@ * storeMetadata: This interface has methods to interact with metadata systems to store schema and store statistics. This interface is truely optional and should only be implemented if metadata needs to stored. The methods which need to be overridden in !StoreFunc are explained below: - * getOutputFormat(): This method will be called by Pig to get the !OutputFormat used by the storer. The methods in the OutputFormat (and underlying RecordWriter and OutputCommitter) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the OutputFormat is a hadoop packaged one, the implementation should use the new API based one in org.apache.hadoop.mapreduce. If it is a custom OutputFormat, it should be implemented using the new API under org.apache.hadoop.mapreduce. The checkOutputSpecs() method of the OutputFormat will be called by pig to check the output location up-front. This method will also be called as part of the Hadoop call sequence when the job is launched. So implementations should ensure that this method can be called multiple times without inconsistent side effects. + * getOutputFormat(): This method will be called by Pig to get the !OutputFormat used by the storer. The methods in the !OutputFormat (and underlying !RecordWriter and OutputCommitter) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the !OutputFormat is a hadoop packaged one, the implementation should use the new API based one in org.apache.hadoop.mapreduce. If it is a custom !OutputFormat, it should be implemented using the new API under org.apache.hadoop.mapreduce. The checkOutputSpecs() method of the !OutputFormat will be called by pig to check the output location up-front. This method will also be called as part of the Hadoop call sequence when the job is launched. So implementations should ensure that this method can be called multiple times without inconsistent side effects. - * setStoreLocation(): This method is called by Pig to communicate the store location to the storer. The storer should use this method to communicate the same information to the underlying OutputFormat. This method is called multiple times by pig - implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls. + * setStoreLocation(): This method is called by Pig to communicate the store location to the storer. The storer should use this method to communicate the same information to the underlying !OutputFormat. This method is called multiple times by pig - implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls. - * prepareToWrite(): In the new API, writing of the data is through the OutputFormat provided by the StoreFunc. In prepareToWrite() the RecordWriter associated with the OutputFormat provided by the StoreFunc is passed to the StoreFunc. The RecordWriter can then be used by the implementation in putNext() to write a tuple representing a record of data in a manner expected by the RecordWriter. + * prepareToWrite(): In the new API, writing of the data is through the !OutputFormat provided by the !StoreFunc. In prepareToWrite() the !RecordWriter associated with the !OutputFormat provided by the !StoreFunc is passed to the !StoreFunc. The !RecordWriter can then be used by the implementation in putNext() to write a tuple representing a record of data in a manner expected by the !RecordWriter. - * putNext(): The meaning of putNext() has not changed and is called by Pig runtime to write the next tuple of data - in the new API, this is the method wherein the implementation will use the the underlying RecordWriter to write the Tuple outThe meaning of putNext() has not changed and is called by Pig runtime to write the next tuple of data - in the new API, this is the method wherein the implementation will use the the underlying RecordWriter to write the Tuple out + * putNext(): The meaning of putNext() has not changed and is called by Pig runtime to write the next tuple of data - in the new API, this is the method wherein the implementation will use the the underlying !RecordWriter to write the Tuple outThe meaning of putNext() has not changed and is called by Pig runtime to write the next tuple of data - in the new API, this is the method wherein the implementation will use the the underlying !RecordWriter to write the Tuple out The following methods have default implementations in !StoreFunc and should be overridden only if necessary: - * setStoreFuncUDFContextSignature(): This method will be called by Pig both in the front end and back end to pass a unique signature to the Storer. The signature can be used to store into the UDFContext any information which the Storer needs to store between various method invocations in the front end and back end. The default implementation in StoreFunc has an empty body. This method will be called before other methods. + * setStoreFunc!UDFContextSignature(): This method will be called by Pig both in the front end and back end to pass a unique signature to the Storer. The signature can be used to store into the UDFContext any information which the Storer needs to store between various method invocations in the front end and back end. The default implementation in !StoreFunc has an empty body. This method will be called before other methods. - * relToAbsPathForStoreLocation(): Pig runtime will call this method to allow the Storer to convert a relative store location to an absolute location. An implementation is provided in StoreFunc which handles this for FileSystem based locations. + * relToAbsPathForStoreLocation(): Pig runtime will call this method to allow the Storer to convert a relative store location to an absolute location. An implementation is provided in !StoreFunc which handles this for FileSystem based locations. - * checkSchema(): A Store function should implement this function to check that a given schema describing the data to be written is acceptable to it. The default implementation in StoreFunc has an empty body. This method will be called before any calls to setStoreLocation(). + * checkSchema(): A Store function should implement this function to check that a given schema describing the data to be written is acceptable to it. The default implementation in !StoreFunc has an empty body. This method will be called before any calls to setStoreLocation(). == Example Implementation == - The storer implementation in the example is a storer for text data with line delimiter as '\n' and '\t' as default field delimiter (which can be overridden by passing a different field delimiter in the constructor) - this is similar to current PigStorage storer in Pig. The new implementation uses an existing Hadoop supported OutputFormat - TextOutputFormat as the underlying OutputFormat. + The storer implementation in the example is a storer for text data with line delimiter as '\n' and '\t' as default field delimiter (which can be overridden by passing a different field delimiter in the constructor) - this is similar to current !PigStorage storer in Pig. The new implementation uses an existing Hadoop supported !OutputFormat - TextOutputFormat as the underlying !OutputFormat. {{{ public class SimpleTextStorer extends StoreFunc {