Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=22&rev2=23

--------------------------------------------------

  ==== Changes to work with Hadoop OutputFormat model ====
  Hadoop has the notion of a single !OutputFormat per job. !PigOutputFormat is 
the class indicated by Pig as the !OutputFormat for map reduce jobs compiled 
from pig scripts. 
  
- In !PigOutputFormat.checkOutputSpecs(), the implementation iterates over 
POStore(s) in the map and reduce phases and for each such store does the 
following:
+ In !PigOutputFormat.checkOutputSpecs(), the implementation iterates over 
!POStore(s) in the map and reduce phases and for each such store does the 
following:
-  * Instantiate the !StoreFunc associated with the POStore
+  * Instantiate the !StoreFunc associated with the !POStore
-  * Make a clone of the JobContext passed in 
!PigOutputFormat.checkOutputSpecs() call and then invoke 
!StoreFunc.setStoreLocation() using the clone. The reason a clone is necessary 
here is because generally in the setStorelocation() method, the StoreFunc would 
communicate the location to its underlying !OutputFormat. Typically 
!OutputFormats store the location into the Configuration for use in the 
checkOutputSpecs() call. For example, !FileOutputFormat does this through 
!FileOutputFormat.setOutputPath(Job job, Path location). We don't want updates 
to the Configuration for different outputs to over-write each other - hence the 
clone.
+  * Make a clone of the !JobContext passed in 
!PigOutputFormat.checkOutputSpecs() call and then invoke 
!StoreFunc.setStoreLocation() using the clone. The reason a clone is necessary 
here is because generally in the setStorelocation() method, the !StoreFunc 
would communicate the location to its underlying !OutputFormat. Typically 
!OutputFormats store the location into the Configuration for use in the 
checkOutputSpecs() call. For example, !FileOutputFormat does this through 
!FileOutputFormat.setOutputPath(Job job, Path location). We don't want updates 
to the Configuration for different outputs to over-write each other - hence the 
clone.
   * Call getOutputFormat() on the !StoreFunc and then checkOutputSpecs() on 
the !OutputFormat returned. Note that the above setStoreLocation call needs to 
happen *before* the checkOutputSpecs() call and the checkOutputSpecs() call 
needs to be given the "updated (with location)" cloned JobContext.
+ 
+ !PigOutputFormat.getOutputCommitter() returns a !PigOutputCommitter object. 
The !PigOutputCommitter internally keeps a list of OutputCommitters 
corresponding to !OutputFormat of !StoreFunc(s) in the POStore(s) in the map 
and reduce phases. It delegates all calls in the OutputCommitter class invoked 
by Hadoop to calls on the appropriate underlying committers.
+ 
+ The other method in !OutputFormat is the getRecordWriter() method. In the 
single store case !PigOutputFormat.getRecordWriter() does the following:
+  * Instantiate the !StoreFunc associated with single !POStore.
+  * invoke !StoreFunc.setStoreLocation()
+  * Call getOutputFormat() on the !StoreFunc and then getRecordWriter() on the 
!OutputFormat returned. Note that the above setStoreLocation call needs to 
happen *before* the getRecordWriter() call and the getRecordWriter() call needs 
to be given a !TaskAttemptContext which has the "updated (with location)" 
Configuration.
+  * Wrap the !RecordWriter returned above in !PigRecordWriter class which is 
returned to Hadoop as the !RecordWriter. !PigRecordReader has 
WritableComparable as key type (which is always sent with a null value when we 
write, since in pig, we really do not have a key to store in the output( and a 
Tuple as a the value type (which is the output tuple).
+ 
+ For the multi query optimized multi store case, there are multiple !POStores 
in the same map reduce job. In this case, the data is written out in the Pig 
map or reduce pipeline itself through the POStore operator. Details of this can 
be found in http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - 
"Internal Changes" section - "Store Operator" subsection. So from the pig 
runtime code, we never call Context.write() (which would have internally called 
PigRecordWriter.write()). So the handling of multi stores has not changed for 
writing data out for this redesign.
  
  
  === Remaining Tasks ===

Reply via email to