Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=19&rev2=20

--------------------------------------------------

  == Implementation details and status ==
  
  === Current status ===
- A branch -'load-store-redesign' 
(http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has 
been created to undertake work on this proposal. As of today (Nov 2. 2009) this 
branch has simple load-store working for PigStorage and BinStorage. Joins on 
multiple inputs and multi store queries with multi query optimization also 
work. Some of the recent changes in the proposal above (the changes noted under 
Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not 
be comprehensive) of remaining tasks is listed in a subsection below.
+ A branch -'load-store-redesign' 
(http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has 
been created to undertake work on this proposal. As of today (Nov 2. 2009) this 
branch has simple load-store working for !PigStorage and !BinStorage. Joins on 
multiple inputs and multi store queries with multi query optimization also 
work. Some of the recent changes in the proposal above (the changes noted under 
Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not 
be comprehensive) of remaining tasks is listed in a subsection below.
  
  === Notes on implementation details ===
  This section is to document changes made at a high level to give an overall 
connected picture which code comments may not provide. 
  
  ==== Changes to work with Hadoop InputFormat model ====
- Hadoop has the notion of a single InputFormat per job. This is restrictive 
since Pig processes multiple inputs in the same map reduce job (in the case of 
Join, Union or Cogroup). This is handled by !PigInputFormat which is the 
!InputFormat Pig communicates to Hadoop as the Job's !InputFormat.
+ Hadoop has the notion of a single InputFormat per job. This is restrictive 
since Pig processes multiple inputs in the same map reduce job (in the case of 
Join, Union or Cogroup). This is handled by !PigInputFormat which is the 
!InputFormat Pig communicates to Hadoop as the Job's !InputFormat. In 
PigInputFormat.getSplits(), the implementation processes each input in the 
following manner:
+ 
+  * Instantiate the LoadFunc associated with the input
+  * Make a clone of the Configuration passed in the getSplits() call and then 
invoke LoadFunc.setlocation() using the clone. The reason a clone is necessary 
here is because generally in the setlocation() method, the loadfunc would 
communicate the location to its underlying !InputFormat. Typically 
!InputFormats store the location into the Configuration for use in the 
getSplits() call. For example, !FileInputFormat does this through 
!FileInputFormat.setInputLocation(Job job, String location). We don't updates 
to the Configuration for different inputs to over-write each other - hence the 
clone.
+  * Call getInputFormat() on the !LoadFunc and then getSplits() on the 
!InputFormat returned. Note that the above setLocation call needs to happen 
*before* the getSplits() call and the getSplits() call needs to be given a 
!JobContext built out of the "updated (with location)" cloned Configuration.
+  * Wrap each returned !InputSplit in !PigSplit to store information like the 
list of target operators (the pipeline) for this input, the index of the split 
in the List of Splits returned by getSplits (this is used during merge join 
index creation) etc (comments in PigSplit explain the members)
+ 
+ The list of target operators helps pig give the tuples from an input to the 
correct part of the pipeline in a multi input pipeline (like in join, cogroup, 
union).
+ 
+ The other method in !InputFormat is createRecordReader which needs be given a 
!TaskAttemptContext. The Configuration present in the !TaskAttemptContext needs 
to have any information that might have been put into it as a result of the 
above LoadFunc.setLocation() call. However the !PigInputFormat.getSplits() 
method is called in the front-end by Hadoop and 
!PigInputFormat.createRecordReader() is called in the back-end. So we would 
need to somehow pass a Map between input and the input specific Configuration 
(updated with location and other information from the relevant 
LoadFunc.setLocation() call) from the front end to the back-end. One way to 
pass this map would be in the Configuration of the !JobContext passed to 
!PigInputFormat.getSplits(). However in Hadoop 0.20.1 this Configuration 
present in the !JobContext passed to !PigInputFormat.getSplits() is a copy of 
the Configuration which is serialized to the backend and used to create the 
!TaskAttemptContext passed in !PigInputFormat.createRecordReader(). Hence 
passing the map this way is not possible. Hence we re-create the side effects 
of the !LoadFunc.setLocation() call in !PigInputFormat.getSplits() in 
!PigInputFormat.createRecordReader() by the following sequence:
+ 
+  * Instantiate the !LoadFunc associated with input represented by the 
PigSplit passed into !PigInputFormat.createRecordReader()
+  * invoke !LoadFunc.setLocation()
+  * Call getInputFormat() on the !LoadFunc and then createRecordReader() on 
the !InputFormat returned. Note that the above setLocation call needs to happen 
*before* the createRecordReader() call and the createRecordReader() call needs 
to be given a !TaskAttemptContext built out of the "updated (with location)" 
Configuration.
  
  ==== Changes to work with Hadoop OutputFormat model ====
  

Reply via email to