[Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by RichardDing

Apache Wiki Tue, 05 May 2009 15:47:39 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by RichardDing:
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

------------------------------------------------------------------------------
     * Creating a split operator in the map or reduce and setting the splittee 
plans as nested plans of the split
     * If it needs to merge combiners it will introduce a Demux operator to 
route the input from mixed split branches in the mapper to the right combine 
plan. The separate combiner plans are the nested plans of the Demux operator   
     * If it needs to merge reduce plans, it will do so using the Demux 
operator the same way the combiner is merged.
-    * In the cases where some splittees have combiners and some do not have 
combiners, the optimizer chooses either the subset of splittees with combiners 
or the subset of splittees without combiners--depending on which subset is 
larger--and merges these splittees into the splitter.
+    * In the cases where some splittees have combiners and some do not have 
combiners, the optimizer chooses either the subset of splittees with combiners 
or the subset of splittees without combiners--depending on which subset is 
larger--and merges the splittees in the chosen subset into the splitter. The 
other subset--if not empty--will not be merged.
  
  Note: As an end result this merging will result in Split or Demux operators 
with multiple stores tucked away in their nested plans.
  
@@ -638, +638 @@

  
  The demux operator is used in combiners and reducers where the input is a mix 
of different split plans of the mapper. The outputs of split plans are indexed 
and based on the index, the demux operator will decide which of it's nested 
plans a record belongs to and then attach it to that particular plan. 
  
+ More precisely, these are the steps to merge a map-reduce splittee into the 
splitter:
  
+    1. Add the map plan of the splittee to the inner plan list of the split 
operator. 
+    2. Set the index on the leaf operator of the map plan based on the order 
this map plan on the inner plan list.
+    3. Add the reduce plan of the splittee to the inner plan list of the demux 
operator in the same order as the corresponding map plan.
+    4. The outputs of merged map plan of the splitter are indexed key/value 
pairs and are sent to the reduce tasks.
+    5. The demux operator extracts the index from the key/values it receives 
and attaches them to the corresponding reduce plan in its inner plan list.
+    6. The chosen reduce plan consumes the key/values data.
+  
+ [[Anchor(PartitionScheme)]]
+ ===== Partition Scheme =====
+ 
+ What is the parallelism (the number of reduce tasks requested) of the merged 
splitter job? How do we partition the keys of the merged inner plans?
+ 
+ After considering several partition schemes, we settled on this one:
+ 
+    * The parallelism of the merged splitter job is the maximum of the 
parallelisms of all splittee jobs.
+    * The keys from inner plans are partitioned into all the buckets via the 
default hash partitioner.
+ 
+ To avoid the key collision of different inner plans with this scheme, the 
PigNullableWritable class is modified to take into account of the indexes when 
two keys are compared (hashed). 
+   
  [[Anchor(Local_Execution_engine)]]
  ==== Local Execution Engine ====
  The local engine has not changed as much as the map reduce engine. The local 
engine executes the physical plan directly. The main changes were:

[Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by RichardDing

Reply via email to