Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by RichardDing: http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification ------------------------------------------------------------------------------ * Creating a split operator in the map or reduce and setting the splittee plans as nested plans of the split * If it needs to merge combiners it will introduce a Demux operator to route the input from mixed split branches in the mapper to the right combine plan. The separate combiner plans are the nested plans of the Demux operator * If it needs to merge reduce plans, it will do so using the Demux operator the same way the combiner is merged. - * In the cases where some splittees have combiners and some do not have combiners, the optimizer chooses either the subset of splittees with combiners or the subset of splittees without combiners--depending on which subset is larger--and merges these splittees into the splitter. + * In the cases where some splittees have combiners and some do not have combiners, the optimizer chooses either the subset of splittees with combiners or the subset of splittees without combiners--depending on which subset is larger--and merges the splittees in the chosen subset into the splitter. The other subset--if not empty--will not be merged. Note: As an end result this merging will result in Split or Demux operators with multiple stores tucked away in their nested plans. @@ -638, +638 @@ The demux operator is used in combiners and reducers where the input is a mix of different split plans of the mapper. The outputs of split plans are indexed and based on the index, the demux operator will decide which of it's nested plans a record belongs to and then attach it to that particular plan. + More precisely, these are the steps to merge a map-reduce splittee into the splitter: + 1. Add the map plan of the splittee to the inner plan list of the split operator. + 2. Set the index on the leaf operator of the map plan based on the order this map plan on the inner plan list. + 3. Add the reduce plan of the splittee to the inner plan list of the demux operator in the same order as the corresponding map plan. + 4. The outputs of merged map plan of the splitter are indexed key/value pairs and are sent to the reduce tasks. + 5. The demux operator extracts the index from the key/values it receives and attaches them to the corresponding reduce plan in its inner plan list. + 6. The chosen reduce plan consumes the key/values data. + + [[Anchor(PartitionScheme)]] + ===== Partition Scheme ===== + + What is the parallelism (the number of reduce tasks requested) of the merged splitter job? How do we partition the keys of the merged inner plans? + + After considering several partition schemes, we settled on this one: + + * The parallelism of the merged splitter job is the maximum of the parallelisms of all splittee jobs. + * The keys from inner plans are partitioned into all the buckets via the default hash partitioner. + + To avoid the key collision of different inner plans with this scheme, the PigNullableWritable class is modified to take into account of the indexes when two keys are compared (hashed). + [[Anchor(Local_Execution_engine)]] ==== Local Execution Engine ==== The local engine has not changed as much as the map reduce engine. The local engine executes the physical plan directly. The main changes were:
