PigCookBook use of PARALLEL keyword

                 Key: PIG-1081
                 URL: https://issues.apache.org/jira/browse/PIG-1081
             Project: Pig
          Issue Type: Bug
          Components: documentation
    Affects Versions: 0.5.0
            Reporter: Viraj Bhat
             Fix For: 0.5.0

Hi all,
 I am looking at some tips for optimizing Pig programs (Pig Cookbook) using the 
PARALLEL keyword.

We know that currently Pig 0.5 uses Hadoop 20 (as its default) which launches 1 
reducer for all cases. 

In this documentation we state that: <num machines> * <num reduce slots per 
machine> * 0.9, this documentation was valid for HoD (Hadoop on Demand) where 
you are creating your own Hadoop clusters, but if you are using:

Either the Capacity Scheduler 
http://hadoop.apache.org/common/docs/current/capacity_scheduler.html or the 
Fair Share Scheduler 
http://hadoop.apache.org/common/docs/current/fair_scheduler.html , these 
numbers could mean that you are using around 90% of your reducer slots in your 

We should change this to something like: 
The number of reducers you may need for a particular construct in Pig which 
forms a Map Reduce boundary depends entirely on your data and the number of 
intermediate keys you are generating in your mappers. In best cases we have 
seen that a reducer processing about 500 MB of data behaves efficiently. 
Additionally it is hard to define the optimum number of reducers, since it 
completely depends on the paritioner and distribution of map (combiner) output 


This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to