Use of default parallelism

                 Key: PIG-729
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: 1.0.1
         Environment: Hadoop 0.20
            Reporter: Santhosh Srinivasan
             Fix For: 1.0.1

Currently, if the user does not specify the number of reduce slots using the 
parallel keyword, Pig lets Hadoop decide on the default number of reducers. 
This model worked well with dynamically allocated clusters using HOD and for 
static clusters where the default number of reduce slots was explicitly set. 
With Hadoop 0.20, a single static cluster will be shared amongst a number of 
queues. As a result, a common scenario is to end up with default number of 
reducers set to one (1).

When users migrate to Hadoop 0.20, they might see a dramatic change in the 
performance of their queries if they had not used the parallel keyword to 
specify the number of reducers. In order to mitigate such circumstances, Pig 
can support one of the following:

1. Specify a default parallelism for the entire script.

This option will allow users to use the same parallelism for all operators that 
do not have the explicit parallel keyword. This will ensure that the scripts 
utilize more reducers than the default of one reducer. On the down side, due to 
data transformations, usually operations that are performed towards the end of 
the script will need smaller number of reducers compared to the operators that 
appear at the beginning of the script.

2. Display a warning message for each reduce side operator that does have the 
use of the explicit parallel keyword. Proceed with the execution.

3. Display an error message indicating the operator that does not have the 
explicit use of the parallel keyword. Stop the execution.

Other suggestions/thoughts/solutions are welcome.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to