Pig users might not know enough to decide on a good default parallelism,
specially when running adhoc queries.

Instead of defaulting to 1 , if a user does not specify the parallelism , we
should use as default a higher number which does not have negative impact on
the throughput of the system.

Hadoop-dev might be able to guide us on the extent to which hadoop scales
linearly with increasing number of reducers. For example, if we are able to
linearly scale upto x reducers, we can use a default of
min(max_reducers_possible, max_reducers_linear) .

On 3/23/09 3:09 PM, "Santhosh Srinivasan (JIRA)" <j...@apache.org> wrote:

> Use of default parallelism
> --------------------------
>                  Key: PIG-729
>                  URL: https://issues.apache.org/jira/browse/PIG-729
>              Project: Pig
>           Issue Type: Bug
>           Components: impl
>     Affects Versions: 1.0.1
>          Environment: Hadoop 0.20
>             Reporter: Santhosh Srinivasan
>              Fix For: 1.0.1
> Currently, if the user does not specify the number of reduce slots using the
> parallel keyword, Pig lets Hadoop decide on the default number of reducers.
> This model worked well with dynamically allocated clusters using HOD and for
> static clusters where the default number of reduce slots was explicitly set.
> With Hadoop 0.20, a single static cluster will be shared amongst a number of
> queues. As a result, a common scenario is to end up with default number of
> reducers set to one (1).
> When users migrate to Hadoop 0.20, they might see a dramatic change in the
> performance of their queries if they had not used the parallel keyword to
> specify the number of reducers. In order to mitigate such circumstances, Pig
> can support one of the following:
> 1. Specify a default parallelism for the entire script.
> This option will allow users to use the same parallelism for all operators
> that do not have the explicit parallel keyword. This will ensure that the
> scripts utilize more reducers than the default of one reducer. On the down
> side, due to data transformations, usually operations that are performed
> towards the end of the script will need smaller number of reducers compared to
> the operators that appear at the beginning of the script.
> 2. Display a warning message for each reduce side operator that does have the
> use of the explicit parallel keyword. Proceed with the execution.
> 3. Display an error message indicating the operator that does not have the
> explicit use of the parallel keyword. Stop the execution.
> Other suggestions/thoughts/solutions are welcome.

Reply via email to