[jira] Commented: (PIG-729) Use of default parallelism

David Ciemiewicz (JIRA) Thu, 09 Apr 2009 13:28:35 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697638#action_12697638
 ]


David Ciemiewicz commented on PIG-729:
--------------------------------------


Adding switches on the command line versus parameters is harder to maintain for 
whom?

Certainly it is not harder for me to maintain as a user if I say -parallelism 
than if I have to litter my Pig scripts with PARALLEL=n statements and then 
have to set other types of parallelism in the numbers of reducers.

In fact, doing this properly with command line options makes my life as a 
developer MUCH easier.

If it is "harder to maintain for the Pig development team" well, I think that 
is a much more scalable burden for someone on the development team to do 
maintenance than for hundreds or thousands of pig programmers.

There's a maxim in business -- the customer is always right. :^)


> Use of default parallelism
> --------------------------
>
>                 Key: PIG-729
>                 URL: https://issues.apache.org/jira/browse/PIG-729
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.2.1
>         Environment: Hadoop 0.20
>            Reporter: Santhosh Srinivasan
>             Fix For: 0.2.1
>
>
> Currently, if the user does not specify the number of reduce slots using the 
> parallel keyword, Pig lets Hadoop decide on the default number of reducers. 
> This model worked well with dynamically allocated clusters using HOD and for 
> static clusters where the default number of reduce slots was explicitly set. 
> With Hadoop 0.20, a single static cluster will be shared amongst a number of 
> queues. As a result, a common scenario is to end up with default number of 
> reducers set to one (1).
> When users migrate to Hadoop 0.20, they might see a dramatic change in the 
> performance of their queries if they had not used the parallel keyword to 
> specify the number of reducers. In order to mitigate such circumstances, Pig 
> can support one of the following:
> 1. Specify a default parallelism for the entire script.
> This option will allow users to use the same parallelism for all operators 
> that do not have the explicit parallel keyword. This will ensure that the 
> scripts utilize more reducers than the default of one reducer. On the down 
> side, due to data transformations, usually operations that are performed 
> towards the end of the script will need smaller number of reducers compared 
> to the operators that appear at the beginning of the script.
> 2. Display a warning message for each reduce side operator that does have the 
> use of the explicit parallel keyword. Proceed with the execution.
> 3. Display an error message indicating the operator that does not have the 
> explicit use of the parallel keyword. Stop the execution.
> Other suggestions/thoughts/solutions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-729) Use of default parallelism

Reply via email to