[
https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695599#action_12695599
]
David Ciemiewicz commented on PIG-729:
--------------------------------------
I've been through this battle before. And I write LOTS of Pig scripts.
Here's what I want:
1) Use default parallelism of 1 reducer. BUT WARN ME that I've got a default
parallelism of 1 reducer. (I'd actually prefer what ever works on a single
node).
2) Allow me a command line option such as -parallel # or -mappers # -reducers #.
3) Allow me a set parameter inside my Pig scripts such as:
set parallel #
set mappers #
set reducers #
4) DO NOT require me to add a PARALLEL clause to each and every one of my
reducer statements.
PARALLEL clauses are a code maintenance nightmare.
Sometimes the grid is fat on available nodes and so I want to take advantage of
this and run my job across as many nodes as possible.
Sometimes the grid is scarce on available nodes and so I want back off on the
parallelism.
I DO NOT WANT to change EVERY PARALLEL clause in my code each time I run my
script.
I DO NOT WANT to change parameter values for the PARALLEL clause each time I
run my script.
I really, really, really want to make this a run-time decision on the execution
of the script at the time that I invoke the script and I want this to be the
default behavior in PIg.
> Use of default parallelism
> --------------------------
>
> Key: PIG-729
> URL: https://issues.apache.org/jira/browse/PIG-729
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.2.1
> Environment: Hadoop 0.20
> Reporter: Santhosh Srinivasan
> Fix For: 0.2.1
>
>
> Currently, if the user does not specify the number of reduce slots using the
> parallel keyword, Pig lets Hadoop decide on the default number of reducers.
> This model worked well with dynamically allocated clusters using HOD and for
> static clusters where the default number of reduce slots was explicitly set.
> With Hadoop 0.20, a single static cluster will be shared amongst a number of
> queues. As a result, a common scenario is to end up with default number of
> reducers set to one (1).
> When users migrate to Hadoop 0.20, they might see a dramatic change in the
> performance of their queries if they had not used the parallel keyword to
> specify the number of reducers. In order to mitigate such circumstances, Pig
> can support one of the following:
> 1. Specify a default parallelism for the entire script.
> This option will allow users to use the same parallelism for all operators
> that do not have the explicit parallel keyword. This will ensure that the
> scripts utilize more reducers than the default of one reducer. On the down
> side, due to data transformations, usually operations that are performed
> towards the end of the script will need smaller number of reducers compared
> to the operators that appear at the beginning of the script.
> 2. Display a warning message for each reduce side operator that does have the
> use of the explicit parallel keyword. Proceed with the execution.
> 3. Display an error message indicating the operator that does not have the
> explicit use of the parallel keyword. Stop the execution.
> Other suggestions/thoughts/solutions are welcome.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.