[ https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697641#action_12697641 ]
David Ciemiewicz commented on PIG-729: -------------------------------------- Ah wait, I just read what Olga wrote again. I think there might be hybrid solution that handles both cases without having to do -param. We should add to Pig a -set option that let's us set values for things that we would "set" in our scripts. pig -set parallelism=5 is equivalent to following idiom in my pig script. set parallelism 5; Command line -set options should override explicit set statements in the pig script with a warning of the override. I think this generalized mechanism would satisfy both my desires as a developer and Olga's desire to reduce pig development team code maintenance headaches. > Use of default parallelism > -------------------------- > > Key: PIG-729 > URL: https://issues.apache.org/jira/browse/PIG-729 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.2.1 > Environment: Hadoop 0.20 > Reporter: Santhosh Srinivasan > Fix For: 0.2.1 > > > Currently, if the user does not specify the number of reduce slots using the > parallel keyword, Pig lets Hadoop decide on the default number of reducers. > This model worked well with dynamically allocated clusters using HOD and for > static clusters where the default number of reduce slots was explicitly set. > With Hadoop 0.20, a single static cluster will be shared amongst a number of > queues. As a result, a common scenario is to end up with default number of > reducers set to one (1). > When users migrate to Hadoop 0.20, they might see a dramatic change in the > performance of their queries if they had not used the parallel keyword to > specify the number of reducers. In order to mitigate such circumstances, Pig > can support one of the following: > 1. Specify a default parallelism for the entire script. > This option will allow users to use the same parallelism for all operators > that do not have the explicit parallel keyword. This will ensure that the > scripts utilize more reducers than the default of one reducer. On the down > side, due to data transformations, usually operations that are performed > towards the end of the script will need smaller number of reducers compared > to the operators that appear at the beginning of the script. > 2. Display a warning message for each reduce side operator that does have the > use of the explicit parallel keyword. Proceed with the execution. > 3. Display an error message indicating the operator that does not have the > explicit use of the parallel keyword. Stop the execution. > Other suggestions/thoughts/solutions are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.