[
https://issues.apache.org/jira/browse/PIG-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539597
]
Benjamin Reed commented on PIG-16:
----------------------------------
set parallelism would probably be better than set reduce_parallelism because
parallelism can be used in other places besides reduce. For example, hopefully,
soon we will detect that a dataset is already sorted and do the group by in the
map rather than reduce.
It would probably also be better to take a number rather than TRUE or FALSE so
that you can deviate from the default when need. I would think it should be
set parallelism number|DEFAULT
In reality, the hadoop configuration file should have the optimal number of
reducer tasks for a given configuration. (That is the default, not false.) If
you want to override it, you could provide a specific number.
I also don't think we should assume the use of an AlgebraicFunction would want
the parallelism set to 1. In general I would think that is not the case. The
only case I can think of for automatically setting to 1 would be a group all.
> setting parallel from grunt via set command
> -------------------------------------------
>
> Key: PIG-16
> URL: https://issues.apache.org/jira/browse/PIG-16
> Project: Pig
> Issue Type: Improvement
> Components: grunt
> Reporter: Olga Natkovich
> Priority: Minor
>
> I'd like to propose a different model which uses the grunt "set" option
> and/or a command line option which sets reduce
> parallelism to the be true and automatic.
> set reduce_parallelism TRUE
> set reduce_parallelism FALSE [Default - BTW, why is this the default?]
> This way I won't have to update my script every single time I try playing
> with -D"hod=-m N", parallelism for reduce
> statements will default, appropriately, to 2*(N-1).
> Alternatively, could I just specify PARALLEL with no value or PARALLEL
> DEFAULT; And any time I needed to force reduce
> to be single job, I could write PARALLEL 1.
> Basically, this whole thing tripped me up for a long time and I just haven't
> understood if there is a really good
> reason to not make parallelism.
> I guess it might be if you have aggregation functions that do not parallelize.
> If this is the case, then it seems to me that this should be detectable
> automagically based on whether the function is
> a vanilla EvalFunction or if it is an AlgebraicFunction.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.