This is fine from our use at Mendeley too, we currently set that manually most of the time and it would be great to have it set automatically in cases where it is not. This will be of most benefit to users who do not understand the map reduce work system fully but still use pig for various tasks.
Thinking about the way of setting it, is using the input size the best option? or might the number of reducers in the cluster be worth looking at too? as that can affect the performance of the job and dataset for a given cluster. Thanks, On 2 July 2010 00:57, Dmitriy Ryaboy <[email protected]> wrote: > We (Twitter) are fine with this change. > > On Thu, Jul 1, 2010 at 4:45 PM, Aravind Srinivasan > <[email protected]>wrote: > > > Dear Pig Users, > > > > My name is Aravind Srinivasan and am the Product Manager for Pig at > Yahoo. > > The Pig team would love to get your feedback on the proposal below. > > Basically we are trying to figure out if this enhancement would break > > backwards compatibility for your system and if so, what are your thoughts > on > > the trade-off between the cost and the benefit. Please drop me an e-mail > ( > > [email protected]) if you have an opinion on this. > > > > Summary: > > Currently, if PARALLEL is not specified, the default value is 1 which > most > > of time is not what users want and ends up causing some problems in the > > clusters in the past. The proposal is to use some very basic heuristic > based > > on the input size to set a better value. This can be issues for users who > > expect just a single part file in the output. > > > > Jira for your reference: > > https://issues.apache.org/jira/browse/PIG-1249 > > > > Thanks, > > Aravind > > > > > > > > > -- Dan Harvey | Datamining Engineer www.mendeley.com/profiles/dan-harvey Mendeley Limited | London, UK | www.mendeley.com Registered in England and Wales | Company Number 6419015
