as far as i understand, this option is not forced. I suspect it actually means 'minimum degree of parallelism'. so if you expect to use that to reduce number of mappers, i don't think this is expected to work so much. The one that do enforce anything are min split size and max split size in file input so i guess you can try those. I rely on them (and open it up as a job-specific option) in stochastic svd.
but usually forcing split size to increase creates a 'superslits' problem, where a lot of data is moved around to just supply data to mappers. which is perhaps why this option is meant to increase parallelism only, but probably not to decrease it. -d On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <[email protected]> wrote: > This is supposed to be a generic option. You should be able to specify > Hadoop options such as this on the command line invocation of your favorite > Mahout routine, but I'm having a similar problem setting > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with and > without a space after the -D. > > Can someone point me to a Mahout command where this does work? Both drivers > extend AbstractJob and do the usual option processing pushups. I don't have > Hadoop source locally so I can't debug the generic options parsing. > > -----Original Message----- > From: beneo_7 [mailto:[email protected]] > Sent: Monday, December 27, 2010 10:45 PM > To: [email protected] > Subject: where i can set -Dmapred.map.tasks=X > > i read onMahout in Action that I should set -Dmapred.map.tasks=X > but it did not work for hadoop >
