PPS it doesn't tell you what InputFileFormat actually uses for it as a property, and i don't remember on top of my head either. but i assume you could use them with -D as well.
On Tue, Dec 28, 2010 at 4:54 PM, Dmitriy Lyubimov <[email protected]> wrote: > In particular, QJob is one of the drivers that uses that , in the following > way: > > f ( minSplitSize>0) > SequenceFileInputFormat.setMinInputSplitSize(job, minSplitSize); > > Interestng pecularity about that parameter is that in the current hadoop > release for anything derived from InputFileFormat it ensures that all splits > are at least that big and the last split is at least times 1.1 that big. I > am not quite sure why special treatment for the last split but that's how it > goes there. > > -Dmitriy > > > On Tue, Dec 28, 2010 at 4:48 PM, Dmitriy Lyubimov <[email protected]>wrote: > >> Jeff, >> >> it's mahout-376 patch i don't think it is committed. the driver class >> there is SSVDCli, for your convenience you can find it here : >> https://github.com/dlyubimov/ssvd-lsi/tree/givens-ssvd/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd >> >> but like i said, i did not try to use it with -D option since i wanted to >> give an explicit option to increase split size if needed (and a help for >> it). Another reason is that solver has a series of jobs and only those >> reading the source matrix have anything to do with the split size. >> >> >> -d >> >> >> On Tue, Dec 28, 2010 at 4:39 PM, Jeff Eastman <[email protected]> wrote: >> >>> What's the driver class? If the -D parameters are working for you I want >>> to compare to the clustering drovers >>> >>> -----Original Message----- >>> From: Dmitriy Lyubimov [mailto:[email protected]] >>> Sent: Tuesday, December 28, 2010 4:37 PM >>> To: [email protected] >>> Subject: Re: where i can set -Dmapred.map.tasks=X >>> >>> as far as i understand, this option is not forced. I suspect it actually >>> means 'minimum degree of parallelism'. so if you expect to use that to >>> reduce number of mappers, i don't think this is expected to work so much. >>> The one that do enforce anything are min split size and max split size in >>> file input so i guess you can try those. I rely on them (and open it up >>> as a >>> job-specific option) in stochastic svd. >>> >>> but usually forcing split size to increase creates a 'superslits' >>> problem, >>> where a lot of data is moved around to just supply data to mappers. which >>> is >>> perhaps why this option is meant to increase parallelism only, but >>> probably >>> not to decrease it. >>> >>> -d >>> >>> On Tue, Dec 28, 2010 at 4:05 PM, Jeff Eastman <[email protected]> >>> wrote: >>> >>> > This is supposed to be a generic option. You should be able to specify >>> > Hadoop options such as this on the command line invocation of your >>> favorite >>> > Mahout routine, but I'm having a similar problem setting >>> > -Dmapred.reduce.tasks=10 with Canopy and k-Means. This is both with and >>> > without a space after the -D. >>> > >>> > Can someone point me to a Mahout command where this does work? Both >>> drivers >>> > extend AbstractJob and do the usual option processing pushups. I don't >>> have >>> > Hadoop source locally so I can't debug the generic options parsing. >>> > >>> > -----Original Message----- >>> > From: beneo_7 [mailto:[email protected]] >>> > Sent: Monday, December 27, 2010 10:45 PM >>> > To: [email protected] >>> > Subject: where i can set -Dmapred.map.tasks=X >>> > >>> > i read onMahout in Action that I should set -Dmapred.map.tasks=X >>> > but it did not work for hadoop >>> > >>> >> >> >
