try using a small value for Hadoop parameter "mapred.max.split.size". For a 
file size of 8.8 Mb (~9000 Kb) if you want 10 mappers you should use a max 
split size of 9000/10=900. 

I don't now if LDADriver implements Hadoop Tool interface, but if it does you 
can pass the desired value in the command line as follows:

hadoop jar /root/mahout-core-0.2.job
org.apache.mahout.clustering.lda.LDADriver -Dmapred.max.split.size=900 -i
hdfs://master/lda/input/vectors -o hdfs://master/lda/output -k 20 -v 10000
--maxIter 40

Please note that it won't work if LDADriver is using a fancy InputFormat other 
than InputFileFormat. The easiest way to now is just to try it !

--- En date de : Mar 12.1.10, Chad Hinton <[email protected]> a écrit :

> De: Chad Hinton <[email protected]>
> Objet: Re: LDA only executes a single map task per iteration when running in  
> actual distributed mode?
> À: "mahout-user" <[email protected]>
> Date: Mardi 12 Janvier 2010, 17h13
> Ted, David - thanks for your replies.
> I thought Hadoop would
> automatically split the file but it is not. The vectors
> file generated
> from build-reuters.sh (by using
> org.apache.mahout.utils.vectors.lucene.Driver over the
> Lucene index)
> comes out to around 8.8 mb. Perhaps that is to small and
> won't be
> split if it's below the HDFS block size. I'm using the
> default 64mb
> for the HDFS. Perhaps a custom InputSplit/RecordReader is
> needed to
> split the sequence file. I'll investigate further. If
> anyone has
> further pointers or more info please chime in.
> 
> Thanks,
> Chad
> 
> > It should just happen if the file is large enough and
> the program is
> > configured for more than one mapper task and the file
> type is correct.
> 
> > If you are reading an uncompressed sequence file you
> should be set.
> 
> > On Mon, Jan 11, 2010 at 9:53 PM, David Hall <[email protected]>
> wrote:
> 
> >>  I can brush up on my hadoop foo to figure
> out how to have
> >> hadoop split up a single file, if you want.
> >>
> 
> >--
> >Ted Dunning, CTO
> >DeepDyve
> 



Reply via email to