Re: LDA only executes a single map task per iteration when running in actual distributed mode?

Chad Hinton Tue, 12 Jan 2010 14:07:24 -0800

I tried it first with the smaller number and got a lot of maps! But
that's good... so long as I got past the single map task. I am
tweaking it now but this did the trick. Thanks again.


On Tue, Jan 12, 2010 at 12:33 PM, deneche abdelhakim <[email protected]> wrote:
> oups, sorry the size should be specified in bytes and not kB. so 8.8Mb ~ 
> 9227468b. to get 10 mappers use a mapred.max.split.size=922747
>
> --- En date de : Mar 12.1.10, deneche abdelhakim <[email protected]> a 
> écrit :
>
>> De: deneche abdelhakim <[email protected]>
>> Objet: Re: LDA only executes a single map task per iteration when running in 
>>  actual distributed mode?
>> À: [email protected]
>> Date: Mardi 12 Janvier 2010, 17h43
>> try using a small value for Hadoop
>> parameter "mapred.max.split.size". For a file size of 8.8 Mb
>> (~9000 Kb) if you want 10 mappers you should use a max split
>> size of 9000/10=900.
>>
>> I don't now if LDADriver implements Hadoop Tool interface,
>> but if it does you can pass the desired value in the command
>> line as follows:
>>
>> hadoop jar /root/mahout-core-0.2.job
>> org.apache.mahout.clustering.lda.LDADriver
>> -Dmapred.max.split.size=900 -i
>> hdfs://master/lda/input/vectors -o hdfs://master/lda/output
>> -k 20 -v 10000
>> --maxIter 40
>>
>> Please note that it won't work if LDADriver is using a
>> fancy InputFormat other than InputFileFormat. The easiest
>> way to now is just to try it !
>>
>> --- En date de : Mar 12.1.10, Chad Hinton <[email protected]>
>> a écrit :
>>
>> > De: Chad Hinton <[email protected]>
>> > Objet: Re: LDA only executes a single map task per
>> iteration when running in  actual distributed mode?
>> > À: "mahout-user" <[email protected]>
>> > Date: Mardi 12 Janvier 2010, 17h13
>> > Ted, David - thanks for your replies.
>> > I thought Hadoop would
>> > automatically split the file but it is not. The
>> vectors
>> > file generated
>> > from build-reuters.sh (by using
>> > org.apache.mahout.utils.vectors.lucene.Driver over
>> the
>> > Lucene index)
>> > comes out to around 8.8 mb. Perhaps that is to small
>> and
>> > won't be
>> > split if it's below the HDFS block size. I'm using
>> the
>> > default 64mb
>> > for the HDFS. Perhaps a custom InputSplit/RecordReader
>> is
>> > needed to
>> > split the sequence file. I'll investigate further. If
>> > anyone has
>> > further pointers or more info please chime in.
>> >
>> > Thanks,
>> > Chad
>> >
>> > > It should just happen if the file is large enough
>> and
>> > the program is
>> > > configured for more than one mapper task and the
>> file
>> > type is correct.
>> >
>> > > If you are reading an uncompressed sequence file
>> you
>> > should be set.
>> >
>> > > On Mon, Jan 11, 2010 at 9:53 PM, David Hall
>> <[email protected]>
>> > wrote:
>> >
>> > >>  I can brush up on my hadoop foo to figure
>> > out how to have
>> > >> hadoop split up a single file, if you want.
>> > >>
>> >
>> > >--
>> > >Ted Dunning, CTO
>> > >DeepDyve
>> >
>>
>>
>>
>>
>
>
>
>

Re: LDA only executes a single map task per iteration when running in actual distributed mode?

Reply via email to