You might find 
http://www.lucidimagination.com/search/document/39b53fbf4b525f2f/lda_only_executes_a_single_map_task_per_iteration_when_running_in_actual_distributed_mode#311eb323a8208e28
 informative.

(BTW, LDA is only meant to run w/ TF)

-Grant

On May 19, 2010, at 9:49 PM, Jeff Eastman wrote:

> I ran the Reuters dataset against LDA yesterday on a 2-node cluster and it 
> took a really long time to converge (100 iterations * 10 min ea) extracting 
> 20 topics. I was able to reduce the iteration time by 50% by using just TF 
> and SeqAccSparseVectors but it was still only using a single mapper and that 
> was where most of the time was spent. Digging backwards, I found that there 
> is only a single vector file produced by seqtosparse and also seqdirectory so 
> that made sense.
> 
> I tried adding a '-chunk 5' param to seqdirectory but internally that got 
> boosted up to 64 so I removed the boost code and am now able to get 3 part 
> files in tokenized-documents.
> 
> I've tried a similar trick with seqtosparse, but its chunk argument only 
> affects the dictionary.file chunking. I also tried running it with 4 reducers 
> but I still get only a single part file in vectors. (It does seem that 
> seqtosparse would produce multiple partial vector files if the dictionary 
> were chunked, but the code then recombines those vectors to produce a single 
> file.)
> 
> I cannot imagine how one could ever get LDA to scale if it is always limited 
> to a single input vector file. Is there a way to get multiple output vector 
> files from seqtosparse?

Reply via email to