You might find http://www.lucidimagination.com/search/document/39b53fbf4b525f2f/lda_only_executes_a_single_map_task_per_iteration_when_running_in_actual_distributed_mode#311eb323a8208e28 informative.
(BTW, LDA is only meant to run w/ TF) -Grant On May 19, 2010, at 9:49 PM, Jeff Eastman wrote: > I ran the Reuters dataset against LDA yesterday on a 2-node cluster and it > took a really long time to converge (100 iterations * 10 min ea) extracting > 20 topics. I was able to reduce the iteration time by 50% by using just TF > and SeqAccSparseVectors but it was still only using a single mapper and that > was where most of the time was spent. Digging backwards, I found that there > is only a single vector file produced by seqtosparse and also seqdirectory so > that made sense. > > I tried adding a '-chunk 5' param to seqdirectory but internally that got > boosted up to 64 so I removed the boost code and am now able to get 3 part > files in tokenized-documents. > > I've tried a similar trick with seqtosparse, but its chunk argument only > affects the dictionary.file chunking. I also tried running it with 4 reducers > but I still get only a single part file in vectors. (It does seem that > seqtosparse would produce multiple partial vector files if the dictionary > were chunked, but the code then recombines those vectors to produce a single > file.) > > I cannot imagine how one could ever get LDA to scale if it is always limited > to a single input vector file. Is there a way to get multiple output vector > files from seqtosparse?
