I ran the Reuters dataset against LDA yesterday on a 2-node cluster and it took a really long time to converge (100 iterations * 10 min ea) extracting 20 topics. I was able to reduce the iteration time by 50% by using just TF and SeqAccSparseVectors but it was still only using a single mapper and that was where most of the time was spent. Digging backwards, I found that there is only a single vector file produced by seqtosparse and also seqdirectory so that made sense.

I tried adding a '-chunk 5' param to seqdirectory but internally that got boosted up to 64 so I removed the boost code and am now able to get 3 part files in tokenized-documents.

I've tried a similar trick with seqtosparse, but its chunk argument only affects the dictionary.file chunking. I also tried running it with 4 reducers but I still get only a single part file in vectors. (It does seem that seqtosparse would produce multiple partial vector files if the dictionary were chunked, but the code then recombines those vectors to produce a single file.)

I cannot imagine how one could ever get LDA to scale if it is always limited to a single input vector file. Is there a way to get multiple output vector files from seqtosparse?

Reply via email to