Tuning LDA on Reuters

Jeff Eastman Wed, 19 May 2010 12:50:12 -0700

I ran the Reuters dataset against LDA yesterday on a 2-node cluster andit took a really long time to converge (100 iterations * 10 min ea)extracting 20 topics. I was able to reduce the iteration time by 50% byusing just TF and SeqAccSparseVectors but it was still only using asingle mapper and that was where most of the time was spent. Diggingbackwards, I found that there is only a single vector file produced byseqtosparse and also seqdirectory so that made sense.

I tried adding a '-chunk 5' param to seqdirectory but internally thatgot boosted up to 64 so I removed the boost code and am now able to get3 part files in tokenized-documents.

I've tried a similar trick with seqtosparse, but its chunk argument onlyaffects the dictionary.file chunking. I also tried running it with 4reducers but I still get only a single part file in vectors. (It doesseem that seqtosparse would produce multiple partial vector files if thedictionary were chunked, but the code then recombines those vectors toproduce a single file.)

I cannot imagine how one could ever get LDA to scale if it is alwayslimited to a single input vector file. Is there a way to get multipleoutput vector files from seqtosparse?

Tuning LDA on Reuters

Reply via email to