It is much more common to use 5 - 100 topics. These are really latent dimensions and the number is not a fair measure of how many different topics that documents can be about. In a moderate dimensional space (say 40-100 dimensions) you can fit a boatload of concepts in without even squeezing.
As an example, with recommendation systems working over movies or similar content, you typically only need 5 latent factors with a good relevance model like the log-linear model of Menon and Elkan. Even with a bad relevance model such as is used by LSI or the old HNC Matchplus system, 100 dimensions works great. LDA has a pretty good relevance model. On Fri, Oct 22, 2010 at 5:35 PM, Sid <[email protected]> wrote: > LDA options chosen I Chose about a 1000 topics to fit the model with a > smoothing of 0.05(50/numtopics) and decided to use Mahout and mapreduce. > > The space required by the big dense matrix that LDA uses is > 440000(vocab)*1000(topics)*8(sizeof int64) = 3.52 GB. Now is this matrix is > kept in the memory at once?? What is the implementation? >
