Re: Running LDA over wikipedia

Ted Dunning Fri, 22 Oct 2010 19:53:04 -0700

It is much more common to use 5 - 100 topics.

These are really latent dimensions and the number is not a fair measure of
how many different topics
that documents can be about.  In a moderate dimensional space (say 40-100
dimensions) you can fit
a boatload of concepts in without even squeezing.

As an example, with recommendation systems working over movies or similar
content, you typically only need 5 latent factors with a good relevance
model like the log-linear model of Menon and Elkan.

Even with a bad relevance model such as is used by LSI or the old HNC
Matchplus system, 100 dimensions works great.

LDA has a pretty good relevance model.

On Fri, Oct 22, 2010 at 5:35 PM, Sid <[email protected]> wrote:

> LDA options chosen I Chose about a 1000 topics to fit the model with a
> smoothing of 0.05(50/numtopics) and decided to use Mahout and mapreduce.
>
> The space required by the big dense matrix that LDA uses is
> 440000(vocab)*1000(topics)*8(sizeof int64) = 3.52 GB. Now is this matrix is
> kept in the memory at once?? What is the implementation?
>

Re: Running LDA over wikipedia

Reply via email to