Re: Running LDA over wikipedia

Sid Sat, 23 Oct 2010 01:58:57 -0700

Hey Ted,
Thanks for suggesting that, I have a question though.

Well what i am making here is a system that can  find similar documents
using a topic model for their auto organization.
So I would draw a topic model over wikipedia, get the model and then given a
set of query documents from a user, I would infer the most representative
words that may represent those query documents, according to wikipedia that
is... Since wikipedia is so diverse I am hoping to get enough relevant
words. The query documents with similar topic distributions will end up
being organized in the same folder (of-course this system is not fully
automated, and a Ui is there but that is a different discussion).

So my question is that if there are a lot of topics bunched in, like you
suggested i should do with 5-100; That may give me multiple concepts for the
same documents; concepts that are actually conceptually different and not
all being representative of the query document.
Is my interpretation here correct?

Regards
- Sid

On Fri, Oct 22, 2010 at 7:53 PM, Ted Dunning <[email protected]> wrote:

> Start much smaller on the number of latent factors (topics).  Try 5, 10,
> and
> then 20.
>
> On Fri, Oct 22, 2010 at 5:35 PM, Sid <[email protected]> wrote:
>
> > t the moment I employ a simple Hadoop with 2.00GB heap space per
> computer.
> > 2 node cluster with 20 Map processes and 4 reducers, but I can change
> this.
> > BTW what defines the upperlimit on the heapspace I cant go beyond 2GB
> > hadoop
> > says its above the valid limit.
> >
>

-- 
Sidharth Gupta

1249 E Spence Avenue
Tempe Az, 85281
480-307-5994

Re: Running LDA over wikipedia

Reply via email to