Hey Ted, Thanks for suggesting that, I have a question though. Well what i am making here is a system that can find similar documents using a topic model for their auto organization. So I would draw a topic model over wikipedia, get the model and then given a set of query documents from a user, I would infer the most representative words that may represent those query documents, according to wikipedia that is... Since wikipedia is so diverse I am hoping to get enough relevant words. The query documents with similar topic distributions will end up being organized in the same folder (of-course this system is not fully automated, and a Ui is there but that is a different discussion).
So my question is that if there are a lot of topics bunched in, like you suggested i should do with 5-100; That may give me multiple concepts for the same documents; concepts that are actually conceptually different and not all being representative of the query document. Is my interpretation here correct? Regards - Sid On Fri, Oct 22, 2010 at 7:53 PM, Ted Dunning <[email protected]> wrote: > Start much smaller on the number of latent factors (topics). Try 5, 10, > and > then 20. > > On Fri, Oct 22, 2010 at 5:35 PM, Sid <[email protected]> wrote: > > > t the moment I employ a simple Hadoop with 2.00GB heap space per > computer. > > 2 node cluster with 20 Map processes and 4 reducers, but I can change > this. > > BTW what defines the upperlimit on the heapspace I cant go beyond 2GB > > hadoop > > says its above the valid limit. > > > -- Sidharth Gupta 1249 E Spence Avenue Tempe Az, 85281 480-307-5994
