I'd also like to know the answer to Isabel's question on how to generate the input vectors manually (not from Lucene). Another LDA question: I've trained up the LDA model which gives me a set of topics. I see that I can now use LDAInference to try to classify a new document w.r.t. these topics. But how can I perform IR tasks, i.e., retrieve training documents that are most similar to a new document?
---------------------------------------- > Date: Thu, 3 Sep 2009 16:31:15 +0200 > From: [email protected] > To: [email protected] > Subject: Re: LDA tutorial? > > On Wed, 2 Sep 2009 14:38:54 -0700 > Grant Ingersoll wrote: > >> http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html > > I have followed the tutorial and was able to run lda on the reuters > dataset. Some questions that occurred to me: > > Looking at the resulting topics it seems like no stemming or > lemmatization has been done prior to generating the vectors. Is that > right? > > Do we have documentation on the vector format? I found > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html but that > describes how to generate vectors from Lucene. I would like to run > MAHOUT-123 on a set of vectors generated from German texts. We already > have a document processing pipeline that is capable of tokenisation, > stemming, term selection and the like that I would like to reuse. I > guess I could reuse the org.apache.mahout.utils.vector.* > classes? > > Isabel _________________________________________________________________ Hotmail: Powerful Free email with security by Microsoft. http://clk.atdmt.com/GBL/go/171222986/direct/01/
