I'd also like to know the answer to Isabel's question on how to generate the 
input vectors manually (not from Lucene).
Another LDA question: I've trained up the LDA model which gives me a set of 
topics. I see that I can now use LDAInference to try to classify a new document 
w.r.t. these topics. But how can I perform IR tasks, i.e., retrieve training 
documents that are most similar to a new document?

----------------------------------------
> Date: Thu, 3 Sep 2009 16:31:15 +0200
> From: [email protected]
> To: [email protected]
> Subject: Re: LDA tutorial?
>
> On Wed, 2 Sep 2009 14:38:54 -0700
> Grant Ingersoll  wrote:
>
>> http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
>
> I have followed the tutorial and was able to run lda on the reuters
> dataset. Some questions that occurred to me:
>
> Looking at the resulting topics it seems like no stemming or
> lemmatization has been done prior to generating the vectors. Is that
> right?
>
> Do we have documentation on the vector format? I found
> http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html but that
> describes how to generate vectors from Lucene. I would like to run
> MAHOUT-123 on a set of vectors generated from German texts. We already
> have a document processing pipeline that is capable of tokenisation,
> stemming, term selection and the like that I would like to reuse. I
> guess I could reuse the org.apache.mahout.utils.vector.*
> classes?
>
> Isabel

_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/171222986/direct/01/

Reply via email to