On Wed, 2 Sep 2009 14:38:54 -0700
Grant Ingersoll <[email protected]> wrote:

> http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html

I have followed the tutorial and was able to run lda on the reuters
dataset. Some questions that occurred to me:

Looking at the resulting topics it seems like no stemming or
lemmatization has been done prior to generating the vectors. Is that
right?

Do we have documentation on the vector format? I found 
http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html but that
describes how to generate vectors from Lucene. I would like to run
MAHOUT-123 on a set of vectors generated from German texts. We already
have a document processing pipeline that is capable of tokenisation,
stemming, term selection and the like that I would like to reuse. I
guess I could reuse the org.apache.mahout.utils.vector.*
classes?

Isabel

Reply via email to