Check the user list a couple days ago for "LDA Mahout" for a similar thread. 
The seq2sparse routine will handle n-grams and has a -maxDFPercent option which 
will handle common terms much like a stoplist would. You can also specify your 
own analyzer which could use whatever stoplist you want. Not sure what you mean 
by "contextual information", but the document term vectors produced by 
seq2sparse wrap the vector in a NamedVector with the document name. That's 
about the extent of context which we currently support.

-----Original Message-----
From: Neel Sheyal [mailto:[email protected]] 
Sent: Thursday, March 03, 2011 5:30 AM
To: [email protected]
Subject: BagofWords and StopList

Hi
       I need to do text clustering but in the context of natural
language processing. Consequently, word ordering becomes important.
Initially, I will be doing the nGram model (with n =3).

In Mahout, the Vector and SequenceFileFormat representation does not
take into consideration contextual information (as I understand). I
know I might need to modify  both of them but is there a bagofwords
and stoplist that I may use?

Thanks,
Neel Sheyal

Reply via email to