Check the user list a couple days ago for "LDA Mahout" for a similar thread. The seq2sparse routine will handle n-grams and has a -maxDFPercent option which will handle common terms much like a stoplist would. You can also specify your own analyzer which could use whatever stoplist you want. Not sure what you mean by "contextual information", but the document term vectors produced by seq2sparse wrap the vector in a NamedVector with the document name. That's about the extent of context which we currently support.
-----Original Message----- From: Neel Sheyal [mailto:[email protected]] Sent: Thursday, March 03, 2011 5:30 AM To: [email protected] Subject: BagofWords and StopList Hi I need to do text clustering but in the context of natural language processing. Consequently, word ordering becomes important. Initially, I will be doing the nGram model (with n =3). In Mahout, the Vector and SequenceFileFormat representation does not take into consideration contextual information (as I understand). I know I might need to modify both of them but is there a bagofwords and stoplist that I may use? Thanks, Neel Sheyal
