Hi Ken, On Wed, Feb 10, 2010 at 10:29 PM, Ken Krugler <[email protected]> wrote: > > Is there any support currently in Mahout for generating tf-idf weighted > vectors without creating a Lucene index? Just curious. [..] > I assume you'd use something like the Lucene ShingleAnalyzer to generate one > and two word terms.
Support for these was (very!) recently checked into svn. Take a look at the main entry points: in mahout-examples: o.a.m.text.SequenceFilesFromDirectory in mahout-utils: o.a.m.text.SparseVectorsFromSequenceFiles This includes the ability to generate n-grams using Lucene's ShingleFilter The basic gist here is that SequenceFilesFromDirectory creates one or more files of SequenceFile<DocId, Document Text> from files found in a directory, and SparseVectorsFromSequenceFiles does the tokenization (lucene StandardAnalyzer by default, but this is configurable), n-gram generation, dictionary generation and document vectorization (including tfidf calculation). The output from each step is preserved as a SequenceFile which can be inspected using SequenceFileDumper. The process of generating the SequenceFile<DocId, DocumentText> is straightforward enough so that it would be fairly trivial to pull content from what ever format you have the data in. A mode in-depth description on the wiki is forthcoming and the usual caveats apply considering this is brand new code. It sounds like it would meet your needs. It would be great to have someone else work with it a bit. Drew
