Re: Suggestions for best approach to classic document clustering

Drew Farris Thu, 11 Feb 2010 05:24:39 -0800

Hi Ken,

On Wed, Feb 10, 2010 at 10:29 PM, Ken Krugler
<[email protected]> wrote:
>
> Is there any support currently in Mahout for generating tf-idf weighted
> vectors without creating a Lucene index? Just curious.
[..]
> I assume you'd use something like the Lucene ShingleAnalyzer to generate one
> and two word terms.


Support for these was (very!) recently checked into svn. Take a look
at the main entry points:

in mahout-examples: o.a.m.text.SequenceFilesFromDirectory
in mahout-utils: o.a.m.text.SparseVectorsFromSequenceFiles

This includes the ability to generate n-grams using Lucene's ShingleFilter

The basic gist here is that SequenceFilesFromDirectory creates one or
more files of SequenceFile<DocId, Document Text> from files found in a
directory, and SparseVectorsFromSequenceFiles does the tokenization
(lucene StandardAnalyzer by default, but this is configurable), n-gram
generation, dictionary generation and document vectorization
(including tfidf calculation). The output from each step is preserved
as a SequenceFile which can be inspected using SequenceFileDumper.

The process of generating the SequenceFile<DocId, DocumentText> is
straightforward enough so that it would be fairly trivial to pull
content from what ever format you have the data in.

A mode in-depth description on the wiki is forthcoming and the usual
caveats apply considering this is brand new code. It sounds like it
would meet your needs. It would be great to have someone else work
with it a bit.

Drew

Re: Suggestions for best approach to classic document clustering

Reply via email to