Suggestions for best approach to classic document clustering

Ken Krugler Wed, 10 Feb 2010 18:04:54 -0800

Hi all,

Give the code currently in Mahout (+ Lucene), is there a generallyaccepted best approach for clustering of documents?

Assumptions are small document sets (e.g. a few thousand), withdocuments being representative data from web pages, all in English.

I've been fooling around with a few different combinations, e.g. pre-processing the documents to extract keywords and using these forclustering w/k-means, canopy, mean-shift canopy.

But before I have too much fun twiddling all the dials, it would begreat to get input on good/bad options.


Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Suggestions for best approach to classic document clustering

Reply via email to