Hi all,
Give the code currently in Mahout (+ Lucene), is there a generally
accepted best approach for clustering of documents?
Assumptions are small document sets (e.g. a few thousand), with
documents being representative data from web pages, all in English.
I've been fooling around with a few different combinations, e.g. pre-
processing the documents to extract keywords and using these for
clustering w/k-means, canopy, mean-shift canopy.
But before I have too much fun twiddling all the dials, it would be
great to get input on good/bad options.
Thanks,
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g