Hi all,

Give the code currently in Mahout (+ Lucene), is there a generally accepted best approach for clustering of documents?

Assumptions are small document sets (e.g. a few thousand), with documents being representative data from web pages, all in English.

I've been fooling around with a few different combinations, e.g. pre- processing the documents to extract keywords and using these for clustering w/k-means, canopy, mean-shift canopy.

But before I have too much fun twiddling all the dials, it would be great to get input on good/bad options.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to