Hi Ted,

Thanks much for the useful response. A few comments/questions inline below...

On Feb 10, 2010, at 6:27pm, Ted Dunning wrote:

I don't think that there is a single best approach. Here in rough order are
the approaches that I would try:

1) carrot2 may be able to handle a few thousand documents.  If so, you
should be pretty well off because their clustering is typically pretty good
and it shows even better than it is because it makes inspectability a
priority.  carrot2 is not, however, designed to scale to very large
collections according to Dawid.

We originally tried Carrot2, and the results weren't bad. But one additional fact I should have mentioned is that I need to be able to use the resulting clusters to do subsequent classification of future documents. From what I could tell w/Carrot2 and perusing the mailing list, this isn't really feasible due to the lack of a centroid from the Carrot2 clusters.

2) k-means on tf-idf weighted words and bigrams, especially with the
k-means++ starting point that isn't available yet. This should also be pretty good, but may show a bit worse than it really is because it doesn't
use the tricks carrot uses.

Is there any support currently in Mahout for generating tf-idf weighted vectors without creating a Lucene index? Just curious.

3) k-means on SVD representations of documents. Same comments as (2) except that this option depends on *two* pieces of unreleased code (k-means+ + and SVD) instead of one. At this size, you should be able to avoid using the mega-SVD that Jake just posted, but that will mean you need to write your own interlude between the document vectorizer and a normal in-memory SVD. You may have some cooler results here in that documents that share no words might be seen as similar, but I expect that overall results for this should
be very similar as for (2).

I assume you'd use something like the Lucene ShingleAnalyzer to generate one and two word terms.

4) if you aren't happy by now, invoke plan B and come up with new ideas. These might include raw LDA, k-means on LDA document vectors and k- means
with norms other than L_2 and L_1.

All of this changes a bit if you have some labeled documents.  K-means
should be pretty easy to extend to deal with that and it can dramatically
improve results.

I've experimented with using just the results of the Zemanta API, which generates keywords from (I believe) calculating similarity with sets of Wikipedia pages that share a similar category.

But clustering these keyword-only vectors gave marginal results, mostly for cases where the Zemanta results were questionable.

What approach would you take to mix-in keywords such as these with raw document data?

Thanks again!

-- Ken

On Wed, Feb 10, 2010 at 6:04 PM, Ken Krugler <[email protected] >wrote:

Give the code currently in Mahout (+ Lucene), is there a generally accepted
best approach for clustering of documents?

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to