Re: String clustering and other newbie questions

Ted Dunning Tue, 01 Sep 2009 09:11:51 -0700

The k-means implementation has the idea of distance between vectors of real
numbers pretty deeply baked into it.  One example of this is that it assumes
that you can take the average (aka centroid) of a set of examples.  Taking
the average of a set of strings in the sense of Levenstein distance would be
difficult.

There is an alternative algorithm called k-medoids which uses on of the
input samples as the centroid, but I would expect that this would give poor
results with Levenstein distance.

It would however, be very reasonable to use bigrams or trigrams as labels on
vector coordinates.  The vector value of a string would be derived by
weighting each bigram or trigram according to the negative log of the
prevalence of that bigram or trigram in your entire corpus.  This
representation would be highly amenable to k-means clustering.  Results
should be relatively good, although inspection of the centroids is likely to
be a bit confusing.

On Tue, Sep 1, 2009 at 5:06 AM, Juan Francisco Contreras Gaitan <
[email protected]> wrote:

> But if I understood you well, and as far as I know, Mahout has its own
> k-means implementation. Then, could I use it for my purposes instead of DP
> like setup?

-- 
Ted Dunning, CTO
DeepDyve

Re: String clustering and other newbie questions

Reply via email to