The k-means implementation has the idea of distance between vectors of real numbers pretty deeply baked into it. One example of this is that it assumes that you can take the average (aka centroid) of a set of examples. Taking the average of a set of strings in the sense of Levenstein distance would be difficult.
There is an alternative algorithm called k-medoids which uses on of the input samples as the centroid, but I would expect that this would give poor results with Levenstein distance. It would however, be very reasonable to use bigrams or trigrams as labels on vector coordinates. The vector value of a string would be derived by weighting each bigram or trigram according to the negative log of the prevalence of that bigram or trigram in your entire corpus. This representation would be highly amenable to k-means clustering. Results should be relatively good, although inspection of the centroids is likely to be a bit confusing. On Tue, Sep 1, 2009 at 5:06 AM, Juan Francisco Contreras Gaitan < [email protected]> wrote: > But if I understood you well, and as far as I know, Mahout has its own > k-means implementation. Then, could I use it for my purposes instead of DP > like setup? -- Ted Dunning, CTO DeepDyve
