On Tue, Sep 1, 2009 at 9:44 AM, Sean Owen <[email protected]> wrote: > Centroids are just strings that are a similar number of edits away from > another set of strings. >
Easy to say that. Very hard to compute. And the dimensionality is unbounded so the properties of the centroid are not nice. You wind up with centroids that are a large number of edits away from everything and nearly the same distance from everything. > ... > > Anything else that doesn't map? Haven't thought about it a lot but don't > yet > see why k-means couldn't let you cluster strings. In the CF code I do > something similar for arbitrary 'items' so that hints to me that a well > behaved distance metric is all you need? > Depends on what you mean by well-behaved. Mathematically speaking, string edit measures are moderately well behaved. Computationally and practically, however, edit distances are not so nice. Counts of common n-grams are much nicer since they can be interpreted as vectors. -- Ted Dunning, CTO DeepDyve
