Yeah that probably kills the idea doesn't it... the 'best' centroid is well defined this way, but, searching for it may be completely unreasonable. I see why counts doesn't have this problem.
On Sep 1, 2009 7:17 PM, "Ted Dunning" <[email protected]> wrote: On Tue, Sep 1, 2009 at 9:44 AM, Sean Owen <[email protected]> wrote: > Centroids are just strings th... Easy to say that. Very hard to compute. And the dimensionality is unbounded so the properties of the centroid are not nice. You wind up with centroids that are a large number of edits away from everything and nearly the same distance from everything. > ... > > Anything else that doesn't map? Haven't thought about it a lot but don't > yet > see why k-means... Depends on what you mean by well-behaved. Mathematically speaking, string edit measures are moderately well behaved. Computationally and practically, however, edit distances are not so nice. Counts of common n-grams are much nicer since they can be interpreted as vectors. -- Ted Dunning, CTO DeepDyve
