On Tue, Sep 1, 2009 at 9:44 AM, Sean Owen <[email protected]> wrote:

> Centroids are just strings that are a similar number of edits away from
> another set of strings.
>

Easy to say that.

Very hard to compute.  And the dimensionality is unbounded so the properties
of the centroid are not nice.  You wind up with centroids that are a large
number of edits away from everything and nearly the same distance from
everything.


> ...
>
> Anything else that doesn't map? Haven't thought about it a lot but don't
> yet
> see why k-means couldn't let you cluster strings. In the CF code I do
> something similar for arbitrary 'items' so that hints to me that a well
> behaved distance metric is all you need?
>

Depends on what you mean by well-behaved.  Mathematically speaking, string
edit measures are moderately well behaved.  Computationally and practically,
however, edit distances are not so nice.

Counts of common n-grams are much nicer since they can be interpreted as
vectors.



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to