If the distance being used is a real metric, it would satisfy the triangle
inequality (that is basically the definition of metric).  For a given point
x relative to two centroids c1 and c2, this means that

         r(c1, c2) <= r(x, c1) + r(x, c2)

or

         r(c1, c2) - r(x, c1) <= r(x, c2)

This means that if r(c1, c2) is > 2 r(x, c1), you don't have to consider
r(x, c2) since r(x, c1) < r(x, c2)

Moreover, since many points stay with the same centroid from one iteration
to the next, especially in later iterations, you can compute the distance to
the previous centroid first and avoid computing distances to several other
centroids.

A related optimization is to not recompute distances to clusters that
haven't moved.  Typically, this would only apply to the closest cluster to
save memory, but that is a late stage decision.

In high dimensional spaces with completely symmetrically distributions of
points, it is pretty easy to see that the triangle optimization will not
help that much.  BUT, one essentially never has completely symmetrical
distributions and with text you have radically non-symmetric distributions
due to sparsity of the document vectors.  So I am very curious how these
optimizations will work for the text clustering you are doing.

On Thu, Jun 25, 2009 at 1:04 PM, Grant Ingersoll <[email protected]>wrote:

> 3) use triangle inequality to limit number of times we have to compute
>> distance (probably 2x win, but I am curious)
>>
>
> Reference?  I don't believe this is implemented, but I might not be
> understanding.




-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Reply via email to