On Jun 24, 2009, at 10:11 PM, Ted Dunning wrote:

To clarify, the optimizations I know about that are pending are:

1) better hash table that doesn't box/unbox (clear win)

This is committed already.

Seems like we should also remove the square root calculation for Euclidean, right, we don't actually care about exact distances, since the computation is relative.



2) better centroid distance computation that uses sparsity of document
vector to minimize the L2 norm computation time (this is what I am curious
about)

Yes, this is in the patch and is notably faster, in my subjective testing, which is backed by Shashi. I'm going to commit soon. Just wanted a final discussion on whether we should remove the sqrt calc. from EuclideanDistance. Seems like we should since it's just used for comparison purposes.



3) use triangle inequality to limit number of times we have to compute
distance (probably 2x win, but I am curious)

Reference? I don't believe this is implemented, but I might not be understanding.

Reply via email to