xwu99 commented on issue #27758:
URL: https://github.com/apache/spark/pull/27758#issuecomment-616958956


   > In current impl, following Lemma is used in KMeans:
   > 
   > 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= 
|(d(x,o) - d(c,o))| = |norm-norm(c)|
   > 
   > ```scala
   >       val center = centers(i)
   >       // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower 
bound to avoid unnecessary
   >       // distance computation.
   >       var lowerBoundOfSqDist = center.norm - point.norm
   >       lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
   >       if (lowerBoundOfSqDist < bestDistance) {
   >         ...
   >       }
   > ```
   > 
   > this can only be applied in `EuclideanDistance`, but not in 
`CosineDistance`
   > 
   > According to [Using the Triangle Inequality to Accelerate 
K-Meanswe](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) , we can go 
futher, and there are another two Lemmas can be used:
   > 
   > > 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then 
d(x,c) >= d(x,b);
   > 
   > this can be applied in `EuclideanDistance`, but not in `CosineDistance`.
   > However, for `CosineDistance` we can luckily get a variant in the space of 
radian/angle.
   > 
   > This PR is mainly for Lemma 1. Its idea is quite simple, if point x is 
close to center b enough (less than a pre-computed radius), then we can say 
point x belong to center c without computing the distances between x and other 
centers. It can be used in both training and predction.
   
   When this condition holds, it can remove the computation of distances for 
other centers. If not the case, it will go to the normal K-means path? 
   
   Does it hurt by introducing more overhead by pre-calculation when K is large 
since you need to pre-calculate when centers are updated for each iteration and 
the condition needs to always hold to get the benefit?
   
   > 
   > > 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max{0, 
d(x,b)-d(b,c)};
   > 
   > this can be applied in EuclideanDistance, but not in CosineDistance
   > 
   > The application of Lemma 2 is a little complex:
   > 
   > * It need to cache/update the distance/lower bounds to previous centers,
   > * and thus can be only applied in training, not usable in prediction.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to