[GitHub] [spark] zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

GitBox Mon, 02 Mar 2020 02:08:20 -0800

zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization 
based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593323571
 
 
   In current impl, following Lemma is used in KMeans:
   
   0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= 
|(d(x,o) - d(c,o))| = |norm-norm(c)|
   this can be applied in `EuclideanDistance`, but not in `CosineDistance`
   
   According to [Using the Triangle Inequality to Accelerate 
K-Meanswe](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) can go futher, 
and there are another two Lemmas can be used:
   
   1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then 
d(x,c) >= d(x,b);
   
   this can be applied in `EuclideanDistance`, but not in `CosineDistance`.
   However, luckily for `CosineDistance` we can get a variant in the space of 
radian/angle.
   
   2, Let x be a point, and let b and c be centers. Then d(x,c) >= max{0, 
d(x,b)-d(b,c)};
   
   this can be applied in EuclideanDistance, but not in CosineDistance
   
   The application of Lemma 2 is a little complex: It need to cache/update the 
distance/lower bounds to previous centers, and thus can be only applied in 
training, not usable in predction.
   
   So this PR is mainly for Lemma 1. Its idea is quite simple, if point x is 
close to center b enough (less than a pre-computed radius), then we can say 
point x belong to center c without computing the distances between x and other 
centers. It can be used in both training and predction.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality

Reply via email to