zhengruifeng commented on issue #27758:
URL: https://github.com/apache/spark/pull/27758#issuecomment-616970345


   @xwu99 Thanks for reviewing!
   
   > When this condition holds, it can remove the computation of distances for 
other centers. If not the case, it will go to the normal K-means path?
   
   Yes, this PR will use two short-circuit:
   1, if point x is close to center b enough (less than a pre-computed radius), 
then we can say point x belong to center c without computing the distances 
between x and other centers.
   2, in normal K-means path, suppose current closest center is `c` and 
distance is `p`, for next cluster `d`, if `distance(c,d)` is larger than a 
bound `f(p)`, then we can skip center `d`;
   
   > Does it hurt by introducing more overhead by pre-calculation when K is 
large since you need to pre-calculate when centers are updated for each 
iteration and the condition needs to always hold to get the benefit?
   
   I had optimize it by make the pre-calculation distributedly when `k*k*dim` 
is too large. On the test dataset with dim=8,289,919 and k=32, it took less 
than 2 seconds at one iteration.
   In most cases, it is computed in the driver, and only took ~100 millsecond.
   
   I think triangle-inequality based optimization is complementary with BLAS 
optimization. When we do not enable Level-3 BLAS, then the triangle-inequality 
approach will work.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to