zhengruifeng commented on issue #27758: [SPARK-31007][ML][WIP] KMeans 
optimization based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-601683609
 
 
   @srowen Sorry to reply late. I missed the emails from github.
   
   > Is the purpose more about prediction speed?
   
   It also help saving training time, if the dataset is large enough. Since the 
cost of computing stats is about O(k^2 * m), while the cost of computing 
distances at one iteration is O(k * n * m) where m is the number of features, 
and n is the number of instances; I guess I can compute the stats distributedly 
in some case (when k is large);
   
   I just mark this PR WIP for two reasons:
   1, I will test this impl on a big dataset distributedly to check wheter 
above hypothesis set up;
   2, for cosine distance, I want to future find a theoretical basis for the 
bound. Since above `Each side of a spherical triangle is less than the sum of 
the other two` seems is for 3-dim. I think it also right when dim>3, but it is 
not used in other impls. I will look for a  theoretical proof.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to