zhengruifeng commented on issue #27758: [SPARK-31007][ML][WIP] KMeans optimization based on triangle-inequality URL: https://github.com/apache/spark/pull/27758#issuecomment-601683609 @srowen Sorry to reply late. I missed the emails from github. > Is the purpose more about prediction speed? It also help saving training time, if the dataset is large enough. Since the cost of computing stats is about O(k^2 * m), while the cost of computing distances at one iteration is O(k * n * m) where m is the number of features, and n is the number of instances; I guess I can compute the stats distributedly in some case (when k is large); I just mark this PR WIP for two reasons: 1, I will test this impl on a big dataset distributedly to check wheter above hypothesis set up; 2, for cosine distance, I want to future find a theoretical basis for the bound. Since above `Each side of a spherical triangle is less than the sum of the other two` seems is for 3-dim. I think it also right when dim>3, but it is not used in other impls. I will look for a theoretical proof.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
