xwu99 commented on issue #28229: URL: https://github.com/apache/spark/pull/28229#issuecomment-616928515
@srowen Thanks you for linking us! > > @xwu99 Could you please provide some performance results of your PR? Our preliminary benchmark shows this approach can boost the training performance by 3.5x with Intel MKL, l can provide further benchmark later. > I had [similar attempts](https://github.com/apache/spark/compare/master...zhengruifeng:blockify_km?expand=1) to optimize KMeans based on high level BLAS. > I also blockfied vectors into blocks, and use BLAS.gemm to find best costs. But I found that: > 1, it will cause performance regression when input dataset is sparse, (I notice that you add `spark.ml.kmeans.matrixImplementation.rowsPerMatrix`, I am not sure whether we should have two implementations); This config is for not to impact the origianl one. If the general idea is OK, we can switch for best performance implementaion under different conditions, it's not unusual in other part of MLlib code. > 2, when input dataset is dense, I found no performace gain when `distanceMeasure = EUCLIDEAN`; while `distanceMeasure = EUCLIDEAN`, about 10% ~ 20% speedup can be obtained; > 3, Native BLAS (Open-BLAS) did not help too much, if single-thread is used (which is suggested [in SPARK](https://spark.apache.org/docs/latest/ml-guide.html#dependencies)); Did you benchmark with native BLAS with a machine with AVX2 or AVX512 ? The native optimization not only take advantage of multi-thread but also SIMD, cache etc. > > Then I swith to another optimization approach based on [triangle-inequality](https://github.com/apache/spark/pull/27758), it works on both dense and sparse dataset, and will gain about 10%~30% when `numFeatures` and/or `k` are large. I do think it's a good idea! But it's still not a general speedup for all cases, gain on assuming some specific conditions. Still need to use the general K-Means. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
