[GitHub] [spark] xwu99 commented on issue #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM

GitBox Mon, 20 Apr 2020 20:22:27 -0700


xwu99 commented on issue #28229:
URL: https://github.com/apache/spark/pull/28229#issuecomment-616928515



   @srowen Thanks you for linking us!
   
   > 
   > @xwu99 Could you please provide some performance results of your PR?
   Our preliminary benchmark shows this approach can boost the training 
performance by 3.5x with Intel MKL, l can provide further benchmark later. 
   
   > I had [similar 
attempts](https://github.com/apache/spark/compare/master...zhengruifeng:blockify_km?expand=1)
 to optimize KMeans based on high level BLAS.
   > I also blockfied vectors into blocks, and use BLAS.gemm to find best 
costs. But I found that:
   > 1, it will cause performance regression when input dataset is sparse, (I 
notice that you add `spark.ml.kmeans.matrixImplementation.rowsPerMatrix`, I am 
not sure whether we should have two implementations);
   
   This config is for not to impact the origianl one. If the general idea is 
OK, we can switch for best performance implementaion under different 
conditions, it's not unusual in other part of MLlib code. 
   
   > 2, when input dataset is dense, I found no performace gain when 
`distanceMeasure = EUCLIDEAN`; while `distanceMeasure = EUCLIDEAN`, about 10% ~ 
20% speedup can be obtained;
   > 3, Native BLAS (Open-BLAS) did not help too much, if single-thread is used 
(which is suggested [in 
SPARK](https://spark.apache.org/docs/latest/ml-guide.html#dependencies));
   
   Did you benchmark with native BLAS with a machine with AVX2 or AVX512 ? The 
native optimization not only take advantage of multi-thread but also SIMD, 
cache etc.  
   
   > 
   > Then I swith to another optimization approach based on 
[triangle-inequality](https://github.com/apache/spark/pull/27758), it works on 
both dense and sparse dataset, and will gain about 10%~30% when `numFeatures` 
and/or `k` are large.
   
   I do think it's a good idea! But it's still not a general speedup for all 
cases, gain on assuming some specific conditions. Still need to use the general 
K-Means.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] xwu99 commented on issue #28229: [SPARK-31454][ML] An optimized K-Means based on DenseMatrix and GEMM

Reply via email to