Xiaochang Wu created SPARK-31454:
------------------------------------
Summary: An optimized K-Means based on DenseMatrix and GEMM
Key: SPARK-31454
URL: https://issues.apache.org/jira/browse/SPARK-31454
Project: Spark
Issue Type: Improvement
Components: ML
Affects Versions: 3.0.0
Reporter: Xiaochang Wu
The main computations in K-Means are calculating distances between individual
points and center points. Currently K-Means implementation is vector-based
which can't take advantage of optimized native BLAS libraries.
When the original points are represented as dense vectors, our approach is to
modify the original input data structures to a DenseMatrix-based one by
grouping several points together. The original distance calculations can be
translated into a Matrix multiplication then optimized native GEMM routines
(Intel MKL, OpenBLAS etc.) can be used. This approach can also work with sparse
vectors despite having larger memory consumption when translating sparse
vectors to dense matrix.
Our preliminary benchmark shows this DenseMatrix+GEMM approach can boost the
training performance by *3.5x* with Intel MKL, looks very promising!
To minimize end user impact, proposed changes are to use config parameters to
control if turn on this implementation without modifying public interfaces.
Parameter rowsPerMatrix is used to control how many points are grouped together
to build a DenseMatrix. An example:
$ spark-submit --master $SPARK_MASTER \
--conf "spark.ml.kmeans.matrixImplementation.enabled=true" \
--conf "spark.ml.kmeans.matrixImplementation.rowsPerMatrix=5000" \
--class org.apache.spark.examples.ml.KMeansExample
Several code changes are made in "spark.ml" namespace as we think "spark.mllib"
is in maintenance mode, some are duplications from spark.mllib for using
private definitions in the same package:
- Modified: KMeans.scala
- Added: KMeansMatrixImpl.scala
- Duplications: DistanceMeasure.scala, LocalKMeans.scala
If this general idea is accepted by community, we are willing to contribute our
code to upstream and polish the implementation according to feedbacks and
produce benchmarks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]