DB Tsai created SPARK-4708:
------------------------------
Summary: k-mean runs two/three times faster with dense/sparse
sample
Key: SPARK-4708
URL: https://issues.apache.org/jira/browse/SPARK-4708
Project: Spark
Issue Type: Improvement
Reporter: DB Tsai
Note that the usage of `breezeSquaredDistance` in
`org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical
path, and breezeSquaredDistance is slow. We should replace it with our own
implementation.
Here is the benchmark against mnist8m dataset.
Before
DenseVector: 70.04secs
SparseVector: 59.05secs
With this PR
DenseVector: 30.58secs
SparseVector: 21.14secs
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]