Github user yanboliang commented on the issue:
https://github.com/apache/spark/pull/14937
@sethah You can try the following piece of code even in a single node:
```Scala
import org.apache.spark.ml.clustering.KMeans
val dataset = spark.read.format("libsvm").options(Map("vectorType" ->
"dense")).load("/Users/yliang/Downloads/libsvm/combined")
val kmeans = new
KMeans().setK(3).setSeed(1L).setTol(1E-16).setMaxIter(100).setInitMode("random")
val model = kmeans.fit(dataset)
```
You can find the dataset at
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html .
I run it against master and this PR, it spends different time for each
iterations.
Before this PR (master code):
```
Time: 32.076 seconds.
Iteration number: 35.
```
After this PR:
```
Time: 16.322 seconds.
Iteration number: 85.
```
I think the value of ```tol``` is not set properly, so it causes the two
implementations converge in different iteration number. We can have more robust
dataset or force each one to run until a fixed number to compare spent time,
but we can still get some sense from this result. Please feel free to try this
test in your environment, and let me know whether it can be reproduced. Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]