[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

yanboliang Wed, 02 Nov 2016 07:30:07 -0700

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/14937
  
    @sethah You can try the following piece of code even in a single node:
    ```Scala
    import org.apache.spark.ml.clustering.KMeans
    val dataset = spark.read.format("libsvm").options(Map("vectorType" -> 
"dense")).load("/Users/yliang/Downloads/libsvm/combined")
    val kmeans = new 
KMeans().setK(3).setSeed(1L).setTol(1E-16).setMaxIter(100).setInitMode("random")
    
    val model = kmeans.fit(dataset)
    ```
    You can find the dataset at 
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html .
    I run it against master and this PR, it spends different time for each 
iterations. 
    Before this PR (master code):
    ```
    Time: 32.076 seconds.
    Iteration number: 35.
    ```
    After this PR:
    ```
    Time: 16.322 seconds.
    Iteration number: 85.
    ```
    I think the value of ```tol``` is not set properly, so it causes the two 
implementations converge in different iteration number. We can have more robust 
dataset or force each one to run until a fixed number to compare spent time, 
but we can still get some sense from this result. Please feel free to try this 
test in your environment, and let me know whether it can be reproduced. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

Reply via email to