GitHub user wangmiao1981 opened a pull request:

    https://github.com/apache/spark/pull/16666

    [SPARK-19319][SparkR]:SparkR Kmeans summary returns error when the cluster 
size doesn't equal to k

    ## What changes were proposed in this pull request
    
    When Kmeans using initMode = "random" and some random seed, it is possible 
the actual cluster size doesn't equal to the configured `k`.
    
    In this case, summary(model) returns error due to the number of cols of 
coefficient matrix doesn't equal to k.
    
    Example:
    >  col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
    >   col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
    >   col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
    >   cols <- as.data.frame(cbind(col1, col2, col3))
    >   df <- createDataFrame(cols)
    > 
    >   model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,
    +                          initMode = "random", seed = 22222, tol = 1E-5)
    > 
    > summary(model2)
    Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) :
      length of 'dimnames' [2] not equal to array extent
    In addition: Warning message:
    In matrix(coefficients, ncol = k) :
      data length [9] is not a sub-multiple or multiple of the number of rows 
[2]
    
    Fix: Get the actual cluster size in the summary and use it to build the 
coefficient matrix.
    ## How was this patch tested?
    
    Add unit tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangmiao1981/spark kmeans

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16666.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16666
    
----
commit 2c1d02d054fe1a8627b8610e8dd6de226b46af55
Author: [email protected] <[email protected]>
Date:   2017-01-21T01:04:21Z

    fix kmeans bug

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to