GitHub user wangmiao1981 opened a pull request:
https://github.com/apache/spark/pull/16666
[SPARK-19319][SparkR]:SparkR Kmeans summary returns error when the cluster
size doesn't equal to k
## What changes were proposed in this pull request
When Kmeans using initMode = "random" and some random seed, it is possible
the actual cluster size doesn't equal to the configured `k`.
In this case, summary(model) returns error due to the number of cols of
coefficient matrix doesn't equal to k.
Example:
> col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0)
> cols <- as.data.frame(cbind(col1, col2, col3))
> df <- createDataFrame(cols)
>
> model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10,
+ initMode = "random", seed = 22222, tol = 1E-5)
>
> summary(model2)
Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) :
length of 'dimnames' [2] not equal to array extent
In addition: Warning message:
In matrix(coefficients, ncol = k) :
data length [9] is not a sub-multiple or multiple of the number of rows
[2]
Fix: Get the actual cluster size in the summary and use it to build the
coefficient matrix.
## How was this patch tested?
Add unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wangmiao1981/spark kmeans
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16666.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16666
----
commit 2c1d02d054fe1a8627b8610e8dd6de226b46af55
Author: [email protected] <[email protected]>
Date: 2017-01-21T01:04:21Z
fix kmeans bug
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]