Hello,
I found a severe bug in the function kmeanspp (k-means++ implementation) in the
LICORS package
(https://www.rdocumentation.org/packages/LICORS/versions/0.2.0/topics/kmeanspp)
and need some advice on how to handle this (therefore posting here).
Since LICORS has not been updated since 2013, I am not sure if there is an
active maintainer. The named maintainer works for Google since 2014 and may
have abandoned the package.
On the other hand, this is one of the few implementations of k-means++ in R and
coming up first when searching.
The bug leads to inferior results. In its current form, the results were much
worse than those of Hartigan-Wong (the default for k-means in the stats
package) for all test problems I tried. However, after fixing the bug, kmeanspp
found better results than Hartigan-Wong for all those problems. So anyone
comparing those two algorithms based on the current implementations in R may
have come to the wrong conclusions. BTW: The Hartigan-Wong implementation
(Fortran) is *much* faster than LICOR/kmeanspp, which is written in pure R
(before making one call to stats/kmeans at the end), but that is not the point
here.
The bug concerns a distance computation which should be a matrix of distances
of all data vectors and all current codebook vectors, but is not. The code and
an example illustrating the problem is shown below. Basically, to subtract a
vector from a matrix, one has to convert the vector into a matrix where all
rows are just copies of the vector. Details are shown below. The fix is trivial.
I stumbled upon this because kmeanspp produced counterintuitive and very poor
results.
Is this of any interest? Should kmeanspp in LICORS be fixed? I have neither the
experience nor the time to take over an R-package. I could write a more formal
bug report, if required, but I initially would like to know if this is
considered relevant.
Suggestions/comments are welcome.
Kind Regards
Bernd Fritzke
The code
if (ndim == 1) {
dists <- apply(cbind(data[center_ids, ]), 1,
function(center) {
rowSums((data - center)^2)
})
}
else {
dists <- apply(data[center_ids, ], 1, function(center) {
rowSums((data - center)^2)
})
}
should rather be (the two changed lines are marked as "fixed"):
if (ndim == 1) {
dists <- apply(cbind(data[center_ids, ]), 1,
function(center) {
rowSums((data - rep(center,each=nrow(data) ) )^2) # fixed
})
}
else {
dists <- apply(data[center_ids, ], 1, function(center) {
rowSums((data - rep(center,each=nrow(data) ) )^2) # fixed
})
}
Here is some example code illustrating the problem. The code should compute the
square distances between the six two-dimensional vectors in "data" and three
vectors which happen to be the elements 1, 2, and 4 in "data".
RSTUDIO
```{r}
data=cbind(1:6,rep(0,6))
print("data")
print(data)
center_ids=c(1,2,4)
print("codebook")
print(data[center_ids, ])
# fixed
dists <- apply(data[center_ids, ], 1,
function(center) {
rowSums((data - rep(center,each=nrow(data)))^2)
}
)
print("dists (correct)")
print(dists)
# buggy
dists <- apply(data[center_ids, ], 1,
function(center) {
rowSums((data - center)^2)
}
)
print("buggy dists from LICORS")
dists
```
Output:
[1] "data"
[,1] [,2]
[1,] 1 0
[2,] 2 0
[3,] 3 0
[4,] 4 0
[5,] 5 0
[6,] 6 0
[1] "codebook"
[,1] [,2]
[1,] 1 0
[2,] 2 0
[3,] 4 0
[1] "dists (correct)"
[,1] [,2] [,3]
[1,] 0 1 9
[2,] 1 0 4
[3,] 4 1 1
[4,] 9 4 0
[5,] 16 9 1
[6,] 25 16 4
[1] "buggy dists from LICORS"
[,1] [,2] [,3]
[1,] 1 5 25
[2,] 4 4 4
[3,] 5 5 17
[4,] 16 16 16
[5,] 17 13 17
[6,] 36 36 36
--
Dr.-Ing. Bernd Fritzke
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.