Dear Alex,

actually fixing the number of clusters in kmeans end then ending up with a smaller number because of empty clusters is not a standard method of estimating the number of clusters. I may happen (as apparently in some of your examples), but it is generally rather unusual. In most cases, kmeans, as well as clara, pam and other clustering methods, only give you the number of clusters you ask for. Even with some reasonable separation between clusters kmeans cannot generally be expected to come up with empty clusters if the number is initially chosen too high or too many initially centers are specified.

The help page for pam.object in library cluster shows you a method to estimate the optimal number of clusters based on pam. However, this problem strongly depends on what cluster concept you have in mind and what you want to use your clusters for. There are alternative indexes that could be optimised to find the best number of clusters. Some of them are implemented in the function cluster.stats in package fpc. I strongly advise reading some literature about this to understand the problem better; the help page of cluster.stats gives a few references.

The BIC gives you an estimate of the number of cluster together with Gaussian mixtures, see package mclust.

If you can specify things like maximum within-cluster distances, you may get something from using cutree together with a hierarchical clustering method in hclust, for example complete linkage.

dbscan and fixmahal in package fpc are further alternatives, requiring
one or two tuning constants to come up with an automatical number of
clusters.

Best regards,
Christian

On Thu, 11 Jun 2009, am...@xs4all.nl wrote:

I use kmeans to classify spectral events in high and low 1/3 octave bands:

#Do cluster analysis
CyclA<-data.frame(LlowA,LhghA)
CntrA<-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE)
ClstA<-kmeans(CyclA,centers=CntrA,nstart=50,algorithm="MacQueen")

This works well when the actual data shows 1,2 or 3 groups that are not
"too close" in a cross plot. The MacQueen algorithm will give one or more
empty groups which is what I want.

However, there are cases when the groups are closer together, less compact
or diffuse which leads to the situation where visually only 2 groups are
apparent but the algorithm returns 3 splitting one group in two.

I looked at the package 'cluster' specifically at clara (cannot use pam as
I have 10000 observations). But clara always returns as many groups as you
aks for.

Is there a way to help find a seed for the intial cluster centers?
Equivalently, is there a way to find a priori the number of groups?

I know this is not an easy problem. I have looked at principal components
(princomp, prcomp) because there is a connection with cluster analysis. It
is not obvious to me how to program that connection though.

http://en.wikipedia.org/wiki/Principal_Component_Analysis
http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf
http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf

Thanks in advance,
Alex van der Spek

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to