Re: [R] Cluster analysis, defining center seeds or number of clusters

Christian Hennig Thu, 11 Jun 2009 09:43:27 -0700

Dear Alex,

actually fixing the number of clusters in kmeans end then ending up with asmaller number because of empty clusters is not a standard method ofestimating the number of clusters. I may happen (as apparently in some ofyour examples), but it is generally rather unusual. In most cases, kmeans,as well as clara, pam and other clustering methods, only give you thenumber of clusters you ask for. Even with some reasonable separationbetween clusters kmeans cannot generally be expected to come up with emptyclusters if the number is initially chosen too high or too manyinitially centers are specified.

The help page for pam.object in library cluster shows you a method toestimate the optimal number of clusters based on pam.However, this problem strongly depends on what cluster concept you have inmind and what you want to use your clusters for. There are alternativeindexes that could be optimised to find the best number of clusters. Someof them are implemented in the function cluster.stats in package fpc.I strongly advise reading some literature about this to understand theproblem better; the help page of cluster.stats gives a few references.

The BIC gives you an estimate of the number of cluster together withGaussian mixtures, see package mclust.

If you can specify things like maximum within-cluster distances, you mayget something from using cutree together with a hierarchical clusteringmethod in hclust, for example complete linkage.


dbscan and fixmahal in package fpc are further alternatives, requiring
one or two tuning constants to come up with an automatical number of
clusters.

Best regards,
Christian

On Thu, 11 Jun 2009, am...@xs4all.nl wrote:

I use kmeans to classify spectral events in high and low 1/3 octave bands:

#Do cluster analysis
CyclA<-data.frame(LlowA,LhghA)
CntrA<-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE)
ClstA<-kmeans(CyclA,centers=CntrA,nstart=50,algorithm="MacQueen")

This works well when the actual data shows 1,2 or 3 groups that are not
"too close" in a cross plot. The MacQueen algorithm will give one or more
empty groups which is what I want.

However, there are cases when the groups are closer together, less compact
or diffuse which leads to the situation where visually only 2 groups are
apparent but the algorithm returns 3 splitting one group in two.

I looked at the package 'cluster' specifically at clara (cannot use pam as
I have 10000 observations). But clara always returns as many groups as you
aks for.

Is there a way to help find a seed for the intial cluster centers?
Equivalently, is there a way to find a priori the number of groups?

I know this is not an easy problem. I have looked at principal components
(princomp, prcomp) because there is a connection with cluster analysis. It
is not obvious to me how to program that connection though.

http://en.wikipedia.org/wiki/Principal_Component_Analysis
http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf
http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf

Thanks in advance,
Alex van der Spek

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cluster analysis, defining center seeds or number of clusters

Reply via email to