[R] Axis scaling for PCA biplot
Hi there, I'm puzzled about the axis scaling in the PCA biplot. Here's an example. library(pdfCluster) # package cepp seems to have the same data set. data(oliveoil) # 572 observations 10 variables olive <- oliveoil[,3:10] # numerical variables prolive <- princomp(olive) summary(prolive) # Importance of components: # Comp.1 Comp.2 Comp.3 Comp.4 # Standard deviation 479.7299024 150.82827868 45.394449751 27.522646558 # Proportion of Variance 0.8970072 0.08866821 0.008031707 0.002952451 # Cumulative Proportion 0.8970072 0.98567544 0.993707152 0.996659603 # Comp.5 Comp.6 Comp.7 Comp.8 # Standard deviation 24.78169442 1.196956e+01 7.1390744088 6.9756965249 # Proportion of Variance 0.00239367 5.584168e-04 0.0001986489 0.0001896608 # Cumulative Proportion 0.99905327 9.996117e-01 0.9998103392 1.00 plot(prolive$scores) # Scaling of this plot reproduces the variances of the components given in the summary, # as does cov(prolive$scores). This seems all fine, however... biplot(prolive) I have no idea what the numbers on the axes of the biplot are, at least not the larger ones. Chances are the smaller ones indicate the loadings. The larger ones are neither the same as in the first plot, nor are they standardised to one, but they seem to be standardised somehow, as the range on x- and y-axis looks the same, which it shouldn't be if variances represented the PCA eigenvalues. Can anyone explain this to me? Actually the help page of biplot.princomp says something on this, but I don't get my head around it: "scale The variables are scaled by |lambda ^ scale| and the observations are scaled by |lambda ^ (1-scale)| where |lambda| are the singular values as computed by |princomp <https://www.rdocumentation.org/link/princomp?package=stats=3.6.2>|. Normally |0 <= scale <= 1|, and a warning will be issued if the specified |scale| is outside this range." The default value of scale seems to be 1, but then (1-scale) is zero so I'd assume data to be unscaled, but that should have reproduced the "plot" scale, shouldn't it? Thanks, Christian -- Christian Hennig Dipartimento di Scienze Statistiche "Paolo Fortunati", Universita di Bologna, phone +39 05120 98163 christian.hen...@unibo.it [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis using term frequencies
Dear Sun Shine, dtes - dist(tes.df, method = 'euclidean') dtesFreq - hclust(dtes, method = 'ward.D') plot(dtesFreq, labels = names(tes.df)) However, I get an error message when trying to plot this: Error in graphics:::plotHclust(n1, merge, height, order(x$order), hang, : invalid dendrogram input. I don't see anything wrong with the code, so what I'd do is run str(dtes) and str(dtesFreq) to see whether these are what they should be (or if not, what they are instead). I'm clearly screwing something up, either in my source data.frame or in my setting hclust up, but don't know which, nor how. Can't comment on your source data but generally, whatever you do, use str() or even print() to see whether the R-objects are allright or what went wrong. More than just identifying the error however, I am interested in finding a smart (efficient/ elegant) way of checking the occurrence and frequency value of the terms that may be associated with 'sports', 'learning', and 'extra-mural' and extracting these into a matrix or data frame so that I can analyse and plot their clustering to see if how I associated these terms is actually supported statistically. The first thing that comes to my mind (not necessarily the best/most elegant) is to run... dtes3 - cutree(dtesFreq,3) ...and to table dtes3 against your manual classification. Note that 3 is the most natural number of clusters to cut the tree here but may not be the best to match your classification (for example, you may have a one-point cluster in the 3-cluster solution, so it may effectively be a two-cluster solution with an outlier). Your dendrogram, if you succeed plotting it, may give you a hint about that. Hope this helps, Christian I'm sure that there must be a way of doing this in R, but I'm obviously not going about it correctly. Can anyone shine a light please? Thanks for any help/ guidance. Regards, Sun __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] clustering with hclust
Dear Marianna, the function agnes in library cluster can compute Ward's method from a raw data matrix (at least this is what the help page suggests). Also, you may not be using the most recent version of hclust. The most recent version has a note in its help page that states: Two different algorithms are found in the literature for Ward clustering. The one used by option ward.D (equivalent to the only Ward option ward in R versions = 3.0.3) does not implement Ward's (1963) clustering criterion, whereas option ward.D2 implements that criterion (Murtagh and Legendre 2013). With the latter, the dissimilarities are squared before cluster updating. Note that agnes(*, method=ward) corresponds to hclust(*, ward.D2). The Murtagh and Legendre paper has more details on this and is here: http://arxiv.org/abs/.6285 F. Murtagh and P. Legendre, Ward's hierarchical clustering method: clustering criterion and agglomerative algorithm It's not clear to me why one would want to use Ward's method for this kind of data, but that's your decision of course. Best wishes, Christian On Fri, 25 Jul 2014, Marianna Bolognesi wrote: Hi everybody, I have a problem with a cluster analysis. I am trying to use hclust, method=ward. The Ward method works with SQUARED Euclidean distances. Hclust demands a dissimilarity structure as produced by dist. Yet, dist does not seem to produce a table of squared euclidean distances, starting from cosines. In fact, computing manually the squared euclidean distances from cosines (d=2(1-cos)) produces a different outcome. As a consequence, using hclust with ward method on a table of cosines tranformed into distances with dist, produces a different dendrogram than other programs for hierarchical clustering with ward method (i.e. multidendrograms). Weird right?? Computing manually the distances and then feeding them to hclust produces an error message. So, I am wondering, what the hell is this dist function doing?! thanks! marianna [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R CMD check Error: package MASS was built before R 3.0.0 - not true!?
Hi there, I just updated my R to 3.0.2 and ran R CMD check --as-cran on the just produced new version of fpc. I got an error Error: package MASS was built before R 3.0.0: please re-install it - but I actually *did* re-install MASS without error just before that and within R library(MASS) works just fine. What can I do about this? Best wishes, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R CMD build error with data files
Dear list, I want to update my prabclus package which I haven't done for quite a while. In the previous version, I had .dat files in my data subdirectory, which I read using .R files. Now R CMD check gives me a warning that .dat files are no longer accepted there. So I changed my filenames to .txt, but actually some of these files are only there in order to be read by .R, not in order to be read independetly by read.table, which produces an error (something like scan error: line 5 does not have 2 elements, which I'd guess comes from the fact that R tries to automatically read the file with read.table that should be read in a different way as specified in my .R-file). So I decided, following the advice in Writing R extensions, to put these files into a directory called inst/extdata. The data directory now only has .R-files that read stuff from files in inst/extdata. This, unfortunately, doesn't seem to work. For example, kykladspecreg.R in data has one line reading kykladspecreg - read.table(system.file(extdata,kykladspecreg.txt,package=prabclus)). R CMD build gives Warning in file(file, rt) : file() only supports open = w+ and open = w+b: using the former Error in read.table(system.file(extdata, kykladspecreg.txt, package = prabclus)) : no lines available in input Same if I use read.table(system.file(inst,extdata,kykladspecreg.txt,package=prabclus)) What's wrong here? Now what to do with the files to be read by the R-files? In the old days all was fine having them as dat files in the data directory. :-( By the way, R CMD check also produces a warning * checking PDF version of manual ... WARNING LaTeX errors when creating PDF version. This typically indicates Rd problems. ...with no useful hint what the problemn may be whatsoever. If anyone has an idea how to find the problem, please tell me! Thanks, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] lda, collinear variables and CV
Dear R-help list, apparently lda from the MASS package can be used in situations with collinear variables. It only produces a warning then but at least it defines a classification rule and produces results. However, I can't find on the help page how exactly it does this. I have a suspicion (it may look at the hyperplane containing the class means, using some kind of default/trivial within-group covariance matrix) but I'd like to know in detail if possible. I find particularly puzzling that it produces different results whether I choose CV=TRUE or I run a manual LOO cross-validation. Constructing an example, I realised that I'm puzzled about CV=TRUE not only in the collinear case. The example is below. Actually it also produces different (though rather similar) results for p=10 (no longer collinear). See here: library(MASS) set.seed(12345) n - 50 p - 200 # or p- 10 testdata - matrix(ncol=p,nrow=n) for (i in 1:p) testdata[,i] - rnorm(n) class - as.factor(c(rep(1,25),rep(2,25))) lda1 - lda(x=testdata,grouping=class,CV=TRUE) table1 - table(lda1$class,class) y.lda - rep(NA, n) for(i in 1:n){ testset - testdata[i,,drop=FALSE] trainset - testdata[-i,] model.lda - lda(x=trainset,grouping=class[-i]) y.lda[i] - predict(model.lda, testset)$class } table2 -table(y.lda, class) table1 class 1 2 1 14 16 2 11 9 table2 class y.lda 1 2 1 15 10 2 10 15 With p=10: table1 class 1 2 1 10 11 2 15 14 table2 class y.lda 1 2 1 10 12 2 15 13 Any explanation? Best regards, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Latent class multinomial (or conditional) logit using R?
You may have a look at lcmixed/flexmixedruns in package fpc. This provides flexmix methods that can be used to fit latent class models with locally independent multinomials (no logits, though, and I'm not sure that for categorical data only this is much different from what poLCA offers). Best regards, Christian On Fri, 23 Dec 2011, adanmartinez wrote: Hi everyone? Does anybody know how can I estimate a Latent class multinomial (or conditional) logit using R? I have tried flexmix, poLCA, and they do not seem to support this model. thanks in advance adan -- View this message in context: http://r.789695.n4.nabble.com/Latent-class-multinomial-or-conditional-logit-using-R-tp4230083p4230083.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] removing outliers in non-normal distributions
Dear Ben, not specifically an R-related response, but the best philosophy of defining outliers, as far as I'm concerned, you'll find in Davies and Gather, The identification of multiple outliers, JASA 1993. The idea is that you can only properly define what an outlier is relative to a reference distributional shape. Note that the reference distributional shape is *not* what you believe the underlying distribution is, but rather a device to define outliers as points that clearly exceed the extremes that are normally to be expected under the reference shape. If your reference distributional shape is the normal, you need to set up a robust outlier identification rule that has a low probability to find *any* outlier if the data rally come from a normal. Basically declaring everything outside median +/- c*MAD will work (Hampel identifier), but c needs to depend on the size of the dataset in order to calibrate it so that for example under the normal the probability not to find any outlier is 0.05 (or whatever you want; note that it is always left to the user and to some extent arbitrary to specify a borderline). Some values for c are given in Davies and Gather. There are some slightly more efficient alternatives but the Hampel identifier is simple and still good. The same principle can be applied (although this is not done in the cited paper) if your reference distribution is different. This may for example make sense if you have skew data and you want a skew outlier identification rule. It should be more or less straightforward to adapt the ideas to a lognormal reference distribution. For others, I'm not sure if literature exists but maybe. A lot has happened since 1993. Hope this helps, Christian On Wed, 28 Sep 2011, Ben qant wrote: Hello, I'm seeking ideas on how to remove outliers from a non-normal distribution predictor variable. We wish to reset points deemed outliers to a truncated value that is less extreme. (I've seen many posts requesting outlier removal systems. It seems like most of the replies center around why do you want to remove them, you shouldn't remove them, it depends, etc. so I've tried to add a lot of notes below in an attempt to answer these questions in advance.) Currently we Winsorize using the quantile function to get the new high and low values to set the outliers to on the high end and low end (this is summarized legacy code that I am revisiting): #Get the truncated values for resetting: lowq = quantile(dat,probs=perc_low,na.rm=TRUE) hiq = quantile(dat,probs=perc_hi,na.rm=TRUE) #resetting the highest and lowest values with the truncated values: dat[lowqdat] = lowq dat[hiqdat] = hiq What I don't like about this is that it always truncates values (whether they truly are outliers or not) and the perc_low and perc_hi settings are arbitrary. I'd like to be more intelligent about it. Notes: 1) Ranking has already been explored and is not an option at this time. 2) Reminder: these factors are almost always distributed non-normally. 3) For reason I won't get into here, I have to do this pragmatically. I can't manually inspect the data each time I remove outliers. 4) I will be removing outliers from candidate predictor variables. Predictors variable distributions all look very different from each other, so I can't make any generalizations about them. 5) As #4 above indicates, I am building and testing predictor variables for use in a regression model. 6) The predictor variable outliers are usually somewhat informative, but their extremeness is a result of the predictor variable calculation. I think extremeness takes away from the information that would otherwise be available (outlier effect). So I want to remove some, but not all, of their extremeness. For example, percent change of a small number: from say 0.001 to 500. Yes, we want to know that it changed a lot, but 49,999,900% is not helpful and masks otherwise useful information. I'd like to hear your ideas. Thanks in advance! Regards, Ben [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mclust: modelName=E vs modelName=V
I probably don't understand problem. I'd assume that variance$sigmasq are the three estimated component variances (probably estimated by maximum a posteriori, but consult the mclust documentation). What's wrong with that? (The values you submit as scale in prior are not fixed variances, but parameters of the prior distribtion - your problem may be that you believe that they are meant to be variances fixed by you!?) Christian On Tue, 6 Sep 2011, Nico902 wrote: Hi, Thanks a lot for your answer. I effectively was able to get rid of this message by doing: resClust - Mclust(data,G=3,modelName=V,prior=priorControl(scale=c(1.44,0.81,0.49))); However, I would like to be able to retrieve the variances I defined in the result. I found: resClust$parameters $Vinv NULL $pro [1] 0.5502496 0.1986852 0.2510652 $mean 1 2 3 -2.8390006980 -0.0003267873 3.1072574619 $variance $variance$modelName [1] V $variance$d [1] 1 $variance$G [1] 3 $variance$sigmasq [1] 0.840267666 0.009466821 1.510263146 $variance$scale [1] 0.840267666 0.009466821 1.510263146 I do not manage to get where the sigmasq is coming from. I tried to sqrt or square the sigmasq but it does not correspond to what I defined. I found nothing in the manual. If I am missing something obvious or if somebody has the solution it will help me a lot. I want to retrieve those values automatically to plot the different curves of the fitting and to be sure this is doing what I want. Thank you very much again. -- View this message in context: http://r.789695.n4.nabble.com/mclust-modelName-E-vs-modelName-V-tp3789167p3793697.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mclust: modelName=E vs modelName=V
This normally happens if the algorithm gets caught in a solution where one of the components has variance converging to zero. One way of dealing with this is the use of a prior that penalises too small variances. This works through the prior argument of Mclust (the defaultPrior should do the trick but I currently don't have the time to figure out again how to do this precisely; I have done it before with success). Another option is to have a look at the flexmix package. Best regards, Christian On Sun, 4 Sep 2011, Nico902 wrote: Hi, I'm trying to use the library mclust for gaussian mixture on a numeric vector. The function Mclust(data,G=3) is working fine but the fitting is not optimal and is using modelNames=E. When I'm trying Mclust(data,G=3,modelName=V) I have the following message: Error in if (Sumry$G 1) ans[c(orderedNames, z)] else ans[orderedNames] : argument is of length zero In addition: Warning message: In pickBIC(object[as.character(G), modelNames, drop = FALSE], k = 3) : none of the selected models could be fitted Using variable variance would fit my data better, any idea how to do it? Thanks a lot. -- View this message in context: http://r.789695.n4.nabble.com/mclust-modelName-E-vs-modelName-V-tp3789167p3789167.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mixture of Regressions
I think that it should be possible to do this with the flexmix package. Christian On Fri, 26 Aug 2011, Andra Isan wrote: Dear All, I have two hidden classes for my data set that I would like to learn them based on the Mixture of Binomial regression model. I was wondering if there is any package that I can use for that purpose. Has any one tried any package for the Mixture models? Thanks a lot, Andra __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Clustering Large Applications..sort of
There is a number of methods in the literature to decide the number of clusters for k-means. Probably the most popular one is the Calinski and Harabasz index, implemented as calinhara in package fpc. A distance based version (and several other indexes to do this) is in function cluster.stats in the same package. Christian On Wed, 10 Aug 2011, Ken Hutchison wrote: Hello all, I am using the clustering functions in R in order to work with large masses of binary time series data, however the clustering functions do not seem able to fit this size of practical problem. Library 'hclust' is good (though it may be sub par for this size of problem, thus doubly poor for this application) in that I do not want to make assumptions about the number of clusters present, also due to computational resources and time hclust is not functionally good enough; furthermore k-means works fine assuming the number of clusters within the data, which is not realistic. The silhouette functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem to be really bad through very thorough experimentation of data generation with known clusters. I am left then with either theoretical abstractions such as pruning hclust trees with minimal spanning trees or perhaps hand-rolling a hierarchical k-medoids which works extremely efficiently and without cluster number assumptions. Anybody have any suggestions as to possible libraries which I have missed or suggestions in general? Note: this is not a question for 'Bigkmeans' unless there exists a 'findbigkmeansnumberofclusters' function also. Thank you in advance for your assistance, Ken [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Clustering Large Applications..sort of
PS to my previous posting: Also have a look at kmeansruns in fpc. This runs kmeans for several numbers of clusters and decides the number of clusters by either CalinskiHarabasz or Average Silhouette Width. Christian On Wed, 10 Aug 2011, Ken Hutchison wrote: Hello all, I am using the clustering functions in R in order to work with large masses of binary time series data, however the clustering functions do not seem able to fit this size of practical problem. Library 'hclust' is good (though it may be sub par for this size of problem, thus doubly poor for this application) in that I do not want to make assumptions about the number of clusters present, also due to computational resources and time hclust is not functionally good enough; furthermore k-means works fine assuming the number of clusters within the data, which is not realistic. The silhouette functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem to be really bad through very thorough experimentation of data generation with known clusters. I am left then with either theoretical abstractions such as pruning hclust trees with minimal spanning trees or perhaps hand-rolling a hierarchical k-medoids which works extremely efficiently and without cluster number assumptions. Anybody have any suggestions as to possible libraries which I have missed or suggestions in general? Note: this is not a question for 'Bigkmeans' unless there exists a 'findbigkmeansnumberofclusters' function also. Thank you in advance for your assistance, Ken [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] k-nn hierarchical clustering
Hi there, is there any R-function for k-nearest neighbour agglomerative hierarchical clustering? By this I mean standard agglomerative hierarchical clustering as in hclust or agnes, but with the k-nearest neighbour distance between clusters used on the higher levels where there are at least k1 distances between two clusters (single linkage is 1-nearest neighbour clustering)? Best regards, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R and DBSCAN
Dear Paco, I tried dbscan on my computer with method=hybrid and a 155000*3 data matrix and it works. Needs some time though. (You can track the progress using something like countmode=c(1:10,100,1000,1,10).) Note that for some reason I don't exactly understand, it takes *much* longer for 1-dimensional data (I need to look into this), so if you tried only 1-d data yet, it may be worth a try to do the whole thing with the full 3-d dataset. So I'm not sure what goes wrong on your side. Perhaps look at str(sst2) in order to make sure that it is what you think it is. I can't advise you on how precisely to take longitude and latitude into account because this depends on your application and would probably require professional statistical advisory that is much more than just R-help. Note however that dbscan treats all variables equally. Best wishes, Christian On Tue, 7 Jun 2011, Paco Pastor wrote: Hello Christian Thanks for answering. Yes, I have tried dbscan from fpc but I'm still stuck on the memory problem. Regarding your answer, I'm not sure which memory parameter should I look at. Following is the code I tried with dbscan parameters, maybe you can see if there is any mistake. sstdat=read.csv(sst.dat,sep=;,header=F,col.names=c(lon,lat,sst)) library(fpc) sst1=subset(sstdat, sst50) sst2=subset(sst1, lon-6) sst2=subset(sst2, lon40) sst2=subset(sst2, lat46) dbscan(sst2$sst, 0.1, MinPts = 5, scale = FALSE, method = c(hybrid), seeds = FALSE, showplot = FALSE, countmode = NULL) Error: no se puede ubicar un vector de tamaño 858.2 Mb head(sst2) lon lat sst 1257 35.18 24.98 26.78 1258 35.22 24.98 26.78 1259 35.27 24.98 26.78 1260 35.31 24.98 26.78 1261 35.35 24.98 26.78 1262 35.40 24.98 26.85 In this example I only apply dbscan to temperature values, not lon/lat, so eps parameter is 0.1. As it is a gridded data set any point is surrounded by eight data points, then I thought that at least 5 of the surrounding points should be within the reachability distance. But I'm not sure I'm getting the right approach by only considering temperature value, maybe then I'm missing spatial information. How should I deal with longitude and latitude data? dimensions of sst2 are: 152243 rows x 3 columns Thanks again El 03/06/2011 18:24, Christian Hennig escribió: Have you considered the dbscan function in library fpc, or was it another one? dbscan in fpc doesn't have a distance parameter but several options, one of which may resolve your memory problem (look up the documentation of the memory parameter). Using a distance matrix for hundreds of thousands of points is a recipe for disaster (memory-wise). I'm not sure whether the function that you used did that, but dbscan in fpc can avoid it. It is true that dbscan requires tuning constants that the user has to provide. There is unfortunately no general rule how to do this; it would be necessary to understand the method and the meaning of the constants, and how this translates into the requirements of your application. You may try several different choices and do some cluster validation to see what works, but I can't explain this in general terms easily via email. Hope this helps at least a bit. Best regards, Christian On Fri, 3 Jun 2011, Paco Pastor wrote: Hello everyone, When looking for information about clustering of spatial data in R I was directed towards DBSCAN. I've read some docs about it and theb new questions have arisen. DBSCAN requires some parameters, one of them is distance. As my data are three dimensional, longitude, latitude and temperature, which distance should I use? which dimension is related to that distance? I suposse it should be temperature. How do I find such minimum distance with R? Another parameter is the minimum number of points neded to form a cluster. Is there any method to find that number? Unfortunately I haven't found. Searching thorugh Google I could not find an R example for using dbscan in a dataset similar to mine, do you know any website with such kind of examples? So I can read and try to adapt to my case. The last question is that my first R attempt with DBSCAN (without a proper answer to the prior questions) resulted in a memory problem. R says it can not allocate vector. I start with a 4 km spaced grid with 779191 points that ends in approximately 30 rows x 3 columns (latitude, longitude and temperature) when removing not valid SST points. Any hint to address this memory problem. Does it depend on my computer or in DBSCAN itself? Thanks for the patience to read a long and probably boring message and for your help. -- --- Francisco Pastor Meteorology department, Instituto Universitario CEAM-UMH http://www.ceam.es --- mail: p...@ceam.es skype: paco.pastor.guzman Researcher ID: http://www.researcherid.com/rid/B-8331-2008 Cosis profile: http://www.cosis.net/profile/francisco.pastor --- Parque Tecnologico
Re: [R] R and DBSCAN
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Latent class analysis, selection of the number of classes
Dear Daniel, the BIC can be used to estimate the number of classes. This is actually given out by lca, so you could run lca with several different k and pick the solution that gives you the best BIC. Unfortunately I can't tell you whether large is good or small is good for the BIC implementation of lca, because there are both versions found in the literature, BIC with positive and negative sign. (I think that if there is any standard, then it should rather be large is good; you certainly can check it looking up the values of the loglikelihood and a definition of the BIC in a book. Large is good if the likelihood is used in the definition with a positive sign.) With a bit of experimentation it should be able to find out which way round it is, or you may ask the e1071-maintainer. Hope this helps (actually I may have missed if somebody responded before), Christian On Mon, 23 May 2011, Daniel Malter wrote: Hi, I perform latent class analysis on a matrix of dichotomous variables to create an indicator of class/category membership for each observation. I would like to know whether there is a function that selects the best fit in terms of number of classes/categories. Currently, I am doing this with the lca() function of the e1071 package. This function requires me to specify the number of classes and to compare fit statistics for each run of lca. This becomes somewhat cumbersome the more variables the data matrix contains and, thus, the greater the number of possible classes is. I was wondering whether there is an alternative implemented in a different package that does exactly that. Thanks, Daniel -- View this message in context: http://r.789695.n4.nabble.com/Latent-class-analysis-selection-of-the-number-of-classes-tp3545538p3545538.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Latent class analysis, selection of the number of classes
Just a quick, slightly critical remark about this: the BIC and the CAIC are *by definition* pretty much the same, so it shouldn't be interpreted as some kind of additional confirmation if they point to the same number of classes, it's rather (more or less) the same information twice. Christian On Tue, 24 May 2011, David Joubert wrote: I have used PoLCA for this purpose, and not the e1071 package. You should use a variety of fit indices to choose the number of classes. The BIC may not always be the best choice, depending on your sample size and frequency table. In the best case, AIC, CAIC and BIC values agree as to the optimal number of classes. The Cressie-Read statistic is useful with sparse tables, but I havent found a way to obtain it in R. If you're a coder there might be a way to write a function to obtain it. With poLCA, you can quickly call output for a range a latent classes, and evaluate solutions. The commands are very simple. David Joubert Date: Tue, 24 May 2011 12:30:01 +0100 From: chr...@stats.ucl.ac.uk To: dan...@umd.edu CC: r-help@r-project.org Subject: Re: [R] Latent class analysis, selection of the number of classes Dear Daniel, the BIC can be used to estimate the number of classes. This is actually given out by lca, so you could run lca with several different k and pick the solution that gives you the best BIC. Unfortunately I can't tell you whether large is good or small is good for the BIC implementation of lca, because there are both versions found in the literature, BIC with positive and negative sign. (I think that if there is any standard, then it should rather be large is good; you certainly can check it looking up the values of the loglikelihood and a definition of the BIC in a book. Large is good if the likelihood is used in the definition with a positive sign.) With a bit of experimentation it should be able to find out which way round it is, or you may ask the e1071-maintainer. Hope this helps (actually I may have missed if somebody responded before), Christian On Mon, 23 May 2011, Daniel Malter wrote: Hi, I perform latent class analysis on a matrix of dichotomous variables to create an indicator of class/category membership for each observation. I would like to know whether there is a function that selects the best fit in terms of number of classes/categories. Currently, I am doing this with the lca() function of the e1071 package. This function requires me to specify the number of classes and to compare fit statistics for each run of lca. This becomes somewhat cumbersome the more variables the data matrix contains and, thus, the greater the number of possible classes is. I was wondering whether there is an alternative implemented in a different package that does exactly that. Thanks, Daniel -- View this message in context: http://r.789695.n4.nabble.com/Latent-class-analysis-selection-of-the-number-of-classes-tp3545538p3545538.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] pam() clustering for large data sets
Dear Lilia, I'm not sure whether this is particularly helpful in your situation, but sometimes it is possible to emulate the same (or approximately the same) distance measure as Euclidean distance between points that are somehow rescaled and retransformed. In this case, you can rescale and retransform your original data from which you computed the distances, and use clara, which then implicitly computes Euclidean distances. Of course whether this works depends on the nature of your data and the distance measure that you want to use. Another possibility is to draw a random subset of, say, 3,000 observations, run pam on it, and assign the remaining ones to their closest medoid manually. Actually this is about what clara does anyway. Best regards, Christian On Mon, 16 May 2011, Lilia Nedialkova wrote: Hello everyone, I need to do k-medoids clustering for data which consists of 50,000 observations. I have computed distances between the observations separately and tried to use those with pam(). I got the cannot allocate vector of length error and I realize this job is too memory intensive. I am at a bit of a loss on what to do at this point. I can't use clara(), because I want to use the already computed distances. What is it that people do to perform clustering for such large data sets? I would greatly appreciate any form of suggestions that people may have. Thank you very much in advance. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] GLM and normality of predictors
Normality of the predictors doesn't belong to the assumptions of the GLM, so you don't have to check this. Note, however, that there are all kinds of potential problems which to detect is fairly hopeless with n=11 and three predictors, so you shouldn't be too confident about your results anyway. Christian On Fri, 15 Apr 2011, Simone Santoro wrote: Hi, I have found quite a few posts on normality checking of response variables, but I am still in doubt about that. As it is easy to understand I'm not a statistician so be patient please. I want to estimate the possible effects of some predictors on my response variable that is n? of males and n? of females (cbind(males,females)), so, it would be: fullmodel-glm(cbind(males,females)~pred1+pred2+pred3, binomial) I have n= 11 (ecological data, small sample size is a a frequent problem!). Someone told me that I have to check for normality of the predictors (and in case transform to reach normality) but I am in doubt about the fact that a normality test can be very informative with such a small sample size. Also, I have read that a normality test (Shapiro, Kolmogornov, Durbin, etc.) can't tell you anything about the fact that the distribution is normal but just that there is no evidence for non-normality. Anyway, I am still looking for some sort of thumb of rule to be used in these cases. The question: is there some simple advice on the way one should proceed in this cases to be reasonably confident of the results? Thanks for any help you might provide [[alternative HTML version deleted]] *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Bayesian PCA
Dear Lucy, not an R-related response at all, but if it's questionnaire data, I'd probably try to do dimension reduction in a non-automated way by defining a number of 10 or so meaningful scores that summarise your questions. Dimension reduction is essentially about how to aggregate the given information into low-dimensional measurements, which according to my opinion should be rather driven by the research aim and meaning of the variables than by the distribution of the data, if at all possible. You can then use PCA in order to examine the remaining dimensions Christian On Tue, 12 Apr 2011, Lucy Asher wrote: First of all I should say this email is more of a general statistics questions rather than being specific to using R but I'm hoping that this may be of general interest. I have a dataset that I would really like to use PCA on and have been using the package pcaMethods to examine my data. The results using traditional PCA come out really nicely. The dataset is comprised of a set of questions on dog behaviour answered by their handlers. The questions fall into distinct components which may biological sense and the residuals are reasonable small. Now the problem. I don't have a big enough sample to run traditional PCA. I have about 40 dogs and 60 questions so which ever way you look at it not enough. There is past data available on some of the questions and the realtionships between them so I was wondering whether Bayesian PCA would be a useful alternative using past research to inform my priors. I wondered if anyone knew whether Bayesian PCA was better suited to smaller datasets than traditional (ML) PCA? If not I wondered if anyone knew of packages in R that could do dimension reduction on datasets with small sample sizes? Many Thanks, Lucy This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it. Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help in kmeans
I'm not going to comment on column names, but this is just to make you aware that the results of k-means depend on random initialisation. This means that it is possible that you get different results if you run it several times. It basically gives you a local optimum and there may be more than one of these. Use set.seed to see whether this explains your problem. Best regards, Christian On Wed, 6 Apr 2011, Raji wrote: Hi All, I was using the following command for performing kmeans for Iris dataset. Kmeans_model-kmeans(dataFrame[,c(1,2,3,4)],centers=3) This was giving proper results for me. But, in my application we generate the R commands dynamically and there was a requirement that the column names will be sent instead of column indices to the R commands.Hence, to incorporate this, i tried using the R commands in the following way. kmeans_model-kmeans((SepalLength+SepalWidth+PetalLength+PetalWidth),centers=3) or kmeans_model-kmeans(as.matrix(SepalLength,SepalWidth,PetalLength,PetalWidth),centers=3) In both the ways, we found that the results are different from what we saw with the first command (with column indices). can you please let us know what is going wrong here.If so, can you please let us know how the column names can be used in kmeans to obtain the correct results? Many thanks, Raji -- View this message in context: http://r.789695.n4.nabble.com/Help-in-kmeans-tp3430433p3430433.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis, factor variables, large data set
Dear Hans, clara doesn't require a distance matrix as input (and therefore doesn't require you to run daisy), it will work with the raw data matrix using Euclidean distances implicitly. I can't tell you whether Euclidean distances are appropriate in this situation (this depends on the interpretation and variables and particularly on how they are scaled), but they may be fine at least after some transformation and standardisation of your variables. Hope this helps, Christian On Thu, 31 Mar 2011, Hans Ekbrand wrote: Dear R helpers, I have a large data set with 36 variables and about 50.000 cases. The variabels represent labour market status during 36 months, there are 8 different variable values (e.g. Full-time Employment, Student,...) Only cases with at least one change in labour market status is included in the data set. To analyse sub sets of the data, I have used daisy in the cluster-package to create a distance matrix and then used pam (or pamk in the fpc-package), to get a k-medoids cluster-solution. Now I want to analyse the whole set. clara is said to cope with large data sets, but the first step in the cluster analysis, the creation of the distance matrix must be done by another function since clara only works with numeric data. Is there an alternative to the daisy - clara route that does not require as much RAM? What functions would you recommend for a cluster analysis of this kind of data on large data set? regards, Hans Ekbrand __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mixture models/latent class regression comparison
Dear Carson, I have never used mmlcr for this, but quite generally when fitting such models, the likelihood has often very many local optima. This means that the result of the EM (or a similar) algorithm depends on the initialisation, which in flexmix (and perhaps also in mmlcr) is done in a random fashion. This means that results may differ even if the same method is applied twice, and unfortunately, depending on the dataset, the result may be quite unstable. This may explain that the two functions give you strongly different results, not of course implying that one of them is generally better. Best regards, Christian On Mon, 28 Feb 2011, Carson Farmer wrote: Dear list, I have been comparing the outputs of two packages for latent class regression, namely 'flexmix', and 'mmlcr'. What I have noticed is that the flexmix package appears to come up with a much better fit than the mmlcr package (based on logLik, AIC, BIC, and visual inspection). Has anyone else observed such behaviour? Has anyone else been successful in using the mmlcr package? I ask because I am interested in latent class negative binomial regression, which the mmlcr package appears to support, however, the results for basic Poisson latent class regression appear to be inferior to the results from flexmix. Below is a simple reproducible example to illustrate the comparison: library(flexmix) library(mmlcr) data(NPreg) # from package flexmix m1 - flexmix(yp ~ x, k=2, data=NPreg, model=FLXMRglm(family='poisson')) NPreg$id - 1:200 # mmlcr requires an id column m2 - mmlcr(outer=~1|id, components=list(list(formula=yp~x, class=poisonce)), data=NPreg, n.groups=2) # summary and coefficients for flexmix model summary(m1) summary(refit(m1)) # summary and coefficients for mmlcr model summary(m2) m2 Regards, Carson P.S. I have attached a copy of the mmlcr package with a modified mmlcr.poisonce function due to errors in the version available here: http://cran.r-project.org/src/contrib/Archive/mmlcr/. See also http://jeldi.com/Members/jthacher/tips-and-tricks/programs/r/mmlcr section Bugs? subsection Poisson. -- Carson J. Q. Farmer ISSP Doctoral Fellow National Centre for Geocomputation National University of Ireland, Maynooth, http://www.carsonfarmer.com/ *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Jaccard dissimilarity matrix for PCA
jaccard in package prabclus computes a Jaccard matrix for you. By the way, if you want to do hierarchical clustering, it doesn't seem to be a good idea to me to run PCA first. Why not cluster the dissimilarity matrix directly without information loss by PCA? (I should not make too general statements on this because generally how to cluster data always depends on the aim of clustering, the cluster concept you are interested in etc.) prabclus also contains clustering methods for such data; have a look at the functions prabclust and hprabclust (however, they are documented as functions for clustering species distribution ranges, so if your application is different, you may have to think about whether and how to adapt them). Hope this helps, Christian On Tue, 28 Dec 2010, Flabbergaster wrote: Hi I have a large dataset, containing a wide range of binary variables. I would like first of all to compute a jaccard matrix, then do a PCA on this matrix, so that I finally can do a hierarchical clustering on the principal components. My problem is, that I don't know how to compute the jaccard dissimilarity matrix in R? Which package to use, and so on... Can anybody help me? Alternatively I'm search for another way to explore the clusters present in my data. Another problem is, that I have cases with missing values on different variables. Jacob -- View this message in context: http://r.789695.n4.nabble.com/Jaccard-dissimilarity-matrix-for-PCA-tp3165982p3165982.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] hclust, does order of data matter?
I don't know how the hclust function is implemented, but generally in hierarchical clustering the result can be ambiguous if there are several distances of identical value in the dataset (or identical between-cluster distances occur when aggregating clusters). The role of the order of the data depends on how these ambiguities are resolved. It may well be that in such cases if at some point when building the hierarchy there are two different possibilities to merge clusters at the same distance value what is done by hclust is determined by the order. Hope this helps, Christian On Mon, 15 Nov 2010, rchowdhury wrote: Hello, I am using the hclust function to cluster some data. I have two separate files with the same data. The only difference is the order of the data in the file. For some reason, when I run the two files through the hclust function, I get two completely different results. Does anyone know why this is happening? Does the order of the data matter? Thanks, RC -- View this message in context: http://r.789695.n4.nabble.com/hclust-does-order-of-data-matter-tp3043896p3043896.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] maxitems in cluster validity
You'd need to show us the code you attempted to use in order to make it possible to help you. A good idea may also be to contact the package maintainer directly. Best regards, Christian On Fri, 22 Oct 2010, Penny Adversario wrote: I did cluster validity using internal and stability measures on 8768 items but I get an error message: “ the number of items to be clustered is larger than maxitems.” I increased my maxitems to 9000 and still got the same error message. I partitioned the data into subsections of 600 and was able to get results but it doesn’t make sense to interpret the 15 validity results based on subsections of the data specially so if one is after the validity of clusters resulting from an entire data set. How can I increase items for cluster validation? Penny [[alternative HTML version deleted]] *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Biweight (Tukey) M-estimator of location with known scale parameter
Have a look at the package smoothmest. Christian On Fri, 15 Oct 2010, Ondrej Vozar wrote: Dear colleagues, I would like to ask you how to estimate biweight M-estimator of Tukey with known scale example. I know how to estimate biweight M-estimator if estimated scale is used using function rml in package MASS. library(MASS) x-rnorm(1000) rlm(x~1,psi=psi.bisquare) But I would like to estimated this with true scale s=1. I also checked package robustbase, but I have not found any solution. Thanks for any advice. Ondrej Vozar __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Package prabclus not available?
Hi there, I just tried to install the package prabclus on a computer running Ubuntu Linux 9.04 using install.packages from within R. This gave me a message: Warning message: In install.packages(prabclus) : package ‘prabclus’ is not available I tried to do this selecting two different CRAN mirrors (same result) and with other packages (installing them works fine). Looking up the CRAN mirror website I used (UK, London), there doesn't seem to be anything wrong with prabclus. (iMac checking apparently gives an error which is due to an error with package spdep on that platform in tests, but that shouldn't affect using it on Linux, or should it?) Any explanation? Thanks and best wishes, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Package prabclus not available?
Works for me for both the London mirror as well as CRAN master. May I guess that you are under R 2.10.x on that machine? Oh yes, need to update this first. Thanks. Solved. The error message could be more informative, though... Christian Uwe On 10.10.2010 14:48, Christian Hennig wrote: Hi there, I just tried to install the package prabclus on a computer running Ubuntu Linux 9.04 using install.packages from within R. This gave me a message: Warning message: In install.packages(prabclus) : package ‘prabclus’ is not available I tried to do this selecting two different CRAN mirrors (same result) and with other packages (installing them works fine). Looking up the CRAN mirror website I used (UK, London), there doesn't seem to be anything wrong with prabclus. (iMac checking apparently gives an error which is due to an error with package spdep on that platform in tests, but that shouldn't affect using it on Linux, or should it?) Any explanation? Thanks and best wishes, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] hclust with method = “ward”
On Wed, 6 Oct 2010, PeterB wrote: Thanks, Christian. This is really helpful. I was not aware of that equality, but now I can see it. I think you mean the inner sum over all distances in the distance matrix (for that cluster), which means that each distance is counted twice (which is why we divide by 2). That's probably how to explain it... you can obviuously check it by writing the whole thing down, which is how I did it. (The formula is in Bock's old book on Automatische Klassifikation, but that's in German.) Christian Peter Christian Hennig wrote: The k-means/Ward criterion can be written down in terms of squared Euclidean distances in a way that doesn't involve means. It is half the sum (over all clusters) of the sum (over all observations in a cluster) of all within-cluster squared dissimilarities, the inner sum divided by the cluster size. This can also be computed for a general dissimilarity matrix (this is for example done by cluster.stats in package fpc). I'd guess that hclust with method=ward uses this when run with a general dissimilarity matrix. At least it would make sense, although I'm not sure whether it really is what hclust does, because I didn't check the underlying Fortran code. Note that I may have missed postings in this thread, so sorry if this doesn't add to what you already have worked out. Christian On Wed, 6 Oct 2010, PeterB wrote: Apparently, the same issue exists in SAS, where there is an option to run the Ward algorithm based only on the distance matrix. Perhaps, a SAS user could confirm that or even check with SAS. Peter -- View this message in context: http://r.789695.n4.nabble.com/hclust-with-method-ward-tp2952140p2965310.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://r.789695.n4.nabble.com/hclust-with-method-ward-tp2952140p2966045.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] hclust with method = “ward”
The k-means/Ward criterion can be written down in terms of squared Euclidean distances in a way that doesn't involve means. It is half the sum (over all clusters) of the sum (over all observations in a cluster) of all within-cluster squared dissimilarities, the inner sum divided by the cluster size. This can also be computed for a general dissimilarity matrix (this is for example done by cluster.stats in package fpc). I'd guess that hclust with method=ward uses this when run with a general dissimilarity matrix. At least it would make sense, although I'm not sure whether it really is what hclust does, because I didn't check the underlying Fortran code. Note that I may have missed postings in this thread, so sorry if this doesn't add to what you already have worked out. Christian On Wed, 6 Oct 2010, PeterB wrote: Apparently, the same issue exists in SAS, where there is an option to run the Ward algorithm based only on the distance matrix. Perhaps, a SAS user could confirm that or even check with SAS. Peter -- View this message in context: http://r.789695.n4.nabble.com/hclust-with-method-ward-tp2952140p2965310.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] KMedians clustering in R
pam/clara in package cluster are probably as close to it as you can be. There is no unique definition of a multivariate median and therefore there is no unique definition of k-medians, but pam/clara is one possible version of it. (Of course if you think of k-medians as defined in a specific paper, it may still be something slightly different.) Christian On Thu, 16 Sep 2010, Shubha Vishwanath Karanth wrote: Hi R, Is there a package in R to perform KMedian clustering? Thanks. Shubha This e-mail may contain confidential and/or privileged i...{{dropped:10}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] own distance
The kmeans function doesn't accept a distance object or distance matrix as input (which you can of course generate from your own definition), but pam in package cluster does. (Note that the term k-means refers to cluster mean vectors, which can only computed from a cases*variables data matrix, not from distances alone.) There are also several other clustering methods that work with distance matrices as input, see for example hclust, agnes from package cluster or dbscan from package fpc, but if you want something that is conceptually close to k-means, you should probably go for pam. Best regards, Christian On Tue, 7 Sep 2010, Karen Sargsyan wrote: Is it possible to implement my own distance and mean for k-means clustering for any clustering package in R? Just looking for simple way, without creating a new package. karsar __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Rd-file error: non-ASCII input and no declared encoding
Dear list, I came across the following error for three of my newly written Rd-files: non-ASCII input and no declared encoding I can't make sense of this. Below I copied in one of the three files. Can anybody please tell me what's wrong with it? Thank you, Christian \name{tetragonula} \alias{tetragonula} \alias{tetragonula.coord} \docType{data} % \non_function{} \title{Microsatellite genetic data of Tetragonula bees} \description{ Genetic data for 236 Tetragonula (Apidae) bees from Australia and Southeast Asia, see Franck et al. (2004). The data give pairs of alleles (codominant markers) for 13 microsatellite loci. } \usage{data(tetragonula)} \format{ Two objects are generated: \describe{ \item{tetragonula}{A data frame with 236 observations and 13 string variables. Strings consist of six digits each. The format is derived from the data format used by the software GENEPOP (Rousset 2010). Alleles have a three digit code, so a value of \code{258260} on variable V10 means that on locus 10 the two alleles have codes 258 and 260. \code{000} refers to missing values.} \item{tetragonula.coord}{a 236*2 matrix. Coordinates of locations of individuals in decimal format, i.e. the first number is latitude (negative values are South), with minutes and seconds converted to fractions. The second number is longitude (negative values are West).} } } \source{ Franck, P., E. Cameron, G. Good, J.-Y. Rasplus, and B. P. Oldroyd (2004) Nest architecture and genetic differentiation in a species complex of Australian stingless bees. \emph{Mol. Ecol.} 13, 2317–2331. Rousset, F. (2010) Genepop 4.0 for Windows and Linux. \url{http://kimura.univ-montp2.fr/~rousset/Genepop.pdf} } \details{ Reads from example data file \code{Heterotrigona_indoFO.txt}. } \examples{ data(tetragonula) } \keyword{datasets} *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Rd-file error: non-ASCII input and no declared encoding
Dear Gavin, very well spotted = solved! Thanks also Duncan! Christian On Wed, 1 Sep 2010, Gavin Simpson wrote: On Wed, 2010-09-01 at 15:09 +0100, Christian Hennig wrote: Dear list, I came across the following error for three of my newly written Rd-files: non-ASCII input and no declared encoding I'm not sure, but is it the m- or n-dash character for the page range in the \source{} section? 2317–2331 ^^^ replace it with 2317-2331 and see if it helps. HTH G I can't make sense of this. Below I copied in one of the three files. Can anybody please tell me what's wrong with it? Thank you, Christian \name{tetragonula} \alias{tetragonula} \alias{tetragonula.coord} \docType{data} % \non_function{} \title{Microsatellite genetic data of Tetragonula bees} \description{ Genetic data for 236 Tetragonula (Apidae) bees from Australia and Southeast Asia, see Franck et al. (2004). The data give pairs of alleles (codominant markers) for 13 microsatellite loci. } \usage{data(tetragonula)} \format{ Two objects are generated: \describe{ \item{tetragonula}{A data frame with 236 observations and 13 string variables. Strings consist of six digits each. The format is derived from the data format used by the software GENEPOP (Rousset 2010). Alleles have a three digit code, so a value of \code{258260} on variable V10 means that on locus 10 the two alleles have codes 258 and 260. \code{000} refers to missing values.} \item{tetragonula.coord}{a 236*2 matrix. Coordinates of locations of individuals in decimal format, i.e. the first number is latitude (negative values are South), with minutes and seconds converted to fractions. The second number is longitude (negative values are West).} } } \source{ Franck, P., E. Cameron, G. Good, J.-Y. Rasplus, and B. P. Oldroyd (2004) Nest architecture and genetic differentiation in a species complex of Australian stingless bees. \emph{Mol. Ecol.} 13, 2317–2331. Rousset, F. (2010) Genepop 4.0 for Windows and Linux. \url{http://kimura.univ-montp2.fr/~rousset/Genepop.pdf} } \details{ Reads from example data file \code{Heterotrigona_indoFO.txt}. } \examples{ data(tetragonula) } \keyword{datasets} *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Latex errors when checking package
Dear Ben, thanks a lot! This did the trick (after adding install behind apt-get). Regards, Christian On Fri, 6 Aug 2010, Ben Bolker wrote: Christian Hennig chrish at stats.ucl.ac.uk writes: LaTeX errors when creating PDF version. This typically indicates Rd problems. LaTeX errors found: ! Font T1/ptm/m/n/10=ptmr8t at 10.0pt not loadable: Metric (TFM) file not found I'm guessing you're running Ubuntu. If so, http://www.vidarholen.net/contents/linuxtips/ suggests that you sudo apt-get texlive-fonts-recommended __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Latex errors when checking package
Dear listers, I just run R CMD check on an update of one of my packages. All seems fine but after having gone through all the Rd-file and example checking and so on, I get the following kind of errors: LaTeX errors when creating PDF version. This typically indicates Rd problems. LaTeX errors found: ! Font T1/ptm/m/n/10=ptmr8t at 10.0pt not loadable: Metric (TFM) file not found . to be read again relax l.7 \begin{document} ! Font T1/ptm/m/n/24.88=ptmr8t at 24.88pt not loadable: Metric (TFM) file not f ound. to be read again (and similar ones) l.18 ...ian Hennig \email{chr...@stats.ucl.ac.uk}} ! \textfont 0 is undefined (character h). \...@formatstring ...\...@string \UrlRight \...@th $ l.41 ...http://www.homepages.ucl.ac.uk/~ucakche/}} ! \textfont 0 is undefined (character t). \...@formatstring ...\...@string \UrlRight \...@th $ l.41 ...http://www.homepages.ucl.ac.uk/~ucakche/}} ! \textfont 0 is undefined (character t). \...@formatstring ...\...@string \UrlRight \...@th $ (...and lots of further ones that look preety much the same.) I have no idea what went wrong and don't see anything suspicious in the Rd-files. Any hint what this could be about? Is something missing in my LATEX-installation? Thanks and best regards, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] K-means result - variance between cluster
Dear Ralph, between and within clusters sum of squares (if you want variances, you need to divide them by the appropriate constant!) add up to the overall sum of squares, so you can get the beween clusters ss by computing the overall ss (one possibility to get this is to run kmeans with k=1) and subtracting the within cluster ss from it. Note, however, that the F-value cannot be interpreted in the usual way and is particulary not F-distributed when computed on clusters from k-means, because for F-distribution you'd need to assume that groups are determined independently of the data. Hope this helps, Christian On Fri, 2 Jul 2010, Ralph Modjesch wrote: Hi, I like to present the results from the clustering method k-means in terms of variances: within and between Cluster. The k-means object gives only the within cluster sum of squares by cluster, so the between variance part is missing,for calculation the following table, which I try to get. Number of | Variance within | Var between | Var total | F-value Cluster k | cluster | cluster | | === 2 ...| 25,00 ..| 75,00 ..| 100 ..| 1,5 3 ...| 45,00 ..| 55,00 ..| 100 ..| 1,7 Is there any package/ function which will do that? -- Mit freundlichen Grüßen Ralph Modjesch __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
Dear abanero, In principle, k nearest neighbours classification can be computed on any dissimilarity matrix. Unfortunately, knn and knn1 seem to assume Euclidean vectors as input, which restricts their use. I'd probably compute an appropriate dissimilarity between points (have a look at Gower's distance in daisy, package cluster), and the implement nearest neighbours classification myself if I needed it. It should be pretty straightforward to implement. If you want unsupervised classification (clustering) instead, you have the choice between all kinds of dissimilarity based algorithms then (hclust, pam, agnes etc.) Christian On Thu, 27 May 2010, Ulrich Bodenhofer wrote: abanero wrote: Do you know something like “knn1” that works with categorical variables too? Do you have any suggestion? There are surely plenty of clustering algorithms around that do not require a vector space structure on the inputs (like KNN does). I think agglomerative clustering would solve the problem as well as a kernel-based clustering (assuming that you have a way to positive semi-definite measure of the similarity of two samples). Probably the simplest way is Affinity Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation; see CRAN package apcluster I have co-developed). All you need is a way of measuring the similarity of samples which is straightforward both for numerical and categorical variables - as well as for mixtures of both (the choice of the similarity measures and how to aggregate the different variables is left to you, of course). Your final classification task can be accomplished simply by assigning the new sample to the cluster whose exemplar is most similar. Joris Meys wrote: Not a direct answer, but from your description it looks like you are better of with supervised classification algorithms instead of unsupervised clustering. If you say that this is a purely supervised task that can be solved without clustering, I disagree. abanero does not mention any class labels. So it seems to me that it is indeed necessary to do unsupervised clustering first. However, I agree that the second task of assigning new samples to clusters/classes/whatever can also be solved by almost any supervised technique if samples are labeled according to their cluster membership first. Cheers, Ulrich -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster analysis and supervised classification: an alternative to knn1?
Christian wrote: and the implement nearest neighbours classification myself if I needed it. It should be pretty straightforward to implement. Do you intend modify the code of the knn1() function by yourself? No; if you understand what the nearest neighbours method does, it's not very complicated to implement it from scratch (assuming that your dataset is small enough that you don't have to worry too much about optimising computing times). A bit of programming experience is required, though. (It's not that I intend to do it right now, I suggest that you do it if you can...) Christian thanks to everyone! -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233210.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Can I compare two clusters without using their distance-matrix (dist()) ?
Dear Tal, I took the definition of the Hubert gamma- and Dunn-index from the Gordon book. They are actually not about comparing two clusters, at least not in that reference, and they require dissimilarities. The adjusted Rand index and Meila's VI, as implemented in cluster.stats, compare two clusterings. If you set compareonly=TRUE in cluster.stats, it only computes these two indexes, so it doesn't need the dissimilarity matrix in principle. I will probably in the next update change it so that in this case you don't need to provide a dissimilarity matrix. Until then, you can supply a noninformative matrix. Example: c1 - sample(4,100,replace=TRUE) c2 - sample(5,100,replace=TRUE) cs - cluster.stats(d=matrix(0,ncol=100,nrow=100),c1,c2,compareonly=TRUE) cs$corrected.rand cs$vi Hope this helps, Christian On Wed, 21 Apr 2010, Tal Galili wrote: Thanks for the fast reply Uwe. My hope in posting this was to find if anyone had already done work (in R) in this direction. So far I wasn't able to find any such relevant code, so I turned to the mailing list. Regarding new implementations - thanks for offering! - I have already came around one such algorithm - I implemented it, and will probably publish it on my blog http://www.r-statistics.com/ in the near future. If any one else has any reference to R implementation, it would be most helpful, Tal Contact Details:--- Contact me: tal.gal...@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) -- 2010/4/21 Uwe Ligges lig...@statistik.tu-dortmund.de On 21.04.2010 18:15, Tal Galili wrote: Hello all, I would like to compare the similarity of two cluster solutions using a validation criteria (such as Hubert's gamma coefficient, the Dunn index the corrected rand index and so on) I see (from here:http://www.statmethods.net/advstats/cluster.html) that the function cluster.stats() in the fpc package provides a mechanism for comparing 2 cluster solutions - *BUT* - it requires me to give the the distance matrix among objects. *My question *is: What ways can you suggest for comparing two cluster solutions, while using the cluster indicators only (i.e: a vector saying to which cluster each object belongs to), and WITHOUT asking to submit the distance matrix between the objects. Don't know. If you have a theoretical solution and can provide the description of a method, there will be many people around happy to make an algorithm and implement it. Uwe Ligges Thanks, Tal Contact Details:--- Contact me: tal.gal...@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster/distance large matrix
Dear Bart, a strange thing in your question is that the term Ward's method usually refers to a method based on the k-means criterion, which, in its standard form, is not based on dissimilarities, but on objects*variables-data. So I wonder how and why you want to use Ward's method on a dissimilarity matrix in the first place (I know that the k-means criterion can in principle be translated to dissimilarity data - this is probably what hclust's method=ward does if fed with a dissimilarity matrix, but I'm not sure -, but then it loses its justification). One thing you could think about is using the function pam in library cluster. Chances are that this won't work on 38,000 cases either, but you may cluster a subsample of, say, 2,000 cases and assign all further objects to the most similar cluster medoid. It is well know that hierarchical methods are problematic with too large dissimilarity matrices; even if you resolve the memory problem, the number of operations required is enormous. Hope this helps, Christian On Thu, 11 Feb 2010, Bart Thijs wrote: Hi all, I've stumbled upon some memory limitations for the analysis that I want to run. I've a matrix of distances between 38000 objects. These distances were calculated outside of R. I want to cluster these objects. For smaller sets (egn=100) this is how I proceed: A-matrix(scan(file, n=100*100),100,100, byrow=TRUE) ad-as.dist(A) ahc-hclust(ad,method=ward,members=NULL) However if I try this with the real dataset I end up with memory problems. I've the 64bit version of R installed on a machine with 40Gb RAM (Windows 2003 64bit version). I'm thinking about using only the lower triangle of the matrix but I can't create a distance object for the clustering from the lower.tri Can someone help me with a suggestion for which way to go? Best Regards Bart Thijs -- View this message in context: http://n4.nabble.com/cluster-distance-large-matrix-tp1477237p1477237.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Sampling theory
Hi there, are there any R-packages for computations required in sampling theury (such as confidence intervals under random, stratified, cluster sampling; I'd be partoculary interested in confidence intervals for the population variance, which is difficult enough to find even in books)? Thanks, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Clustering with clara
Dear Paco, as far as I know, there is no such problem with clara, but I may be wrong. However, in order to help you (though I'm not sure whether I'll be able to do that), we'd need to understand precisely what you were doing in R and how your data looks like (code and data; you can show us a relevant bit of the data using the str command). Chances are that the problem is not in clara but some other thing that you do doesn't do what you expect it to do. Christian On Thu, 14 Jan 2010, pacomet wrote: Hello everyone I am trying to use CLARA method for finding clusters in my spatial surface temperature data and noticed one problem. My data are in the form lat,lon,temperature. I extract lat,lon and cluster number for each point in the dataset. When I plotted a map of cluster numbers I found empty areas in the map. The point is that the number of points that were assigned a cluster number are less than the original temperature analyzed points. Why are there less points in the clustering results? is there any option in the CLARA method to retain every single point? is there another clustering method that preserves all the points? Thanks in advance Paco -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] K-means recluster data with given cluster centers
That kmeans returns an error if there is an empty cluster is a bit of a nuisance. It should not be too difficult to get rid off the kmeans function for what you call reclustering. You could write your own function that assigns every point of the new data to the closest initial center. That should be relatively easy and does the same thing, if I understand correctly what you want. I don't comment on whether it makes sense what you attempt to do, which entirely depends on the aim of your analysis (and on what you mean by cluster in the same way), but an alternative could be to cluster the initial data by mclustBIC in library mclust and to use the resulting clusters as training data in mclustDA. Cheers, Christian On Mon, 11 Jan 2010, t.peter.muel...@gmx.net wrote: K-means recluster data with given cluster centers Dear R user, I have several large data sets. Over time additional new data sets will be created. I want to cluster all the data in a similar/ identical way with the k-means algorithm. With the first data set I will find my cluster centers and save the cluster centers to a file [1]. This first data set is huge, it is guarantied that cluster centers will converge. Afterwards I load my cluster centers and cluster via k-means all other datasets with the same cluster centers [2]. I tried this but now I'm getting in the reclustering step following error message: Error: empty cluster: try a better set of initial centers That one of the clusters is empty (has no datapoint) should not be a problem. This can happen because the new data sets can be smaller. What am I doing wrong? Is there a other way to cluster new data in the same way like the old datasets? Thanks Peter 1: R code to find cluster center and save them to file #---INITIAL CLUSTERING TO FIND CLUSTER CENTERS # LOAD LIB library(cluster) # LOAD DATA data_unclean - read.table(dataset1.dat) data.matrix-as.matrix(data_unclean,any) # CLUSTER Nclust - 100 # amount cluster centers Imax - 200 # amount of iteration for convergence of clustering set.seed(100) # set seed of random nr generator init - sample(dim(data.matrix)[1], Nclust) # this is the initial Nclust prototypes km - kmeans(data.matrix, centers=data.matrix[init,], iter.max=Imax) # WRITE OUT CLUSTER CENTERS km$centers # print cluster center (columns: dim component; rows: clusters) km$size # print amount of data in each cluster clusterCenters=km$centers save(file=clusterCenters.RData, list='clusterCenters') # Beispiel write.table(km$centers, file = clusterCenters.dat, sep = ,, col.names= FALSE, row.names= FALSE) 2: R code to recluster new data #---RECLUSTER NEW DATA WITH GIVEN CLUSTER CENTERS # LOAD LIB, SET PARAMETER library(cluster) loopStart=0 loopEnd=10 # LOAD CLUSTER CENTER load(clusterCenters.RData) # load cluster centers # LOOP OVER TRAJ AND RECLUSTER THEM for(ii in loopStart:loopEnd){ # DEFINE FILENAME #print(paste(test,ii,sep=)) filenameInput=paste(dataset,ii,dat,sep=) filenameOutput=paste(dataset,ii,datClusters,sep=) print(filenameInput) print(filenameOutput) # LOAD DATA data_unclean - read.table(filenameInput) data.matrix-as.matrix(data_unclean,any) # RECLUSTER DATA kmRecluster - kmeans(data.matrix, centers=clusterCenters, iter.max=1) kmRecluster$size # WRITE OUT CLUSTERS FOR EACH DATA write.table(kmRecluster$cluster, file = filenameOutput, sep = ,, col.names= FALSE, row.names= FALSE) } -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fishers exact test at 2.2e-16
I know that you didn't ask for this but to me this seems to be a very dodgy method to select a best number of clusters with no proper basis at all. All of these tests are data dependent, so the p-values cannot be interpreted in the usual way. It is actually not clear how they can be interpreted, and the freedom in the data to find a clustering depends on the number of clusters, so there is no reason to expect that comparing p-values for different numbers tells you anything meaningful. Do you really think that it is an informative difference if one clustering gives you p=10^{-58} and another one 10^{-30}? Christian On Thu, 17 Dec 2009, Søren Faurby wrote: In an effort to select the most appropriate number of clusters in a mixture analysis I am comparing the expected and actual membership of individuals in various clusters using the Fisher?s exact test. I aim for the model with the lowest possible p-value, but I frequently get p-values below 2.2e-16 and therefore does not get exact p-values with standard Fisher?s exact tests in R. Does anybody know if there is a version of Fisher?s exact test in any package which can handle lower probabilities, or have other suggestions as to how I can compare the probabilities? I am for instance comparing the following two: dat2-matrix(c(29,0,29,0,12,0,18,0,0,29,0,16,0,19), nrow=2) fisher.test(dat2, workspace=3000) dat3-matrix(c(29,0,0,29,0,0,12,0,0,17,0,1,0,29,0,0,15,1,0,0,19), nrow=3) fisher.test(dat3, workspace=3000) Which both result in p-value 2.2e-16 Kind regards, Søren __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster size
Dear Ms Karunambigai, the kmeans algorithm depends on random initialisation. There are two basic strategies that can be applied in order to make your results reproducible: 1) Fix the random number generator by means of set.seed (see ?set.seed) before you run kmeans. The problem with this is that your solution can only be reproduced using the same random seed; it technically still is random. 2) Specify fixed initial centers, using the centers argument in kmeans. (Sensible initial centers may be obtained by running hclust using Ward's method, obtain the desired number of clusters using cutree and compute the centers of the resulting clusters; sorry that I don't have the time right now to explain how to do that precisely; the help pages and hopefully some understanding of what is going on may help you further.) An alternative strategy that will not absolutely guarantee reproducibility but make your results more stable is to use kmeansruns in library fpc, which is a wrapper that runs kmeans several times and gives you the optimal solution. That should reproduce its outcome with higher probability (though not precisely 1). I don't know whether the default value runs=100 is sufficient to give a stable solution for your data, but increasing the runs parameter may help. Cheers, Christian On Fri, 11 Dec 2009, karuna m wrote: hi r-help, i am doing kmeans clustering in stats. i tried for five clusters clustering using: kcl1 - kmeans(as1[,c(contlife,somlife,agglife,sexlife, rellife,hordlife,doutlife,symtlife,washlife, chcklife,rptlife,countlife,coltlife,ordlife)], 5, iter.max = 10, nstart = 1, algorithm = Hartigan-Wong) table(kcl1$cluster) every time i am getting five clusters of different sizes like first time with cluster sizes table(kcl1$cluster) 1 2 3 4 5 140 72 105 98 112 second time with cluster sizes table(kcl1$cluster) 1 2 3 4 5 91 149 106 76 105 and so on. I wish to know that whether there is any function to get same sizes of clusters everytime when we do kmeans clustering. Thanks in advance. regards, Ms.Karunambigai M PhD Scholar Dept. of Biostatistics NIMHANS Bangalore India The INTERNET now has a personality. YOURS! See your Yahoo! Homepage. [[alternative HTML version deleted]] *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] fitting mixture of normals distribution to asset return data
Dear John, I don't know what the exp stuff in your line below is about, but mclustBIC in package mclust does fit normal mixtures. Try for a start library(mclust) mmm - mclustBIC(data,G=2) mmms - summary(mmm) mmms If you want to learn more, read the documentation. Christian On Wed, 25 Nov 2009, John Seppänen wrote: Hi, I have a 15 years of monthly return data (180 observations) from instruments that have non-normal return distributions. Thus, I would like to fit a mixture of normal distribution to each of the series. So, that I would be able to simulate from the marginal distributions like this: asset.1-exp(c(rnorm(500,-0.07,0.02),rnorm(9500,0.05,0.05)))-1 My problem is that I have tried to use Google and go through some packages (eg mixtools mclust) but haven't been able to find a function to fit the mixture of normals. I would like to have two different states of world and then get the probabilities and the mean and sigma in those states (as in the example above). I am newbie in this subject so if someone could point me a R function for this, I'd really appreciate it... br, John [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] test for bimodality
Dear John, you may check dip in package diptest, which tests unimodality vs. multimuodality (you'll need qDiptab for that, too). Apologies if this misses the point; I only saw the last responses but not the original question so this may not be what you are after, or already mentioned, but I decided that I risk making a fool of myself for a small chance that it helps you. Cheers, Christian On Mon, 31 Aug 2009, Rolf Turner wrote: On 31/08/2009, at 9:40 AM, John Sansom wrote: Has a test for bimodality been implemented in R? Doing RSiteSearch(test for bimodality) yields one hit, which points to http://finzi.psych.upenn.edu/Rhelp08/2008-September/173308.html It looks like it might be *some* help to you. cheers, Rolf Turner ## Attention:\ This e-mail message is privileged and confid...{{dropped:9}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] normal mixture model
Hi Cindy, you need the summary function mclustsummary - summary(mclustBICoutputobject,data) to get all the information. Some (like best model) is given if you just print out the summary object. Some other information (like estimated parameter values) are accessible as components of the summary object, like mclustsummary$parameters$... Try str(mclustsummary) to see what's there (unfortunately this is not fully documented). For more detail see the help pages. Hope this helps, Christian On Sun, 26 Jul 2009, cindy Guo wrote: Hi, Christian, Thank you for the reply. I just tried. Does the function mclustBIC only give the best model, or does it also do EM to get the cluster means and variances according to the best model it picks? I didn't find it. Is there a way to automatically select the best number of components and do EM? Because I need to do the normal mixture model in a loop (one EM at an iteration), so I want it to do everything automatically. Thanks, Cindy On Sun, Jul 26, 2009 at 3:46 PM, Christian Hennig chr...@stats.ucl.ac.ukwrote: You can use mclustBIC in package mclust (uses the BIC for deciding about the number of components and hierarchical clustering for initialisation). Christian On Sun, 26 Jul 2009, cindy Guo wrote: Hi, All, I want to fit a normal mixture model. Which package in R is best for this? I was using the package 'mixdist', but I need to group the data into groups before fitting model, and different groupings seem to lead to different results. What other package can I use which is stable? And are there packages that can automatically determine the number of components? Thank you, Cindy [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] normal mixture model
You can use mclustBIC in package mclust (uses the BIC for deciding about the number of components and hierarchical clustering for initialisation). Christian On Sun, 26 Jul 2009, cindy Guo wrote: Hi, All, I want to fit a normal mixture model. Which package in R is best for this? I was using the package 'mixdist', but I need to group the data into groups before fitting model, and different groupings seem to lead to different results. What other package can I use which is stable? And are there packages that can automatically determine the number of components? Thank you, Cindy [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Hartigan's Dip test
James, I assume that you use package diptest. Look up ?qDiptab for a table of quantiles from which you can obtain your p-value. data(qDiptab) qDiptab You then look up your statistic value (0.074) for a suitable n (n=30, say), and you take one minus the Pr value you find on top of the table. Using n=30 gives you Pr between 0.8 and 0.9, so 0.2p0.1. However, the next largest value of n (50) would lead to 0.05p0.02 leading to possibly different conclusions. Unfortunately this means that you are in a kind of ambiguous borderline situation for the table, though 33 is so much closer to 30 than to 50 that it seems that your result is at least not significant at 0.05 level. By the way, you can simulate a p-value yourself by repeating dip(runif(33)) lots of times. Hope this helps, Christian On Mon, 6 Jul 2009, James Allsopp wrote: Hi, I just got a value for the dip test out of my data of 0.074 for a sample size of 33. I'm trying to work out what this actually means though? Could someone help me relate this to a p-value? Thanks James __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis, defining center seeds or number of clusters
Dear Alex, actually fixing the number of clusters in kmeans end then ending up with a smaller number because of empty clusters is not a standard method of estimating the number of clusters. I may happen (as apparently in some of your examples), but it is generally rather unusual. In most cases, kmeans, as well as clara, pam and other clustering methods, only give you the number of clusters you ask for. Even with some reasonable separation between clusters kmeans cannot generally be expected to come up with empty clusters if the number is initially chosen too high or too many initially centers are specified. The help page for pam.object in library cluster shows you a method to estimate the optimal number of clusters based on pam. However, this problem strongly depends on what cluster concept you have in mind and what you want to use your clusters for. There are alternative indexes that could be optimised to find the best number of clusters. Some of them are implemented in the function cluster.stats in package fpc. I strongly advise reading some literature about this to understand the problem better; the help page of cluster.stats gives a few references. The BIC gives you an estimate of the number of cluster together with Gaussian mixtures, see package mclust. If you can specify things like maximum within-cluster distances, you may get something from using cutree together with a hierarchical clustering method in hclust, for example complete linkage. dbscan and fixmahal in package fpc are further alternatives, requiring one or two tuning constants to come up with an automatical number of clusters. Best regards, Christian On Thu, 11 Jun 2009, am...@xs4all.nl wrote: I use kmeans to classify spectral events in high and low 1/3 octave bands: #Do cluster analysis CyclA-data.frame(LlowA,LhghA) CntrA-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE) ClstA-kmeans(CyclA,centers=CntrA,nstart=50,algorithm=MacQueen) This works well when the actual data shows 1,2 or 3 groups that are not too close in a cross plot. The MacQueen algorithm will give one or more empty groups which is what I want. However, there are cases when the groups are closer together, less compact or diffuse which leads to the situation where visually only 2 groups are apparent but the algorithm returns 3 splitting one group in two. I looked at the package 'cluster' specifically at clara (cannot use pam as I have 1 observations). But clara always returns as many groups as you aks for. Is there a way to help find a seed for the intial cluster centers? Equivalently, is there a way to find a priori the number of groups? I know this is not an easy problem. I have looked at principal components (princomp, prcomp) because there is a connection with cluster analysis. It is not obvious to me how to program that connection though. http://en.wikipedia.org/wiki/Principal_Component_Analysis http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf Thanks in advance, Alex van der Spek __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] em algorithm mixture of multivariate normals
Look up packages flexmix and mclust! Christian On Thu, 21 May 2009, daniele riggi wrote: Hi, I would like to know if it is possible to have a R code to estimate the parameters of a mixture of bivariate (or multivariate) normals via EM Algorithm. I tried to write it, but in the estimation of the matrix of variance and covariance, i have some problems. I generate two bidimensional vectors both from different distribution with their own vector means and variance and covariance structure. When I create a unique vector, the structure of covariance changes, and so the implementation of the EM algorithm doesn't work. Maybe someone knows the reason. If I fix the starting initial value of the covariance matrix and I don't update the estimate of this matrix, the algorithm works and finds the estimate of the vector means, so I wrote it in the correct way. However if someone could help me I will be very grateful to him for kindness. Best RegardsDaniele -- Dr. Daniele Riggi, PhD student University of Milano-Bicocca Department of Statistics Building U7, Via Bicocca degli Arcimboldi, 8 20126 Milano, Italy cell. +39 328 3380690 mailto: daniele.ri...@gmail.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Save Cluster results to data frame
Hi Chris, this isn't particularly a clustering question, is it? Why don't you just take your clustering vector (pam.output.object$clustering) and assign it to a$clustering (given that a is the name of your data frame)? And why don't you just define a new character/string vector and assign the cluster names that you want to it using if or case? Regards, Christian On Mon, 18 May 2009, Chris Arthur wrote: If I cluster my data into 3 sets, using pam for instance, is there a way to save the resultant cluster results, to the originating data frame. and related to that how do i say change the cluster names to something a bit more meaningful that 1..2...3 So it goes like this. Data --- Cluster into 3 groups given them meaningful names ---output back to data frame Thanks for the help Chris __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Bayesian 2*2 table independence test
Dear list, I'd like to test the null hypothesis of independence in a 2*2 contingency table in a Bayesian way. As far as I can see, this can be done by the function ctable in library LearnBayes. (More precisely, a Bayes factor can be computed.) Two questions: 1) Is there any other package/function than can be used for this in a more or less straightforward manner? 2) The ctable help page explains the parameter a as a: matrix of prior hyperparameters, but it is not explained what the prior is and how it precisely depends on these hyperparameters. (I read the paper test_of_independence.pdf by the package author J. Albert on the web, but it only explains the choice a=1+0*data, but not general a.) Thank you, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] finite mixture model (2-component Weibull): plotting Weibull components?
mclust doesn't do Weibull mixtures, only Gaussian ones (though you may approximate a Weibull by several Gaussians). You may look up the flexmix package, which either does it already or a method function can be provided to do it. There is also an example fitting a mixture distribution in Venables and Ripley's MASS book with normals (including plotting the density), which you could adapt for Weibull distributions by plugging in the corresponding functions for the Weibull. (1) restrict the number of components to 2 . Specify G=2 in mclust (if want to fit 2 Gaussians). (2) obtain and plot a component Weibull density Generally, adding a Weibull mixture density to a plot works by x - seq(0,100,by=0.1) # or whatever reasonable choice of x-values lines(x,p*dweibull(x,s11,s12)+(1-p)*dweibull(x,s21,s22)) where p, s11, s12, s21, s22 are the mixture parameters. Regards, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] clustering, don't understand this error
Hi there, I'm travelling right now so I can't really check this but it seems that the problem is that cluster.stats needs a partition as input. hclust doesn't give you a partition but you can generate one from it using cutree. BTW, rather use - than =. Best wishes, Christian On Wed, 15 Apr 2009, Ana M Aparicio Carrasco wrote: Hello, I am using the dunn metric, but something is wrong and I dont understand what or what that this error mean. Please can you help me with this? The instructions are: #Indice de Dunn disbupa=dist(bupa[,1:6]) a=hclust(disbupa) cluster.stats(disbupa,a,bupa[,7])$dunn And the error is: Error in max(clustering) : invalid 'type' (list) of argument thank you so much. Ana Maria Aparicio. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with clustering
Generally, how to scale different variables when aggregating them in a dissimilarity measure is strongly dependent on the subject matter, what the aim of clustering and your cluster comncept is. This cannot be answered properly on such a mailing list. A standard transformation before computing dissimilarities would be to scale all variables to variance 1 by dividing by their standard deviations. This gives in some well defined sense all variables the same weight (which may be somewhat affected by outliers, heavy tails, skewness; note, however, that normalising to the same range shares the same problems more severly). Regards, Christian On Mon, 26 Jan 2009, mau...@alice.it wrote: I am going to try out a tentative clustering of some feature vectors. The range of values spanned by the three items making up the features vector is quite different: Item-1 goes roughly from 70 to 525 (integer numbers only) Item-2 is in-between 0 and 1 (all real numbers between 0 and 1) Item-3 goes from 1 to 10 (integer numbers only) In order to spread out Item-2 even further I might try to replace Item-2 with Log10(Item-2). My concern is that, regardless the distance measure used, the item whose order of magnitude is the highest may carry the highest weight in the process of calculating the similarity matrix therefore fading out the influence of the items with smaller variation in the resulting clusters. Should I normalize all feature vector elements to 1 in advance of generating the similarity matrix ? Thank you so much. Maura tutti i telefonini TIM! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] bug (?!) in pam() clustering from fpc package ?
Dear Tal, pam is not in the fpc package but in the cluster package. Look at ?pam and ?pam.object to find out what it does. As far as I see, the medoids in the output object are the final cluster medoids, not the initial ones, which presumably explains the observed behaviour. Best regards, Christian On Wed, 17 Dec 2008, Tal Galili wrote: Hello all. I wish to run k-means with manhattan distance. Since this is not supported by the function kmeans, I turned to the pam function in the fpc package. Yet, when I tried to have the algorithm run with different starting points, I found that pam ignores and keep on starting the algorithm from the same starting-points (medoids). For my questions: 1) is there a bug in the code or in the way I am using it ? 2) is there a way to either fix the code or to another function in some package that can run kmeans with manhattan distance (manhattan distances are the sum of absolute differences) ? here is a sample code: require(fpc) x - rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)), cbind(rnorm(15,5,0.5), rnorm(15,5,0.5))) pam(x, 2, medoids = c(1,16)) output: Medoids: ID [1,] 3 -0.1406026 0.1131493 [2,] 17 4.9564839 4.6480520 ... So the initial medeoids where 3 and 17, not 1 and 16 as I asked. Thanks, Tal -- -- Tal Galili Phone number: 972-50-3373767 FaceBook: Tal Galili My Blogs: www.talgalili.com www.biostatistics.co.il *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] CLARA and determining the right number of clusters
Hi there, generally finding the right number of clusters is a difficult problem and depends heavily on the cluster concept needed for the particular application. No outcome of any automatic mathod should be taken for granted. Having said that, I guess that something like the example given in ?pam.object (replacing pam by clara) should work with clara, too. Regards, Christian On Tue, 30 Sep 2008, pacomet wrote: Hi everyone I have a question about clustering. I've managed using CLARA to get a clustering analysis of a large data set. But now I want to find which is the right number of clusters. The clara.object gives some information like the ratio between maximal and minimal dissimilarity that says (maybe if lower than 1??) if a cluster is well-separated from the other. I've also read something about silhouette and abut cluster.stats but can't manage to get how to find the right number of clusters. I've tried a suggestion from the mailing list but when using dist d1-dist(mydata$sst) it says that specified vector size is too big Is there any method to find the right number of clusters when using clara? Maybe something I've tried but with a small and simple trick I can't find Thanks in advance -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] CLARA and determining the right number of clusters
Hi there, I've tried your suggestion and it seems promising. But I have a couple of questions. I am reading a three column ASCII file (lon, lat, sst) mydata - read.table(INFILE, header=FALSE,sep=, na.strings=99.00,dec=.,strip.white=TRUE,col.names=c(lon,lat,sst)) then I extract a subset of the data and try to get the right number of clusters just for third var, sst I'm not sure whether you feel that you have to look at a single variable at a time, but the whole thing should work for more than one as well. x-mydata$sst asw - numeric(10) for (k in 4:10) + asw[k] - clara(x, k) $ silinfo $ avg.width k.best - which.max(asw) cat(silhouette-optimal number of clusters:, k.best, \n) silhouette-optimal number of clusters: 5 I've changed the maximum number of clusters in your example from 20 just to 10 as I am expecting a number between 5 and 8 clusters would be right. Is there any problem with this change? Maybe this restriction is too strict if I just consider the data are just numbers but as it is sea surface temperature under certain environmental-meteorological conditions in this particular case I think there should not be more than 8-9 clusters (If 20 is retained I get 11 clusters). This kind of problem concerns your data and your application and cannot be solved on such a mailing list. Perhaps you should go for professional advice about your particular data. Quite obviously, if you restrict the number of clusters to be at most 10, you cannot find eleven, and how strong you think that there should not be more than 8-9 clusters and how good your arguments against 11 are, nobody on this list can decide. The general problem is that there is no unique statistical definition of what a true cluster is and whether your dataset rather contains 5 or 11 clusters (or any other number) depends on what you want to call a cluster. The second question is how should one understand the plot? Is the right number the one with greater average silhouette width? I don't know which plot you refer to but you may have a look at the Kaufman and Rousseeuw book quoted on the help page. Best wishes, Christian Thanks again 2008/9/30 Christian Hennig [EMAIL PROTECTED] Hi there, generally finding the right number of clusters is a difficult problem and depends heavily on the cluster concept needed for the particular application. No outcome of any automatic mathod should be taken for granted. Having said that, I guess that something like the example given in ?pam.object (replacing pam by clara) should work with clara, too. Regards, Christian On Tue, 30 Sep 2008, pacomet wrote: Hi everyone I have a question about clustering. I've managed using CLARA to get a clustering analysis of a large data set. But now I want to find which is the right number of clusters. The clara.object gives some information like the ratio between maximal and minimal dissimilarity that says (maybe if lower than 1??) if a cluster is well-separated from the other. I've also read something about silhouette and abut cluster.stats but can't manage to get how to find the right number of clusters. I've tried a suggestion from the mailing list but when using dist d1-dist(mydata$sst) it says that specified vector size is too big Is there any method to find the right number of clusters when using clara? Maybe something I've tried but with a small and simple trick I can't find Thanks in advance -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakchehttp://www.homepages.ucl.ac.uk/%7Eucakche -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] intToUtf8
Hi there, any explanation for this? intToUtf8(66) Error in intToUtf8(66) : argument 'x' must be an integer vector intToUtf8(c(66,55)) Error in intToUtf8(c(66, 55)) : argument 'x' must be an integer vector intToUtf8(c(66,55),multiple=TRUE) Error in intToUtf8(c(66, 55)) : argument 'x' must be an integer vector Errr... 66 and c(66,55) are as integer vectorish as anything can be, aren't they? version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 2 minor 6.0 year 2007 month 10 day03 svn rev43063 language R version.string R version 2.6.0 (2007-10-03) Thanks, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] intToUtf8
Dear Duncan, No, they're numeric. integer here is a type, not the mathematical thing. This works: intToUtf8(as.integer(c(66,55))) The docs don't mention this requirement, and it does seem somewhat unnecessary; I'll look into it. Thanks! You're lucky I didn't read to the end of your message: that's a pretty old version. Please test on current versions before reporting problems. Sorry but people won't always install the most recent version when running into smallish problems like this. Particularly not when they're working on machines where they are not superuser. We have other things to do than updating every two months. I may be egoist here but I think the list has to live with requests concerning old versions. (I use version so that people can tell me has been fixed in the meantime if this is the case.) If you personally don't respond to them, that's your choice of course. But it doesn't work as a policy. Regards, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About clustering techniques
Dear Paco, in order to use the methods in the cluster package (including pam), look up the help page of daisy, which is able to compute dissimilarity matrices handling missing values appropriately (in most situations). A good reference is the Kaufman and Rousseeuw book cited on that help page. Christian On Tue, 29 Jul 2008, pacomet wrote: Hello R users It's some time I am playing with a dataset to do some cluster analysis. The data set consists of 14 columns being geographical coordinates and monthly temperatures in annual files latitutde - longitude - temperature 1 -. - temperature 12 I have some missing values in some cases, maybe there are 8 monthly valid values at some points with four non valid. I don't want to supress the whole row with 8 good/4 bad values as I wanna try annual and monthy analysis. I first tried kmeans but found a problem with missing values. When trying without omitting missing values kmeans gives an error and when excluding invalid data too many values are excluded in some years of the data series. Now I have been reading about pam, pamk and clara, I think they can handle missing values. But can't find out the way to perform the analysis with these functions. As I'm not an statistics nor an R expert the fpc or cluster package documentation is not enough for me. If you know about a website or a tutorial explaining the way to use that functions, with examples to check if possible, please post them. Any other help or suggestion is greatly appreciated. Thanks in advance Paco -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About clustering techniques
A quick comment on this: imputation is an option to make things technically work, but it is not necessarily good. Imputation always introduces some noise, ie, it fakes information that is not really there. Whether it is good depends strongly on the data, the situation and the imputation method (random often not being a very sensible choice). Christian On Tue, 29 Jul 2008, [EMAIL PROTECTED] wrote: Hi Paco, I got the same problem with you before. Thus, I just impute the missing values For example: newdata-as.matrix(impute(olddata, fun=random)) then I believe that you could analyze your data. Hopefully it helps. Chunhao Quoting pacomet [EMAIL PROTECTED]: Hello R users It's some time I am playing with a dataset to do some cluster analysis. The data set consists of 14 columns being geographical coordinates and monthly temperatures in annual files latitutde - longitude - temperature 1 -. - temperature 12 I have some missing values in some cases, maybe there are 8 monthly valid values at some points with four non valid. I don't want to supress the whole row with 8 good/4 bad values as I wanna try annual and monthy analysis. I first tried kmeans but found a problem with missing values. When trying without omitting missing values kmeans gives an error and when excluding invalid data too many values are excluded in some years of the data series. Now I have been reading about pam, pamk and clara, I think they can handle missing values. But can't find out the way to perform the analysis with these functions. As I'm not an statistics nor an R expert the fpc or cluster package documentation is not enough for me. If you know about a website or a tutorial explaining the way to use that functions, with examples to check if possible, please post them. Any other help or suggestion is greatly appreciated. Thanks in advance Paco -- _ El ponent la mou, el llevant la plou Usuari Linux registrat: 363952 --- Fotos: http://picasaweb.google.es/pacomet [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster.stats
The given information is not enough to tell you what's going on. as.dist doesn't appear in the given code and it's not clear to me what kind of object img is (a small image doesn't tell me what R makes of it). Also, try to read the help pages first and find out whether img is of the format that is required by the functions. And check (using str for example) whether data is what you expect it to be. Christian On Sat, 14 Jun 2008, Laura Poggio wrote: Thank you very much for your answer. I tried to run the function on my data and now I am getting this message of error Error in as.dist(dmat[clustering == i, clustering == i]) : (subscript) logical subscript too long Below the code I am using (version2.7.0 of R with all packages updated): data - - as(img, data.frame)[1:1]#(where img is a small image 256 px x 256 px) kl - kmeans(data, 5) library(fpc) cluster.stats(data, kl$cluster) Thank you for any hints on the reasons and meaning of the error! Laura 2008/6/13 Christian Hennig [EMAIL PROTECTED]: Dear Laura, Dear list, I just tried to use the function cluster.stat in the package fpc. I just have a couple of questions about the syntax: cluster.stats(d,clustering,alt.clustering=NULL, silhouette=TRUE,G2=FALSE,G3=FALSE) 1) the distance object (d) is an object obtained by the function dist() on my own original matrix? d is allowed to be an object of class dist or a dissimilarity matrix. The answer to your question depends on what your original matrix is. If it is something on which you can compute a distance by dist(), you're right, at least if dist() delivers the distance you are interested in. 2) clustering is the clusters vector as result of one of the many clustering methods? The help page tells you what clustering can be. So it could be the clustering/partition vector of a clustering method or it could be something else. Note that cluster.stats doesn't depend on any particular clustering method. It computes the statistics regardless of where the clustering vector comes from. Best regards, Christian Thank you very much in advance and sorry for such basic question, but I did not manage to clarify my mind. Laura [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakchehttp://www.homepages.ucl.ac.uk/%7Eucakche *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster.stats
Dear Laura, I have R 2.6.0. I tried dist on a vector of length 200,000 and it told me that it is too long. Theoretically, if you have 260,000 observations, the length of the dist object should be 260,000*259,999/2, which is too large for our computers, I guess. Which means that unfortunately cluster.stats won't work for such a large data set, because it needs the full casewise dissimilarity information. I don't understand how you managed to produce a dist object of length of only 130,000 out of your data, but it certainly doesn't give all pairwise distance information for 260,000 points and therefore cannot be used in cluster.stats with a clustering vector of length 260,000 or so. Sorry, Christian On Sat, 14 Jun 2008, Laura Poggio wrote: Thank. See below. Laura 2008/6/14 Christian Hennig [EMAIL PROTECTED]: What does str(ddata) give? Class 'dist' atomic [1:130816] 69.2 117.1 145.6 179.9 195.6 ... dcent doesn't make sense as input for cluster.stats, because you need a dissimilarity matrix between all objects. Yes I know ... I simply try to see if something was changing with different structure of data Christian On Sat, 14 Jun 2008, Laura Poggio wrote: I am sorry I did not provide enough information. I am not using img later, but data that is data.frame. I wrote that img is a image just to explain what kind of data is coming from, but the object I am using is data and it is a data.frame (checked many times). I am not using as.dist, but dist in order to calculate the distance matrix among the data I have. Then the whole code I am using is: data - - as(img, data.frame)[1:1]#(where img is an image 256x256 px) kl - kmeans(data, 5) library(fpc) ddata - dist(data) dcent - dist(kl$centers) cluster.stats(ddata, kl$cluster) cluster.stats(dcent, kl$cluster) In both cases I got the same error: Error in as.dist(dmat[clustering == i, clustering == i]) : (subscript) logical subscript too long Below the structure of the different objects is detailed below: data is 'data.frame': 262144 obs. of 1 variable kl$centers is num [1:5, 1] kl$cluster is Named int [1:262144] I hope it is more informative. I am sorry but I did not find any explanation for the error message I am getting. Thank you very much in advance Laura 2008/6/14 Christian Hennig [EMAIL PROTECTED]: The given information is not enough to tell you what's going on. as.dist doesn't appear in the given code and it's not clear to me what kind of object img is (a small image doesn't tell me what R makes of it). Also, try to read the help pages first and find out whether img is of the format that is required by the functions. And check (using str for example) whether data is what you expect it to be. Christian On Sat, 14 Jun 2008, Laura Poggio wrote: Thank you very much for your answer. I tried to run the function on my data and now I am getting this message of error Error in as.dist(dmat[clustering == i, clustering == i]) : (subscript) logical subscript too long Below the code I am using (version2.7.0 of R with all packages updated): data - - as(img, data.frame)[1:1]#(where img is a small image 256 px x 256 px) kl - kmeans(data, 5) library(fpc) cluster.stats(data, kl$cluster) Thank you for any hints on the reasons and meaning of the error! Laura 2008/6/13 Christian Hennig [EMAIL PROTECTED]: Dear Laura, Dear list, I just tried to use the function cluster.stat in the package fpc. I just have a couple of questions about the syntax: cluster.stats(d,clustering,alt.clustering=NULL, silhouette=TRUE,G2=FALSE,G3=FALSE) 1) the distance object (d) is an object obtained by the function dist() on my own original matrix? d is allowed to be an object of class dist or a dissimilarity matrix. The answer to your question depends on what your original matrix is. If it is something on which you can compute a distance by dist(), you're right, at least if dist() delivers the distance you are interested in. 2) clustering is the clusters vector as result of one of the many clustering methods? The help page tells you what clustering can be. So it could be the clustering/partition vector of a clustering method or it could be something else. Note that cluster.stats doesn't depend on any particular clustering method. It computes the statistics regardless of where the clustering vector comes from. Best regards, Christian Thank you very much in advance and sorry for such basic question, but I did not manage to clarify my mind. Laura [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698
Re: [R] cluster.stats
Dear Laura, Dear list, I just tried to use the function cluster.stat in the package fpc. I just have a couple of questions about the syntax: cluster.stats(d,clustering,alt.clustering=NULL, silhouette=TRUE,G2=FALSE,G3=FALSE) 1) the distance object (d) is an object obtained by the function dist() on my own original matrix? d is allowed to be an object of class dist or a dissimilarity matrix. The answer to your question depends on what your original matrix is. If it is something on which you can compute a distance by dist(), you're right, at least if dist() delivers the distance you are interested in. 2) clustering is the clusters vector as result of one of the many clustering methods? The help page tells you what clustering can be. So it could be the clustering/partition vector of a clustering method or it could be something else. Note that cluster.stats doesn't depend on any particular clustering method. It computes the statistics regardless of where the clustering vector comes from. Best regards, Christian Thank you very much in advance and sorry for such basic question, but I did not manage to clarify my mind. Laura [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cluster analysis with numeric and categorical variables
Dear Miha, a general way to do this is as follows: Define a distance measure by aggregating the Euclidean distance on the (X,Y)-space and the trivial 0-1 distance (0 if category is the same) on the categorial variable. Perform cluster analysis (whichever you want) on the resulting distance matrix. Note that there is more than one way to do this. The 0-1-distance could be incorporated in the definition of the Euclidean distance (instead of (x_i-y_i)^2), or a weighted average of the distances in X-, Y- and categorial space could be computed. Weights of variables (including possibly rescaling) have to be decided. How to do this precisely should depend on the subject matter and prior information about variable importance etc. In absence of such information, you may standardise the variablewise sums of squared pairwise distances to be equal. Hope this helps (and you can figure out the relevant R code yourself). Christian On Tue, 3 Jun 2008, Miha Staut wrote: Dear all, I would like to perform a clustering analysis on a data frame with two coordinate variables (X and Y) and a categorical variable where only a != b can be established. As far as I understood classification analyses, they are not an option as they partition the training set only in k classes of the test set. By searching through the book Modern Applied Statistics with S I did not find a satisfactory solution. I will be grateful for any suggestions. Best regards Miha __ can.html __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Clustering large data matrix
Hi there, whether clara is a proper way of clustering depends strongly on what your data are and particularly what interpretation or use you want for your clustering. You may do better with a hierarchical method after having defined a proper distance (however this would rather go into statistical consultation and not just R help). Assuming that you use some reasonable dimension reduction and clustering method, you may get a good visualization of you clustering using the methods available via functions plotcluster/discrproj in package fpc. Best, Christian On Thu, 6 Mar 2008, Dani Valverde wrote: Hello, I have a large data matrix (68x13112), each row corresponding to one observation (patients) and each column corresponding to the variables (points within an NMR spectrum). I would like to carry out some kind of clustering on these data to see how many clusters are there. I have tried the function clara() from the package cluster. If I use the matrix as is, I can perform the clara analysis but when I call clusplot() I get this error: Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) : 'princomp' can only be used with more units than variables Then, I reduce the dimensionality by using the function prcomp(). Then I take the 13 first principal components (80% variability) and I carry out the clara() analysis again. Then, I call the clusplot() function again and voilà!, it works. The problem is that clusplot() only represents the two first components of my prcomp() analysis, which represents only 15% of the variability. So, my questions are 1) is clara() a proper way to analyze such a large data set? and 2) Is there an appropiate method for graphic plotting of my data, that takes into account the whole variability if my data, not just two principal components? Many thanks. Best, Dani -- Daniel Valverde Saubí Grup de Biologia Molecular de Llevats Facultat de Veterinària de la Universitat Autònoma de Barcelona Edifici V, Campus UAB 08193 Cerdanyola del Vallès- SPAIN Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN) Grup d'Aplicacions Biomèdiques de la RMN Facultat de Biociències Universitat Autònoma de Barcelona Edifici Cs, Campus UAB 08193 Cerdanyola del Vallès- SPAIN +34 93 5814126 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R and Clusters
Dear Lorenzo, if I understand your posting correctly, this is exactly what Single Linkage clustering does if you cut the dendrogram tree at your threshold distance. Therefore you can use hclust with method=single (which produces the full dendrogram; you have to generate the Euclidean distances first) and then you feed the output into cutree with h=0.5 (your threshold), which produces a partition by cutting the dendrogram at the height 0.5. Hope this helps, Christian On Mon, 7 Jan 2008, Lorenzo Isella wrote: Dear All, I hope I am not asking a FAQ. I am dealing with a problem of graph theory [connected components in a non-directed graph] and I do not want to rediscover the wheel. I saw a large number of R packages dealing for instance with the k-means method or hierarchical clustering for spatially distributed data and I am basically facing a similar problem. I am given a set of data which are the positions of particles in 3 dimensions; I define two particles A and B to be directly connected if their Euclidean distance is below a certain threshold d. If A and B are directly connected and B and C are directly connected, then A,B and C are connected components (physically it means that they are members of the same cluster). All my N particles then split into k disjointed clusters, each with a certain number of connected components, and this is what I want to investigate. I do not know a priori how many clusters I have (this is my problem with e.g. k-means since k is an output for me); the only input is the set of 3-dimensional particle positions and a threshold distance. The algorithm/package I am looking should return the number of clusters and the composition of each cluster, e.g. the fact that the second cluster is made up of particles {R,T,L}. Consider for instance: # a 2-dimensional example x - rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) - c(x, y) How can I then find out how many connected components I have when my threshold distance is d=0.5? Many thanks Lorenzo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] order a matrix
Dear list, order(x,y,z) returns a permutation to order x, ties broken by y, remaining ties broken by z. (And so on.) What I'd like to do is order(X), where X is a matrix (or a list or data frame if necessary) of unspecified size, which orders X[,1], ties broken by X[,2], remaining ties broken by X[,3] and so on - without having to know and to write down how many columns X has. Any ideas how to achieve that? Thanks, Christian *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] order a matrix
That solved it! Thank you very much! On Mon, 5 Nov 2007, Prof Brian Ripley wrote: On Mon, 5 Nov 2007, Christian Hennig wrote: Dear list, order(x,y,z) returns a permutation to order x, ties broken by y, remaining ties broken by z. (And so on.) What I'd like to do is order(X), where X is a matrix (or a list or data frame if necessary) of unspecified size, which orders X[,1], ties broken by X[,2], remaining ties broken by X[,3] and so on - without having to know and to write down how many columns X has. Any ideas how to achieve that? For a list or data frame, do.call(order, X). For a matrix, do.call(order, split(A, col(A))). -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 [EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.