[R] Axis scaling for PCA biplot

2022-11-15 Thread Christian Hennig
Hi there,

I'm puzzled about the axis scaling in the PCA biplot.

Here's an example.

library(pdfCluster) # package cepp seems to have the same data set.

data(oliveoil) # 572 observations 10 variables

olive <- oliveoil[,3:10] # numerical variables

prolive <- princomp(olive)
summary(prolive)
# Importance of components:
# Comp.1   Comp.2 Comp.3   Comp.4
# Standard deviation 479.7299024 150.82827868 45.394449751 27.522646558
# Proportion of Variance   0.8970072   0.08866821  0.008031707 0.002952451
# Cumulative Proportion    0.8970072   0.98567544  0.993707152 0.996659603
# Comp.5   Comp.6 Comp.7   Comp.8
# Standard deviation 24.78169442 1.196956e+01 7.1390744088 6.9756965249
# Proportion of Variance  0.00239367 5.584168e-04 0.0001986489 0.0001896608
# Cumulative Proportion   0.99905327 9.996117e-01 0.9998103392 1.00

plot(prolive$scores)

# Scaling of this plot reproduces the variances of the components given 
in the summary,

# as does cov(prolive$scores). This seems all fine, however...

biplot(prolive)

I have no idea what the numbers on the axes of the biplot are, at least 
not the larger ones. Chances are the smaller ones indicate the loadings. 
The larger ones are neither the same as in the first plot, nor are they 
standardised to one, but they seem to be standardised somehow, as the 
range on x- and y-axis looks the same, which it shouldn't be if 
variances represented the PCA eigenvalues.

Can anyone explain this to me?

Actually the help page of biplot.princomp says something on this, but I 
don't get my head around it:

"scale

The variables are scaled by |lambda ^ scale| and the observations are 
scaled by |lambda ^ (1-scale)| where |lambda| are the singular values as 
computed by |princomp 
<https://www.rdocumentation.org/link/princomp?package=stats=3.6.2>|. 
Normally |0 <= scale <= 1|, and a warning will be issued if the 
specified |scale| is outside this range."

The default value of scale seems to be 1, but then (1-scale) is zero so 
I'd assume data to be unscaled, but that should have reproduced the 
"plot" scale, shouldn't it?

Thanks,

Christian

-- 
Christian Hennig
Dipartimento di Scienze Statistiche "Paolo Fortunati",
Universita di Bologna, phone +39 05120 98163
christian.hen...@unibo.it

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis using term frequencies

2015-03-24 Thread Christian Hennig

Dear Sun Shine,


dtes - dist(tes.df, method = 'euclidean')
dtesFreq - hclust(dtes, method = 'ward.D')
plot(dtesFreq, labels = names(tes.df))


However, I get an error message when trying to plot this: Error in 
graphics:::plotHclust(n1, merge, height, order(x$order), hang,  : invalid 
dendrogram input.


I don't see anything wrong with the code, so what I'd do is run
str(dtes) and str(dtesFreq) to see whether these are what they should be 
(or if not, what they are instead).


I'm clearly screwing something up, either in my source data.frame or in my 
setting hclust up, but don't know which, nor how.


Can't comment on your source data but generally, whatever you do, use 
str() or even print() to see whether the R-objects are allright or what 
went wrong.


More than just identifying the error however, I am interested in finding a 
smart (efficient/ elegant) way of checking the occurrence and frequency value 
of the terms that may be associated with 'sports', 'learning', and 
'extra-mural' and extracting these into a matrix or data frame so that I can 
analyse and plot their clustering to see if how I associated these terms is 
actually supported statistically.


The first thing that comes to my mind (not necessarily the best/most 
elegant) is to run...

dtes3 - cutree(dtesFreq,3)
...and to table dtes3 against your manual classification.
Note that 3 is the most natural number of clusters to cut the tree 
here but may not be the best to match your classification (for example, 
you may have a one-point cluster in the 3-cluster solution, so it may 
effectively be a two-cluster solution with an outlier). Your 
dendrogram, if you succeed plotting it, may give you a hint about that.


Hope this helps,
Christian




I'm sure that there must be a way of doing this in R, but I'm obviously not 
going about it correctly. Can anyone shine a light please?


Thanks for any help/ guidance.

Regards,
Sun

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] clustering with hclust

2014-07-25 Thread Christian Hennig

Dear Marianna,

the function agnes in library cluster can compute Ward's method from a raw 
data matrix (at least this is what the help page suggests).


Also, you may not be using the most recent version of hclust. The most 
recent version has a note in its help page that states:


Two different algorithms are found in the literature for Ward clustering. 
The one used by option ward.D (equivalent to the only Ward option ward 
in R versions = 3.0.3) does not implement Ward's (1963) clustering 
criterion, whereas option ward.D2 implements that criterion (Murtagh and 
Legendre 2013). With the latter, the dissimilarities are squared before 
cluster updating. Note that agnes(*, method=ward) corresponds to 
hclust(*, ward.D2).


The Murtagh and Legendre paper has more details on this and is here:
http://arxiv.org/abs/.6285
F. Murtagh and P. Legendre, Ward's hierarchical clustering method: 
clustering criterion and agglomerative algorithm


It's not clear to me why one would want to use Ward's method for this kind 
of data, but that's your decision of course.


Best wishes,
Christian


On Fri, 25 Jul 2014, Marianna Bolognesi wrote:


Hi everybody, I have a problem with a cluster analysis.

I am trying to use hclust, method=ward.

The Ward method works with SQUARED Euclidean distances.

Hclust demands a dissimilarity structure as produced by dist.

Yet, dist does not seem to produce a table of squared euclidean distances,
starting from cosines.
In fact, computing manually the squared euclidean distances from cosines
(d=2(1-cos)) produces a different outcome.

As a consequence, using hclust with ward method on a table of cosines
tranformed into distances with dist, produces a different dendrogram than
other programs for hierarchical clustering with ward method (i.e.
multidendrograms). Weird right??

Computing manually the distances and then feeding them to hclust produces
an error message. So, I am wondering, what the hell is this dist function
doing?!

thanks!

marianna

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] R CMD check Error: package MASS was built before R 3.0.0 - not true!?

2013-10-29 Thread Christian Hennig

Hi there,

I just updated my R to 3.0.2 and ran
R CMD check --as-cran on the just produced new version of fpc.

I got an error
Error: package MASS was built before R 3.0.0: please re-install it

- but I actually *did* re-install MASS without error just before that and 
within R library(MASS) works just fine.


What can I do about this?

Best wishes,
Christian

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
c.hen...@ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] R CMD build error with data files

2012-08-21 Thread Christian Hennig

Dear list,

I want to update my prabclus package which I haven't done for quite a 
while.


In the previous version, I had .dat files in my data subdirectory, which I 
read using .R files. Now R CMD check gives me a warning that .dat files 
are no longer accepted there.


So I changed my filenames to .txt, but actually some of these files are 
only there in order to be read by .R, not in order to be read independetly 
by read.table, which produces an error (something like scan error: line 5 
does not have 2 elements, which I'd guess comes from the fact that R 
tries to automatically read the file with read.table that should be read 
in a different way as specified in my .R-file).


So I decided, following the advice in Writing R extensions, to put these 
files into a directory called inst/extdata.
The data directory now only has .R-files that read stuff from files in 
inst/extdata.

This, unfortunately, doesn't seem to work.
For example,
kykladspecreg.R in data has one line reading
kykladspecreg - 
read.table(system.file(extdata,kykladspecreg.txt,package=prabclus)).


R CMD build gives
Warning in file(file, rt) :
  file() only supports open = w+ and open = w+b: using the former
Error in read.table(system.file(extdata, kykladspecreg.txt, package = 
prabclus)) :

  no lines available in input

Same if I use 
read.table(system.file(inst,extdata,kykladspecreg.txt,package=prabclus))


What's wrong here?
Now what to do with the files to be read by the R-files?
In the old days all was fine having them as dat files in the data 
directory. :-(


By the way, R CMD check also produces a warning
* checking PDF version of manual ... WARNING
LaTeX errors when creating PDF version.
This typically indicates Rd problems.
...with no useful hint what the problemn may be whatsoever. If anyone has 
an idea how to find the problem, please tell me!


Thanks,
Christian



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] lda, collinear variables and CV

2012-07-26 Thread Christian Hennig

Dear R-help list,

apparently lda from the MASS package can be used in situations with 
collinear variables. It only produces a warning then but at least it 
defines a classification rule and produces results.


However, I can't find on the help page how exactly it does this. I have a 
suspicion (it may look at the hyperplane containing the class means, 
using some kind of default/trivial within-group covariance matrix) but I'd 
like to know in detail if possible.


I find particularly puzzling that it produces different 
results whether I choose CV=TRUE or I run a manual LOO cross-validation.


Constructing an example, I realised that I'm puzzled about 
CV=TRUE not only in the collinear case. The example is below. Actually it 
also produces different (though rather similar) results for p=10 (no 
longer collinear).


See here:

library(MASS)
set.seed(12345)
n - 50
p - 200 # or p- 10
testdata - matrix(ncol=p,nrow=n)
for (i in 1:p)
  testdata[,i] - rnorm(n)
class - as.factor(c(rep(1,25),rep(2,25)))

lda1 - lda(x=testdata,grouping=class,CV=TRUE)
table1 - table(lda1$class,class)


y.lda - rep(NA, n)
for(i in 1:n){
  testset - testdata[i,,drop=FALSE]
  trainset - testdata[-i,]
  model.lda - lda(x=trainset,grouping=class[-i])
  y.lda[i] - predict(model.lda, testset)$class
}
table2 -table(y.lda, class)


table1

   class
 1  2
  1 14 16
  2 11  9


table2

 class
y.lda  1  2
1 15 10
2 10 15

With p=10:

table1

   class
 1  2
  1 10 11
  2 15 14

table2

 class
y.lda  1  2
1 10 12
2 15 13


Any explanation?

Best regards,
Christian


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Latent class multinomial (or conditional) logit using R?

2011-12-23 Thread Christian Hennig
You may have a look at lcmixed/flexmixedruns in package fpc. This provides 
flexmix methods that can be used to fit latent class models with locally 
independent multinomials (no logits, though, and I'm not sure that for 
categorical data only this is much different from what poLCA offers).


Best regards,
Christian

On Fri, 23 Dec 2011, adanmartinez wrote:


Hi everyone?

Does anybody know how can I estimate a
Latent class multinomial (or conditional) logit using R?

I have tried flexmix, poLCA, and
they do not seem to support this model.
thanks in advance

adan

--
View this message in context: 
http://r.789695.n4.nabble.com/Latent-class-multinomial-or-conditional-logit-using-R-tp4230083p4230083.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] removing outliers in non-normal distributions

2011-09-28 Thread Christian Hennig

Dear Ben,

not specifically an R-related response, but the best philosophy of 
defining outliers, as far as I'm concerned, you'll find  in Davies 
and Gather, The identification of multiple outliers, JASA 1993.


The idea is that you can only properly define what an outlier is relative 
to a reference distributional shape. Note that the reference 
distributional shape is *not* what you believe the underlying distribution 
is, but rather a device to define outliers as points that clearly exceed 
the extremes that are normally to be expected under the reference shape.


If your reference distributional shape is the normal, you need to set up a
robust outlier identification rule that has a low probability to find 
*any* outlier if the data rally come from a normal. Basically declaring 
everything outside median +/- c*MAD will work (Hampel identifier), but c 
needs to depend on 
the size of the dataset in order to calibrate it so that for example under 
the normal the probability  not to find any outlier is 0.05 (or whatever 
you want; note that it is always left to the user and to some extent 
arbitrary to specify a borderline). Some values for c are given in 
Davies and Gather.
There are some slightly more efficient alternatives but the Hampel 
identifier is simple and still good.


The same principle can be applied (although this is not done in the cited 
paper) if your reference distribution is different. This may for example 
make sense if you have skew data and you want a skew outlier 
identification rule. It should be more or less straightforward to adapt 
the ideas to  a lognormal reference distribution. For others, I'm not sure 
if literature exists but maybe. A lot has happened since 1993.


Hope this helps,
Christian

On Wed, 28 Sep 2011, Ben qant wrote:


Hello,

I'm seeking ideas on how to remove outliers from a non-normal distribution
predictor variable. We wish to reset points deemed outliers to a truncated
value that is less extreme. (I've seen many posts requesting outlier removal
systems. It seems like most of the replies center around why do you want to
remove them, you shouldn't remove them, it depends, etc. so I've tried
to add a lot of notes below in an attempt to answer these questions in
advance.)

Currently we Winsorize using the quantile function to get the new high and
low values to set the outliers to on the high end and low end (this is
summarized legacy code that I am revisiting):

#Get the truncated values for resetting:
lowq = quantile(dat,probs=perc_low,na.rm=TRUE)
hiq = quantile(dat,probs=perc_hi,na.rm=TRUE)

#resetting the highest and lowest values with the truncated values:
dat[lowqdat] = lowq
dat[hiqdat] = hiq

What I don't like about this is that it always truncates values (whether
they truly are outliers or not) and the perc_low and perc_hi settings are
arbitrary. I'd like to be more intelligent about it.

Notes:
1) Ranking has already been explored and is not an option at this time.
2) Reminder: these factors are almost always distributed non-normally.
3) For reason I won't get into here, I have to do this pragmatically. I
can't manually inspect the data each time I remove outliers.
4) I will be removing outliers from candidate predictor variables.
Predictors variable distributions all look very different from each other,
so I can't make any generalizations about them.
5) As #4 above indicates, I am building and testing predictor variables for
use in a regression model.
6) The predictor variable outliers are usually somewhat informative, but
their extremeness is a result of the predictor variable calculation. I
think extremeness takes away from the information that would otherwise be
available (outlier effect). So I want to remove some, but not all, of their
extremeness. For example, percent change of a small number: from say 0.001
to 500. Yes, we want to know that it changed a lot, but 49,999,900% is not
helpful and masks otherwise useful information.

I'd like to hear your ideas. Thanks in advance!

Regards,

Ben

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] mclust: modelName=E vs modelName=V

2011-09-06 Thread Christian Hennig
I probably don't understand problem. I'd assume that variance$sigmasq are 
the three estimated component variances (probably estimated by maximum a 
posteriori, but consult the mclust documentation).


What's wrong with that?

(The values you submit as scale in prior are not fixed variances, but 
parameters of the prior distribtion - your problem may be that you 
believe that they are meant to be variances fixed by you!?)


Christian

On Tue, 6 Sep 2011, Nico902 wrote:


Hi,

Thanks a lot for your answer. I effectively was able to get rid of this
message by doing:


resClust -
Mclust(data,G=3,modelName=V,prior=priorControl(scale=c(1.44,0.81,0.49)));



However, I would like to be able to retrieve the variances I defined in the
result. I found:


resClust$parameters

$Vinv
NULL

$pro
[1] 0.5502496 0.1986852 0.2510652

$mean
   1 2 3
-2.8390006980 -0.0003267873  3.1072574619

$variance
$variance$modelName
[1] V

$variance$d
[1] 1

$variance$G
[1] 3

$variance$sigmasq
[1] 0.840267666 0.009466821 1.510263146

$variance$scale
[1] 0.840267666 0.009466821 1.510263146


I do not manage to get where the sigmasq is coming from. I tried to sqrt or
square the sigmasq but it does not correspond to what I defined. I found
nothing in the manual. If I am missing something obvious or if somebody has
the solution it will help me a lot. I want to retrieve those values
automatically to plot the different curves of the fitting and to be sure
this is doing what I want.

Thank you very much again.

--
View this message in context: 
http://r.789695.n4.nabble.com/mclust-modelName-E-vs-modelName-V-tp3789167p3793697.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] mclust: modelName=E vs modelName=V

2011-09-04 Thread Christian Hennig
This normally happens if the algorithm gets caught in a solution where one 
of the components has variance converging to zero.


One way of dealing with this is the use of a prior that penalises too 
small variances. This works through the prior argument of Mclust (the 
defaultPrior should do the trick but I currently don't have the time to 
figure out again how to do this precisely; I have done it before with 
success).


Another option is to have a look at the flexmix package.

Best regards,
Christian

On Sun, 4 Sep 2011, Nico902 wrote:


Hi,

I'm trying to use the library mclust for gaussian mixture on a numeric
vector. The function Mclust(data,G=3) is working fine but the fitting is not
optimal and is using modelNames=E. When I'm trying
Mclust(data,G=3,modelName=V) I have the following message:

Error in if (Sumry$G  1) ans[c(orderedNames, z)] else ans[orderedNames] :
 argument is of length zero
In addition: Warning message:
In pickBIC(object[as.character(G), modelNames, drop = FALSE], k = 3) :
 none of the selected models could be fitted


Using variable variance would fit my data better, any idea how to do it?

Thanks a lot.

--
View this message in context: 
http://r.789695.n4.nabble.com/mclust-modelName-E-vs-modelName-V-tp3789167p3789167.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mixture of Regressions

2011-08-28 Thread Christian Hennig

I think that it should be possible to do this with the flexmix package.

Christian

On Fri, 26 Aug 2011, Andra Isan wrote:


Dear All,

I have two hidden classes for my data set that I would like to learn them based 
on the Mixture of Binomial regression model. I was wondering if there is any 
package that I can use for that purpose. Has any one tried any package for the 
Mixture models?

Thanks a lot,
Andra

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Clustering Large Applications..sort of

2011-08-10 Thread Christian Hennig
There is a number of methods in the literature to decide the number of 
clusters for k-means. Probably the most popular one is the Calinski and 
Harabasz index, implemented as calinhara in package fpc. A distance 
based version (and several other indexes to do this) is in function 
cluster.stats in the same package.


Christian

On Wed, 10 Aug 2011, Ken Hutchison wrote:


Hello all,
  I am using the clustering functions in R in order to work with large
masses of binary time series data, however the clustering functions do not
seem able to fit this size of practical problem. Library 'hclust' is good
(though it may be sub par for this size of problem, thus doubly poor for
this application) in that I do not want to make assumptions about the number
of clusters present, also due to computational resources and time hclust is
not functionally good enough; furthermore k-means works fine assuming the
number of clusters within the data, which is not realistic. The silhouette
functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem
to be really bad through very thorough experimentation of data generation
with known clusters. I am left then with either theoretical abstractions
such as pruning hclust trees with minimal spanning trees or perhaps
hand-rolling a hierarchical k-medoids which works extremely efficiently and
without cluster number assumptions. Anybody have any suggestions as to
possible libraries which I have missed or suggestions in general? Note: this
is not a question for 'Bigkmeans' unless there exists a
'findbigkmeansnumberofclusters' function also.
   Thank you in advance for your
assistance,
Ken

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Clustering Large Applications..sort of

2011-08-10 Thread Christian Hennig
PS to my previous posting: Also have a look at kmeansruns in fpc. This 
runs kmeans for several numbers of clusters and decides the number of 
clusters by either CalinskiHarabasz or Average Silhouette Width.


Christian

On Wed, 10 Aug 2011, Ken Hutchison wrote:


Hello all,
  I am using the clustering functions in R in order to work with large
masses of binary time series data, however the clustering functions do not
seem able to fit this size of practical problem. Library 'hclust' is good
(though it may be sub par for this size of problem, thus doubly poor for
this application) in that I do not want to make assumptions about the number
of clusters present, also due to computational resources and time hclust is
not functionally good enough; furthermore k-means works fine assuming the
number of clusters within the data, which is not realistic. The silhouette
functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem
to be really bad through very thorough experimentation of data generation
with known clusters. I am left then with either theoretical abstractions
such as pruning hclust trees with minimal spanning trees or perhaps
hand-rolling a hierarchical k-medoids which works extremely efficiently and
without cluster number assumptions. Anybody have any suggestions as to
possible libraries which I have missed or suggestions in general? Note: this
is not a question for 'Bigkmeans' unless there exists a
'findbigkmeansnumberofclusters' function also.
   Thank you in advance for your
assistance,
Ken

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] k-nn hierarchical clustering

2011-06-09 Thread Christian Hennig

Hi there,

is there any R-function for k-nearest neighbour agglomerative hierarchical
clustering?
By this I mean standard agglomerative hierarchical clustering as in hclust 
or agnes, but with the k-nearest neighbour distance between clusters used 
on the higher levels where there are at least k1 distances between two 
clusters (single linkage is 1-nearest neighbour clustering)?


Best regards,
Christian

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and DBSCAN

2011-06-08 Thread Christian Hennig

Dear Paco,

I tried dbscan on my computer with method=hybrid and a 155000*3 
data matrix and it works. Needs some time though.

(You can track the progress using something like
countmode=c(1:10,100,1000,1,10).)
Note that for some reason I don't exactly understand, it takes *much* 
longer for 1-dimensional data (I need to look into this), so if you tried 
only 1-d data yet, it may be worth a try to do the whole thing with

the full 3-d dataset.

So I'm not sure what goes wrong on your side. Perhaps look at str(sst2) in 
order to make sure that it is what you think it is.


I can't advise you on how precisely to take longitude and latitude into 
account because this depends on your application and would probably 
require professional statistical advisory that is much more than just 
R-help. Note however that dbscan treats all variables equally.


Best wishes,
Christian

On Tue, 7 Jun 2011, Paco Pastor wrote:


Hello Christian

Thanks for answering. Yes, I have tried dbscan from fpc but I'm still stuck 
on the memory problem. Regarding your answer, I'm not sure which memory 
parameter should I look at. Following is the code I tried with dbscan 
parameters, maybe you can see if there is any mistake.


sstdat=read.csv(sst.dat,sep=;,header=F,col.names=c(lon,lat,sst))

library(fpc)
sst1=subset(sstdat, sst50)
sst2=subset(sst1, lon-6)
sst2=subset(sst2, lon40)
sst2=subset(sst2, lat46)

dbscan(sst2$sst, 0.1, MinPts = 5, scale = FALSE, method = c(hybrid), 

seeds = FALSE, showplot = FALSE, countmode = NULL)
Error: no se puede ubicar un vector de tamaño  858.2 Mb

head(sst2)

lon   lat   sst
1257 35.18 24.98 26.78
1258 35.22 24.98 26.78
1259 35.27 24.98 26.78
1260 35.31 24.98 26.78
1261 35.35 24.98 26.78
1262 35.40 24.98 26.85


In this example I only apply dbscan to temperature values, not lon/lat, so 
eps parameter is 0.1. As it is a gridded data set any point is surrounded by 
eight data points, then I thought that at least 5 of the surrounding points 
should be within the reachability distance. But I'm not sure I'm getting the 
right approach by only considering temperature value, maybe then I'm missing 
spatial information. How should I deal with longitude and latitude data?


dimensions of sst2 are: 152243 rows x 3 columns

Thanks again

El 03/06/2011 18:24, Christian Hennig escribió:
Have you considered the dbscan function in library fpc, or was it another 
one?

dbscan in fpc doesn't have a distance parameter but several options, one
of which may resolve your memory problem (look up the documentation of the 
memory parameter).


Using a distance matrix for hundreds of thousands of points is a recipe for 
disaster (memory-wise). I'm not sure whether the function that you used did 
that, but dbscan in fpc can avoid it.


It is true that dbscan requires tuning constants that the user has to 
provide. There is unfortunately no general rule how to do this; it would be 
necessary to understand the method and the meaning of the constants, and 
how this translates into the requirements of your application.


You may try several different choices and do some cluster validation to see 
what works, but I can't explain this in general terms easily via email.


Hope this helps at least a bit.

Best regards,
Christian


On Fri, 3 Jun 2011, Paco Pastor wrote:


Hello everyone,

When looking for information about clustering of spatial data in R I was 
directed towards DBSCAN. I've read some docs about it and theb new 
questions have arisen.


DBSCAN requires some parameters, one of them is distance. As my data are 
three dimensional, longitude, latitude and temperature, which distance 
should I use? which dimension is related to that distance? I suposse it 
should be temperature. How do I find such minimum distance with R?


Another parameter is the minimum number of points neded to form a cluster. 
Is there any method to find that number? Unfortunately I haven't found.


Searching thorugh Google I could not find an R example for using dbscan in 
a dataset similar to mine, do you know any website with such kind of 
examples? So I can read and try to adapt to my case.


The last question is that my first R attempt with DBSCAN (without a proper 
answer to the prior questions) resulted in a memory problem. R says it can 
not allocate vector. I start with a 4 km spaced grid with 779191 points 
that ends in approximately 30 rows x 3 columns (latitude, longitude 
and temperature) when removing not valid SST points. Any hint to address 
this memory problem. Does it depend on my computer or in DBSCAN itself?


Thanks for the patience to read a long and probably boring message and for 
your help.


--
---
Francisco Pastor
Meteorology department, Instituto Universitario CEAM-UMH
http://www.ceam.es
---
mail: p...@ceam.es
skype: paco.pastor.guzman
Researcher ID: http://www.researcherid.com/rid/B-8331-2008
Cosis profile: http://www.cosis.net/profile/francisco.pastor
---
Parque Tecnologico

Re: [R] R and DBSCAN

2011-06-03 Thread Christian Hennig
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Latent class analysis, selection of the number of classes

2011-05-24 Thread Christian Hennig

Dear Daniel,

the BIC can be used to estimate the number of classes. This is actually 
given out by lca, so you could run lca with several different k and pick 
the solution that gives you the best BIC.
Unfortunately I can't tell you whether large is good or small is good 
for the BIC implementation of lca, because there are both versions found 
in the literature, BIC with positive and negative sign. (I think that if 
there is any standard, then it should rather be large is good; you 
certainly can check it looking up the values of the loglikelihood and a 
definition of the BIC in a book. Large is good if the likelihood is 
used in the definition with a positive sign.)


With a bit of experimentation it should be able to find out which way 
round it is, or you may ask the e1071-maintainer.


Hope this helps (actually I may have missed if somebody responded before),
Christian

On Mon, 23 May 2011, Daniel Malter wrote:


Hi,

I perform latent class analysis on a matrix of dichotomous variables to
create an indicator of class/category membership for each observation. I
would like to know whether there is a function that selects the best fit in
terms of number of classes/categories.

Currently, I am doing this with the lca() function of the e1071 package.
This function requires me to specify the number of classes and to compare
fit statistics for each run of lca. This becomes somewhat cumbersome the
more variables the data matrix contains and, thus, the greater the number of
possible classes is. I was wondering whether there is an alternative
implemented in a different package that does exactly that.

Thanks,
Daniel

--
View this message in context: 
http://r.789695.n4.nabble.com/Latent-class-analysis-selection-of-the-number-of-classes-tp3545538p3545538.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Latent class analysis, selection of the number of classes

2011-05-24 Thread Christian Hennig
Just a quick, slightly critical remark about this: the BIC and the CAIC 
are *by definition* pretty much the same, so it shouldn't be interpreted 
as some kind of additional confirmation if they point to the same number 
of classes, it's rather (more or less) the same information twice.


Christian

On Tue, 24 May 2011, David Joubert wrote:



I have used PoLCA for this purpose, and not the e1071 package.
You should use a variety of fit indices to choose the number of classes. The 
BIC may not always be the best choice, depending on your sample size and 
frequency table. In the best case, AIC, CAIC and BIC values agree as to the 
optimal number of classes. The Cressie-Read statistic is useful with sparse 
tables, but I havent found a way to obtain it in R. If you're a coder there 
might be a way to write a function to obtain it.

With poLCA, you can quickly call output for a range a latent classes, and 
evaluate solutions. The commands are very simple.

David Joubert



Date: Tue, 24 May 2011 12:30:01 +0100
From: chr...@stats.ucl.ac.uk
To: dan...@umd.edu
CC: r-help@r-project.org
Subject: Re: [R] Latent class analysis, selection of the number of classes

Dear Daniel,

the BIC can be used to estimate the number of classes. This is actually
given out by lca, so you could run lca with several different k and pick
the solution that gives you the best BIC.
Unfortunately I can't tell you whether large is good or small is good
for the BIC implementation of lca, because there are both versions found
in the literature, BIC with positive and negative sign. (I think that if
there is any standard, then it should rather be large is good; you
certainly can check it looking up the values of the loglikelihood and a
definition of the BIC in a book. Large is good if the likelihood is
used in the definition with a positive sign.)

With a bit of experimentation it should be able to find out which way
round it is, or you may ask the e1071-maintainer.

Hope this helps (actually I may have missed if somebody responded before),
Christian

On Mon, 23 May 2011, Daniel Malter wrote:


Hi,

I perform latent class analysis on a matrix of dichotomous variables to
create an indicator of class/category membership for each observation. I
would like to know whether there is a function that selects the best fit in
terms of number of classes/categories.

Currently, I am doing this with the lca() function of the e1071 package.
This function requires me to specify the number of classes and to compare
fit statistics for each run of lca. This becomes somewhat cumbersome the
more variables the data matrix contains and, thus, the greater the number of
possible classes is. I was wondering whether there is an alternative
implemented in a different package that does exactly that.

Thanks,
Daniel

--
View this message in context: 
http://r.789695.n4.nabble.com/Latent-class-analysis-selection-of-the-number-of-classes-tp3545538p3545538.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] pam() clustering for large data sets

2011-05-17 Thread Christian Hennig

Dear Lilia,

I'm not sure whether this is particularly helpful in your situation, but 
sometimes it is possible to emulate the same (or approximately the same) 
distance measure as Euclidean distance between points that are 
somehow rescaled and retransformed. In this case, you can rescale and 
retransform your original data from which you computed the distances, and 
use clara, which then implicitly computes Euclidean distances.


Of course whether this works depends on the nature of your data and the 
distance measure that you want to use.


Another possibility is to draw a random subset of, say, 3,000 
observations, run pam on it, and assign the remaining ones to their 
closest medoid manually. Actually this is about what clara does anyway.


Best regards,
Christian

On Mon, 16 May 2011, Lilia Nedialkova wrote:


Hello everyone,

I need to do k-medoids clustering for data which consists of 50,000
observations.  I have computed distances between the observations
separately and tried to use those with pam().

I got the cannot allocate vector of length error and I realize this
job is too memory intensive.  I am at a bit of a loss on what to do at
this point.

I can't use clara(), because I want to use the already computed distances.

What is it that people do to perform clustering for such large data sets?

I would greatly appreciate any form of suggestions that people may have.

Thank you very much in advance.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] GLM and normality of predictors

2011-04-15 Thread Christian Hennig
Normality of the predictors doesn't belong to the assumptions of the GLM, 
so you don't have to check this.


Note, however, that there are all kinds of potential problems which to 
detect is fairly hopeless with n=11 and three predictors, so you shouldn't 
be too confident about your results anyway.


Christian

On Fri, 15 Apr 2011, Simone Santoro wrote:



Hi,

I have found quite a few posts on normality checking of response variables, but 
I am still in doubt about that. As it is easy to understand I'm not a 
statistician so be patient please.
I want to estimate the possible effects of some predictors on my response 
variable that is n? of males and n? of females (cbind(males,females)), so, it 
would be:

fullmodel-glm(cbind(males,females)~pred1+pred2+pred3, binomial)

I have n= 11 (ecological data, small sample size is a a frequent problem!).

Someone told me that I have to check for normality of the predictors (and in 
case transform to reach normality) but I am in doubt about the fact that a 
normality test can be very informative with such a small sample size.
Also, I have read that a normality test (Shapiro, Kolmogornov, Durbin, etc.) 
can't tell you anything about the fact that the distribution is normal but just 
that there is no evidence for non-normality.
Anyway, I am still looking for some sort of thumb of rule to be used in these 
cases.

The question: is there some simple advice on the way one should proceed in this 
cases to be reasonably confident of the results?

Thanks for any help you might provide

[[alternative HTML version deleted]]




*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Bayesian PCA

2011-04-12 Thread Christian Hennig

Dear Lucy,

not an R-related response at all, but if it's questionnaire data, I'd 
probably try to do dimension reduction in a non-automated way by defining 
a number of 10 or so meaningful scores that summarise your questions.
Dimension reduction is essentially about how to aggregate 
the given information into low-dimensional measurements, which 
according to my opinion should be rather driven by the research aim and 
meaning of the variables than by the distribution of the data, if at all 
possible.

You can then use PCA in order to examine the remaining dimensions

Christian

On Tue, 12 Apr 2011, Lucy Asher wrote:

First of all I should say this email is more of a general statistics 
questions rather than being specific to using R but I'm hoping that this 
may be of general interest.


I have a dataset that I would really like to use PCA on and have been 
using the package pcaMethods to examine my data. The results using 
traditional PCA come out really nicely. The dataset is comprised of a 
set of questions on dog behaviour answered by their handlers. The 
questions fall into distinct components which may biological sense and 
the residuals are reasonable small. Now the problem. I don't have a big 
enough sample to run traditional PCA. I have about 40 dogs and 60 
questions so which ever way you look at it not enough. There is past 
data available on some of the questions and the realtionships between 
them so I was wondering whether Bayesian PCA would be a useful 
alternative using past research to inform my priors. I wondered if 
anyone knew whether Bayesian PCA was better suited to smaller datasets 
than traditional (ML) PCA? If not I wondered if anyone knew of packages 
in R that could do dimension reduction on datasets with small sample 
sizes?


Many Thanks,

Lucy
This message and any attachment are intended solely for the addressee and may 
contain confidential information. If you have received this message in error, 
please send it back to me, and immediately delete it.   Please do not use, copy 
or disclose the information contained in this message or in any attachment.  
Any views or opinions expressed by the author of this email do not necessarily 
reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment
may still contain software viruses which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help in kmeans

2011-04-06 Thread Christian Hennig
I'm not going to comment on column names, but this is just to make you 
aware that the results of k-means depend on random initialisation.


This means that it is possible that you get different results if you run 
it several times. It basically gives you a local optimum and there may be 
more than one of these.

Use set.seed to see whether this explains your problem.

Best regards,
Christian

On Wed, 6 Apr 2011, Raji wrote:


Hi All,

 I was using the following command for performing kmeans for Iris dataset.

Kmeans_model-kmeans(dataFrame[,c(1,2,3,4)],centers=3)

This was giving proper results for me. But, in my application we generate
the R commands dynamically and there was a requirement that the column names
will be sent instead of column indices to the R commands.Hence, to
incorporate this, i tried using the R commands in the following way.

kmeans_model-kmeans((SepalLength+SepalWidth+PetalLength+PetalWidth),centers=3)

or

kmeans_model-kmeans(as.matrix(SepalLength,SepalWidth,PetalLength,PetalWidth),centers=3)

In both the ways, we found that the results are different from what we saw
with the first command (with column indices).

can you please let  us know what is going wrong here.If so, can you please
let us know how the column names can be used in kmeans to obtain the correct
results?

Many thanks,
Raji

--
View this message in context: 
http://r.789695.n4.nabble.com/Help-in-kmeans-tp3430433p3430433.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis, factor variables, large data set

2011-03-31 Thread Christian Hennig

Dear Hans,

clara doesn't require a distance matrix as input (and therefore doesn't 
require you to run daisy), it will work with the raw data matrix using

Euclidean distances implicitly.
I can't tell you whether Euclidean distances are appropriate in this 
situation (this depends on the interpretation and variables and 
particularly on how they are scaled), but they may be fine at least after 
some transformation and standardisation of your variables.


Hope this helps,
Christian

On Thu, 31 Mar 2011, Hans Ekbrand wrote:


Dear R helpers,

I have a large data set with 36 variables and about 50.000 cases. The
variabels represent labour market status during 36 months, there are 8
different variable values (e.g. Full-time Employment, Student,...)

Only cases with at least one change in labour market status is
included in the data set.

To analyse sub sets of the data, I have used daisy in the
cluster-package to create a distance matrix and then used pam (or pamk
in the fpc-package), to get a k-medoids cluster-solution. Now I want
to analyse the whole set.

clara is said to cope with large data sets, but the first step in the
cluster analysis, the creation of the distance matrix must be done by
another function since clara only works with numeric data.

Is there an alternative to the daisy - clara route that does not
require as much RAM?

What functions would you recommend for a cluster analysis of this kind
of data on large data set?


regards,

Hans Ekbrand

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] mixture models/latent class regression comparison

2011-02-28 Thread Christian Hennig

Dear Carson,

I have never used mmlcr for this, but quite generally when fitting such 
models, the likelihood has often very many local optima. This means that 
the result of the EM (or a similar) algorithm depends on the 
initialisation, which in flexmix (and perhaps also in mmlcr) is done in a 
random fashion. This means that results may differ even if the same method 
is applied twice, and unfortunately, depending on the dataset, the result 
may be quite unstable. This may explain that the two functions give you 
strongly different results, not of course implying that one of them is 
generally better.


Best regards,
Christian

On Mon, 28 Feb 2011, Carson Farmer wrote:


Dear list,

I have been comparing the outputs of two packages for latent class
regression, namely 'flexmix', and 'mmlcr'. What I have noticed is that
the flexmix package appears to come up with a much better fit than the
mmlcr package (based on logLik, AIC, BIC, and visual inspection). Has
anyone else observed such behaviour? Has anyone else been successful
in using the mmlcr package? I ask because I am interested in latent
class negative binomial regression, which the mmlcr package appears to
support, however, the results for basic Poisson latent class
regression appear to be inferior to the results from flexmix. Below is
a simple reproducible example to illustrate the comparison:

library(flexmix)
library(mmlcr)
data(NPreg) # from package flexmix
m1 - flexmix(yp ~ x, k=2, data=NPreg, model=FLXMRglm(family='poisson'))
NPreg$id - 1:200 # mmlcr requires an id column
m2 - mmlcr(outer=~1|id, components=list(list(formula=yp~x,
class=poisonce)), data=NPreg, n.groups=2)

# summary and coefficients for flexmix model
summary(m1)
summary(refit(m1))

# summary and coefficients for mmlcr model
summary(m2)
m2

Regards,

Carson

P.S. I have attached a copy of the mmlcr package with a modified
mmlcr.poisonce function due to errors in the version available here:
http://cran.r-project.org/src/contrib/Archive/mmlcr/. See also
http://jeldi.com/Members/jthacher/tips-and-tricks/programs/r/mmlcr
section Bugs? subsection Poisson.

--
Carson J. Q. Farmer
ISSP Doctoral Fellow
National Centre for Geocomputation
National University of Ireland, Maynooth,
http://www.carsonfarmer.com/



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Jaccard dissimilarity matrix for PCA

2010-12-28 Thread Christian Hennig

jaccard in package prabclus computes a Jaccard matrix for you.

By the way, if you want to do hierarchical clustering, it doesn't seem to 
be a good idea to me to run PCA first. Why 
not cluster the dissimilarity matrix directly without information loss by 
PCA? (I should not make too general statements on this because generally 
how to cluster data always depends on the aim of clustering, the cluster 
concept you are interested in etc.)


prabclus also contains clustering methods for such data; have a 
look at the functions prabclust and hprabclust (however, they are 
documented as functions for clustering species distribution ranges, so if 
your application is different, you may have to think about whether and how 
to adapt them).


Hope this helps,
Christian




On Tue, 28 Dec 2010, Flabbergaster wrote:



Hi
I have a large dataset, containing a wide range of binary variables.
I would like first of all to compute a jaccard matrix, then do a PCA on this
matrix, so that I finally can do a hierarchical clustering on the principal
components.
My problem is, that I don't know how to compute the jaccard dissimilarity
matrix in R? Which package to use, and so on...
Can anybody help me?
Alternatively I'm search for another way to explore the clusters present in
my data.
Another problem is, that I have cases with missing values on different
variables.

Jacob
--
View this message in context: 
http://r.789695.n4.nabble.com/Jaccard-dissimilarity-matrix-for-PCA-tp3165982p3165982.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] hclust, does order of data matter?

2010-11-15 Thread Christian Hennig
I don't know how the hclust function is implemented, but generally in 
hierarchical clustering the result can be ambiguous if there are several 
distances of identical value in the dataset (or identical between-cluster 
distances occur when aggregating clusters). The role of the order of the 
data depends on how these ambiguities are resolved. It may well be that in 
such cases if at some point when building the hierarchy there are two 
different possibilities to merge clusters at the same distance value what 
is done by hclust is determined by the order.


Hope this helps,
Christian

On Mon, 15 Nov 2010, rchowdhury wrote:



Hello,

I am using the hclust function to cluster some data.  I have two separate
files with the same data.  The only difference is the order of the data in
the file.  For some reason, when I run the two files through the hclust
function, I get two completely different results.

Does anyone know why this is happening?  Does the order of the data matter?

Thanks,
RC
--
View this message in context: 
http://r.789695.n4.nabble.com/hclust-does-order-of-data-matter-tp3043896p3043896.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] maxitems in cluster validity

2010-10-22 Thread Christian Hennig
You'd need to show us the code you attempted to use in order to make it 
possible to help you.


A good idea may also be to contact the package maintainer directly.

Best regards,
Christian

On Fri, 22 Oct 2010, Penny Adversario wrote:


I did cluster validity using internal and stability measures on 8768 items but 
I get an error message: “ the number of items to be clustered is larger than 
maxitems.”  I increased my maxitems to 9000 and still got the same error 
message.  I partitioned the data into subsections of 600 and was able to get 
results but it doesn’t make sense to interpret the 15 validity results based on 
subsections of the data specially so if one is after the validity of clusters 
resulting from an entire data set. How can I increase items for cluster 
validation?   
 
Penny



[[alternative HTML version deleted]]




*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Biweight (Tukey) M-estimator of location with known scale parameter

2010-10-15 Thread Christian Hennig

Have a look at the package smoothmest.

Christian

On Fri, 15 Oct 2010, Ondrej Vozar wrote:


Dear colleagues,

I would like to ask you how to estimate
biweight M-estimator of Tukey with known
scale example.

I know how to estimate biweight M-estimator
if estimated scale is used using function
rml in package MASS.

library(MASS)
x-rnorm(1000)
rlm(x~1,psi=psi.bisquare)

But I would like to estimated this
with true scale s=1.

I also checked package robustbase, but I have not
found any solution.

Thanks for any advice.
Ondrej Vozar

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Package prabclus not available?

2010-10-10 Thread Christian Hennig

Hi there,

I just tried to install the package prabclus on a computer running Ubuntu 
Linux 9.04 using install.packages from within R.

This gave me a message:
Warning message:
In install.packages(prabclus) : package ‘prabclus’ is not available

I tried to do this selecting two different CRAN mirrors (same result) and 
with other packages (installing them works fine).


Looking up the CRAN mirror website I used (UK, London), there doesn't seem 
to be anything wrong with prabclus. (iMac checking apparently gives an 
error which is due to an error with package spdep on that platform in 
tests, but that shouldn't affect using it on Linux, or should it?)


Any explanation?

Thanks and best wishes,
Christian

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Package prabclus not available?

2010-10-10 Thread Christian Hennig

Works for me for both the London mirror as well as CRAN master.
May I guess that you are under R  2.10.x on that machine?


Oh yes, need to update this first. Thanks. Solved.

The error message could be more informative, though...

Christian



Uwe



On 10.10.2010 14:48, Christian Hennig wrote:

Hi there,

I just tried to install the package prabclus on a computer running
Ubuntu Linux 9.04 using install.packages from within R.
This gave me a message:
Warning message:
In install.packages(prabclus) : package ‘prabclus’ is not available

I tried to do this selecting two different CRAN mirrors (same result)
and with other packages (installing them works fine).

Looking up the CRAN mirror website I used (UK, London), there doesn't
seem to be anything wrong with prabclus. (iMac checking apparently gives
an error which is due to an error with package spdep on that platform in
tests, but that shouldn't affect using it on Linux, or should it?)

Any explanation?

Thanks and best wishes,
Christian

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.




*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] hclust with method = “ward”

2010-10-07 Thread Christian Hennig

On Wed, 6 Oct 2010, PeterB wrote:


Thanks, Christian. This is really helpful.

I was not aware of that equality, but now I can see it. I think you mean the
inner sum over all distances in the distance matrix (for that cluster),
which means that each distance is counted twice (which is why we divide by
2).


That's probably how to explain it... you can obviuously check it by 
writing the whole thing down, which is how I did it. (The formula is in 
Bock's old book on Automatische Klassifikation, but that's in German.)


Christian



Peter


Christian Hennig wrote:


The k-means/Ward criterion can be written down in terms of squared
Euclidean distances in a way that doesn't involve means. It is half the
sum (over all clusters) of the sum (over all observations in a
cluster) of all within-cluster squared dissimilarities, the inner sum
divided by the cluster size. This can also be computed for a general
dissimilarity matrix (this is for example done by cluster.stats in
package fpc).

I'd guess that hclust with method=ward uses this when run with a general
dissimilarity matrix. At least it would make sense, although I'm not sure
whether it really is what hclust does, because I didn't check the
underlying Fortran code.

Note that I may have missed postings in this thread, so sorry if this
doesn't add to what you already have worked out.

Christian

On Wed, 6 Oct 2010, PeterB wrote:



Apparently, the same issue exists in SAS, where there is an option to run
the
Ward algorithm based only on the distance matrix. Perhaps, a SAS user
could
confirm that or even check with SAS.

Peter

--
View this message in context:
http://r.789695.n4.nabble.com/hclust-with-method-ward-tp2952140p2965310.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
View this message in context: 
http://r.789695.n4.nabble.com/hclust-with-method-ward-tp2952140p2966045.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] hclust with method = “ward”

2010-10-06 Thread Christian Hennig
The k-means/Ward criterion can be written down in terms of squared 
Euclidean distances in a way that doesn't involve means. It is half the 
sum (over all clusters) of the sum (over all observations in a 
cluster) of all within-cluster squared dissimilarities, the inner sum 
divided by the cluster size. This can also be computed for a general 
dissimilarity matrix (this is for example done by cluster.stats in

package fpc).

I'd guess that hclust with method=ward uses this when run with a general 
dissimilarity matrix. At least it would make sense, although I'm not sure 
whether it really is what hclust does, because I didn't check the 
underlying Fortran code.


Note that I may have missed postings in this thread, so sorry if this 
doesn't add to what you already have worked out.


Christian

On Wed, 6 Oct 2010, PeterB wrote:



Apparently, the same issue exists in SAS, where there is an option to run the
Ward algorithm based only on the distance matrix. Perhaps, a SAS user could
confirm that or even check with SAS.

Peter

--
View this message in context: 
http://r.789695.n4.nabble.com/hclust-with-method-ward-tp2952140p2965310.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] KMedians clustering in R

2010-09-16 Thread Christian Hennig

pam/clara in package cluster are probably as close to it as you can be.

There is no unique definition of a multivariate median and therefore there 
is no unique definition of k-medians, but pam/clara is one possible 
version of it. (Of course if you think of k-medians as defined in a 
specific paper, it may still be something slightly different.)


Christian

On Thu, 16 Sep 2010, Shubha Vishwanath Karanth wrote:


Hi R,

Is there a package in R to perform KMedian clustering?

Thanks.
Shubha
This e-mail may contain confidential and/or privileged i...{{dropped:10}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] own distance

2010-09-07 Thread Christian Hennig
The kmeans function doesn't accept a distance object or distance matrix as 
input (which you can of course generate from your own definition), but pam 
in package cluster does. (Note that the term k-means refers to cluster 
mean vectors, which can only computed from a cases*variables data matrix, 
not from distances alone.)


There are also several other clustering methods that work with distance 
matrices as input, see for example hclust, agnes from package cluster or 
dbscan from package fpc, but if you want something that is 
conceptually close to k-means, you should probably go for pam.


Best regards,
Christian

On Tue, 7 Sep 2010, Karen Sargsyan wrote:

Is it possible to implement my own distance and mean for k-means clustering 
for any clustering package in R?

Just looking for simple way, without creating a new package.

karsar

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Rd-file error: non-ASCII input and no declared encoding

2010-09-01 Thread Christian Hennig

Dear list,

I came across the following error for three of my newly written Rd-files:
non-ASCII input and no declared encoding

I can't make sense of this.

Below I copied in one of the three files.
Can anybody please tell me what's wrong with it?

Thank you,
Christian

\name{tetragonula}
\alias{tetragonula}
\alias{tetragonula.coord}
\docType{data}
% \non_function{}
\title{Microsatellite genetic data of Tetragonula bees}
\description{
  Genetic data for 236 Tetragonula (Apidae) bees from Australia
  and Southeast Asia, see  Franck et al. (2004). The data give pairs of
  alleles (codominant markers) for 13 microsatellite loci.
}
\usage{data(tetragonula)}
\format{
  Two objects are generated:
  \describe{
\item{tetragonula}{A data frame with 236 observations and 13 string
variables. Strings consist of six digits each. The
format is derived from the data format used by the software GENEPOP
(Rousset 2010). Alleles have a three digit code, so a value of
\code{258260} on variable V10 means that on locus 10 the two alleles
have codes 258 and 260. \code{000} refers to missing values.}
\item{tetragonula.coord}{a 236*2 matrix. Coordinates of locations of
individuals in decimal format, i.e. the first number is latitude
(negative values are South), with minutes and seconds
converted to fractions. The second number is longitude (negative
values are West).}
  }
}
\source{
  Franck, P., E. Cameron, G. Good, J.-Y. Rasplus, and B. P. Oldroyd
  (2004) Nest architecture and genetic differentiation in a species
  complex of Australian stingless bees. \emph{Mol. Ecol.} 13, 2317–2331.

  Rousset, F. (2010) Genepop 4.0 for Windows and Linux.
  \url{http://kimura.univ-montp2.fr/~rousset/Genepop.pdf}
}
\details{
   Reads from example data file \code{Heterotrigona_indoFO.txt}.
}

\examples{
data(tetragonula)
}
\keyword{datasets}


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Rd-file error: non-ASCII input and no declared encoding

2010-09-01 Thread Christian Hennig

Dear Gavin,

very well spotted = solved!
Thanks also Duncan!

Christian

On Wed, 1 Sep 2010, Gavin Simpson wrote:


On Wed, 2010-09-01 at 15:09 +0100, Christian Hennig wrote:

Dear list,

I came across the following error for three of my newly written Rd-files:
non-ASCII input and no declared encoding


I'm not sure, but is it the m- or n-dash character for the page range in
the \source{} section?

2317–2331
  ^^^

replace it with 2317-2331 and see if it helps.

HTH

G


I can't make sense of this.

Below I copied in one of the three files.
Can anybody please tell me what's wrong with it?

Thank you,
Christian

\name{tetragonula}
\alias{tetragonula}
\alias{tetragonula.coord}
\docType{data}
% \non_function{}
\title{Microsatellite genetic data of Tetragonula bees}
\description{
   Genetic data for 236 Tetragonula (Apidae) bees from Australia
   and Southeast Asia, see  Franck et al. (2004). The data give pairs of
   alleles (codominant markers) for 13 microsatellite loci.
}
\usage{data(tetragonula)}
\format{
   Two objects are generated:
   \describe{
 \item{tetragonula}{A data frame with 236 observations and 13 string
 variables. Strings consist of six digits each. The
 format is derived from the data format used by the software GENEPOP
 (Rousset 2010). Alleles have a three digit code, so a value of
 \code{258260} on variable V10 means that on locus 10 the two alleles
 have codes 258 and 260. \code{000} refers to missing values.}
 \item{tetragonula.coord}{a 236*2 matrix. Coordinates of locations of
 individuals in decimal format, i.e. the first number is latitude
 (negative values are South), with minutes and seconds
 converted to fractions. The second number is longitude (negative
 values are West).}
   }
}
\source{
   Franck, P., E. Cameron, G. Good, J.-Y. Rasplus, and B. P. Oldroyd
   (2004) Nest architecture and genetic differentiation in a species
   complex of Australian stingless bees. \emph{Mol. Ecol.} 13, 2317–2331.

   Rousset, F. (2010) Genepop 4.0 for Windows and Linux.
   \url{http://kimura.univ-montp2.fr/~rousset/Genepop.pdf}
}
\details{
Reads from example data file \code{Heterotrigona_indoFO.txt}.
}

\examples{
data(tetragonula)
}
\keyword{datasets}


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
__ R-help@r-project.org mailing 
list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting 
guide http://www.R-project.org/posting-guide.html and provide commented, 
minimal, self-contained, reproducible code.


--
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Dr. Gavin Simpson [t] +44 (0)20 7679 0522
ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Latex errors when checking package

2010-08-08 Thread Christian Hennig

Dear Ben,

thanks a lot! This did the trick (after adding install behind apt-get).

Regards,
Christian

On Fri, 6 Aug 2010, Ben Bolker wrote:


Christian Hennig chrish at stats.ucl.ac.uk writes:


LaTeX errors when creating PDF version.
This typically indicates Rd problems.
LaTeX errors found:
! Font T1/ptm/m/n/10=ptmr8t at 10.0pt not loadable: Metric (TFM) file not
found


 I'm guessing you're running Ubuntu.  If so,

http://www.vidarholen.net/contents/linuxtips/

suggests that you

sudo apt-get texlive-fonts-recommended

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Latex errors when checking package

2010-08-06 Thread Christian Hennig

Dear listers,

I just run R CMD check on an update of one of my packages.
All seems fine but after having gone through all the Rd-file and example 
checking and so on, I get the following kind of errors:


LaTeX errors when creating PDF version.
This typically indicates Rd problems.
LaTeX errors found:
! Font T1/ptm/m/n/10=ptmr8t at 10.0pt not loadable: Metric (TFM) file not 
found

.
to be read again
   relax
l.7 \begin{document}
! Font T1/ptm/m/n/24.88=ptmr8t at 24.88pt not loadable: Metric (TFM) file 
not f

ound.
to be read again

(and similar ones)

l.18 ...ian Hennig \email{chr...@stats.ucl.ac.uk}}
! \textfont 0 is undefined (character h).
\...@formatstring ...\...@string \UrlRight \...@th $

l.41 ...http://www.homepages.ucl.ac.uk/~ucakche/}}

! \textfont 0 is undefined (character t).
\...@formatstring ...\...@string \UrlRight \...@th $

l.41 ...http://www.homepages.ucl.ac.uk/~ucakche/}}

! \textfont 0 is undefined (character t).
\...@formatstring ...\...@string \UrlRight \...@th $

(...and lots of further ones that look preety much the same.)

I have no idea what went wrong and don't see anything suspicious in the 
Rd-files.


Any hint what this could be about?
Is something missing in my LATEX-installation?

Thanks and best regards,
Christian


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] K-means result - variance between cluster

2010-07-02 Thread Christian Hennig

Dear Ralph,

between and within clusters sum of squares (if you want variances, you 
need to divide them by the appropriate constant!) add up to the 
overall sum of squares, so you can get the beween clusters ss by 
computing the overall ss (one possibility to get this is to run kmeans 
with k=1) and subtracting the within cluster ss from it.


Note, however, that the F-value cannot be interpreted in the usual way and 
is particulary not F-distributed when computed on clusters from k-means, 
because for F-distribution you'd need to assume that groups are determined 
independently of the data.


Hope this helps,
Christian

On Fri, 2 Jul 2010, Ralph Modjesch wrote:


Hi,

I like to present the results from the clustering method k-means in
terms of variances: within and between Cluster. The k-means object
gives only the within cluster sum of squares by cluster, so the between
variance part is missing,for calculation the following table, which I
try to get.

Number of | Variance within | Var between | Var total | F-value
Cluster k | cluster | cluster |   |
===
2 ...| 25,00 ..| 75,00 ..| 100 ..| 1,5
3 ...| 45,00 ..| 55,00 ..| 100 ..| 1,7

Is there any package/ function which will do that?


--
Mit freundlichen Grüßen

Ralph Modjesch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Christian Hennig

Dear abanero,

In principle, k nearest neighbours classification can be computed on 
any dissimilarity matrix. Unfortunately, knn and knn1 seem to assume 
Euclidean vectors as input, which restricts their use.


I'd probably compute an appropriate dissimilarity between points (have a 
look at Gower's distance in daisy, package cluster), and the implement 
nearest neighbours classification myself if I needed it. It should be 
pretty straightforward to implement.


If you want unsupervised classification (clustering) instead, you have the 
choice between all kinds of dissimilarity based algorithms then (hclust, pam, 
agnes etc.)


Christian

On Thu, 27 May 2010, Ulrich Bodenhofer wrote:



abanero wrote:


Do you know  something like “knn1” that works with categorical variables
too?
Do you have any suggestion? 


There are surely plenty of clustering algorithms around that do not require
a vector space structure on the inputs (like KNN does). I think
agglomerative clustering would solve the problem as well as a kernel-based
clustering (assuming that you have a way to positive semi-definite measure
of the similarity of two samples). Probably the simplest way is Affinity
Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation;
see CRAN package apcluster I have co-developed). All you need is a way of
measuring the similarity of samples which is straightforward both for
numerical and categorical variables - as well as for mixtures of both (the
choice of the similarity measures and how to aggregate the different
variables is left to you, of course). Your final classification task can
be accomplished simply by assigning the new sample to the cluster whose
exemplar is most similar.

Joris Meys wrote:


Not a direct answer, but from your description it looks like you are
better
of with supervised classification algorithms instead of unsupervised
clustering. 


If you say that this is a purely supervised task that can be solved without
clustering, I disagree. abanero does not mention any class labels. So it
seems to me that it is indeed necessary to do unsupervised clustering first.
However, I agree that the second task of assigning new samples to
clusters/classes/whatever can also be solved by almost any supervised
technique if samples are labeled according to their cluster membership
first.

Cheers, Ulrich
--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster analysis and supervised classification: an alternative to knn1?

2010-05-27 Thread Christian Hennig



Christian wrote:

and the implement
nearest neighbours classification myself if I needed it.
It should be pretty straightforward to implement.


Do you intend modify the code of the knn1() function by yourself?


No; if you understand what the nearest neighbours method does, it's not 
very complicated to implement it from scratch (assuming that your dataset 
is small enough that you don't have to worry too much about optimising 
computing times). A bit of programming experience is required, though. 
(It's not that I intend to do it right now, I suggest that you do it if 
you can...)


Christian




thanks to everyone!

--
View this message in context: 
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2233210.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Can I compare two clusters without using their distance-matrix (dist()) ?

2010-04-21 Thread Christian Hennig

Dear Tal,

I took the definition of the Hubert gamma- and Dunn-index from the Gordon 
book. They are actually not about comparing two clusters, at least not in 
that reference, and they require dissimilarities.


The adjusted Rand index and Meila's VI, as implemented in 
cluster.stats, compare two clusterings. If you set compareonly=TRUE in 
cluster.stats, it only computes these two indexes, so it doesn't need the 
dissimilarity matrix in principle. I will probably in the next update

change it so that in this case you don't need to provide a
dissimilarity matrix.

Until then, you can supply a noninformative matrix.
Example:
c1 - sample(4,100,replace=TRUE)
c2 - sample(5,100,replace=TRUE)
cs - cluster.stats(d=matrix(0,ncol=100,nrow=100),c1,c2,compareonly=TRUE)

cs$corrected.rand
cs$vi

Hope this helps,
Christian



On Wed, 21 Apr 2010, Tal Galili wrote:


Thanks for the fast reply Uwe.

My hope in posting this was to find if anyone had already done work (in R)
in this direction.  So far I wasn't able to find any such relevant code, so
I turned to the mailing list.

Regarding new implementations - thanks for offering! - I have already came
around one such algorithm - I implemented it, and will probably publish it
on my blog http://www.r-statistics.com/ in the near future.

If any one else has any reference to R implementation, it would be most
helpful,
Tal


Contact
Details:---
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
--




2010/4/21 Uwe Ligges lig...@statistik.tu-dortmund.de


On 21.04.2010 18:15, Tal Galili wrote:


Hello all,

I would like to compare the similarity of two cluster solutions using a
validation criteria (such as Hubert's gamma coefficient, the Dunn index
the
corrected rand index and so on)

I see (from here:http://www.statmethods.net/advstats/cluster.html) that
the function cluster.stats() in the fpc package provides a mechanism
for comparing 2 cluster solutions - *BUT* - it requires me to give the
the distance matrix among objects.

*My question *is: What ways can you suggest for comparing two cluster
solutions, while using the cluster indicators only (i.e: a vector saying
to
which cluster each object belongs to), and WITHOUT asking to submit the
distance matrix between the objects.



Don't know. If you have a theoretical solution and can provide the
description of a method, there will be many people around happy to make an
algorithm and implement it.

Uwe Ligges



 Thanks,

Tal



Contact
Details:---
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)

--

   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.





[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster/distance large matrix

2010-02-11 Thread Christian Hennig

Dear Bart,

a strange thing in your question is that the term Ward's method 
usually refers to a method based on the k-means criterion, which, in its 
standard form, is not based on dissimilarities, but on 
objects*variables-data.
So I wonder how and why you want to use Ward's method on a dissimilarity 
matrix in the first place (I know that the k-means criterion 
can in principle be translated to dissimilarity data - this is probably 
what hclust's method=ward does if fed with a dissimilarity matrix, but 
I'm not sure -, but then it loses its justification).


One thing you could think about is using the function pam in library 
cluster. Chances are that this won't work on 38,000 cases either, but you 
may cluster a subsample of, say, 2,000 cases and assign all further 
objects to the most similar cluster medoid.


It is well know that hierarchical methods are problematic with too large 
dissimilarity matrices; even if you resolve the memory problem, the number 
of operations required is enormous.


Hope this helps,
Christian


On Thu, 11 Feb 2010, Bart Thijs wrote:



Hi all,

I've stumbled upon some memory limitations for the analysis that I want to
run.

I've a matrix of distances between 38000 objects. These distances were
calculated outside of R.
I want to cluster these objects.

For smaller sets (egn=100) this is how I proceed:
A-matrix(scan(file, n=100*100),100,100, byrow=TRUE)
ad-as.dist(A)
ahc-hclust(ad,method=ward,members=NULL)


However if I try this with the real dataset I end up with memory problems.
I've the 64bit version of R installed on a machine with 40Gb RAM (Windows
2003 64bit version).

I'm thinking about using only the lower triangle of the matrix but I can't
create a distance object for the clustering from the lower.tri

Can someone help me with a suggestion for which way to go?

Best Regards
Bart Thijs
--
View this message in context: 
http://n4.nabble.com/cluster-distance-large-matrix-tp1477237p1477237.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Sampling theory

2010-01-19 Thread Christian Hennig

Hi there,

are there any R-packages for computations required in sampling theury 
(such as confidence intervals under random, stratified, cluster sampling; 
I'd be partoculary interested in confidence intervals for the population 
variance, which is difficult enough to find even in books)?


Thanks,
Christian

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Clustering with clara

2010-01-14 Thread Christian Hennig

Dear Paco,

as far as I know, there is no such problem with clara, but I may be wrong. 
However, in order to help you (though I'm not sure whether I'll be able to 
do that), we'd need to understand precisely what you were doing in R and 
how your data looks like (code and data; you can show us a relevant bit 
of the data using the str command). Chances are that the problem is 
not in clara but some other thing that you do doesn't do what you expect

it to do.

Christian

On Thu, 14 Jan 2010, pacomet wrote:


Hello everyone

I am trying to use CLARA method for finding clusters in my spatial surface
temperature data and noticed one problem. My data are in the form
lat,lon,temperature. I extract lat,lon and cluster number for each point in
the dataset. When I plotted a map of cluster numbers I found empty areas in
the map. The point is that the number of points that were assigned a cluster
number are less than the original temperature analyzed points.

Why are there less points in the clustering results? is there any option in
the CLARA method to retain every single point? is there another clustering
method that preserves all the points?

Thanks in advance

Paco

--
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] K-means recluster data with given cluster centers

2010-01-11 Thread Christian Hennig
That kmeans returns an error if there is an empty cluster is a bit of a 
nuisance.


It should not be too difficult to get rid off the kmeans function for what 
you call reclustering. You could write your own function that assigns

every point of the new data to the closest initial center. That should be
relatively easy and does the same thing, if I understand correctly what you 
want.


I don't comment on whether it makes sense what you attempt to do, which 
entirely depends on the aim of your analysis (and on what you mean by 
cluster in the same way), but an alternative could be 
to cluster the initial data by mclustBIC in library mclust and to use the 
resulting clusters as training data in mclustDA.


Cheers,
Christian


On Mon, 11 Jan 2010, t.peter.muel...@gmx.net wrote:


K-means recluster data with given cluster centers

Dear R user,

I have several large data sets. Over time additional new data sets will be 
created.
I want to cluster all the data in a similar/ identical way with the k-means 
algorithm.

With the first data set I will find my cluster centers and save the cluster 
centers to a file [1].
This first data set is huge, it is guarantied that cluster centers will 
converge.

Afterwards I load my cluster centers and cluster via k-means all other datasets 
with the same cluster centers [2].

I tried this but now I'm getting in the reclustering step following error 
message:
Error: empty cluster: try a better set of initial centers

That one of the clusters is empty (has no datapoint) should not be a 
problem. This can happen because the new data sets can be smaller. What 
am I doing wrong? Is there a other way to cluster new data in the same 
way like the old datasets?


Thanks
Peter


1: R code to find cluster center and save them to file
  #---INITIAL CLUSTERING TO FIND CLUSTER CENTERS
  # LOAD LIB
  library(cluster)

  # LOAD DATA
  data_unclean - read.table(dataset1.dat)
  data.matrix-as.matrix(data_unclean,any)

  # CLUSTER
  Nclust - 100 # amount cluster centers
  Imax - 200 # amount of iteration for convergence of clustering
  set.seed(100) # set seed of random nr generator
  init - sample(dim(data.matrix)[1], Nclust) # this is the initial Nclust 
prototypes
  km - kmeans(data.matrix, centers=data.matrix[init,], iter.max=Imax)

  # WRITE OUT CLUSTER CENTERS
  km$centers # print cluster center (columns: dim component; rows: clusters)
  km$size # print amount of data in each cluster
  clusterCenters=km$centers
  save(file=clusterCenters.RData, list='clusterCenters') # Beispiel
  write.table(km$centers, file = clusterCenters.dat, sep = ,, col.names= 
FALSE, row.names= FALSE)


2: R code to recluster new data
  #---RECLUSTER NEW DATA WITH GIVEN CLUSTER CENTERS
  # LOAD LIB, SET PARAMETER
  library(cluster)
  loopStart=0
  loopEnd=10

  # LOAD CLUSTER CENTER
  load(clusterCenters.RData) # load cluster centers

  # LOOP OVER TRAJ AND RECLUSTER THEM
  for(ii in loopStart:loopEnd){
   # DEFINE FILENAME
   #print(paste(test,ii,sep=))
   filenameInput=paste(dataset,ii,dat,sep=)
   filenameOutput=paste(dataset,ii,datClusters,sep=)
   print(filenameInput)
   print(filenameOutput)

   # LOAD DATA
   data_unclean - read.table(filenameInput)
   data.matrix-as.matrix(data_unclean,any)

   # RECLUSTER DATA
   kmRecluster - kmeans(data.matrix, centers=clusterCenters, iter.max=1)
   kmRecluster$size

   # WRITE OUT CLUSTERS FOR EACH DATA
   write.table(kmRecluster$cluster, file = filenameOutput, sep = ,, 
col.names= FALSE, row.names= FALSE)
  }

--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fishers exact test at 2.2e-16

2009-12-17 Thread Christian Hennig
I know that you didn't ask for this but to me this seems to be a very 
dodgy method to select a best number of clusters with no proper basis at 
all. All of these tests are data dependent, so the p-values cannot be 
interpreted in the usual way. It is actually not clear how they can be 
interpreted, and the freedom in the data to find a clustering depends on 
the number of clusters, so there is no reason to expect that comparing 
p-values for different numbers tells you anything meaningful. Do you 
really think that it is an informative difference if one clustering gives

you p=10^{-58} and another one 10^{-30}?

Christian

On Thu, 17 Dec 2009, Søren Faurby wrote:


In an effort to select the most appropriate number of clusters in a
mixture analysis I am comparing the expected and actual membership of
individuals in various clusters using the Fisher?s exact test. I aim
for the model with the lowest possible p-value, but I frequently get
p-values below 2.2e-16 and therefore does not get exact p-values with
standard Fisher?s exact tests in R.

Does anybody know if there is a version of Fisher?s exact test in
any package which can handle lower probabilities, or have other suggestions 
as to how I can compare the probabilities?


I am for instance comparing the following two:

dat2-matrix(c(29,0,29,0,12,0,18,0,0,29,0,16,0,19), nrow=2)
fisher.test(dat2, workspace=3000)

dat3-matrix(c(29,0,0,29,0,0,12,0,0,17,0,1,0,29,0,0,15,1,0,0,19),
nrow=3)
fisher.test(dat3, workspace=3000)

Which both result in p-value  2.2e-16

Kind regards, Søren

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster size

2009-12-11 Thread Christian Hennig

Dear Ms Karunambigai,

the kmeans algorithm depends on random initialisation.
There are two basic strategies that can be applied in order to make your 
results reproducible:
1) Fix the random number generator by means of set.seed (see ?set.seed) 
before you run kmeans. The problem with this is that your solution can 
only be reproduced using the same random seed; it technically still is 
random.

2) Specify fixed initial centers, using the centers argument in kmeans.
(Sensible initial centers may be obtained by running hclust using Ward's 
method, obtain the desired number of clusters using cutree and compute the 
centers of the resulting clusters; sorry that I 
don't have the time right now to explain how to do that precisely; the 
help pages and hopefully some understanding of what is going on may help 
you further.)


An alternative strategy that will not absolutely guarantee reproducibility 
but make your results more stable is to use kmeansruns in library fpc, which
is a wrapper that runs kmeans several times and gives you the optimal 
solution. That should reproduce its outcome with higher probability 
(though not precisely 1).
I don't know whether the default value runs=100 is sufficient to give a 
stable solution for your data, but increasing the runs parameter may help.


Cheers,
Christian

On Fri, 11 Dec 2009, karuna m wrote:


hi r-help,
i am doing kmeans clustering in stats. i tried for five clusters clustering 
using:
kcl1 - kmeans(as1[,c(contlife,somlife,agglife,sexlife,
    rellife,hordlife,doutlife,symtlife,washlife,
   chcklife,rptlife,countlife,coltlife,ordlife)], 
5, iter.max = 10, nstart = 1,
     algorithm = Hartigan-Wong)
  table(kcl1$cluster)
every time i am getting five clusters of different sizes like first time with 
cluster sizes
table(kcl1$cluster)
  1   2   3   4   5
140  72 105  98 112
second time with cluster sizes
table(kcl1$cluster)
  1   2   3   4   5
 91 149 106  76 105 and so on.
I wish to know that whether there is any function to get same sizes of clusters 
everytime when we do kmeans clustering.
Thanks in advance.
regards,
Ms.Karunambigai M
PhD Scholar
Dept. of Biostatistics
NIMHANS
Bangalore
India


 The INTERNET now has a personality. YOURS! See your Yahoo! Homepage.
[[alternative HTML version deleted]]




*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] fitting mixture of normals distribution to asset return data

2009-11-25 Thread Christian Hennig

Dear John,

I don't know what the exp stuff in your line below is about, but 
mclustBIC in package mclust does fit normal mixtures.

Try for a start
library(mclust)
mmm - mclustBIC(data,G=2)
mmms - summary(mmm)
mmms

If you want to learn more, read the documentation.

Christian

On Wed, 25 Nov 2009, John Seppänen wrote:


Hi,

I have a 15 years of monthly return data (180 observations) from instruments
that have non-normal return distributions. Thus, I would like to fit a
mixture of normal distribution to each of the series. So, that I would be
able to simulate from the marginal distributions like this:

asset.1-exp(c(rnorm(500,-0.07,0.02),rnorm(9500,0.05,0.05)))-1

My problem is that I have tried to use Google and go through some packages
(eg mixtools  mclust) but haven't been able to find a function to fit the
mixture of normals. I would like to have two different states of world and
then get the probabilities and the mean and sigma in those states (as in the
example above).

I am newbie in this subject so if someone could point me a R function for
this, I'd really appreciate it...

br,
John

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] test for bimodality

2009-08-30 Thread Christian Hennig

Dear John,

you may check dip in package diptest, which tests unimodality vs. 
multimuodality (you'll need qDiptab for that, too).


Apologies if this misses the point; I only saw the last responses but not 
the original question so this may not be what you are after, or already 
mentioned, but I decided that I risk making a fool of myself for a small 
chance that it helps you.


Cheers,
Christian

On Mon, 31 Aug 2009, Rolf Turner wrote:



On 31/08/2009, at 9:40 AM, John Sansom wrote:


Has a test for bimodality been implemented in R?


Doing RSiteSearch(test for bimodality) yields one hit,
which points to

http://finzi.psych.upenn.edu/Rhelp08/2008-September/173308.html

It looks like it might be *some* help to you.

cheers,

Rolf Turner

##
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] normal mixture model

2009-07-27 Thread Christian Hennig

Hi Cindy,

you need the summary function

mclustsummary - summary(mclustBICoutputobject,data)
to get all the information. Some (like best model) is given if you just 
print out the summary object. Some other information (like 
estimated parameter values) are accessible as components of the summary 
object, like

mclustsummary$parameters$...
Try

str(mclustsummary)

to see what's there (unfortunately this is not fully documented).

For more detail see the help pages.

Hope this helps,
Christian

On Sun, 26 Jul 2009, cindy Guo wrote:


Hi, Christian,

Thank you for the reply. I just tried. Does the function mclustBIC only give
the best model, or does it also do EM to get the cluster means and variances
according to the best model it picks? I didn't find it.  Is there a way to
automatically select the best number of components and do EM? Because I need
to do the normal mixture model in a loop (one EM at an iteration), so I want
it to do everything automatically.
Thanks,

Cindy

On Sun, Jul 26, 2009 at 3:46 PM, Christian Hennig chr...@stats.ucl.ac.ukwrote:


You can use mclustBIC in package mclust (uses the BIC for deciding about
the number of components and hierarchical clustering for initialisation).

Christian


On Sun, 26 Jul 2009, cindy Guo wrote:

  Hi, All,


I want to fit a normal mixture model. Which package in R is best for this?
I
was using the package 'mixdist', but I need to group the data into groups
before fitting model, and different groupings seem to lead to different
results. What other package can I use which is stable? And are there
packages that can automatically determine the number of components?

Thank you,

Cindy

   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche





*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] normal mixture model

2009-07-26 Thread Christian Hennig
You can use mclustBIC in package mclust (uses the BIC for deciding 
about the 
number of components and hierarchical clustering for initialisation).


Christian

On Sun, 26 Jul 2009, cindy Guo wrote:


Hi, All,

I want to fit a normal mixture model. Which package in R is best for this? I
was using the package 'mixdist', but I need to group the data into groups
before fitting model, and different groupings seem to lead to different
results. What other package can I use which is stable? And are there
packages that can automatically determine the number of components?

Thank you,

Cindy

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Hartigan's Dip test

2009-07-06 Thread Christian Hennig

James,

I assume that you use package diptest.
Look up
?qDiptab
for a table of quantiles from which you can obtain your p-value.


data(qDiptab)
qDiptab


You then look up your statistic value (0.074) for a suitable n
(n=30, say), and you take one minus the Pr value you find on top of the 
table.

Using n=30 gives you Pr between 0.8 and 0.9, so 0.2p0.1.
However, the next largest value of n (50) would lead to 0.05p0.02 
leading to possibly different conclusions. Unfortunately this means that 
you are in a kind of ambiguous borderline situation for the table, though 
33 is so much closer to 30 than to 50 that it seems that your result is 
at least not significant at 0.05 level.


By the way, you can simulate a p-value yourself by repeating 
dip(runif(33)) lots of times.


Hope this helps,
Christian

On Mon, 6 Jul 2009, James Allsopp wrote:


Hi,
I just got a value for the dip test out of my data of 0.074 for a sample
size of 33. I'm trying to work out what this actually means though?
Could someone help me relate this to a p-value?

Thanks
James

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis, defining center seeds or number of clusters

2009-06-11 Thread Christian Hennig

Dear Alex,

actually fixing the number of clusters in kmeans end then ending up with a 
smaller number because of empty clusters is not a standard method of 
estimating the number of clusters. I may happen (as apparently in some of 
your examples), but it is generally rather unusual. In most cases, kmeans, 
as well as clara, pam and other clustering methods, only give you the 
number of clusters you ask for. Even with some reasonable separation 
between clusters kmeans cannot generally be expected to come up with empty 
clusters if the number is initially chosen too high or too many 
initially centers are specified.


The help page for pam.object in library cluster shows you a method to 
estimate the optimal number of clusters based on pam.
However, this problem strongly depends on what cluster concept you have in 
mind and what you want to use your clusters for. There are alternative 
indexes that could be optimised to find the best number of clusters. Some 
of them are implemented in the function cluster.stats in package fpc.
I strongly advise reading some literature about this to understand the 
problem better; the help page of cluster.stats gives a few references.


The BIC gives you an estimate of the number of cluster together with 
Gaussian mixtures, see package mclust.


If you can specify things like maximum within-cluster distances, you may 
get something from using cutree together with a hierarchical clustering 
method in hclust, for example complete linkage.


dbscan and fixmahal in package fpc are further alternatives, requiring
one or two tuning constants to come up with an automatical number of
clusters.

Best regards,
Christian

On Thu, 11 Jun 2009, am...@xs4all.nl wrote:


I use kmeans to classify spectral events in high and low 1/3 octave bands:

#Do cluster analysis
CyclA-data.frame(LlowA,LhghA)
CntrA-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE)
ClstA-kmeans(CyclA,centers=CntrA,nstart=50,algorithm=MacQueen)

This works well when the actual data shows 1,2 or 3 groups that are not
too close in a cross plot. The MacQueen algorithm will give one or more
empty groups which is what I want.

However, there are cases when the groups are closer together, less compact
or diffuse which leads to the situation where visually only 2 groups are
apparent but the algorithm returns 3 splitting one group in two.

I looked at the package 'cluster' specifically at clara (cannot use pam as
I have 1 observations). But clara always returns as many groups as you
aks for.

Is there a way to help find a seed for the intial cluster centers?
Equivalently, is there a way to find a priori the number of groups?

I know this is not an easy problem. I have looked at principal components
(princomp, prcomp) because there is a connection with cluster analysis. It
is not obvious to me how to program that connection though.

http://en.wikipedia.org/wiki/Principal_Component_Analysis
http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf
http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf

Thanks in advance,
Alex van der Spek

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] em algorithm mixture of multivariate normals

2009-05-21 Thread Christian Hennig

Look up packages flexmix and mclust!

Christian

On Thu, 21 May 2009, daniele riggi wrote:


Hi,
I would like to know if it is possible to have a R code to estimate the
parameters of a mixture of bivariate (or multivariate) normals via EM
Algorithm. I tried to write it, but in the estimation of the matrix of
variance and covariance, i have some problems. I generate two bidimensional
vectors both from different distribution with their own vector means and
variance and covariance structure. When I create a unique vector, the
structure of covariance changes, and so the implementation of the EM
algorithm doesn't work.

Maybe someone knows the reason. If I fix the starting initial value of the
covariance matrix and I don't update the estimate of this matrix, the
algorithm works and finds the estimate of the vector means, so I wrote it in
the correct way. However if someone could help me I will be very grateful to
him for kindness.

Best RegardsDaniele

--
Dr. Daniele Riggi, PhD student
University of Milano-Bicocca
Department of Statistics
Building U7, Via Bicocca degli Arcimboldi, 8
20126 Milano, Italy
cell. +39 328 3380690
mailto: daniele.ri...@gmail.com

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Save Cluster results to data frame

2009-05-18 Thread Christian Hennig

Hi Chris,

this isn't particularly a clustering question, is it?

Why don't you just take your clustering vector 
(pam.output.object$clustering) and assign it to 
a$clustering (given that a is the name of your data frame)?


And why don't you just define a new character/string vector and assign
the cluster names that you want to it using if or case?

Regards,
Christian

On Mon, 18 May 2009, Chris Arthur wrote:

If I cluster my data into 3 sets, using pam for instance, is there a way 
to save the resultant cluster results, to the originating data frame. and 
related to that how do i say change the cluster names to something a bit 
more meaningful that 1..2...3


So it goes like this.

Data  --- Cluster into 3 groups  given them meaningful names
   ---output 
back to data frame


Thanks for the help

Chris

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Bayesian 2*2 table independence test

2009-05-14 Thread Christian Hennig

Dear list,

I'd like to test the null hypothesis of independence in a 2*2 
contingency table in a Bayesian way.
As far as I can see, this can be done by the function ctable in library 
LearnBayes. (More precisely, a Bayes factor can be computed.)


Two questions:
1) Is there any other package/function than can be used for this in a more 
or less straightforward manner?


2) The ctable help page explains the parameter a as
a: matrix of prior hyperparameters, but it is not explained what the 
prior is and how it precisely depends on these hyperparameters. (I read
the paper test_of_independence.pdf by the package author J. Albert on the 
web, but it only explains the choice a=1+0*data, but not general a.)


Thank you,
Christian

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] finite mixture model (2-component Weibull): plotting Weibull components?

2009-04-30 Thread Christian Hennig
mclust doesn't do Weibull mixtures, only Gaussian ones (though you may 
approximate a Weibull by several Gaussians). You may look up the flexmix 
package, which either does it already or a method function can be provided

to do it.
There is also an example fitting a mixture distribution in Venables and 
Ripley's MASS book with normals (including plotting the density), which 
you could adapt for Weibull distributions by plugging in the 
corresponding functions for the Weibull.



(1) restrict the number of components to 2 .


Specify G=2 in mclust (if want to fit 2 Gaussians).


(2) obtain and plot a component Weibull density


Generally, adding a Weibull mixture density to a plot works by

x - seq(0,100,by=0.1)
# or whatever reasonable choice of x-values
lines(x,p*dweibull(x,s11,s12)+(1-p)*dweibull(x,s21,s22))

where p, s11, s12, s21, s22 are the mixture parameters.

Regards,
Christian

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] clustering, don't understand this error

2009-04-16 Thread Christian Hennig

Hi there,

I'm travelling right now so I can't really check this but it seems that 
the problem is that cluster.stats needs a partition as input. hclust 
doesn't give you a partition but you can generate one from it using 
cutree.

BTW, rather use - than =.

Best wishes,
Christian


On Wed, 15 Apr 2009, Ana M Aparicio Carrasco wrote:


Hello,
I am using the dunn metric, but something is wrong and I dont understand
what or what that this error mean. Please can you help me with this?

The instructions are:

#Indice de Dunn
disbupa=dist(bupa[,1:6])
a=hclust(disbupa)
cluster.stats(disbupa,a,bupa[,7])$dunn

And the error is:

Error in max(clustering) : invalid 'type' (list) of argument

thank you so much.

Ana Maria Aparicio.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help with clustering

2009-01-26 Thread Christian Hennig
Generally, how to scale different variables when aggregating them in a 
dissimilarity measure is strongly dependent on the subject matter, what the 
aim of clustering and your cluster comncept is. This cannot be answered 
properly on such a mailing list.


A standard transformation before computing dissimilarities would be to 
scale all variables to variance 1 by dividing by their standard deviations. 
This gives in some well defined sense all 
variables the same weight (which may be somewhat affected by 
outliers, heavy tails, skewness; note, however, that normalising to the same 
range shares the same problems more severly).


Regards,
Christian

On Mon, 26 Jan 2009, mau...@alice.it wrote:


I am going to try out a tentative clustering of some feature vectors.
The range of values spanned by the three items making up the features vector is 
quite different:

Item-1 goes roughly from 70 to 525 (integer numbers only)
Item-2 is in-between 0 and 1 (all real numbers between 0 and 1)
Item-3 goes from 1 to 10 (integer numbers only)

In order to spread out Item-2 even further I might try to replace Item-2 with 
Log10(Item-2).

My concern is that, regardless the distance measure used, the item whose order 
of magnitude is the highest may carry the highest weight in the process of 
calculating the similarity matrix therefore fading out the influence of the 
items with smaller variation in the resulting clusters.
Should I normalize all feature vector elements to 1 in advance of generating 
the similarity matrix ?

Thank you so much.
Maura







tutti i telefonini TIM!


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] bug (?!) in pam() clustering from fpc package ?

2008-12-17 Thread Christian Hennig

Dear Tal,

pam is not in the fpc package but in the cluster package. Look at ?pam and 
?pam.object to find out what it does.
As far as I see, the medoids in the output object are the final cluster 
medoids, not the initial ones, which presumably explains the observed 
behaviour.


Best regards,
Christian

On Wed, 17 Dec 2008, Tal Galili wrote:


Hello all.
I wish to run k-means with manhattan distance.
Since this is not supported by the function kmeans, I turned to the pam
function in the fpc package.
Yet, when I tried to have the algorithm run with different starting points,
I found that pam ignores and keep on starting the algorithm from the same
starting-points  (medoids).

For my questions:
1) is there a bug in the code or in the way I am using it ?
2) is there a way to either fix the code or to another function in some
package that can run kmeans with manhattan distance (manhattan distances are
the sum of absolute differences) ?

here is a sample code:
require(fpc)
x - rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
  cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
pam(x, 2, medoids = c(1,16))


output:
Medoids:
ID
[1,]  3 -0.1406026 0.1131493
[2,] 17  4.9564839 4.6480520
...

So the initial medeoids where 3 and 17, not 1 and 16 as I asked.



Thanks,
Tal



--
--
Tal Galili
Phone number: 972-50-3373767
FaceBook: Tal Galili
My Blogs:
www.talgalili.com
www.biostatistics.co.il



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chr...@stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] CLARA and determining the right number of clusters

2008-09-30 Thread Christian Hennig

Hi there,

generally finding the right number of clusters is a difficult problem and 
depends heavily on the cluster concept needed for the particular 
application.

No outcome of any automatic mathod should be taken for granted.

Having said that, I guess that something like the example given in

?pam.object

(replacing pam by clara) should work with clara, too.

Regards,
Christian


On Tue, 30 Sep 2008, pacomet wrote:


Hi everyone

I have a question about clustering. I've managed using CLARA to get a
clustering analysis of a large data set. But now I want to find which is the
right number of clusters.

The clara.object gives some information like the ratio between maximal and
minimal dissimilarity that says (maybe if lower than 1??) if a cluster is
well-separated from the other. I've also read something about silhouette and
abut cluster.stats but can't manage to get how to find the right number of
clusters.

I've tried a suggestion from the mailing list but when using dist

d1-dist(mydata$sst)

it says that specified vector size is too big

Is there any method to find the right number of clusters when using clara?
Maybe something I've tried but with a small and simple trick I can't find

Thanks in advance

--
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] CLARA and determining the right number of clusters

2008-09-30 Thread Christian Hennig

Hi there,


I've tried your suggestion and it seems promising. But I have a couple of
questions. I am reading a three column ASCII file (lon, lat, sst)


mydata - read.table(INFILE, header=FALSE,sep=,

na.strings=99.00,dec=.,strip.white=TRUE,col.names=c(lon,lat,sst))

then I extract a subset of the data and try to get the right number of
clusters just for third var, sst


I'm not sure whether you feel that you have to look at a single variable at 
a time, but the whole thing should work for more than one as well.



x-mydata$sst
asw - numeric(10)
for (k in 4:10)

+  asw[k] - clara(x, k) $ silinfo $ avg.width

 k.best - which.max(asw)
cat(silhouette-optimal number of clusters:, k.best, \n)

silhouette-optimal number of clusters: 5

I've changed the maximum number of clusters in your example from 20 just to
10 as I am expecting a number between 5 and 8 clusters would be right. Is
there any problem with this change? Maybe this restriction is too strict if
I just consider the data are just numbers but as it is sea surface
temperature under certain environmental-meteorological conditions in this
particular case I think there should not be more than 8-9 clusters (If 20 is
retained I get 11 clusters).


This kind of problem concerns your data and your application and cannot be 
solved on such a mailing list. Perhaps you should go for professional 
advice about your particular data. Quite obviously, if you restrict the 
number of clusters to be at most 10, you cannot find eleven, and how strong 
you think that there should not be more than 8-9 clusters and how good 
your arguments against 11 are, nobody on this list can decide.
The general problem is that there is no unique statistical definition of 
what a true cluster is and whether your dataset rather contains 5 or 11 
clusters (or any other number) depends on what you want to call a 
cluster.



The second question is how should one understand the plot? Is the right
number the one with greater average silhouette width?


I don't know which plot you refer to but you may have a look at the Kaufman 
and Rousseeuw book quoted on the help page.


Best wishes,
Christian



Thanks again


2008/9/30 Christian Hennig [EMAIL PROTECTED]


Hi there,

generally finding the right number of clusters is a difficult problem and
depends heavily on the cluster concept needed for the particular
application.
No outcome of any automatic mathod should be taken for granted.

Having said that, I guess that something like the example given in


?pam.object


(replacing pam by clara) should work with clara, too.

Regards,
Christian



On Tue, 30 Sep 2008, pacomet wrote:

 Hi everyone


I have a question about clustering. I've managed using CLARA to get a
clustering analysis of a large data set. But now I want to find which is
the
right number of clusters.

The clara.object gives some information like the ratio between maximal and
minimal dissimilarity that says (maybe if lower than 1??) if a cluster is
well-separated from the other. I've also read something about silhouette
and
abut cluster.stats but can't manage to get how to find the right number of
clusters.

I've tried a suggestion from the mailing list but when using dist

d1-dist(mydata$sst)

it says that specified vector size is too big

Is there any method to find the right number of clusters when using clara?
Maybe something I've tried but with a small and simple trick I can't find

Thanks in advance

--
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet

   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], 
www.homepages.ucl.ac.uk/~ucakchehttp://www.homepages.ucl.ac.uk/%7Eucakche





--
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] intToUtf8

2008-09-19 Thread Christian Hennig

Hi there,

any explanation for this?


intToUtf8(66)

Error in intToUtf8(66) : argument 'x' must be an integer vector

intToUtf8(c(66,55))

Error in intToUtf8(c(66, 55)) : argument 'x' must be an integer vector

intToUtf8(c(66,55),multiple=TRUE)

Error in intToUtf8(c(66, 55)) : argument 'x' must be an integer vector

Errr... 66 and c(66,55) are as integer vectorish as anything can be, aren't 
they?



version

   _
platform   i686-pc-linux-gnu
arch   i686
os linux-gnu
system i686, linux-gnu
status
major  2
minor  6.0
year   2007
month  10
day03
svn rev43063
language   R
version.string R version 2.6.0 (2007-10-03)

Thanks,
Christian

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] intToUtf8

2008-09-19 Thread Christian Hennig

Dear Duncan,

No, they're numeric.  integer here is a type, not the mathematical thing. 
This works:


intToUtf8(as.integer(c(66,55)))

The docs don't mention this requirement, and it does seem somewhat 
unnecessary; I'll look into it.


Thanks!

You're lucky I didn't read to the end of your message:  that's a pretty old 
version.  Please test on current versions before reporting problems.


Sorry but people won't always install the most recent version when running 
into smallish problems like this. Particularly not when they're working on 
machines where they are not superuser. We have other things to do than 
updating every two months. I may be egoist here but I think 
the list has to live with requests concerning old versions. (I use version so 
that people can tell me has been fixed in the meantime if this is the 
case.)
If you personally don't respond to them, that's your choice of course. But 
it doesn't work as a policy.


Regards,
Christian

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] About clustering techniques

2008-07-29 Thread Christian Hennig

Dear Paco,

in order to use the methods in the cluster package (including pam), look up 
the help page of daisy, which is able to compute dissimilarity matrices

handling missing values appropriately (in most situations).

A good reference is the Kaufman and Rousseeuw book cited on that help page.

Christian

On Tue, 29 Jul 2008, pacomet wrote:


Hello R users

It's some time I am playing with a dataset to do some cluster analysis. The
data set consists of 14 columns being geographical coordinates and monthly
temperatures in annual files

latitutde - longitude - temperature 1 -. - temperature 12

I have some missing values in some cases, maybe there are 8 monthly valid
values at some points with four non valid. I don't want to supress the whole
row with 8 good/4 bad values as I wanna try annual and monthy analysis.

I first tried kmeans but found a problem with missing values. When trying
without omitting missing values kmeans gives an error and when excluding
invalid data too many values are excluded in some years of the data series.

Now I have been reading about pam, pamk and clara, I think they can handle
missing values. But can't find out the way to perform the analysis with
these functions. As I'm not an statistics nor an R expert the fpc or cluster
package documentation is not enough for me. If you know about a website or a
tutorial explaining the way to use that functions, with examples to check if
possible, please post them.

Any other help or suggestion is greatly appreciated.

Thanks in advance

Paco

--
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] About clustering techniques

2008-07-29 Thread Christian Hennig
A quick comment on this: imputation is an option to make things 
technically work, but it is not 
necessarily good. Imputation always introduces some noise, ie, it fakes 
information that is not really there.


Whether it is good depends strongly on the data, the situation and the 
imputation method (random often not being a very sensible 
choice).


Christian

On Tue, 29 Jul 2008, [EMAIL PROTECTED] wrote:


Hi Paco,
I got the same problem with you before. Thus, I just impute the missing 
values

For example:

newdata-as.matrix(impute(olddata, fun=random))
then I believe that you could analyze your data.

Hopefully it helps.
Chunhao


Quoting pacomet [EMAIL PROTECTED]:


Hello R users

It's some time I am playing with a dataset to do some cluster 
analysis. The
data set consists of 14 columns being geographical coordinates and 
monthly

temperatures in annual files

latitutde - longitude - temperature 1 -. - temperature 12

I have some missing values in some cases, maybe there are 8 monthly 
valid
values at some points with four non valid. I don't want to supress the 
whole
row with 8 good/4 bad values as I wanna try annual and monthy 
analysis.


I first tried kmeans but found a problem with missing values. When 
trying
without omitting missing values kmeans gives an error and when 
excluding
invalid data too many values are excluded in some years of the data 
series.


Now I have been reading about pam, pamk and clara, I think they can 
handle
missing values. But can't find out the way to perform the analysis 
with
these functions. As I'm not an statistics nor an R expert the fpc or 
cluster
package documentation is not enough for me. If you know about a 
website or a
tutorial explaining the way to use that functions, with examples to 
check if

possible, please post them.

Any other help or suggestion is greatly appreciated.

Thanks in advance

Paco

--
_
El ponent la mou, el llevant la plou
Usuari Linux registrat: 363952
---
Fotos: http://picasaweb.google.es/pacomet

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster.stats

2008-06-14 Thread Christian Hennig
The given information is not enough to tell you what's going on. as.dist 
doesn't appear in the given code and it's not clear to me what kind of 
object img is (a small image doesn't tell me what R makes of it).
Also, try to read the help pages first and find out whether img is of the 
format that is required by the functions. And check (using str for example) 
whether data is what you expect it to be.


Christian

On Sat, 14 Jun 2008, Laura Poggio wrote:


Thank you very much for your answer.
I tried to run the function on my data and now I am getting this message of
error
Error in as.dist(dmat[clustering == i, clustering == i]) :  (subscript)
logical subscript too long

Below the code I am using (version2.7.0 of R with all packages updated):

data - - as(img, data.frame)[1:1]#(where img is a small image 256 px
x 256 px)
kl - kmeans(data, 5)
library(fpc)
cluster.stats(data, kl$cluster)

Thank you for any hints on the reasons and meaning of the error!

Laura





2008/6/13 Christian Hennig [EMAIL PROTECTED]:


Dear Laura,

 Dear list,

I just tried to use the function cluster.stat in the package fpc.
I just have a couple of questions about the syntax:

cluster.stats(d,clustering,alt.clustering=NULL,
silhouette=TRUE,G2=FALSE,G3=FALSE)

1) the distance object (d) is an object obtained by the function dist() on
my own original matrix?



d is allowed to be an object of class dist or a dissimilarity matrix.
The answer to your question depends on what your original matrix is. If
it is something on which you can compute a distance by dist(), you're right,
at least if dist() delivers the distance you are interested in.

 2) clustering is the clusters vector as result of one of the many

clustering
methods?



The help page tells you what clustering can be. So it could be the
clustering/partition vector of a clustering method or it could be something
else. Note that cluster.stats doesn't depend on any particular clustering
method. It computes the statistics regardless of where the clustering vector
comes from.

Best regards,
Christian



Thank you very much in advance and sorry for such basic question, but I
did
not manage to clarify my mind.

Laura

   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], 
www.homepages.ucl.ac.uk/~ucakchehttp://www.homepages.ucl.ac.uk/%7Eucakche





*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster.stats

2008-06-14 Thread Christian Hennig

Dear Laura,

I have R 2.6.0. I tried dist on a vector of length 200,000 and it told me 
that it is too long. Theoretically, if you have 260,000 observations, the 
length of the dist object should be 260,000*259,999/2, which is too large 
for our computers, I guess. Which means that unfortunately cluster.stats 
won't work for such a large data set, because it needs the full casewise 
dissimilarity information.


I don't understand how you managed to produce a dist object of length 
of only 130,000 out of your data, but it certainly doesn't give all 
pairwise distance information for 260,000 points and therefore cannot be 
used in cluster.stats with a clustering vector of length 260,000 or so.


Sorry,
Christian

On Sat, 14 Jun 2008, Laura Poggio wrote:


Thank. See below.

Laura

2008/6/14 Christian Hennig [EMAIL PROTECTED]:


What does str(ddata) give?



Class 'dist'  atomic [1:130816]   69.2 117.1 145.6 179.9 195.6 ...




dcent doesn't make sense as input for cluster.stats, because you need a
dissimilarity matrix between all objects.



Yes I know ... I simply try to see if something was changing with different
structure of data





Christian

On Sat, 14 Jun 2008, Laura Poggio wrote:

 I am sorry I did not provide enough information.

I am not using img later, but data that is data.frame.
I wrote that img is a image just to explain what kind of data is coming
from, but the object I am using is data and it is a data.frame (checked
many
times).

I am not using as.dist, but dist in order to calculate the distance matrix
among the data I have. Then the whole code I am using is:

data - - as(img, data.frame)[1:1]#(where img is an image 256x256
px)
kl - kmeans(data, 5)
library(fpc)
ddata - dist(data)
dcent - dist(kl$centers)

cluster.stats(ddata, kl$cluster)
cluster.stats(dcent, kl$cluster)

In both cases I got the same error:
Error in as.dist(dmat[clustering == i, clustering == i]) :  (subscript)
logical subscript too long

Below the structure of the different objects is detailed below:
data is 'data.frame':   262144 obs. of  1 variable
kl$centers is num [1:5, 1]
kl$cluster is Named int [1:262144]

I hope it is more informative. I am sorry but I did not find any
explanation
for the error message I am getting.

Thank you very much in advance

Laura



2008/6/14 Christian Hennig [EMAIL PROTECTED]:

 The given information is not enough to tell you what's going on. as.dist

doesn't appear in the given code and it's not clear to me what kind of
object img is (a small image doesn't tell me what R makes of it).
Also, try to read the help pages first and find out whether img is of the
format that is required by the functions. And check (using str for
example)
whether data is what you expect it to be.

Christian


On Sat, 14 Jun 2008, Laura Poggio wrote:

 Thank you very much for your answer.


I tried to run the function on my data and now I am getting this message
of
error
Error in as.dist(dmat[clustering == i, clustering == i]) :  (subscript)
logical subscript too long

Below the code I am using (version2.7.0 of R with all packages updated):

data - - as(img, data.frame)[1:1]#(where img is a small image
256
px
x 256 px)
kl - kmeans(data, 5)
library(fpc)
cluster.stats(data, kl$cluster)

Thank you for any hints on the reasons and meaning of the error!

Laura





2008/6/13 Christian Hennig [EMAIL PROTECTED]:

 Dear Laura,



 Dear list,

 I just tried to use the function cluster.stat in the package fpc.

I just have a couple of questions about the syntax:

cluster.stats(d,clustering,alt.clustering=NULL,
silhouette=TRUE,G2=FALSE,G3=FALSE)

1) the distance object (d) is an object obtained by the function
dist()
on
my own original matrix?


 d is allowed to be an object of class dist or a dissimilarity matrix.

The answer to your question depends on what your original matrix is.
If
it is something on which you can compute a distance by dist(), you're
right,
at least if dist() delivers the distance you are interested in.

 2) clustering is the clusters vector as result of one of the many

 clustering

methods?


 The help page tells you what clustering can be. So it could be the

clustering/partition vector of a clustering method or it could be
something
else. Note that cluster.stats doesn't depend on any particular
clustering
method. It computes the statistics regardless of where the clustering
vector
comes from.

Best regards,
Christian


 Thank you very much in advance and sorry for such basic question, but
I


did
not manage to clarify my mind.

Laura

 [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


 *** --- ***


Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698

Re: [R] cluster.stats

2008-06-13 Thread Christian Hennig

Dear Laura,


Dear list,
I just tried to use the function cluster.stat in the package fpc.
I just have a couple of questions about the syntax:

cluster.stats(d,clustering,alt.clustering=NULL,
silhouette=TRUE,G2=FALSE,G3=FALSE)

1) the distance object (d) is an object obtained by the function dist() on
my own original matrix?


d is allowed to be an object of class dist or a dissimilarity matrix.
The answer to your question depends on what your original matrix is. If 
it is something on which you can compute a distance by dist(), you're 
right, at least if dist() delivers the distance you are interested in.



2) clustering is the clusters vector as result of one of the many clustering
methods?


The help page tells you what clustering can be. So it could be the 
clustering/partition vector of a clustering method or it could be something 
else. Note that cluster.stats doesn't depend on any particular clustering 
method. It computes the statistics regardless of where the clustering 
vector comes from.


Best regards,
Christian



Thank you very much in advance and sorry for such basic question, but I did
not manage to clarify my mind.

Laura

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cluster analysis with numeric and categorical variables

2008-06-03 Thread Christian Hennig

Dear Miha,

a general way to do this is as follows:
Define a distance measure by aggregating the 
Euclidean distance on the (X,Y)-space and the trivial 0-1 distance (0 if 
category is the same) on the categorial variable. Perform cluster analysis 
(whichever you want) on the resulting distance matrix.


Note that there is more than one way to do this. The 0-1-distance could be 
incorporated in the definition of the Euclidean distance (instead of 
(x_i-y_i)^2), or a weighted average of the distances in X-, Y- and 
categorial space could be computed. Weights of variables (including 
possibly rescaling) have to be decided. How to do this precisely should 
depend on the subject matter and prior information about variable 
importance etc. In absence of such information, you may standardise the 
variablewise sums of squared pairwise distances to be equal.


Hope this helps (and you can figure out the relevant R code yourself).

Christian

On Tue, 3 Jun 2008, Miha Staut wrote:


Dear all,

I would like to perform a clustering analysis on a data frame with two coordinate 
variables (X and Y) and a categorical variable where only a != b can be established.  As 
far as I understood classification analyses, they are not an option as they partition the 
training set only in k classes of the test set.  By searching through the book 
Modern Applied Statistics with S I did not find a satisfactory solution.

I will be grateful for any suggestions.

Best regards
Miha



 __
can.html

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Clustering large data matrix

2008-03-06 Thread Christian Hennig

Hi there,

whether clara is a proper way of clustering depends strongly on what your data 
are and particularly what interpretation or use you want for your 
clustering. You may do better with a hierarchical method after having defined a 
proper distance (however this would rather go into statistical consultation and 
not just R help).


Assuming that you use some reasonable dimension reduction and clustering
method, you may get a good visualization of you clustering using the methods 
available via functions plotcluster/discrproj in package fpc.


Best,
Christian

On Thu, 6 Mar 2008, Dani Valverde wrote:


Hello,
I have a large data matrix (68x13112), each row corresponding to one
observation (patients) and each column corresponding to the variables
(points within an NMR spectrum). I would like to carry out some kind of
clustering on these data to see how many clusters are there. I have
tried the function clara() from the package cluster. If I use the matrix
as is, I can perform the clara analysis but when I call clusplot() I get
this error:

Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
'princomp' can only be used with more units than variables

Then, I reduce the dimensionality by using the function prcomp(). Then I
take the 13 first principal components (80% variability) and I carry
out the clara() analysis again. Then, I call the clusplot() function
again and voilà!, it works. The problem is that clusplot() only
represents the two first components of my prcomp() analysis, which
represents only 15% of the variability.
So, my questions are 1) is clara() a proper way to analyze such a large
data set? and 2) Is there an appropiate method for graphic plotting of
my data, that takes into account the whole variability if my data, not
just two principal components?
Many thanks.
Best,

Dani

--
Daniel Valverde Saubí

Grup de Biologia Molecular de Llevats
Facultat de Veterinària de la Universitat Autònoma de Barcelona
Edifici V, Campus UAB
08193 Cerdanyola del Vallès- SPAIN

Centro de Investigación Biomédica en Red
en Bioingeniería, Biomateriales y
Nanomedicina (CIBER-BBN)

Grup d'Aplicacions Biomèdiques de la RMN
Facultat de Biociències
Universitat Autònoma de Barcelona
Edifici Cs, Campus UAB
08193 Cerdanyola del Vallès- SPAIN
+34 93 5814126

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and Clusters

2008-01-07 Thread Christian Hennig
Dear Lorenzo,

if I understand your posting correctly, this is exactly what Single 
Linkage clustering does if you cut the dendrogram tree at your threshold 
distance.

Therefore you can use hclust with method=single (which produces the full 
dendrogram; you have to generate the Euclidean distances first) and then 
you feed the output into cutree with h=0.5 (your 
threshold), which produces a partition by cutting the 
dendrogram at the height 0.5.

Hope this helps,
Christian

On Mon, 7 Jan 2008, Lorenzo Isella wrote:

 Dear All,
 I hope I am not asking a FAQ. I am dealing with a problem of graph
 theory [connected components in a non-directed graph] and I do not
 want to rediscover the wheel.
 I saw a large number of R packages dealing for instance with the
 k-means method or hierarchical clustering for spatially distributed
 data and I am basically facing a similar problem.
 I am given a set of data which are the positions of particles in 3
 dimensions; I define two particles A and B to be directly connected if
 their Euclidean distance is below a certain threshold d. If A and B
 are directly connected and B and C are directly connected, then A,B
 and C are connected components (physically it means that they are
 members of the same cluster).
 All my N particles then split into k disjointed clusters, each with a
 certain number of connected components, and this is what I want to
 investigate.
 I do not know a priori how many clusters I have (this is my problem
 with e.g. k-means since k is an output for me); the only input is the
 set of 3-dimensional particle positions and a threshold distance.
 The algorithm/package I am looking should return the number of
 clusters and the composition of each cluster, e.g. the fact that the
 second cluster is made up of particles {R,T,L}.
 Consider for instance:

 # a 2-dimensional example
 x - rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
   matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
 colnames(x) - c(x, y)

 How can I then find out how many connected components I have when my
 threshold distance is d=0.5?

 Many thanks

 Lorenzo

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] order a matrix

2007-11-05 Thread Christian Hennig
Dear list,

order(x,y,z) returns a permutation to order x, ties broken by y, remaining 
ties broken by z. (And so on.)

What I'd like to do is
order(X), where X is a matrix (or a list or data frame if necessary) of 
unspecified size, which orders X[,1], ties broken by X[,2], remaining ties 
broken by X[,3] and so on - without having to know and to write down how 
many columns  X has.

Any ideas how to achieve that?

Thanks,
Christian

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] order a matrix

2007-11-05 Thread Christian Hennig
That solved it!

Thank you very much!

On Mon, 5 Nov 2007, Prof Brian Ripley wrote:

 On Mon, 5 Nov 2007, Christian Hennig wrote:

 Dear list,
 
 order(x,y,z) returns a permutation to order x, ties broken by y, remaining
 ties broken by z. (And so on.)
 
 What I'd like to do is
 order(X), where X is a matrix (or a list or data frame if necessary) of
 unspecified size, which orders X[,1], ties broken by X[,2], remaining ties
 broken by X[,3] and so on - without having to know and to write down how
 many columns  X has.
 
 Any ideas how to achieve that?

 For a list or data frame, do.call(order, X).
 For a matrix, do.call(order, split(A, col(A))).

 -- 
 Brian D. Ripley,  [EMAIL PROTECTED]
 Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
 University of Oxford, Tel:  +44 1865 272861 (self)
 1 South Parks Road, +44 1865 272866 (PA)
 Oxford OX1 3TG, UKFax:  +44 1865 272595


*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
[EMAIL PROTECTED], www.homepages.ucl.ac.uk/~ucakche

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.