Dear R community,
I am trying to understand how the ward linkage works from a quantitative point 
of view.
To test it I have devised a simple 3-members set:

                           G = c(0,2,10)

The distances between all couples are:

d(0,2)  =  2
d(0,10) = 10
d(2,10) =  8

The smallest distance corresponds to merging 0 and 2. The corresponding ESS are:

ESS(0,2) = 2*var(c(0,2)) = 4
ESS(0,10) = 2*var(c(0,10)) = 100
ESS(2,10) = 2*var(c(2,10)) = 64

and, indeed, the smallest ESS corresponds to merging 0 and 2. The next element 
that should be added
to 0 and 2 is obviously 10. This is where I don't understand how the hclust 
algorithm in R works. We have

> G <- c(0,2,10)
> G.dist <- dist(G)
> G.hc <- hclust(G.dist,method="ward")
> G.hc$merge
     [,1] [,2]
[1,]   -1   -2
[2,]   -3    1
> G.hc$height
[1]  2.00000 11.33333

Now, according to standard definitions, the distance between two clusters with 
elements Nr and Ns is:

                          d(Rs,Rr) = sqrt(2*Nr*Ns/(Nr+Ns))*||<Rs> - <Rr>||

where < > in the last expression indicates averages (centroids). If I carry out 
this operation to merge cluster
c(0,2) with 10, I get:

                          d(c(0,2),10) = sqrt(2*2*1/(2+1))*|1-9| = 9.237604

This is different from 11.3333 in the R output.

Does anyone know what's the exact value for the ward linkage, as displayed in 
the hclust height output?

Thanks in advance for any help!

J


-- 
This e-mail and any attachments may contain confidential...{{dropped:8}}

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to