Re: [R] cluster analysis for 80000 observations

Martin Maechler Fri, 27 Jan 2006 00:32:26 -0800

>>>>> "Markus" == Markus Preisetanz <[EMAIL PROTECTED]>
>>>>>     on Thu, 26 Jan 2006 20:48:29 +0100 writes:


    Markus> Dear R Specialists,
    Markus> when trying to cluster a data.frame with about 80.000 rows and 25 
columns I get the above error message. I tried hclust (using dist), agnes 
(entering the data.frame directly) and pam (entering the data.frame directly). 
What I actually do not want to do is generate a random sample from the data.

Currently all the above mentioned cluster methods work with
full distance / dissimilarity objects, even if only internally,
i.e. they store all d_{i,j} for  1 <= i < j <= n, i.e.  n(n-1)/2 values,
also each of them in double precision, i.e. 8 bytes.

So: no chance with the above functions and n=80'000

 Markus> The machine I run R on is a Windows 2000 Server (Pentium 4) with 2 GB 
of RAM.

If you would run an machine with a 64-bit version of OS and R
{typical case today: Linux on AMD Opteron}, you could go up
quite a bit higher than on your Windoze box,
{I vaguely remember I could do  'n = a few thousand' on our 
 dual opteron with 16 GBytes}, but 80'000 is definitely too
large.

OTOH, there is clara() in the cluster package, which has been
designed for such situations, 
         CLARA:= [C]lustering [LAR]ge [A]pplications.
It is similar in spirit to pam(),
*does* cluster all 80'000 observations but does so by taking
sub samples to construct the medoids.
(and you can ask it to take many medium size subsamples, instead
 of just 5 small sized ones as it does by default).

Martin Maechler, ETH Zurich
maintainer of "cluster" package.

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] cluster analysis for 80000 observations

Reply via email to