You probably want to use runif() instead of rnorm() for equal probability of selecting between i,j
Your algorithm is of order n^2 [ 294 choose 2, 293 choose 2, ... ], so it should not be too slow. But two for() loops are inefficient in R. Something like this should be fairly fast in C. What is you aim in trying to do this ? Your algorithm is similar to hclust() - which has nice graphical support - but it merges two nearest neighbour to find another centroid instead of removing one of the neigbours. By removing columns early in stage you are losing information. The alternative would be to use hclust(), select a similarity/dissimilarity cutoff to create groups. Then from each group you can either choose the average profile or randomly select one column to represent the group. -- Adaikalavan Ramasamy -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Rajarshi Guha Sent: Friday, November 21, 2003 11:23 AM To: R Subject: [R] speeding up a pairwise correlation calculation Hi, I have a data.frame with 294 columns and 211 rows. I am calculating correlations between all pairs of columns (excluding column 1) and based on these correlation values I delete one column from any pair that shows a R^2 greater than a cuttoff value. (Rather than directly delete the column all I do is store the column number, and do the deletion later) The code I am using is: ndesc <- length(names(data)); for (i in 2:(ndesc-1)) { for (j in (i+1):ndesc) { if (i %in% drop || j %in% drop) next; r2 <- cor(data[,i],data[,j]); r2 <- r2*r2; if (r2 >= r2cut) { rnd <- abs(rnorm(1)); if (rnd < 0.5) { drop <- c(drop,i); } else { drop <- c(drop,j); } } } } drop is a vector that contains columns numbers that can be skipped data is the data.frame For the data.frame mentioned above (279 columns, 211 rows) the calculation takes more than 7 minutes (after which I Ctrl-C'ed the calculation). The machine is a 1GHz Duron with 1GB RAM The output of version is: platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 1 minor 7.1 year 2003 month 06 day 16 language R I'm not too sure why it takes *so* long (I had done a similar calculation in Python using list operations and it took forever), but is there any trick that could be used to make this run faster or is this type of runtime to be expected? Thanks, ------------------------------------------------------------------- Rajarshi Guha <[EMAIL PROTECTED]> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- A red sign on the door of a physics professor: 'If this sign is blue, you're going too fast.' ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
