Thanks for the suggestion. I found some documentation on why accessing a data.gram using the matrix notation (e.g., [i,j]) is so expensive, which was the cause of the problem.
regards, Roberto On Thu, Oct 22, 2009 at 12:05 AM, Jim Holtman <jholt...@gmail.com> wrote: > try running Rprof on the two examples to see what the difference is. what > you will probably see is a lot of the time on the dataframe is spent in > accessing it like a matrix ('['). Rprof is very helpful to see where time is > spent in your scripts. > > Sent from my iPhone > > On Oct 21, 2009, at 17:17, Roberto Perdisci <roberto.perdi...@gmail.com> > wrote: > >> Hi everybody, >> I noticed a strange behavior when using loops versus apply() on a data >> frame. >> The example below "explicitly" computes a distance matrix given a >> dataset. When the dataset is a matrix, everything works fine. But when >> the dataset is a data.frame, the dist.for function written using >> nested loops will take a lot longer than the dist.apply >> >> ######## USING FOR ####### >> >> dist.for <- function(data) { >> >> d <- matrix(0,nrow=nrow(data),ncol=nrow(data)) >> n <- ncol(data) >> r <- nrow(data) >> >> for(i in 1:r) { >> for(j in 1:r) { >> d[i,j] <- sum(abs(data[i,]-data[j,]))/n >> } >> } >> >> return(as.dist(d)) >> } >> >> ######## USING APPLY ####### >> >> f <- function(data.row,data.rest) { >> >> r2 <- as.double(apply(data.rest,1,g,data.row)) >> >> } >> >> g <- function(row2,row1) { >> return(sum(abs(row1-row2))/length(row1)) >> } >> >> dist.apply <- function(data) { >> d <- apply(data,1,f,data) >> >> return(as.dist(d)) >> } >> >> >> ######## TESTING ####### >> >> library(mvtnorm) >> data <- rmvnorm(100,mean=seq(1,10),sigma=diag(1,nrow=10,ncol=10)) >> >> tf <- system.time(df <- dist.for(data)) >> ta <- system.time(da <- dist.apply(data)) >> >> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da)))) >> print("tf = ") >> print(tf) >> print("ta = ") >> print(ta) >> >> print('----------------------------------') >> print('Same experiment on data.frame...') >> data2 <- as.data.frame(data) >> >> tf <- system.time(df <- dist.for(data2)) >> ta <- system.time(da <- dist.apply(data2)) >> >> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da)))) >> print("tf = ") >> print(tf) >> print("ta = ") >> print(ta) >> >> ######################## >> >> Here is the output I get on my system (R version 2.7.1 on a Debian lenny) >> >> [1] "diff = 0" >> [1] "tf = " >> user system elapsed >> 0.088 0.000 0.087 >> [1] "ta = " >> user system elapsed >> 0.128 0.000 0.128 >> [1] "----------------------------------" >> [1] "Same experiment on data.frame..." >> [1] "diff = 0" >> [1] "tf = " >> user system elapsed >> 35.031 0.000 35.029 >> [1] "ta = " >> user system elapsed >> 0.184 0.000 0.185 >> >> Could you explain why that happens? >> >> thank you, >> regards >> >> Roberto >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.