Roger D. Peng wrote: > Extracting rows from data frames is tricky, since each of the columns > could be of a different class. For your toy example, it seems a matrix > would be a more reasonable option.
There is no doubt about this ;-) > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) > dat <- as.data.frame(mat) With the matrix: > system.time(for (i in 1:100) { row <- mat[i, ] }) user system elapsed 0 0 0 With the data frame: > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] }) user system elapsed 12.565 0.296 12.859 And even with a mixed-type data frame, it's very tempting to convert it to a matrix before to do any loop on it: > dat2 <- as.data.frame(mat, stringsAsFactors=FALSE) > dat2 <- cbind(dat2, ii=1:300000) > sapply(dat2, typeof) V1 V2 V3 V4 V5 ii "character" "character" "character" "character" "character" "integer" > system.time(for (key in row.names(dat2)[1:100]) { row <- dat2[key, ] }) user system elapsed 13.201 0.144 13.360 > system.time({mat2 <- as.matrix(dat2); for (i in 1:100) { row <- mat2[i, ] }}) user system elapsed 0.128 0.036 0.163 Big win isn't it? (only if you have enough memory for it though...) Cheers, H. > > R-devel has some improvements to row extraction, if I remember > correctly. You might want to try your example there. > > -roger > > Herve Pages wrote: >> Hi, >> >> >> I have a big data frame: >> >> > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) >> > dat <- as.data.frame(mat) >> >> and I need to do some computation on each row. Currently I'm doing this: >> >> > for (key in row.names(dat)) { row <- dat[key, ]; ... do some >> computation on row... } >> >> which could probably considered a very natural (and R'ish) way of >> doing it >> (but maybe I'm wrong and the real idiom for doing this is something >> different). >> >> The problem with this "idiomatic form" is that it is _very_ slow. The >> loop >> itself + the simple extraction of the rows (no computation on the >> rows) takes >> 10 hours on a powerful server (quad core Linux with 8G of RAM)! >> >> Looping over the first 100 rows takes 12 seconds: >> >> > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] }) >> user system elapsed >> 12.637 0.120 12.756 >> >> But if, instead of the above, I do this: >> >> > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) } >> >> then it's 20 times faster!! >> >> > system.time(for (i in 1:100) { row <- sapply(dat, function(col) >> col[i]) }) >> user system elapsed >> 0.576 0.096 0.673 >> >> I hope you will agree that this second form is much less natural. >> >> So I was wondering why the "idiomatic form" is so slow? Shouldn't the >> idiomatic >> form be, not only elegant and easy to read, but also efficient? >> >> >> Thanks, >> H. >> >> >>> sessionInfo() >> R version 2.5.0 Under development (unstable) (2007-01-05 r40386) >> x86_64-unknown-linux-gnu >> >> locale: >> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C >> >> >> attached base packages: >> [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" >> [7] "base" >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel