Here is an even faster one; the general point is to create a properly vectorized custom function/expression:
mymean <- function(x, y, z) (x+y+z)/3 a = data.frame(matrix(1:3e4, ncol=3)) attach(a) print(system.time({r3 = mymean(X1,X2,X3)})) detach(a) # Yields: # [1] 0.000 0.010 0.005 0.000 0.000 print(identical(r2, r3)) # [1] TRUE # May values for version 1 and 2 resp. were # time for r1: [1] 29.420 23.090 60.093 0.000 0.000 # time for r2: [1] 1.400 0.050 1.505 0.000 0.000 Best wishes Ulf P.S. A somewhat more meaningful comparison of version 2 and 3: a = data.frame(matrix(1:3e5, ncol=3)) # time r2e5: [1] 12.04 0.15 12.92 0.00 0.00 # time r3e5: [1] 0.030 0.020 0.051 0.000 0.000 > depending on your problem, using "mapply" might help, as in the code > example below: > > a = data.frame(matrix(1:3e4, ncol=3)) > > print(system.time({ > r1 = numeric(nrow(a)) > for(i in seq_len(nrow(a))) { > g = a[i,] > r1[i] = mean(c(g$X1, g$X2, g$X3)) > }})) > > print(system.time({ > f = function(X1,X2,X3) mean(c(X1, X2, X3)) > r2 = do.call("mapply", args=append(f, a)) > })) > > print(identical(r1, r2)) > > # user system elapsed > 6.049 0.200 6.987 > user system elapsed > 0.508 0.000 0.509 > [1] TRUE > > Best wishes > Wolfgang > > Roger D. Peng wrote: >> Extracting rows from data frames is tricky, since each of the columns could >> be >> of a different class. For your toy example, it seems a matrix would be a >> more >> reasonable option. >> >> R-devel has some improvements to row extraction, if I remember correctly. >> You >> might want to try your example there. >> >> -roger >> >> Herve Pages wrote: >>> Hi, >>> >>> >>> I have a big data frame: >>> >>> > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) >>> > dat <- as.data.frame(mat) >>> >>> and I need to do some computation on each row. Currently I'm doing this: >>> >>> > for (key in row.names(dat)) { row <- dat[key, ]; ... do some >>> computation on row... } >>> >>> which could probably considered a very natural (and R'ish) way of doing it >>> (but maybe I'm wrong and the real idiom for doing this is something >>> different). >>> >>> The problem with this "idiomatic form" is that it is _very_ slow. The loop >>> itself + the simple extraction of the rows (no computation on the rows) >>> takes >>> 10 hours on a powerful server (quad core Linux with 8G of RAM)! >>> >>> Looping over the first 100 rows takes 12 seconds: >>> >>> > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] }) >>> user system elapsed >>> 12.637 0.120 12.756 >>> >>> But if, instead of the above, I do this: >>> >>> > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) } >>> >>> then it's 20 times faster!! >>> >>> > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) >>> }) >>> user system elapsed >>> 0.576 0.096 0.673 >>> >>> I hope you will agree that this second form is much less natural. >>> >>> So I was wondering why the "idiomatic form" is so slow? Shouldn't the >>> idiomatic >>> form be, not only elegant and easy to read, but also efficient? >>> >>> >>> Thanks, >>> H. >>> >>> >>>> sessionInfo() >>> R version 2.5.0 Under development (unstable) (2007-01-05 r40386) >>> x86_64-unknown-linux-gnu >>> >>> locale: >>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" >>> [7] "base" >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel