On 2014-03-17 00:36, William Dunlap wrote:
Duncan's analysis suggests another way to do this: extract the 'x' vector, operate on that vector in a loop, then insert the result into the data.frame.
Thanks Bill, that is a good improvement. Göran
I added a df="quicker" option to your df argument and made the test dataset deterministic so we could verify that the algorithms do the same thing: dumkoll <- function(n = 1000, df = TRUE){ dfr <- data.frame(x = log(seq_len(n)), y = sqrt(seq_len(n))) if (identical(df, "quicker")) { x <- dfr$x for(i in 2:length(x)) { x[i] <- x[i-1] } dfr$x <- x } else if (df){ for (i in 2:NROW(dfr)){ # if (!(i %% 100)) cat("i = ", i, "\n") dfr$x[i] <- dfr$x[i-1] } }else{ dm <- as.matrix(dfr) for (i in 2:NROW(dm)){ # if (!(i %% 100)) cat("i = ", i, "\n") dm[i, 1] <- dm[i-1, 1] } dfr$x <- dm[, 1] } dfr } Timings for 10^4, 2*10^4, and 4*10^4 show that the time is quadratic in n for the df=TRUE case and close to linear in the other cases, with the new method taking about 60% the time of the matrix method: > n <- c("10k"=1e4, "20k"=2e4, "40k"=4e4) > sapply(n, function(n)system.time(dumkoll(n, df=FALSE))[1:3]) 10k 20k 40k user.self 0.11 0.22 0.43 sys.self 0.02 0.00 0.00 elapsed 0.12 0.22 0.44 > sapply(n, function(n)system.time(dumkoll(n, df=TRUE))[1:3]) 10k 20k 40k user.self 3.59 14.74 78.37 sys.self 0.00 0.11 0.16 elapsed 3.59 14.91 78.81 > sapply(n, function(n)system.time(dumkoll(n, df="quicker"))[1:3]) 10k 20k 40k user.self 0.06 0.12 0.26 sys.self 0.00 0.00 0.00 elapsed 0.07 0.13 0.27 I also timed the 2 faster cases for n=10^6 and the time still looks linear in n, with vector approach still taking about 60% the time of the matrix approach. > system.time(dumkoll(n=10^6, df=FALSE)) user system elapsed 11.65 0.12 11.82 > system.time(dumkoll(n=10^6, df="quicker")) user system elapsed 6.79 0.08 6.91 The results from each method are identical: > identical(dumkoll(100,df=FALSE), dumkoll(100,df=TRUE)) [1] TRUE > identical(dumkoll(100,df=FALSE), dumkoll(100,df="quicker")) [1] TRUE If your data.frame has columns of various types, then as.matrix will coerce them all to a common type (often character), so it may give you the wrong result in addition to being unnecessarily slow. Bill Dunlap TIBCO Software wdunlap tibco.com-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Duncan Murdoch Sent: Sunday, March 16, 2014 3:56 PM To: Göran Broström; [email protected] Subject: Re: [R] data frame vs. matrix On 14-03-16 2:57 PM, Göran Broström wrote:I have always known that "matrices are faster than data frames", for instance this function: dumkoll <- function(n = 1000, df = TRUE){ dfr <- data.frame(x = rnorm(n), y = rnorm(n)) if (df){ for (i in 2:NROW(dfr)){ if (!(i %% 100)) cat("i = ", i, "\n") dfr$x[i] <- dfr$x[i-1] } }else{ dm <- as.matrix(dfr) for (i in 2:NROW(dm)){ if (!(i %% 100)) cat("i = ", i, "\n") dm[i, 1] <- dm[i-1, 1] } dfr$x <- dm[, 1] } } -------------------- > system.time(dumkoll()) user system elapsed 0.046 0.000 0.045 > system.time(dumkoll(df = FALSE)) user system elapsed 0.007 0.000 0.008 ---------------------- OK, no big deal, but I stumbled over a data frame with one million records. Then, with df = TRUE, ---------------------------- user system elapsed 44677.141 1271.544 46016.754 ---------------------------- This is around 12 hours. With df = FALSE, it took only six seconds! About 7500 time faster. I was really surprised by the huge difference, and I wonder if this is to be expected, or if it is some peculiarity with my installation: I'm running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.I don't find it surprising. The line dfr$x[i] <- dfr$x[i-1] will be executed about a million times. It does the following: 1. Get a pointer to the x element of dfr. This requires R to look through all the names of dfr to figure out which one is "x". 2. Extract the i-1 element from it. Not particularly slow. 3. Get a pointer to the x element of dfr again. (R doesn't cache these things.) 4. Set the i element of it to a new value. This could require the entire column or even the entire dataframe to be copied, if R hasn't kept track of the fact that it is really being changed in place. In a complex assignment like that, I wouldn't be surprised if that took place. (In the matrix equivalent, it would be easier to recognize that it is safe to change the existing value.) Luke Tierney is making some changes in R-devel that might help a lot in cases like this, but I expect the matrix code will always be faster. Duncan Murdoch ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

