On Sat, 2006-08-19 at 10:25 -0600, Mike Nielsen wrote: > Wow. New respect for parse/eval. > > Do you think this is a special case of a more general principle? I > suppose the cost is memory, but from time to time a speedup like this > would be very beneficial. > > Any hints about how R programmers could recognize such cases would, I > am sure, be of value to the list in general. > > Many thanks for your efforts, Marc!
Mike, I think that one needs to consider where the time is being spent and then adjust accordingly. Once you understand that, you can develop some insight into what may be a more efficient approach. R provides good profiling tools that facilitate this process. In this case, almost all of the time in the first two examples using strsplit(), is in that function: > repeated.measures.columns <- paste(1:100000, collapse = ",") > library(utils) > Rprof(tmp <- tempfile()) > res1 <- as.numeric(unlist(strsplit(repeated.measures.columns, ","))) > Rprof() > summaryRprof(tmp) $by.self self.time self.pct total.time total.pct "strsplit" 23.68 99.7 23.68 99.7 "as.double.default" 0.06 0.3 0.06 0.3 "as.numeric" 0.00 0.0 23.74 100.0 "unlist" 0.00 0.0 23.68 99.7 $by.total total.time total.pct self.time self.pct "as.numeric" 23.74 100.0 0.00 0.0 "strsplit" 23.68 99.7 23.68 99.7 "unlist" 23.68 99.7 0.00 0.0 "as.double.default" 0.06 0.3 0.06 0.3 $sampling.time [1] 23.74 Contrast that with Prof. Ripley's approach: > Rprof(tmp <- tempfile()) > res3 <- eval(parse(text=paste("c(", repeated.measures.columns, ")"))) > Rprof() > summaryRprof(tmp) $by.self self.time self.pct total.time total.pct "parse" 0.42 87.5 0.42 87.5 "eval" 0.06 12.5 0.48 100.0 $by.total total.time total.pct self.time self.pct "eval" 0.48 100.0 0.06 12.5 "parse" 0.42 87.5 0.42 87.5 $sampling.time [1] 0.48 To some extent, one could argue that my initial timing examples are contrived, in that they specifically demonstrate a worst case scenario using strsplit(). Real world examples may or may not show such gains. For example with Charles' initial query, the initial vector was rather short: > repeated.measures.columns [1] "3,6,10" So if this was a one-time conversion, we would not see such significant gains. However, what if we had a long list of shorter entries: > repeated.measures.columns <- paste(1:10, collapse = ",") > repeated.measures.columns [1] "1,2,3,4,5,6,7,8,9,10" > big.list <- replicate(10000, list(repeated.measures.columns)) > head(big.list) [[1]] [1] "1,2,3,4,5,6,7,8,9,10" [[2]] [1] "1,2,3,4,5,6,7,8,9,10" [[3]] [1] "1,2,3,4,5,6,7,8,9,10" [[4]] [1] "1,2,3,4,5,6,7,8,9,10" [[5]] [1] "1,2,3,4,5,6,7,8,9,10" [[6]] [1] "1,2,3,4,5,6,7,8,9,10" > system.time(res1 <- t(sapply(big.list, function(x) as.numeric(unlist(strsplit(x, ",")))))) [1] 1.972 0.044 2.411 0.000 0.000 > str(res1) num [1:10000, 1:10] 1 1 1 1 1 1 1 1 1 1 ... > head(res1) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 2 3 4 5 6 7 8 9 10 [2,] 1 2 3 4 5 6 7 8 9 10 [3,] 1 2 3 4 5 6 7 8 9 10 [4,] 1 2 3 4 5 6 7 8 9 10 [5,] 1 2 3 4 5 6 7 8 9 10 [6,] 1 2 3 4 5 6 7 8 9 10 Now use Prof. Ripley's approach: > system.time(res3 <- t(sapply(big.list, function(x) eval(parse(text=paste("c(", x, ")")))))) [1] 1.676 0.012 1.877 0.000 0.000 > str(res3) num [1:10000, 1:10] 1 1 1 1 1 1 1 1 1 1 ... > head(res3) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 2 3 4 5 6 7 8 9 10 [2,] 1 2 3 4 5 6 7 8 9 10 [3,] 1 2 3 4 5 6 7 8 9 10 [4,] 1 2 3 4 5 6 7 8 9 10 [5,] 1 2 3 4 5 6 7 8 9 10 [6,] 1 2 3 4 5 6 7 8 9 10 > all(res1 == res3) [1] TRUE We do see a notable reduction in time with strsplit(), while a notable increase in time using eval(parse)), even though we are converting the same net number of values (100,000). Much of the increase with eval(parse()) is of course due to the overhead of sapply() and navigating the list. Let's increase the size of the list components to 1000: > repeated.measures.columns <- paste(1:1000, collapse = ",") > big.list <- replicate(10000, list(repeated.measures.columns)) > system.time(res1 <- sapply(big.list, function(x) as.numeric(unlist(strsplit(x, ","))))) [1] 33.270 0.744 37.163 0.000 0.000 > system.time(res3 <- t(sapply(big.list, function(x) eval(parse(text=paste("c(", x, ")")))))) [1] 15.893 0.928 18.139 0.000 0.000 So we see here that as the size of the list components increases, there continues to be an advantage to Prof. Ripley's approach over using strsplit(). Again, one needs to develop an understanding of where the time is spent in the processing by profiling and then consider how to introduce efficiencies, which in some cases may very well require the use of compiled C/FORTRAN as may be appropriate if times become too long. HTH, Marc Schwartz ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.