>>>>> "Marcus" == Marcus G Daniels <[EMAIL PROTECTED]> >>>>> on Tue, 12 Dec 2006 09:05:15 -0700 writes:
Marcus> Vladimir Dergachev wrote: >> Here is the second iteration of data frame subset patch. >> It now passes make check on both 2.4.0 and 2.5.0 (svn as >> of a few days ago). Same speedup as before. >> Marcus> Hi, Marcus> I was wondering if this patch would make it into the Marcus> next release. I don't see it in SVN, but it's hard Marcus> to be sure because the mailing list apparently Marcus> strips attachments. If it isn't in, or going to be Marcus> in, is this patch available somewhere else? I was wondering too. http://www.r-project.org/mail.html explains what kind of attachments are allowed on R-devel. I'm particularly interested, since during the last several days I've made (somewhat experimental) changes to R-devel, which makes some dealings with large data frames that have "trivial rownames" (those represented as 1:nrow(.)) much more efficient. Notably, as.matrix() of such data frames now no longer produces huge row names, and e.g. dim(.) of such data frames has become lightning fast [compared to what it was]. Some measurements: N <- 1e6 set.seed(1) ## we round (for later dump().. reasons) x <- round(rnorm(N),2) y <- round(rnorm(N),2) mOrig <- cbind(x = x, y = y) df <- data.frame(x = x, y = y) mNew <- as.matrix(df) (sizes <- sapply(list(mOrig=mOrig, df=df, mNew=mNew), object.size)) ## R-2.4.0 (64-bit): ## mOrig df mNew ## 16000520 16000776 72000560 ## R-2.4.1 beta (32-bit): ## mOrig df mNew ## 16000296 16000448 52000320 ## R-pre-2.5.0 (32-bit): ## mOrig df mNew ## 16000296 16000448 16000296 ##------------------------------------ N <- 1e6 df <- data.frame(x = 0+ 1:N, y = 1+ 1:N) system.time(for(i in 1:1000) d <- dim(df)) ## R-2.4.1 beta (32-bit) [deb1]: ## [1] 1.920 3.748 7.810 0.000 0.000 ## R-pre-2.5.0 (32-bit) [deb1]: ## user system elapsed ## 0.012 0.000 0.011 --- --- --- --- --- --- --- --- --- --- However, currently df[2,] ## still internally produces the character(1e6) row names! something I think we should eliminate as well, i.e., at least make sure that only seq_len(1e6) is internally produced and not the character vector. Note however that some of these changes are backward incompatible. I do hope that the changes gaining efficiency for such large data frames are worth some adaption of current/old R source code.. Feedback on this topic is very welcome! Martin ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel