In fact, we already had a ticket on the tracker for this, so just updating this (and that) with this thread:
https://r-forge.r-project.org/tracker/?group_id=240&atid=975&func=detail&aid=2599 On Wed, Aug 14, 2013 at 10:18 AM, Steve Lianoglou <[email protected]> wrote: > Hi Arun, > > Thanks for this very detailed analysis! > > The slowness of transform.data.table is something that's been bugging > me for a while but have not had the time to dig into it myself yet, so > this is really great. > > I quickly tried to apply your proposed fix and recompiled/reinstalled > data.table. It looks like there are some errors that do pop up after > running test.data.table(), but I *think* they are trivial -- I don't > have time to investigate further right now, but will do so in due time > if Matthew (or you :-) don't be me to it. > > Thanks again, > -steve > > > On Wed, Aug 14, 2013 at 9:26 AM, Arunkumar Srinivasan > <[email protected]> wrote: >> Hello, >> >> This question comes from a recent SO question on Why is transform.data.table >> so much slower than transform.data.frame? >> >> Suppose I've, >> >> DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5)) >> >> And I want to transform this data.table by adding an extra column z = 1 (I'm >> aware of the idiomatic way of using :=, but let's keep that aside for the >> moment), I'd do: >> >> transform(DT, z = 1)) >> >> However, this is terribly slow. I debugged the code and found out the reason >> for this slowness. To gist the issue, transform.data.table calls: >> >> ans <- do.call("data.table", c(list(`_data`), e[!matched])) >> >> which calls data.table() where, the slowness happens here: >> >> exptxt = as.character(tt) # <~~~~~~~~ SLOW when called with `do.call`! >> >> Now, the point is, exptxt is only used under one other if-statement, pasted >> below. >> >> if (any(novname) && length(exptxt)==length(vnames)) { >> okexptxt = exptxt[novname] == make.names(exptxt[novname]) >> vnames[novname][okexptxt] = exptxt[novname][okexptxt] >> } >> tt = vnames=="" >> >> And this statement is basically useful, for example, if one does: >> >> x <- 1:5 >> y <- 6:10 >> DT <- data.table(x, y) >> x y >> 1: 1 6 >> 2: 2 7 >> 3: 3 8 >> 4: 4 9 >> 5: 5 10 >> >> This gives a data.table with column names the same as input variables >> instead of giving V1 and V2. >> >> But, this is what is slowing down do.call("data.table", ...) function. For >> example, >> >> ll <- list(data.table(x=runif(1e5), y=runif(1e5)), z=runif(1e5), w=1) >> system.time(do.call("data.table", ll)) # 30 seconds on my mac >> >> But, this exptxt <- as.character(tt) and the above mentioned if-statement >> can be replaced with (with help from data.frame function): >> >> for (i in which(novname)) { >> tmp <- deparse(tt[[i]]) >> if (tmp == make.names(tmp)) >> vnames[i] <- tmp >> } >> >> And by replacing with this and running do.call("data.table", ...) takes 0.04 >> seconds. Also,data.table(x,y) gives the intended result with column names x >> and y. >> >> In essence, by replacing the above mentioned lines, the desired function of >> data.table remains unchanged while do.call("data.table", ...) is faster (and >> hence transform and other functions that depend on it). >> >> What do you think? To my knowledge, this doesn't seem to break anything >> else... >> >> Arun >> >> >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
