Hi Arun, Thanks for this very detailed analysis!
The slowness of transform.data.table is something that's been bugging me for a while but have not had the time to dig into it myself yet, so this is really great. I quickly tried to apply your proposed fix and recompiled/reinstalled data.table. It looks like there are some errors that do pop up after running test.data.table(), but I *think* they are trivial -- I don't have time to investigate further right now, but will do so in due time if Matthew (or you :-) don't be me to it. Thanks again, -steve On Wed, Aug 14, 2013 at 9:26 AM, Arunkumar Srinivasan <[email protected]> wrote: > Hello, > > This question comes from a recent SO question on Why is transform.data.table > so much slower than transform.data.frame? > > Suppose I've, > > DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5)) > > And I want to transform this data.table by adding an extra column z = 1 (I'm > aware of the idiomatic way of using :=, but let's keep that aside for the > moment), I'd do: > > transform(DT, z = 1)) > > However, this is terribly slow. I debugged the code and found out the reason > for this slowness. To gist the issue, transform.data.table calls: > > ans <- do.call("data.table", c(list(`_data`), e[!matched])) > > which calls data.table() where, the slowness happens here: > > exptxt = as.character(tt) # <~~~~~~~~ SLOW when called with `do.call`! > > Now, the point is, exptxt is only used under one other if-statement, pasted > below. > > if (any(novname) && length(exptxt)==length(vnames)) { > okexptxt = exptxt[novname] == make.names(exptxt[novname]) > vnames[novname][okexptxt] = exptxt[novname][okexptxt] > } > tt = vnames=="" > > And this statement is basically useful, for example, if one does: > > x <- 1:5 > y <- 6:10 > DT <- data.table(x, y) > x y > 1: 1 6 > 2: 2 7 > 3: 3 8 > 4: 4 9 > 5: 5 10 > > This gives a data.table with column names the same as input variables > instead of giving V1 and V2. > > But, this is what is slowing down do.call("data.table", ...) function. For > example, > > ll <- list(data.table(x=runif(1e5), y=runif(1e5)), z=runif(1e5), w=1) > system.time(do.call("data.table", ll)) # 30 seconds on my mac > > But, this exptxt <- as.character(tt) and the above mentioned if-statement > can be replaced with (with help from data.frame function): > > for (i in which(novname)) { > tmp <- deparse(tt[[i]]) > if (tmp == make.names(tmp)) > vnames[i] <- tmp > } > > And by replacing with this and running do.call("data.table", ...) takes 0.04 > seconds. Also,data.table(x,y) gives the intended result with column names x > and y. > > In essence, by replacing the above mentioned lines, the desired function of > data.table remains unchanged while do.call("data.table", ...) is faster (and > hence transform and other functions that depend on it). > > What do you think? To my knowledge, this doesn't seem to break anything > else... > > Arun > > > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
