Hello,

This question comes from a recent SO question on Why is transform.data.table so 
much slower than transform.data.frame? 
(http://stackoverflow.com/questions/18216658/why-is-transform-data-table-so-much-slower-than-transform-data-frame)


Suppose I've,

DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5)) 

And I want to transform this data.table by adding an extra column z = 1 (I'm 
aware of the idiomatic way of using :=, but let's keep that aside for the 
moment), I'd do:

transform(DT, z = 1)) 

However, this is terribly slow. I debugged the code and found out the reason 
for this slowness. To gist the issue, transform.data.table calls:

ans <- do.call("data.table", c(list(`_data`), e[!matched])) 

which calls data.table() where, the slowness happens here:

exptxt = as.character(tt) # <~~~~~~~~ SLOW when called with `do.call`! 

Now, the point is, exptxt is only used under one other if-statement, pasted 
below.

if (any(novname) && length(exptxt)==length(vnames)) { okexptxt = 
exptxt[novname] == make.names(exptxt[novname]) vnames[novname][okexptxt] = 
exptxt[novname][okexptxt] } tt = vnames=="" 

And this statement is basically useful, for example, if one does:

x <- 1:5 y <- 6:10 DT <- data.table(x, y) x y 1: 1 6 2: 2 7 3: 3 8 4: 4 9 5: 5 
10 

This gives a data.table with column names the same as input variables instead 
of giving V1 and V2.


But, this is what is slowing down do.call("data.table", ...) function. For 
example,

ll <- list(data.table(x=runif(1e5), y=runif(1e5)), z=runif(1e5), w=1) 
system.time(do.call("data.table", ll)) # 30 seconds on my mac 

But, this exptxt <- as.character(tt) and the above mentioned if-statement can 
be replaced with (with help from data.frame function):

for (i in which(novname)) { tmp <- deparse(tt[[i]]) if (tmp == make.names(tmp)) 
vnames[i] <- tmp } 

And by replacing with this and running do.call("data.table", ...) takes 0.04 
seconds. Also,data.table(x,y) gives the intended result with column names x and 
y.


In essence, by replacing the above mentioned lines, the desired function of 
data.table remains unchanged while do.call("data.table", ...) is faster (and 
hence transform and other functions that depend on it).


What do you think? To my knowledge, this doesn't seem to break anything else...


Arun

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to