[datatable-help] FR #2722 testing

Arunkumar Srinivasan Tue, 18 Mar 2014 19:01:30 -0700

Hi everybody,

FR #2722 is now implemented and committed recently. It'd be great if people 
who're used to using devel versions could test it out and let us know if things 
are alright.


Here's an explanation of what the FR is and what's being optimised: 
Assuming a data.table with 4 columns x,y,z,grp, something like:

DT[, c(sum(y), lapply(.SD, sum), .N .I, lapply(.SD, mean)), by=grp]
will usually be quite slow because of using eval with lapply. This will now be 
optimised to:

DT[, list(sum(y), sum(x), sum(y), sum(z), .N, .I, mean(x), mean(y), mean(z)), 
by=grp]
However, we don't optimise if .SD is present in j in the form c(.) in any other 
form other than lapply(.SD, fun), because there are quite a few possibilities 
with .SD:

DT[, c(.SD, .SD[1], .SD+a, .SD[x>1], .SD[J(.), .SD[.(.)], lapply(.SD, sum)), 
by=grp]
Also, consider the case .SD[sample(.N, 1)] - this can't be optimised to 
list(x=x[sample(.)], y=y[sample(.)], z=y[sample(.)] obviously. So, the 
expression inside .SD has to be evaluated first, checked for type - logical, 
numeric, integer, data.table? and then must be optimised accordingly.

Therefore, this'll be postponed, if at all possible in a clear way. However, 
we've not come across such a case here on the mailing list or on SO yet. I'm 
therefore assuming it's a very rare case, which is good.

Summary: The most common cases should therefore be very fast. Here's a 
benchmark comparing the timings with and without optimisation:

require(data.table)
set.seed(1L)
dt <- data.table(x=rep(1:1e6, each=10), y=sample(10), z=sample(2))

options(datatable.verbose=TRUE) # not pasting verbose messages here.

# without optimisation
options(datatable.optimize=0L)
system.time(ans1 <- dt[, c(bla = sum(y), lapply(.SD, mean)), by=x])
#   user  system elapsed 
# 90.705   5.184 121.274 

# with optimisation
options(datatable.optimize=Inf)
system.time(ans2 <- dt[, c(bla = sum(y), lapply(.SD, mean)), by=x])
#   user  system elapsed 
#  0.450   0.128   0.690 
Note that the case DT[, c(sum(y), lapply(.SD, sum)), by=grp, .SDcols=..] is 
still not implemented - FR #5222. So the optimisation will also result in 
object not found. When this FR is taken care of, the optimisation will also 
work automatically.



Arun

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

[datatable-help] FR #2722 testing

Reply via email to