Hi everybody,
I think there's a faster version of "CJ" function that's possible. The issue
currently is that the "sort" is done at the very end by using `setkey` which
will work on the data *after* getting all the combinations, and therefore
sorting a huge amount of entries.
However, a faster way would be to get it first sorted (even before working out
all combinations) and then use the hack:
setattr(l, 'sorted', names(l))
Basically there are just 2 lines that need change (see bottom of the post).
---------
Here's first some benchmarks on `CJ_fast` (see below) and `CJ` on a relatively
big data:
w <- sample(1e4, 1e3)
x <- sample(letters, 12)
y <- sample(letters, 12)
z <- sample(letters, 12)
system.time(t1 <- do.call(CJ, list(w,x,y,z)))
user system elapsed
0.775 0.052 0.835
system.time(t2 <- do.call(CJ_fast, list(w,x,y,z)))
user system elapsed
0.220 0.001 0.221
identical(t1, t2)
[1] TRUE
---------
The function: (there are only two changes)
CJ_fast <- function (...)
{
l = list(...)
if (length(l) > 1) {
n = sapply(l, length)
nrow = prod(n)
x = c(rev(data.table:::take(cumprod(rev(n)))), 1L)
# 1) SORT HERE
for (i in seq(along = x)) l[[i]] = rep(sort(l[[i]], na.last = TRUE),
each = x[i],
length = nrow)
}
setattr(l, "row.names", .set_row_names(length(l[[1]])))
setattr(l, "class", c("data.table", "data.frame"))
vnames = names(l)
if (is.null(vnames))
vnames = rep("", length(l))
tt = vnames == ""
if (any(tt)) {
vnames[tt] = paste("V", which(tt), sep = "")
setattr(l, "names", vnames)
}
data.table:::settruelength(l, 0L)
l = alloc.col(l)
# 2) REPLACE SETKEY WITH ATTRIBUTE "SORTED"
setattr(l, 'sorted', names(l))
l
}
Arun
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help