Nice example. Yes this is the way to use it and I agree more
readable. But I fear it isn't actually working as you expected. Each
component of `:=` doesn't see previous results, yet (not yet
implemented). Easier to see that in a simple example :
> DT = data.table(a=1:3,b=1:6)
> DT
a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
> DT[,`:=`(b=1L, d=sum(b)), by=a]
> DT
a b d
1: 1 1 5 # all the RHS got evaluated first, before starting to
assign the results.
2: 2 1 7
3: 3 1 9
4: 1 1 5
5: 2 1 7
6: 3 1 9
>
To get the result you want, you currently have to add an extra `<-`.
Like this :
> DT = data.table(a=1:3,b=1:6) # start fresh
> DT
a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
> DT[,`:=`(b=b<-1L, d=sum(b)), by=a] # extra b<-
> DT
a b d
1: 1 1 1
2: 2 1 1
3: 3 1 1
4: 1 1 1
5: 2 1 1
6: 3 1 1
>
Clearly in your example, since you're using earlier columns in later
ones, that becomes onerous and bug prone due to typos, but shouldn't
slow it down :
pre.coupleDT <- function(serostates, sexually.active) {
serostates[sexually.active , `:=`(
s.. = s.. <- s.. * (1-p.m.bef) * (1-p.f.bef),
mb.a1 = mb.a1 <- s.. * p.m.bef * (1-p.f.bef),
mb.a2 = mb.a2 <- mb.a1 * (1 - p.f.bef),
mb. = mb. <- mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
f.ba1 = f.bal <- s.. * p.f.bef * (1-p.m.bef),
f.ba2 = f.ba2 <- f.ba1 * (1 - p.m.bef),
f.b = f.b <- f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
hb1b2 = hb1b2 <- hb1b2 + .5 * s.. * p.m.bef * p.f.bef +
(mb.a1 + mb.a2 + mb.) * p.f.bef,
hb2b1 = hb2b1 + .5 * s.. * p.m.bef * p.f.bef +
(f.ba1 + f.ba2 + f.b) * p.m.bef)
]
return(serostates)
}
It's on the list to change it to the way you expected, and we all
want that. It involves a change quite deep down in the C code so
isn't done yet, although there's nothing particularly hard about it.
In terms of why data.table is faster here, consider the repeated :
temp[,'s..']
The `[` there is a function call; is.function(`[`)==TRUE. And each
time the 's..' string appears, it looks up which column number
corresponds to that name. There are 28 calls in your matrix version.
It isn't so much matrix vs data.table, more the access method. In the
data.table version, once you're inside scope, it's just symbol lookup
(the 28 calls to `[` are gone, as are the 28 lookups of 'colname').
There may be some copies going on as well; e.g.
serostates[sexually.active,] <- temp. Run both through Rprof() and
it might reveal more.
I can't think of a better way to use data.table. But note that the
benchmark is pretty meaningless. It's being looped 100 times
presumably because one run is so quick. This is quite a bug bear when
we see this done online. The only way to scale up, is to increase the
data size, perhaps by 100 times in this example. Then a single run
takes a measurable amount of time (say 10 seconds or more) and the
industry rule of thumb is to report the minimum of three consecutive
runs. The inferences are usually very different than when you repeat a
tiny test many times. The data has to be much much bigger than L2/L3
cache (typically 8MB but varies widely), e.g. 1GB or more. This
matrix is just 6MB and likely fits entirely in cache, depending on how
big your cache is (see output of lscpu on unix/mac, or system info on
Windows). Unless of course the nature of the task is to iterate, in
which case the overhead of the `[` call can become significant, and is
why we added set() as a loopable `:=`.
HTH
Matt
On 09/07/14 16:30, Steve Bellan wrote:
I'm trying to optimize the speed of a script that iteratively updates
state variables for several thousands of individuals through time
though only some individuals are active at each point in time. I had
been doing this with matrices but was wondering how it compared with
data.table since the latter seems to be more readable. I'm finding
that my data.table implementation is about 2-3 times faster, which
seems surprising since I thought matrices should be faster. It makes
me wonder if there are ways to speed up either implementation. Any
help is much appreciated! Here's an example of the code:
n <- 10^5
k <- 9
serostates <- matrix(0,n,k)
serostates <- as.data.table(serostates)
setnames(serostates, 1:k, c('s..', 'mb.a1', 'mb.a2', 'mb.', 'f.ba1',
'f.ba2', 'f.b', 'hb1b2', 'hb2b1'))
serostates[, `:=`(s.. = 1)]
serostates
serostatesMat <- as.matrix(serostates)
pre.coupleDT <- function(serostates, sexually.active) {
serostates[sexually.active , `:=`(
s.. = s.. * (1-p.m.bef) * (1-p.f.bef),
mb.a1 = s.. * p.m.bef * (1-p.f.bef),
mb.a2 = mb.a1 * (1 - p.f.bef),
mb. = mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
f.ba1 = s.. * p.f.bef * (1-p.m.bef),
f.ba2 = f.ba1 * (1 - p.m.bef),
f.b = f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
hb1b2 = hb1b2 + .5 * s.. * p.m.bef * p.f.bef + (mb.a1 +
mb.a2 + mb.) * p.f.bef,
hb2b1 = hb2b1 + .5 * s.. * p.m.bef * p.f.bef + (f.ba1 +
f.ba2 + f.b) * p.m.bef)
]
return(serostates)
}
pre.coupleMat <- function(serostates, sexually.active) {
temp <- serostates[sexually.active,]
temp[,'s..'] = temp[,'s..'] * (1-p.m.bef) * (1-p.f.bef)
temp[,'mb.a1'] = temp[,'s..'] * p.m.bef * (1-p.f.bef)
temp[,'mb.a2'] = temp[,'mb.a1'] * (1 - p.f.bef)
temp[,'mb.'] = temp[,'mb.a2'] * (1 - p.f.bef) + temp[,'mb.'] *
(1 - p.f.bef)
temp[,'f.ba1'] = temp[,'s..'] * p.f.bef * (1-p.m.bef)
temp[,'f.ba2'] = temp[,'f.ba1'] * (1 - p.m.bef)
temp[,'f.b'] = temp[,'f.ba2'] * (1 - p.m.bef) + temp[,'f.b'] *
(1 - p.m.bef)
temp[,'hb1b2'] = temp[,'hb1b2'] + .5 * temp[,'s..'] * p.m.bef
* p.f.bef + (temp[,'mb.a1'] + temp[,'mb.a2'] + temp[,'mb.']) * p.f.bef
temp[,'hb2b1'] = temp[,'hb2b1'] + .5 * temp[,'s..'] * p.m.bef
* p.f.bef + (temp[,'f.ba1'] + temp[,'f.ba2'] + temp[,'f.b']) * p.m.bef
serostates[sexually.active,] <- temp
return(serostates)
}
sexually.active <- rbinom(n, 1,.5)==1
p.m.bef <- .5
p.f.bef <- .8
system.time(
for(ii in 1:100) {
serostates <- pre.couple(serostates, sexually.active)
}
) ## about 2.25 seconds
system.time(
for(ii in 1:100) {
serostatesMat <- pre.coupleMat(serostatesMat, sexually.active)
}
) ## about 6 seconds
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help