Re: [datatable-help] data.table vs matrix speed

Matt Dowle Wed, 09 Jul 2014 10:00:44 -0700

Oops, that highlighted that adding <- isn't quite the same whenrecycling comes into it. In your case, each RHS returns a vector aslong as the input, so adding <- should be ok. But in my example, thefirst RHS was a single 1L which was assigned to the symbol b (beforerecycling) that sum(b) then saw and returned 1 not 2.

Ok, iterative RHS more pressing that I thought then. Thanks forhighlighting.


Matt

On 09/07/14 17:52, Matt Dowle wrote:

Nice example. Yes this is the way to use it and I agree morereadable. But I fear it isn't actually working as you expected. Eachcomponent of `:=` doesn't see previous results, yet (not yetimplemented). Easier to see that in a simple example :
> DT = data.table(a=1:3,b=1:6)
> DT
   a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
> DT[,`:=`(b=1L, d=sum(b)), by=a]
> DT
   a b d
1: 1 1 5 # all the RHS got evaluated first, before starting toassign the results.
2: 2 1 7
3: 3 1 9
4: 1 1 5
5: 2 1 7
6: 3 1 9
>
To get the result you want, you currently have to add an extra `<-`.Like this :
> DT = data.table(a=1:3,b=1:6)   # start fresh
> DT
   a b
1: 1 1
2: 2 2
3: 3 3
4: 1 4
5: 2 5
6: 3 6
> DT[,`:=`(b=b<-1L, d=sum(b)), by=a]   # extra b<-
> DT
   a b d
1: 1 1 1
2: 2 1 1
3: 3 1 1
4: 1 1 1
5: 2 1 1
6: 3 1 1
>
Clearly in your example, since you're using earlier columns in laterones, that becomes onerous and bug prone due to typos, but shouldn'tslow it down :
pre.coupleDT <- function(serostates, sexually.active) {
    serostates[sexually.active , `:=`(
        s..   = s..   <- s.. * (1-p.m.bef) * (1-p.f.bef),
        mb.a1 = mb.a1 <- s.. * p.m.bef * (1-p.f.bef),
        mb.a2 = mb.a2 <- mb.a1 * (1 - p.f.bef),
        mb.   = mb.   <- mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
        f.ba1 = f.bal <- s.. * p.f.bef * (1-p.m.bef),
        f.ba2 = f.ba2 <- f.ba1 * (1 - p.m.bef),
        f.b   = f.b   <- f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
hb1b2 = hb1b2 <- hb1b2 + .5 * s.. * p.m.bef * p.f.bef +(mb.a1 + mb.a2 + mb.) * p.f.bef,hb2b1 = hb2b1 + .5 * s.. * p.m.bef * p.f.bef +(f.ba1 + f.ba2 + f.b) * p.m.bef)
           ]
    return(serostates)
}
It's on the list to change it to the way you expected, and we allwant that. It involves a change quite deep down in the C code soisn't done yet, although there's nothing particularly hard about it.
In terms of why data.table is faster here, consider the repeated :

    temp[,'s..']
The `[` there is a function call; is.function(`[`)==TRUE. And eachtime the 's..' string appears, it looks up which column numbercorresponds to that name. There are 28 calls in your matrix version.It isn't so much matrix vs data.table, more the access method. In thedata.table version, once you're inside scope, it's just symbol lookup(the 28 calls to `[` are gone, as are the 28 lookups of 'colname').
There may be some copies going on as well; e.g.serostates[sexually.active,] <- temp. Run both through Rprof() andit might reveal more.
I can't think of a better way to use data.table. But note that thebenchmark is pretty meaningless. It's being looped 100 timespresumably because one run is so quick. This is quite a bug bear whenwe see this done online. The only way to scale up, is to increase thedata size, perhaps by 100 times in this example. Then a single runtakes a measurable amount of time (say 10 seconds or more) and theindustry rule of thumb is to report the minimum of three consecutiveruns. The inferences are usually very different than when you repeat atiny test many times. The data has to be much much bigger than L2/L3cache (typically 8MB but varies widely), e.g. 1GB or more. Thismatrix is just 6MB and likely fits entirely in cache, depending on howbig your cache is (see output of lscpu on unix/mac, or system info onWindows). Unless of course the nature of the task is to iterate, inwhich case the overhead of the `[` call can become significant, and iswhy we added set() as a loopable `:=`.
HTH
Matt


On 09/07/14 16:30, Steve Bellan wrote:
I'm trying to optimize the speed of a script that iteratively updatesstate variables for several thousands of individuals through timethough only some individuals are active at each point in time. I hadbeen doing this with matrices but was wondering how it compared withdata.table since the latter seems to be more readable. I'm findingthat my data.table implementation is about 2-3 times faster, whichseems surprising since I thought matrices should be faster. It makesme wonder if there are ways to speed up either implementation. Anyhelp is much appreciated! Here's an example of the code:
n <- 10^5
k <- 9
serostates <- matrix(0,n,k)
serostates <- as.data.table(serostates)
setnames(serostates, 1:k, c('s..', 'mb.a1', 'mb.a2', 'mb.', 'f.ba1','f.ba2', 'f.b', 'hb1b2', 'hb2b1'))
serostates[, `:=`(s.. = 1)]
serostates
serostatesMat <- as.matrix(serostates)

pre.coupleDT <- function(serostates, sexually.active) {
     serostates[sexually.active , `:=`(
         s..   = s.. * (1-p.m.bef) * (1-p.f.bef),
         mb.a1 = s.. * p.m.bef * (1-p.f.bef),
         mb.a2 = mb.a1 * (1 - p.f.bef),
         mb.   = mb.a2 * (1 - p.f.bef) + mb. * (1 - p.f.bef),
         f.ba1 = s.. * p.f.bef * (1-p.m.bef),
         f.ba2 = f.ba1 * (1 - p.m.bef),
         f.b   = f.ba2 * (1 - p.m.bef) + f.b * (1 - p.m.bef),
hb1b2 = hb1b2 + .5 * s.. * p.m.bef * p.f.bef + (mb.a1 +mb.a2 + mb.) * p.f.bef,hb2b1 = hb2b1 + .5 * s.. * p.m.bef * p.f.bef + (f.ba1 +f.ba2 + f.b) * p.m.bef)
            ]
     return(serostates)
}


pre.coupleMat <- function(serostates, sexually.active) {
     temp <- serostates[sexually.active,]
     temp[,'s..']   = temp[,'s..'] * (1-p.m.bef) * (1-p.f.bef)
     temp[,'mb.a1'] = temp[,'s..'] * p.m.bef * (1-p.f.bef)
     temp[,'mb.a2'] = temp[,'mb.a1'] * (1 - p.f.bef)
temp[,'mb.'] = temp[,'mb.a2'] * (1 - p.f.bef) + temp[,'mb.'] *(1 - p.f.bef)
     temp[,'f.ba1'] = temp[,'s..'] * p.f.bef * (1-p.m.bef)
     temp[,'f.ba2'] = temp[,'f.ba1'] * (1 - p.m.bef)
temp[,'f.b'] = temp[,'f.ba2'] * (1 - p.m.bef) + temp[,'f.b'] *(1 - p.m.bef)temp[,'hb1b2'] = temp[,'hb1b2'] + .5 * temp[,'s..'] * p.m.bef* p.f.bef + (temp[,'mb.a1'] + temp[,'mb.a2'] + temp[,'mb.']) * p.f.beftemp[,'hb2b1'] = temp[,'hb2b1'] + .5 * temp[,'s..'] * p.m.bef* p.f.bef + (temp[,'f.ba1'] + temp[,'f.ba2'] + temp[,'f.b']) * p.m.bef
serostates[sexually.active,] <- temp
return(serostates)
}

sexually.active <- rbinom(n, 1,.5)==1
p.m.bef <- .5
p.f.bef <- .8

system.time(
     for(ii in 1:100) {
         serostates <- pre.couple(serostates, sexually.active)
     }
     ) ## about 2.25 seconds


system.time(
     for(ii in 1:100) {
         serostatesMat <- pre.coupleMat(serostatesMat, sexually.active)
     }
     ) ## about 6 seconds

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] data.table vs matrix speed

Reply via email to