Re: [datatable-help] Speeding up column references with roll

Arunkumar Srinivasan Mon, 30 Jun 2014 16:30:26 -0700

Thanks, that helped. To illustrate on your big data (from the first post), your 
question is:

require(data.table) ## 1.9.3
set.seed(12312391)
data <- data.table(
          group = sample(1e3,1e7,replace=T),
          time = ceiling(runif(1e7, 0, 1e5)),
          hit = rbinom(1e7, 1, p = 0.1),
  key=c("group","time"))

system.time(ans1 <- d[(hit)][d,list(hittime=time),roll=-20,by=.EACHI]) ## 5.4 
sec
system.time(ans2 <- d[(hit)][d,time,roll=-20,by=.EACHI])  ## 3.4 sec

setnames(ans2, 3L, "hittime")
setkey(ans1, NULL)
setkey(ans2, NULL)
identical(ans1, ans2) # [1] TRUE
Why this difference? And that’s a great question!

Note that this is not particularly due to you not setting name (because 
[.data.table is clever enough to remove names before to call dogroups). Just to 
be sure, we’ll do a check:

system.time(ans3 <- d[(hit)][d,list(time),roll=-20,by=.EACHI]) ## 5.7 sec
setnames(ans3, 3L, "hittime")
setkey(ans3, NULL)
identical(ans1, ans3) # [1] TRUE
The difference comes from the j-expression’s difference in list(.) in both the 
slow cases.. For each group, in C-level, the j-expression is evaluated.. and in 
the slow cases it’s eval(list(time)) and in the fast case, it’s eval(time) and 
my guess is that this difference in the call is what makes that difference..

It’d be easy to test this by writing a simple C-script and evaluating both 
expressions, but I don’t have the time to do that right now. However, here’s an 
alternate “easy-route” to verify.

require(data.table) ## 1.9.3
DT <- data.table(x=rep(1:1e7, 2L), y=1L)
system.time(ans1 <- DT[, .N, by=x])            ## 3.5 sec
system.time(ans2 <- DT[, list(N = .N), by=x])  ## 5.8 sec
Basically, when j-expression is just 1 entry, we could gain some speedup by 
removing the list() that’s being wrapped around..

It’d be great if you could cite this thread from the data.table mailing list 
and file an issue here: 
https://github.com/Rdatatable/data.table/issues?direction=desc&labels=&milestone=&page=1&sort=updated&state=open

Arun

From: Stavros Macrakis (Σταῦρος Μακράκης) [email protected]
Reply: Stavros Macrakis (Σταῦρος Μακράκης) [email protected]
Date: July 1, 2014 at 12:51:36 AM
To: Arunkumar Srinivasan [email protected]
Cc: [email protected] [email protected]
Subject:  Re: [datatable-help] Speeding up column references with roll  

Thanks for your reply, but your code doesn't do the same thing as mine. Here's 
a very small example of what I'm trying to do.

# Test data

> dd <- 
> data.table(groups=rep(1:2,each=4),time=1:8,hit=1:8%%3==0,key=c("groups","time"))
> dd
   groups time   hit
1:      1    1 FALSE
2:      1    2 FALSE
3:      1    3  TRUE
4:      1    4 FALSE
5:      2    5 FALSE
6:      2    6  TRUE
7:      2    7 FALSE
8:      2    8 FALSE

# Desired output includes the time and the corresponding roll time

> (res1 <- dd[(hit)][dd,list(rolltime=time),roll=2,by=.EACHI][!is.na(rolltime)])
   groups time rolltime
1:      1    3       3
2:      1    4       3
3:      2    6       6
4:      2    7       6
5:      2    8       6

# Undesired output (without .EACHI)

> (res2 <- dd[hit==1][dd,list(rolltime=time),roll=2][!is.na(rolltime)])
   rolltime
1:       1
2:       2
3:       3
4:       4
5:       5
6:       6
7:       7
8:       8

# Undesired output (with allow.cartesian)

> res3 <- 
> dd[hit==1][dd,list(rolltime=time),roll=2,allow.cartesian=TRUE][!is.na(rolltime)])
> identical(res2,res3)
[1] TRUE

Re rolltime vs. time, consider the following 

> dd[(hit)][dd,time,roll=2,by=.EACHI]
   groups time time
1:      1    1   NA
2:      1    2   NA
3:      1    3    3
4:      1    4    3
5:      2    5   NA
6:      2    6    6
7:      2    7    6
8:      2    8    6

There are two different output columns named 'time'. One is the time from the 
right relation of the join, the other is the time from the left relation of the 
join. There is nothing like the i.time convention for distinguishing the time 
that comes from one of the tables from the (rolled) time that comes from the 
other.

           -s

On Mon, Jun 30, 2014 at 5:34 PM, Arunkumar Srinivasan <[email protected]> 
wrote:
Your example doesn’t work without allow.cartesian=TRUE.

You shouldn’t be using by=.EACHI here. This by was what was implicit in the 
earlier versions which made it slow. Please re-read the README.

Here’s the function I tested on 1.9.3:

calc1 <- function(d) {
    d[ hit==1][ d,list(hittime=time),roll=-20, allow.cartesian=TRUE][ 
!is.na(hittime)]
}

calc2 <- function(d) {
  temp <- d[ hit==1][ d,list(time),roll=-20, allow.cartesian=TRUE]
  setnames(temp,1,"hittime")
  temp[!is.na(hittime)]
}

# Generate sample data
set.seed(12312391)
data <- data.table(
          group = sample(1e3,1e7,replace=T),
          time = ceiling(runif(1e7, 0, 1e5)),
          hit = rbinom(1e7, 1, p = 0.1),
  key=c("group","time"))

system.time(ans1 <- calc1(data))
#   user  system elapsed   
#  2.083   0.189   2.344   
system.time(ans2 <- calc2(data))
#   user  system elapsed   
#  2.012   0.241   2.426   
identical(ans1, ans2) # [1] TRUE

You write:
I also don't see any way to refer to the different time vs. hittime without 
renaming the second time column.

I don’t quite follow what this means, but IIUC I think this is what you’re 
referring to: https://github.com/Rdatatable/data.table/issues/471

You write:
You mention some FR's, but they're hard to find without the specific numbers.

I was mentioning the first two points under NEW FEATURES within Changes in 
v1.9.3. The one that starts with by=.EACHI runs j for each group in x that each 
row of i joins to. and the one that starts with Accordingly, X[Y, j] now does 
what X[Y][, j] did.

Maybe we should start numbering the fixes for easy reference. Will note it down.

You write: Where can I find the 1.9.3 reference manual?

This version is a development version. Necesary changes will be reflected in 
their corresponding ?... entry. And when we find some time, the introduction 
and FAQs will be updated. But that’s not yet.

If you don’t wish to keep up-to-date by looking at the NEWS, you’ll have to 
wait until the next stable release on CRAN.

You write: On my system (MacOSX), build_vignettes=TRUE gives an error in 
texi2dvi -- would that have generated the refman? If so, how do I fix that?

I’m guessing it’s a PDF latex error. If so, you’ll have to install what the 
error message says is missing on your system. Sorry, can’t help you much there.

Arun

From: Stavros Macrakis (Σταῦρος Μακράκης) [email protected]
Reply: Stavros Macrakis (Σταῦρος Μακράκης) [email protected]
Date: June 30, 2014 at 10:40:24 PM
To: Arunkumar Srinivasan [email protected]
Cc: [email protected] [email protected]
Subject:  Re: [datatable-help] Speeding up column references with roll

OK, I'm retesting in 1.9.3, adding by=.EACHI. I don't see any significant 
difference in the timings -- setnames is still 25% faster than 
list(hittime=time). What exactly was fixed?

I also don't see any way to refer to the different time vs. hittime without 
renaming the second time column.

You mention some FR's, but they're hard to find without the specific numbers.

Where can I find the 1.9.3 reference manual? I think it would be easier to 
understand for me than the incremental changes in the New Features listings. On 
my system (MacOSX), build_vignettes=TRUE gives an error in texi2dvi -- would 
that have generated the refman? If so, how do I fix that?

Thanks,

               -s

On Mon, Jun 30, 2014 at 1:00 PM, Arunkumar Srinivasan <[email protected]> 
wrote:
Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI` (explicit) 
to perform a by-without-by.
https://github.com/Rdatatable/data.table/blob/master/README.md
Have a look at the first FR (by = .EACHI runs ...) that's been fixed in 1.9.3 - 
there's some changes in the way join results in due to these changes (which've 
been discussed since and for quite sometime) to bring more consistency to the 
DT[i, j, by] syntax. Also have a look at the second FR and the links it points 
to for the discussions.

In general, it's better to test with the devel version (and have a look at 
README) for any bugs you may encounter.

Arun

From: Stavros Macrakis (Σταῦρος Μακράκης) [email protected]
Reply: Stavros Macrakis (Σταῦρος Μακράκης) [email protected]
Date: June 30, 2014 at 5:38:10 PM
To: [email protected] [email protected]
Subject:  [datatable-help] Speeding up column references with roll

In the following example, it is about 15-25% faster to use setnames rather than 
j=list(name=var). Is there some better approach to referencing the other joined 
column when using roll?

# Use j=list(name=var)
calc1 <- function(d) {
  d[ hit==1
   ][ d,list(hittime=time),roll=-20
   ][ !is.na(hittime)
   ]
}

# Use setnames
calc2 <- function(d) {
  temp <- d[ hit==1
           ][ d,time,roll=-20
           ]
  setnames(temp,3,"hittime")
  temp[!is.na(hittime)]
}

# Generate sample data
set.seed(12312391)
data <- data.table(
          group = sample(1e3,1e7,replace=T),
          time = ceiling(runif(1e7, 0, 1e5)),
          hit = rbinom(1e7, 1, p = 0.1),
  key=c("group","time"))

# Timing

system.time(replicate(10,{gc();calc1(data)})) => 69 sec 
system.time(replicate(10,{gc();calc2(data)})) => 52 sec
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] Speeding up column references with roll

Reply via email to