Re: [datatable-help] data.table and aggregating out-of-order columns in result from by

Arunkumar Srinivasan Wed, 16 Apr 2014 11:33:20 -0700

Okay here we go, once again. A much more detailed look:

A) Let’s start with datat.able:


require(data.table) ## 1.9.3 commit 1263
dt <- data.table(x=1:1e7, y=1:1e7)

## with optimisation - the names are removed and added at the end
system.time(dt[, list(z=y), by=x])
#   user  system elapsed  
#  7.481   0.253   8.017  
   
## without optimisation + no external function still.
system.time(dt[, {list(z=y)}, by=x])
#   user  system elapsed  
#  9.913   0.076  10.408  

## without optimisation + external function with unnamed list
foo <- function(x) list(x)
system.time(dt[, foo(y), by=x])
#   user  system elapsed  
# 13.742   0.139  14.320  
  
## without optimisation + external function with named list
foo <- function(x) list(z=x)
system.time(dt[, foo(y), by=x])
#   user  system elapsed  
# 15.333   0.181  15.911  
Summary: The difference between evaluating a named and unnamed list seems to be 
around 2.4 seconds without function and about 1.6 seconds with functions..

Using functions to evaluate is what seems to bring the speedup to ~2x when 
compared to list with no names.

B) Let’s verify it by comparing the same as above separately without any other 
factors, in a separate C file:

// test.c
#include <R.h>
#define USE_RINTERNALS
#include <Rinternals.h>
#include <Rdefines.h>

// test function - no checks!
SEXP test(SEXP expr, SEXP env, SEXP n)
{
    R_len_t i;
    SEXP ans;
    for (i=0; i<INTEGER(n)[0]; i++) {
        ans = eval(expr, env);
    }
    return(ans);
}
Save it as test.c and then from command line:

## From command line:
R CMD SHLIB -o test.so test.c
Now, from R-session:

## From R session
dyn.load("~/Downloads/test.so")
env <- new.env()
env$y = 1L

expr = quote(list(z=y))
system.time(.Call("test", expr, env, 1e7L))
#   user  system elapsed  
#  5.249   0.015   5.343  

expr = quote(list(y))
system.time(.Call("test", expr, env, 1e7L))
#   user  system elapsed  
#  4.030   0.010   4.054  

foo <- function(y) list(z=y)
expr = quote(foo(y))
system.time(.Call("test", expr, env, 1e7L))
#   user  system elapsed  
# 11.653   0.021  11.745  

foo <- function(y) list(y)
expr = quote(foo(y))
system.time(.Call("test", expr, env, 1e7L))
#   user  system elapsed  
# 10.064   0.022  10.224  
Summary: More or less the same as (A), but slightly better. The difference is 
always around 1.3 seconds on 1e7 groups, function or no function. But still 
evaluating functions take longer.

@Matt, thoughts? Because turning verbose on with 
options(datatable.verbose=TRUE) states that using named lists is terribly 
inefficient.. which seems to be not so much the case here..?

C) Let’s now add a match statement and test with 1e7 groups:

// test.c
#include <R.h>
#define USE_RINTERNALS
#include <Rinternals.h>
#include <Rdefines.h>

// test function - no checks!
SEXP test(SEXP expr, SEXP env, SEXP n)
{
    R_len_t i;
    SEXP tmp, nm, ans, j;
    j = allocVector(INTSXP, 1);
    ans = eval(expr, env);
    nm = getAttrib(ans, R_NamesSymbol);
    for (i=0; i<INTEGER(n)[0]; i++) {
        ans = eval(expr, env);
        tmp = getAttrib(ans, R_NamesSymbol);
        j = match(tmp, nm, 0);
    }
    return(j);
}
Running it only on expressions which return named list:

dyn.load("~/Downloads/test.so")
env <- new.env()
env$y = 1L

expr = quote(list(z=y))
system.time(.Call("test", expr, env, 1e7L))
#   user  system elapsed  
# 15.444   0.042  15.546  

foo <- function(y) list(z=y)
expr = quote(foo(y))
system.time(.Call("test", expr, env, 1e7L))
#   user  system elapsed  
# 26.969   0.062  27.199  
So, when we have to check for names - note that this still only matches for 
names, not checks if they’re in the right order yet etc.. It takes:

15.5 seconds instead of 5.3 seconds in case of named list 27.2 seconds instead 
of 11.7 seconds in case of a function that returns a named list.

If we decide to avoid the call to match 1e7 times (here), then we’ve to collect 
all the names and the results first for each group and then match once and then 
rearrange the results, which would be very memory inefficient, I’d think.

Perhaps Matt’ll have a better outlook from these results..


Arun

From: Arunkumar Srinivasan [email protected]
Reply: Arunkumar Srinivasan [email protected]
Date: April 16, 2014 at 6:41:50 PM
To: Clayton Stanley [email protected], 
[email protected] 
[email protected]
Subject:  Re: [datatable-help] data.table and aggregating out-of-order columns 
in result from by  

Clayton,

Thanks for posting it here. Here’s the first follow-up. Here’s an example:

require(data.table) ## 1.9.3 comm 1263
dt <- data.table(x=1:1e7, y=1:1e7)

## data.table optimisation removes names
system.time(ans1 <- dt[, list(z=y), by=x])

#   user  system elapsed   
#  7.193   0.275   7.859   
    
## data.table can't optimise to remove names
foo <- function(x) list(z=x)
system.time(ans2 <- dt[, foo(y), by=x])
#   user  system elapsed   
# 16.020   0.179  16.411   

> identical(ans1, ans2)
[1] TRUE

This is without checking for names, for each of the 1e7 groups.


Arun

From: Clayton Stanley [email protected]
Reply: Clayton Stanley [email protected]
Date: April 16, 2014 at 6:23:50 PM
To: [email protected] 
[email protected]
Subject:  [datatable-help] data.table and aggregating out-of-order columns in 
result from by

Copied from this SO post: http://stackoverflow.com/questions/23097461

Here's some interesting behavior that I noticed with data.table 1.9.2

>   testFun <- function(val) {
        if (val == 'geteeee') return(data.table(x=4,y=3))
        if (val == 'get') return(data.table(y=3,x=4))
    }
>   tbl = data.table(val=c('geteeee', 'get'))
>   tbl[, testFun(val), by=val]
       val x y
1: geteeee 4 3
2:     get 3 4
>  

When the column order of the data tables returned from each call to testFun are 
mixed (but have the same name and number of columns), data.table silently binds 
the tables together without taking into account that they are out of order. 
This was probably done for speed, but I found the behavior quite unexpected, 
and would have appreciated at least a warning.

Is there a way that I can get data.table to warn or error when this situation 
happens?

This happened in my analysis code and caused values for two DVs to be 
intermixed. The reason why it happened is that in the 'testFun' there is a 
branch and the returned data table is created within both sides of the branch. 
The branch is necessary to handle the case where the data table used to create 
the final returned data table is empty. So on one side of that branch I 
basically create an empty data table with the correct columns, and on the other 
side the data table is created from the first. The point is that the column 
order for the data tables returned from each side of the branch are different. 
Now this is certainly a bug on my part in 'testFun'. However I could have 
caught the issue much earlier if I had received a warning from data.table when 
the by operation completed and the resulting tables were bound together. 

Also since there isn't a check for column order, it does make me worry that 
there are other places in my analysis code where the same thing could be 
happening. What would be ideal is if there was some way for me to tell if that 
is the case. Perhaps a warning, temporarily increasing a 'safety' level as an 
options call, etc. Usually data.table is great at warning me when things are 
not quite right, so I was surprised when I noticed the current behavior. I 
understand that this was done for speed. So maybe temporarily increasing a 
'safety' level is a way to keep things fast by default and have additional 
checks (for a speed cost) when the user wants them? This sort of mimics how 
compiler optimization declarations are done in common lisp.

-Clayton











_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] data.table and aggregating out-of-order columns in result from by

Reply via email to