Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments

Arunkumar Srinivasan Tue, 20 May 2014 13:29:24 -0700

I’ve filed FR #5690 to remind myself of the recycling feature; that’d be 
awesome to have.


One feature I forgot to point out in the previous post is that, even when there 
are duplicate names, rbind/rbindlist binds them consistent with ‘base’ when 
use.names=TRUE. And it fills the duplicate columns properly (in the order of 
occurrence) also when fill=TRUE.

Okay, on to benchmarks. I took a set of 10,000 data.tables, each with columns 
ranging from V1 to V500 in random order (all integers for simplicity). We’ll 
need to just use use.names=TRUE (as all columns are available in all 
data.tables).

I think this data is big enough to illustrate the point. Also, I was curious to 
see a comparison against dplyr’s rbind_all (commit 1504 devel version). So, 
I’ve added it as well to the benchmarks.

Here’s the data generation. Note: It takes a while for this step to finish.

require(data.table) ## 1.9.3 commit 1267
require(dplyr)      ## commit 1504 devel
set.seed(1L)
foo <- function(k) {
    ans = setDT(lapply(1:k, function(x) sample(10)))
}
bar <- function(ans, k, n) {
    bla = sample(paste0("V", 1:k), n)
    setnames(ans, bla)
}
n = 10000L
ll = vector("list", n)
for (i in 1:n) {
    bla = bar(foo(500L), 500L, 500L)
    .Call("Csetlistelt", ll, i, bla)
}
And here are the timings:

## data.table v1.9.3 commit 1267's rbindlist
## Timings of three consecutive runs:
system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))
   user  system elapsed  
 10.909   0.449  11.843  
  
    user  system elapsed  
  5.219   0.386   5.640  
   
    user  system elapsed  
  5.355   0.429   5.898  

## dplyr's rbind_all
## Timings for three consecutive runs
system.time(ans2 <- rbind_all(ll))
   user  system elapsed  
 62.769   0.247  63.941  
  
    user  system elapsed  
 62.010   0.335  65.876  
  
   user  system elapsed  
 55.345   0.359  60.193  

> identical(ans1, setDT(ans2)) # [1] TRUE
  
## data.table v1.9.2's rbind version:
## ran only once as it took a bit more.
system.time(ans1 <- do.call("rbind", ll))
    user  system elapsed  
125.356   2.247 139.000  

> identical(ans1, setDT(ans2)) # [1] TRUE
In summary, the newer implementation is about ~11–23x faster than data.table’s 
older implementation and is ~5.5–10x faster against dplyr on this (relatively 
huge) data.

Arun

From: Arunkumar Srinivasan [email protected]
Reply: Arunkumar Srinivasan [email protected]
Date: May 20, 2014 at 9:27:56 PM
To: [email protected] 
[email protected]
Subject:  FR #5249 - rbindlist gains use.names and fill arguments  

Hello everyone,

With the latest commit #1266, the extra functionality offered via rbind 
(use.names and fill) is also now available to rbindlist. In addition, the 
implementation is completely moved to C, and is therefore tremendously fast, 
especially for cases where one has to bind using with use.names=TRUE and/or 
with fill=TRUE. I’ll try to put out a benchmark comparing speed differences 
with the older implementation ASAP.

Note that this change comes with a very low cost to the default speed to 
rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding 10,000 
data.tables with 20 columns each, resulted in the new version running in 0.107 
seconds, where as the older version ran in 0.095 seconds.

In addition the documentation for ?rbindlist also has been improved (#5158 from 
Alexander). Here’s the change log from NEWS:

  o  'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented 
entirely in C. Closes #5249
         -> use.names by default is FALSE for backwards compatibility (doesn't 
bind by names by default)
         -> rbind(...) now just calls rbindlist() internally, except that 
'use.names' is TRUE by default,   
            for compatibility with base (and backwards compatibility).
         -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
         -> At least one item of the input list has to have non-null column 
names.
         -> Duplicate columns are bound in the order of occurrence, like base.
         -> Attributes that might exist in individual items would be lost in 
the bound result.
         -> Columns are coerced to the highest SEXPTYPE, if they are different, 
if/when possible.
         -> And incredibly fast ;).
         -> Documentation updated in much detail. Closes DR #5158.
     Eddi's (excellent) work on finding factor levels, type coercion of columns 
etc. are all retained.

Please try it and write back if things aren’t working as it was before. The 
tests that had to be fixed are extremely rare cases. I suspect there should be 
minimal issue, if at all, in this version. However, I do find the changes here 
bring consistency to the function.

One (very rare) feature that is not available due to this implementation is the 
ability to recycle.

dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
lst1 <- list(x=4, y=5, z=as.list(1:3))

rbind(dt1, lst1)
#    x y       z
# 1: 1 4     1,2
# 2: 2 5   1,2,3
# 3: 3 6 1,2,3,4
# 4: 4 5       1
# 5: 4 5       2
# 6: 4 5       3

The 4,5 are recycled very nicely here.. This is not possible at the moment. 
This is because the earlier rbind implementation used as.data.table to convert 
to data.table, however it takes a copy (very inefficient on huge / many 
tables). I’d love to add this feature in C as well, as it would help incredibly 
for use within [.data.table (now that we can fill columns and bind by names 
faster). Will add a FR.

In summary, I think there should be minimal issues, if any and should be much 
faster (for rbind cases). Please write back what you think, if you happen to 
try out.



Arun

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments

Reply via email to