If I understand this right then the table below shows the valid logical combinations in order of speed (slowest first). Is that right? If so then if fill = FALSE and use.names = fill then we get the fastest case by default.
Furthermore if you were concerned that we might be T/T when F/T would be sufficient I don't think that is likely since getting F/T is done by setting use.names = TRUE. fill/use.names T/T (slowest) F/T F/F (fasetest) On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan <[email protected]> wrote: > I’ve filed FR #5690 to remind myself of the recycling feature; that’d be > awesome to have. > > One feature I forgot to point out in the previous post is that, even when > there are duplicate names, rbind/rbindlist binds them consistent with ‘base’ > when use.names=TRUE. And it fills the duplicate columns properly (in the > order of occurrence) also when fill=TRUE. > > Okay, on to benchmarks. I took a set of 10,000 data.tables, each with > columns ranging from V1 to V500 in random order (all integers for > simplicity). We’ll need to just use use.names=TRUE (as all columns are > available in all data.tables). > > I think this data is big enough to illustrate the point. Also, I was curious > to see a comparison against dplyr’s rbind_all (commit 1504 devel version). > So, I’ve added it as well to the benchmarks. > > Here’s the data generation. Note: It takes a while for this step to finish. > > require(data.table) ## 1.9.3 commit 1267 > require(dplyr) ## commit 1504 devel > set.seed(1L) > foo <- function(k) { > ans = setDT(lapply(1:k, function(x) sample(10))) > } > bar <- function(ans, k, n) { > bla = sample(paste0("V", 1:k), n) > setnames(ans, bla) > } > n = 10000L > ll = vector("list", n) > for (i in 1:n) { > bla = bar(foo(500L), 500L, 500L) > .Call("Csetlistelt", ll, i, bla) > } > > And here are the timings: > > ## data.table v1.9.3 commit 1267's rbindlist > ## Timings of three consecutive runs: > system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE)) > user system elapsed > 10.909 0.449 11.843 > > user system elapsed > 5.219 0.386 5.640 > > user system elapsed > 5.355 0.429 5.898 > > ## dplyr's rbind_all > ## Timings for three consecutive runs > system.time(ans2 <- rbind_all(ll)) > user system elapsed > 62.769 0.247 63.941 > > user system elapsed > 62.010 0.335 65.876 > > user system elapsed > 55.345 0.359 60.193 > >> identical(ans1, setDT(ans2)) # [1] TRUE > > ## data.table v1.9.2's rbind version: > ## ran only once as it took a bit more. > system.time(ans1 <- do.call("rbind", ll)) > user system elapsed > 125.356 2.247 139.000 > >> identical(ans1, setDT(ans2)) # [1] TRUE > > In summary, the newer implementation is about ~11–23x faster than > data.table’s older implementation and is ~5.5–10x faster against dplyr on > this (relatively huge) data. > > Arun > > From: Arunkumar Srinivasan [email protected] > Reply: Arunkumar Srinivasan [email protected] > Date: May 20, 2014 at 9:27:56 PM > To: [email protected] > [email protected] > Subject: FR #5249 - rbindlist gains use.names and fill arguments > > Hello everyone, > > With the latest commit #1266, the extra functionality offered via rbind > (use.names and fill) is also now available to rbindlist. In addition, the > implementation is completely moved to C, and is therefore tremendously fast, > especially for cases where one has to bind using with use.names=TRUE and/or > with fill=TRUE. I’ll try to put out a benchmark comparing speed differences > with the older implementation ASAP. > > Note that this change comes with a very low cost to the default speed to > rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding > 10,000 data.tables with 20 columns each, resulted in the new version running > in 0.107 seconds, where as the older version ran in 0.095 seconds. > > In addition the documentation for ?rbindlist also has been improved (#5158 > from Alexander). Here’s the change log from NEWS: > > o 'rbindlist' gains 'use.names' and 'fill' arguments and is now > implemented entirely in C. Closes #5249 > -> use.names by default is FALSE for backwards compatibility > (doesn't bind by names by default) > -> rbind(...) now just calls rbindlist() internally, except that > 'use.names' is TRUE by default, > for compatibility with base (and backwards compatibility). > -> fill by default is FALSE. If fill is TRUE, use.names has to be > TRUE. > -> At least one item of the input list has to have non-null column > names. > -> Duplicate columns are bound in the order of occurrence, like > base. > -> Attributes that might exist in individual items would be lost in > the bound result. > -> Columns are coerced to the highest SEXPTYPE, if they are > different, if/when possible. > -> And incredibly fast ;). > -> Documentation updated in much detail. Closes DR #5158. > Eddi's (excellent) work on finding factor levels, type coercion of > columns etc. are all retained. > > Please try it and write back if things aren’t working as it was before. The > tests that had to be fixed are extremely rare cases. I suspect there should > be minimal issue, if at all, in this version. However, I do find the changes > here bring consistency to the function. > > One (very rare) feature that is not available due to this implementation is > the ability to recycle. > > dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) > lst1 <- list(x=4, y=5, z=as.list(1:3)) > > rbind(dt1, lst1) > # x y z > # 1: 1 4 1,2 > # 2: 2 5 1,2,3 > # 3: 3 6 1,2,3,4 > # 4: 4 5 1 > # 5: 4 5 2 > # 6: 4 5 3 > > The 4,5 are recycled very nicely here.. This is not possible at the moment. > This is because the earlier rbind implementation used as.data.table to > convert to data.table, however it takes a copy (very inefficient on huge / > many tables). I’d love to add this feature in C as well, as it would help > incredibly for use within [.data.table (now that we can fill columns and > bind by names faster). Will add a FR. > > In summary, I think there should be minimal issues, if any and should be > much faster (for rbind cases). Please write back what you think, if you > happen to try out. > > > > Arun > > > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
