In that case I suggest just changing rbindlist to have use.names = fill and leave rbind as is.
On Tue, May 20, 2014 at 5:16 PM, Arunkumar Srinivasan <[email protected]> wrote: > In the current CRAN: > > rbindlist corresponds to use.names=FALSE and fill = FALSE > rbind corresponds to use.names=TRUE and fill = FALSE > > Just to be clear, again, are you suggesting that I change *just* rbindlist's > defaults to use.names=fill and fill=FALSE or for both? > Arun > > From: Gabor Grothendieck [email protected] > Reply: Gabor Grothendieck [email protected] > Date: May 20, 2014 at 11:14:15 PM > > To: Arunkumar Srinivasan [email protected] > Cc: [email protected] > [email protected] > Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill > arguments > > Yes. That is what I intended. > > rbindlist on CRAN currently has no fill or use.names arguments. What > combo of the new fill and use.names does the currrent CRAN rbindlst > correspond to? > > > > On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan > <[email protected]> wrote: >> I think I understand now what you’re trying to say. Going back to an >> earlier >> post, you wrote: >> >> Then why not make the default of `use.names` be `fill`. Then you don't get >> the warning and you can tell just from the argument list what the >> dependencies are. >> >> You mean to basically do? >> >> rbindlist <- function(l, use.names=fill, fill=FALSE) >> .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE) >> >> Is this what you mean? If so, the defaults from the previous versions will >> be changed. The ones who use rbind directly without setting use.names will >> have different results.. (assuming I understand you correctly this time). >> >> >> Arun >> >> From: Gabor Grothendieck [email protected] >> Reply: Gabor Grothendieck [email protected] >> Date: May 20, 2014 at 10:49:54 PM >> >> To: Arunkumar Srinivasan [email protected] >> Cc: [email protected] >> [email protected] >> Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and >> fill >> arguments >> >> If I understand this right then the table below shows the valid >> logical combinations in order of speed (slowest first). Is that >> right? If so then if fill = FALSE and use.names = fill then we get >> the fastest case by default. >> >> Furthermore if you were concerned that we might be T/T when F/T would >> be sufficient I don't think that is likely since getting F/T is done >> by setting use.names = TRUE. >> >> fill/use.names >> T/T (slowest) >> F/T >> F/F (fasetest) >> >> >> On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan >> <[email protected]> wrote: >>> I’ve filed FR #5690 to remind myself of the recycling feature; that’d be >>> awesome to have. >>> >>> One feature I forgot to point out in the previous post is that, even when >>> there are duplicate names, rbind/rbindlist binds them consistent with >>> ‘base’ >>> when use.names=TRUE. And it fills the duplicate columns properly (in the >>> order of occurrence) also when fill=TRUE. >>> >>> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with >>> columns ranging from V1 to V500 in random order (all integers for >>> simplicity). We’ll need to just use use.names=TRUE (as all columns are >>> available in all data.tables). >>> >>> I think this data is big enough to illustrate the point. Also, I was >>> curious >>> to see a comparison against dplyr’s rbind_all (commit 1504 devel >>> version). >>> So, I’ve added it as well to the benchmarks. >>> >>> Here’s the data generation. Note: It takes a while for this step to >>> finish. >>> >>> require(data.table) ## 1.9.3 commit 1267 >>> require(dplyr) ## commit 1504 devel >>> set.seed(1L) >>> foo <- function(k) { >>> ans = setDT(lapply(1:k, function(x) sample(10))) >>> } >>> bar <- function(ans, k, n) { >>> bla = sample(paste0("V", 1:k), n) >>> setnames(ans, bla) >>> } >>> n = 10000L >>> ll = vector("list", n) >>> for (i in 1:n) { >>> bla = bar(foo(500L), 500L, 500L) >>> .Call("Csetlistelt", ll, i, bla) >>> } >>> >>> And here are the timings: >>> >>> ## data.table v1.9.3 commit 1267's rbindlist >>> ## Timings of three consecutive runs: >>> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE)) >>> user system elapsed >>> 10.909 0.449 11.843 >>> >>> user system elapsed >>> 5.219 0.386 5.640 >>> >>> user system elapsed >>> 5.355 0.429 5.898 >>> >>> ## dplyr's rbind_all >>> ## Timings for three consecutive runs >>> system.time(ans2 <- rbind_all(ll)) >>> user system elapsed >>> 62.769 0.247 63.941 >>> >>> user system elapsed >>> 62.010 0.335 65.876 >>> >>> user system elapsed >>> 55.345 0.359 60.193 >>> >>>> identical(ans1, setDT(ans2)) # [1] TRUE >>> >>> ## data.table v1.9.2's rbind version: >>> ## ran only once as it took a bit more. >>> system.time(ans1 <- do.call("rbind", ll)) >>> user system elapsed >>> 125.356 2.247 139.000 >>> >>>> identical(ans1, setDT(ans2)) # [1] TRUE >>> >>> In summary, the newer implementation is about ~11–23x faster than >>> data.table’s older implementation and is ~5.5–10x faster against dplyr on >>> this (relatively huge) data. >>> >>> Arun >>> >>> From: Arunkumar Srinivasan [email protected] >>> Reply: Arunkumar Srinivasan [email protected] >>> Date: May 20, 2014 at 9:27:56 PM >>> To: [email protected] >>> [email protected] >>> Subject: FR #5249 - rbindlist gains use.names and fill arguments >>> >>> Hello everyone, >>> >>> With the latest commit #1266, the extra functionality offered via rbind >>> (use.names and fill) is also now available to rbindlist. In addition, the >>> implementation is completely moved to C, and is therefore tremendously >>> fast, >>> especially for cases where one has to bind using with use.names=TRUE >>> and/or >>> with fill=TRUE. I’ll try to put out a benchmark comparing speed >>> differences >>> with the older implementation ASAP. >>> >>> Note that this change comes with a very low cost to the default speed to >>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding >>> 10,000 data.tables with 20 columns each, resulted in the new version >>> running >>> in 0.107 seconds, where as the older version ran in 0.095 seconds. >>> >>> In addition the documentation for ?rbindlist also has been improved >>> (#5158 >>> from Alexander). Here’s the change log from NEWS: >>> >>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now >>> implemented entirely in C. Closes #5249 >>> -> use.names by default is FALSE for backwards compatibility >>> (doesn't bind by names by default) >>> -> rbind(...) now just calls rbindlist() internally, except that >>> 'use.names' is TRUE by default, >>> for compatibility with base (and backwards compatibility). >>> -> fill by default is FALSE. If fill is TRUE, use.names has to be >>> TRUE. >>> -> At least one item of the input list has to have non-null column >>> names. >>> -> Duplicate columns are bound in the order of occurrence, like >>> base. >>> -> Attributes that might exist in individual items would be lost in >>> the bound result. >>> -> Columns are coerced to the highest SEXPTYPE, if they are >>> different, if/when possible. >>> -> And incredibly fast ;). >>> -> Documentation updated in much detail. Closes DR #5158. >>> Eddi's (excellent) work on finding factor levels, type coercion of >>> columns etc. are all retained. >>> >>> Please try it and write back if things aren’t working as it was before. >>> The >>> tests that had to be fixed are extremely rare cases. I suspect there >>> should >>> be minimal issue, if at all, in this version. However, I do find the >>> changes >>> here bring consistency to the function. >>> >>> One (very rare) feature that is not available due to this implementation >>> is >>> the ability to recycle. >>> >>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4)) >>> lst1 <- list(x=4, y=5, z=as.list(1:3)) >>> >>> rbind(dt1, lst1) >>> # x y z >>> # 1: 1 4 1,2 >>> # 2: 2 5 1,2,3 >>> # 3: 3 6 1,2,3,4 >>> # 4: 4 5 1 >>> # 5: 4 5 2 >>> # 6: 4 5 3 >>> >>> The 4,5 are recycled very nicely here.. This is not possible at the >>> moment. >>> This is because the earlier rbind implementation used as.data.table to >>> convert to data.table, however it takes a copy (very inefficient on huge >>> / >>> many tables). I’d love to add this feature in C as well, as it would help >>> incredibly for use within [.data.table (now that we can fill columns and >>> bind by names faster). Will add a FR. >>> >>> In summary, I think there should be minimal issues, if any and should be >>> much faster (for rbind cases). Please write back what you think, if you >>> happen to try out. >>> >>> >>> >>> Arun >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> [email protected] >>> >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
