Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments

Gabor Grothendieck Tue, 20 May 2014 16:03:33 -0700

In that case I suggest just changing rbindlist to have use.names =
fill and leave rbind as is.


On Tue, May 20, 2014 at 5:16 PM, Arunkumar Srinivasan
<[email protected]> wrote:
> In the current CRAN:
>
> rbindlist corresponds to use.names=FALSE and fill = FALSE
> rbind corresponds to use.names=TRUE and fill = FALSE
>
> Just to be clear, again, are you suggesting that I change *just* rbindlist's
> defaults to use.names=fill and fill=FALSE or for both?
> Arun
>
> From: Gabor Grothendieck [email protected]
> Reply: Gabor Grothendieck [email protected]
> Date: May 20, 2014 at 11:14:15 PM
>
> To: Arunkumar Srinivasan [email protected]
> Cc: [email protected]
> [email protected]
> Subject:  Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill
> arguments
>
> Yes. That is what I intended.
>
> rbindlist on CRAN currently has no fill or use.names arguments. What
> combo of the new fill and use.names does the currrent CRAN rbindlst
> correspond to?
>
>
>
> On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan
> <[email protected]> wrote:
>> I think I understand now what you’re trying to say. Going back to an
>> earlier
>> post, you wrote:
>>
>> Then why not make the default of `use.names` be `fill`. Then you don't get
>> the warning and you can tell just from the argument list what the
>> dependencies are.
>>
>> You mean to basically do?
>>
>> rbindlist <- function(l, use.names=fill, fill=FALSE)
>> .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE)
>>
>> Is this what you mean? If so, the defaults from the previous versions will
>> be changed. The ones who use rbind directly without setting use.names will
>> have different results.. (assuming I understand you correctly this time).
>>
>>
>> Arun
>>
>> From: Gabor Grothendieck [email protected]
>> Reply: Gabor Grothendieck [email protected]
>> Date: May 20, 2014 at 10:49:54 PM
>>
>> To: Arunkumar Srinivasan [email protected]
>> Cc: [email protected]
>> [email protected]
>> Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and
>> fill
>> arguments
>>
>> If I understand this right then the table below shows the valid
>> logical combinations in order of speed (slowest first). Is that
>> right? If so then if fill = FALSE and use.names = fill then we get
>> the fastest case by default.
>>
>> Furthermore if you were concerned that we might be T/T when F/T would
>> be sufficient I don't think that is likely since getting F/T is done
>> by setting use.names = TRUE.
>>
>> fill/use.names
>> T/T (slowest)
>> F/T
>> F/F (fasetest)
>>
>>
>> On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan
>> <[email protected]> wrote:
>>> I’ve filed FR #5690 to remind myself of the recycling feature; that’d be
>>> awesome to have.
>>>
>>> One feature I forgot to point out in the previous post is that, even when
>>> there are duplicate names, rbind/rbindlist binds them consistent with
>>> ‘base’
>>> when use.names=TRUE. And it fills the duplicate columns properly (in the
>>> order of occurrence) also when fill=TRUE.
>>>
>>> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with
>>> columns ranging from V1 to V500 in random order (all integers for
>>> simplicity). We’ll need to just use use.names=TRUE (as all columns are
>>> available in all data.tables).
>>>
>>> I think this data is big enough to illustrate the point. Also, I was
>>> curious
>>> to see a comparison against dplyr’s rbind_all (commit 1504 devel
>>> version).
>>> So, I’ve added it as well to the benchmarks.
>>>
>>> Here’s the data generation. Note: It takes a while for this step to
>>> finish.
>>>
>>> require(data.table) ## 1.9.3 commit 1267
>>> require(dplyr) ## commit 1504 devel
>>> set.seed(1L)
>>> foo <- function(k) {
>>> ans = setDT(lapply(1:k, function(x) sample(10)))
>>> }
>>> bar <- function(ans, k, n) {
>>> bla = sample(paste0("V", 1:k), n)
>>> setnames(ans, bla)
>>> }
>>> n = 10000L
>>> ll = vector("list", n)
>>> for (i in 1:n) {
>>> bla = bar(foo(500L), 500L, 500L)
>>> .Call("Csetlistelt", ll, i, bla)
>>> }
>>>
>>> And here are the timings:
>>>
>>> ## data.table v1.9.3 commit 1267's rbindlist
>>> ## Timings of three consecutive runs:
>>> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))
>>> user system elapsed
>>> 10.909 0.449 11.843
>>>
>>> user system elapsed
>>> 5.219 0.386 5.640
>>>
>>> user system elapsed
>>> 5.355 0.429 5.898
>>>
>>> ## dplyr's rbind_all
>>> ## Timings for three consecutive runs
>>> system.time(ans2 <- rbind_all(ll))
>>> user system elapsed
>>> 62.769 0.247 63.941
>>>
>>> user system elapsed
>>> 62.010 0.335 65.876
>>>
>>> user system elapsed
>>> 55.345 0.359 60.193
>>>
>>>> identical(ans1, setDT(ans2)) # [1] TRUE
>>>
>>> ## data.table v1.9.2's rbind version:
>>> ## ran only once as it took a bit more.
>>> system.time(ans1 <- do.call("rbind", ll))
>>> user system elapsed
>>> 125.356 2.247 139.000
>>>
>>>> identical(ans1, setDT(ans2)) # [1] TRUE
>>>
>>> In summary, the newer implementation is about ~11–23x faster than
>>> data.table’s older implementation and is ~5.5–10x faster against dplyr on
>>> this (relatively huge) data.
>>>
>>> Arun
>>>
>>> From: Arunkumar Srinivasan [email protected]
>>> Reply: Arunkumar Srinivasan [email protected]
>>> Date: May 20, 2014 at 9:27:56 PM
>>> To: [email protected]
>>> [email protected]
>>> Subject: FR #5249 - rbindlist gains use.names and fill arguments
>>>
>>> Hello everyone,
>>>
>>> With the latest commit #1266, the extra functionality offered via rbind
>>> (use.names and fill) is also now available to rbindlist. In addition, the
>>> implementation is completely moved to C, and is therefore tremendously
>>> fast,
>>> especially for cases where one has to bind using with use.names=TRUE
>>> and/or
>>> with fill=TRUE. I’ll try to put out a benchmark comparing speed
>>> differences
>>> with the older implementation ASAP.
>>>
>>> Note that this change comes with a very low cost to the default speed to
>>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding
>>> 10,000 data.tables with 20 columns each, resulted in the new version
>>> running
>>> in 0.107 seconds, where as the older version ran in 0.095 seconds.
>>>
>>> In addition the documentation for ?rbindlist also has been improved
>>> (#5158
>>> from Alexander). Here’s the change log from NEWS:
>>>
>>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now
>>> implemented entirely in C. Closes #5249
>>> -> use.names by default is FALSE for backwards compatibility
>>> (doesn't bind by names by default)
>>> -> rbind(...) now just calls rbindlist() internally, except that
>>> 'use.names' is TRUE by default,
>>> for compatibility with base (and backwards compatibility).
>>> -> fill by default is FALSE. If fill is TRUE, use.names has to be
>>> TRUE.
>>> -> At least one item of the input list has to have non-null column
>>> names.
>>> -> Duplicate columns are bound in the order of occurrence, like
>>> base.
>>> -> Attributes that might exist in individual items would be lost in
>>> the bound result.
>>> -> Columns are coerced to the highest SEXPTYPE, if they are
>>> different, if/when possible.
>>> -> And incredibly fast ;).
>>> -> Documentation updated in much detail. Closes DR #5158.
>>> Eddi's (excellent) work on finding factor levels, type coercion of
>>> columns etc. are all retained.
>>>
>>> Please try it and write back if things aren’t working as it was before.
>>> The
>>> tests that had to be fixed are extremely rare cases. I suspect there
>>> should
>>> be minimal issue, if at all, in this version. However, I do find the
>>> changes
>>> here bring consistency to the function.
>>>
>>> One (very rare) feature that is not available due to this implementation
>>> is
>>> the ability to recycle.
>>>
>>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
>>> lst1 <- list(x=4, y=5, z=as.list(1:3))
>>>
>>> rbind(dt1, lst1)
>>> # x y z
>>> # 1: 1 4 1,2
>>> # 2: 2 5 1,2,3
>>> # 3: 3 6 1,2,3,4
>>> # 4: 4 5 1
>>> # 5: 4 5 2
>>> # 6: 4 5 3
>>>
>>> The 4,5 are recycled very nicely here.. This is not possible at the
>>> moment.
>>> This is because the earlier rbind implementation used as.data.table to
>>> convert to data.table, however it takes a copy (very inefficient on huge
>>> /
>>> many tables). I’d love to add this feature in C as well, as it would help
>>> incredibly for use within [.data.table (now that we can fill columns and
>>> bind by names faster). Will add a FR.
>>>
>>> In summary, I think there should be minimal issues, if any and should be
>>> much faster (for rbind cases). Please write back what you think, if you
>>> happen to try out.
>>>
>>>
>>>
>>> Arun
>>>
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> [email protected]
>>>
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments

Reply via email to