Re: [Rd] Potential improvements of ave?

SOEIRO Thomas Tue, 16 Mar 2021 15:50:58 -0700

Dear all,

Thank you for your consideration on this topic.


I do not have enough knowledge of R internals to join the discussion about 
sorting mechanisms. In fact, I did not get how ordering could help for ave as 
the output must maintain the order of the input (because ave returns only x and 
not the entiere data.frame).

However, while the proposed workaround (i.e. paste0 instead of interaction, cf 
https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not solves 
the "bigger problem" of sorting, it is usable as is and solves the issue. 
Therefore, what do you think about it? (i.e is it relevant for a patch?)

Thanks,

Thomas


> ________________________________________
> De : Abby Spurdle <spurdl...@gmail.com>
> Envoyé : lundi 15 mars 2021 10:22
> À : SOEIRO Thomas
> Cc : r-devel@r-project.org
> Objet : Re: [Rd] Potential improvements of ave?
>
> Hi Thomas,
>
> These are some great suggestions.
> But I can't help but feel there's a much bigger problem here.
>
> Intuitively, the ave function could (or should) sort the data.
> Then the indexing step becomes almost trivial, in terms of both time
> and space complexity.
> And the ave function is not the only example of where a problem
> becomes much simpler, if the data is sorted.
>
> Historically, I've never found base R functions user-friendly for
> aggregation purposes, or for sorting.
> (At least, not by comparison to SQL).
>
> But that's not the main problem.
> It would seem preferable to sort the data, only once.
> (Rather than sorting it repeatedly, or not at all).
>
> Perhaps, objects such as vectors and data.frame(s) could have a
> boolean attribute, to indicate if they're sorted.
> Or functions such as ave could have a sorted argument.
> In either case, if true, the function assumes the data is sorted and
> applies a more efficient algorithm.
>
>
> B.
>
>
> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <thomas.soe...@ap-hm.fr> wrote:
>>
>> Dear all,
>>
>> I have two questions/suggestions about ave, but I am not sure if it's 
>> relevant for bug reports.
>>
>>
>>
>> 1) I have performance issues with ave in a case where I didn't expect it. 
>> The following code runs as expected:
>>
>> set.seed(1)
>>
>> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>>                   id2 = sample(1:3, 5e2, TRUE),
>>                   id3 = sample(1:5, 5e2, TRUE),
>>                   val = sample(1:300, 5e2, TRUE))
>>
>> df1$diff <- ave(df1$val,
>>                 df1$id1,
>>                 df1$id2,
>>                 df1$id3,
>>                 FUN = function(i) c(diff(i), 0))
>>
>> head(df1[order(df1$id1,
>>                df1$id2,
>>                df1$id3), ])
>>
>> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate 
>> vector of size 1110.0 Gb):
>>
>> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>>                   id2 = sample(1:3, 5e2 * 1e4, TRUE),
>>                   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>>                   val = sample(1:300, 5e2 * 1e4, TRUE))
>>
>> df2$diff <- ave(df2$val,
>>                 df2$id1,
>>                 df2$id2,
>>                 df2$id3,
>>                 FUN = function(i) c(diff(i), 0))
>>
>> This use case does not seem extreme to me (e.g. aggregate et al work 
>> perfectly on this data.frame).
>> So my question is: Is this expected/intended/reasonable? i.e. Does ave need 
>> to be optimized?
>>
>>
>>
>> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to 
>> avoid warnings in case of unused levels 
>> (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
>> Is it relevant/possible to expose the drop argument explicitly?
>>
>>
>>
>> Thanks,
>>
>> Thomas
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Potential improvements of ave?

Reply via email to