Dear all, Thank you for your consideration on this topic.
I do not have enough knowledge of R internals to join the discussion about sorting mechanisms. In fact, I did not get how ordering could help for ave as the output must maintain the order of the input (because ave returns only x and not the entiere data.frame). However, while the proposed workaround (i.e. paste0 instead of interaction, cf https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not solves the "bigger problem" of sorting, it is usable as is and solves the issue. Therefore, what do you think about it? (i.e is it relevant for a patch?) Thanks, Thomas > ________________________________________ > De : Abby Spurdle <spurdl...@gmail.com> > Envoyé : lundi 15 mars 2021 10:22 > À : SOEIRO Thomas > Cc : r-devel@r-project.org > Objet : Re: [Rd] Potential improvements of ave? > > Hi Thomas, > > These are some great suggestions. > But I can't help but feel there's a much bigger problem here. > > Intuitively, the ave function could (or should) sort the data. > Then the indexing step becomes almost trivial, in terms of both time > and space complexity. > And the ave function is not the only example of where a problem > becomes much simpler, if the data is sorted. > > Historically, I've never found base R functions user-friendly for > aggregation purposes, or for sorting. > (At least, not by comparison to SQL). > > But that's not the main problem. > It would seem preferable to sort the data, only once. > (Rather than sorting it repeatedly, or not at all). > > Perhaps, objects such as vectors and data.frame(s) could have a > boolean attribute, to indicate if they're sorted. > Or functions such as ave could have a sorted argument. > In either case, if true, the function assumes the data is sorted and > applies a more efficient algorithm. > > > B. > > > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas <thomas.soe...@ap-hm.fr> wrote: >> >> Dear all, >> >> I have two questions/suggestions about ave, but I am not sure if it's >> relevant for bug reports. >> >> >> >> 1) I have performance issues with ave in a case where I didn't expect it. >> The following code runs as expected: >> >> set.seed(1) >> >> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), >> id2 = sample(1:3, 5e2, TRUE), >> id3 = sample(1:5, 5e2, TRUE), >> val = sample(1:300, 5e2, TRUE)) >> >> df1$diff <- ave(df1$val, >> df1$id1, >> df1$id2, >> df1$id3, >> FUN = function(i) c(diff(i), 0)) >> >> head(df1[order(df1$id1, >> df1$id2, >> df1$id3), ]) >> >> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate >> vector of size 1110.0 Gb): >> >> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), >> id2 = sample(1:3, 5e2 * 1e4, TRUE), >> id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), >> val = sample(1:300, 5e2 * 1e4, TRUE)) >> >> df2$diff <- ave(df2$val, >> df2$id1, >> df2$id2, >> df2$id3, >> FUN = function(i) c(diff(i), 0)) >> >> This use case does not seem extreme to me (e.g. aggregate et al work >> perfectly on this data.frame). >> So my question is: Is this expected/intended/reasonable? i.e. Does ave need >> to be optimized? >> >> >> >> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to >> avoid warnings in case of unused levels >> (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html). >> Is it relevant/possible to expose the drop argument explicitly? >> >> >> >> Thanks, >> >> Thomas ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel