>> Hi all, >> >> I am trying to understand the performance of functions applied to integer >sequences. Consider the following: >> >> ### begin example ### >> >> library(lobstr) >> library(microbenchmark) >> >> x <- sample(1e6) >> obj_size(x) >> # 4,000,048 B >> >> y <- 1:1e6 >> obj_size(y) >> # 680 B >> >> # So we can see that 'y' uses ALTREP. These are, as expected, the same: >> >> sum(x) >> # [1] 500000500000 >> sum(y) >> # [1] 500000500000 >> >> # For 'x', we have to go through the trouble of actually summing up 1e6 >integers. >> # For 'y', knowing its form, we really just need to do: >> >> 1e6*(1e6+1)/2 >> # [1] 500000500000 >> >> # which should be a whole lot faster. And indeed, it is: >> >> microbenchmark(sum(x),sum(y)) >> >> # Unit: nanoseconds >> # expr min lq mean median uq max neval cld >> # sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519 100 b >> # sum(y) 183 245.5 446.09 338.5 447.0 3233 100 a >> >> # Now what about mean()? >> >> mean(x) >> # [1] 500000.5 >> mean(y) >> # [1] 500000.5 >> >> # which is the same as >> >> (1e6+1)/2 >> # [1] 500000.5 >> >> # But this surprised me: >> >> microbenchmark(mean(x),mean(y)) >> >> # Unit: microseconds >> # expr min lq mean median uq max neval cld >> # mean(x) 935.389 943.4795 1021.423 954.689 985.122 2065.974 100 a >> # mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768 100 b >> >> ### end example ### >> >> So why is mean() on an ALTREP sequence slower when sum() is faster? >> >> And more generally, when using sum() on an ALTREP integer sequence, does R >actually use something like n*(n+1)/2 (or generalized to sequences a:b -- >(a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()? > >The mean.default function looks like this: > >function (x, trim = 0, na.rm = FALSE, ...) >{ > if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) { > warning("argument is not numeric or logical: returning NA") > return(NA_real_) > } > if (na.rm) > x <- x[!is.na(x)] > if (!is.numeric(trim) || length(trim) != 1L) > stop("'trim' must be numeric of length one") > n <- length(x) > if (trim > 0 && n) { > if (is.complex(x)) > stop("trimmed means are not defined for complex data") > if (anyNA(x)) > return(NA_real_) > if (trim >= 0.5) > return(stats::median(x, na.rm = FALSE)) > lo <- floor(n * trim) + 1 > hi <- n + 1 - lo > x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi] > } > .Internal(mean(x)) >} > >So it does fixups for trimming and NA removal, then calls an internal >function. The internal function is the first part of do_summary, here: > >https://github.com/wch/r- >source/blob/f9c955fc6699a1f0482e4281ba658215c0e0b949/src/main/summary.c#L541-L556 > >It is using separate functions for the mean by type. The real_mean >function here: > >https://github.com/wch/r- >source/blob/f9c955fc6699a1f0482e4281ba658215c0e0b949/src/main/summary.c#L476-L515 > >makes a big effort to avoid overflows. > >So I suspect the reason mean.default doesn't use sum(x)/length(x) at the >end is that on a long vector sum(x) could overflow when mean(x) shouldn't. > >So why not take the ALTREP into account? I suspect it's just too much >trouble for a rare case. > >Duncan Murdoch
Hi Duncan, Thanks for pointing me to the appropriate places in the C code. And ok, makes sense. Best, Wolfgang ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel