I'm observing that base::sum(x, na.rm=FALSE) for typeof(x) == "double" is much more time consuming when there are missing values versus when there are not. I'm observing this on both Window and Linux, but it's quite surprising to me. Currently, my main suspect is settings in on how R was built. The second suspect is my brain. I hope that someone can clarify the below results and confirm or not whether they see the same. Note, this is for "doubles", so I'm not expecting early-stopping as for "integers" (where testing for NA is cheap).
On R 3.2.0, on Windows (using the official CRAN builds), on Linux (local built), and on OS X (official AT&T builds), I get: > x <- rep(0, 1e8) > stopifnot(typeof(x) == "double") > system.time(sum(x, na.rm=FALSE)) user system elapsed 0.19 0.01 0.20 > y <- rep(NA_real_, 1e8) > stopifnot(typeof(y) == "double") > system.time(sum(y, na.rm=FALSE)) user system elapsed 9.54 0.00 9.55 > z <- x; z[length(z)/2] <- NA_real_ > stopifnot(typeof(z) == "double") > system.time(sum(z, na.rm=FALSE)) user system elapsed 4.49 0.00 4.51 Following the source code, I'm pretty sure the code (https://github.com/wch/r-source/blob/trunk/src/main/summary.c#L112-L128) performing the calculation is: static Rboolean rsum(double *x, R_xlen_t n, double *value, Rboolean narm) { LDOUBLE s = 0.0; Rboolean updated = FALSE; for (R_xlen_t i = 0; i < n; i++) { if (!narm || !ISNAN(x[i])) { if(!updated) updated = TRUE; s += x[i]; } } if(s > DBL_MAX) *value = R_PosInf; else if (s < -DBL_MAX) *value = R_NegInf; else *value = (double) s; return updated; } In other words, when na.rm=FALSE, that inner for loop: for (R_xlen_t i = 0; i < n; i++) { if (!narm || !ISNAN(x[i])) { if(!updated) updated = TRUE; s += x[i]; } } should effectively become (because !ISNAN(x[i]) "does not make a difference"): for (R_xlen_t i = 0; i < n; i++) { if (!narm) { if(!updated) updated = TRUE; s += x[i]; } } That is, sum(x, na.rm=FALSE) basically spends time on `s += x[i]`. Now, I have always been under impression that summing with NA:s is *not* more expensive that summing over regular (double) values, which is confirmed by the below example, but the above benchmarking disagree. It looks like there is a big overhead keeping track of the sum `s` being NA, which is supported by the fact that summing over 'z' is costs half of 'y'. Now, I *cannot* reproduce the above using the following 'inline' example: > sum2 <- inline::cfunction(sig=c(x="double", narm="logical"), body=' double *x_ = REAL(x); int narm_ = asLogical(narm); int n = length(x); double sum = 0; for (R_xlen_t i = 0; i < n; i++) { if (!narm_ || !ISNAN(x_[i])) sum += x_[i]; } return ScalarReal(sum); ') > x <- rep(0, 1e8) > stopifnot(typeof(x) == "double") > system.time(sum2(x, narm=FALSE)) user system elapsed 0.16 0.00 0.16 > y <- rep(NA_real_, 1e8) > stopifnot(typeof(y) == "double") > system.time(sum2(y, narm=FALSE)) user system elapsed 0.16 0.00 0.15 > z <- x; z[length(z)/2] <- NA_real_ > stopifnot(typeof(z) == "double") > system.time(sum2(z, narm=FALSE)) user system elapsed 0.16 0.00 0.15 This is why I suspect it's related to how R was configured when it was built. What's going on? Can someone please bring some light on this? Thanks Henrik ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel