While profiling some C code, I rolled my own nchar function which appears to be much faster than base R's (25 times faster for a 10M length vector). Obviously base::nchar provides significantly more features than my barebones function (C snippet below); however, for argument type = "bytes" it seems that the R_nchar and do_nchar functions do not actually do anything more than this function.
My suspicion is that I have overlooked some subtlety in the base R code, or that my benchmarks are not representative. Alternatively, the action in `do_nchar` of preparing the potential error message before being passed to `R_nchar` may be quite costly indeed. Or the function cannot be unswitched from the more complex width and chars arguments by the compiler. If I haven't missed something, would a patch be warranted? SEXP Cnchar(SEXP x) { R_xlen_t N = xlength(x); SEXP ans = PROTECT(allocVector(INTSXP, N)); int * restrict ansp = INTEGER(ans); // Ignoring NA to avoid the branch has a very small // impact on performance. for (R_xlen_t i = 0; i < N; ++i) { SEXP sxi = STRING_ELT(x, i); if (sxi == NA_STRING) { ansp[i] = NA_INTEGER; continue; } ansp[i] = length(sxi); } UNPROTECT(1); return ans; } x <- rep_len(c(as.character(c(5L, 1:1e6)), NA_character_, 1e6:15e5), 1e7) Cnchar(x) 90ms nchar(x, type = "bytes") 2500 ms ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel