On 03/13/2018 09:23 AM, Doran, Harold wrote:
While working with sapply, the documentation states that the simplify argument will yield a vector,
matrix etc "when possible". I was curious how the code actually defined "as
possible" and see this within the function
if (!identical(simplify, FALSE) && length(answer))
This seems superfluous to me, in particular this part:
!identical(simplify, FALSE)
The preceding code could be reduced to
if (simplify && length(answer))
and it would not need to execute the call to identical in order to trigger the
conditional execution, which is known from the user's simplify = TRUE or FALSE
inputs. I *think* the extra call to identical is just unnecessary overhead in
this instance.
Take for example, the following toy example code and benchmark results and a
small modification to sapply:
myList <- list(a = rnorm(100), b = rnorm(100))
answer <- lapply(X = myList, FUN = length)
simplify = TRUE
library(microbenchmark)
mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
FUN <- match.fun(FUN)
answer <- lapply(X = X, FUN = FUN, ...)
if (USE.NAMES && is.character(X) && is.null(names(answer)))
names(answer) <- X
if (simplify && length(answer))
simplify2array(answer, higher = (simplify == "array"))
else answer
}
microbenchmark(sapply(myList, length), times = 10000L)
Unit: microseconds
expr min lq mean median uq max neval
sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46 10000
microbenchmark(mySapply(myList, length), times = 10000L)
Unit: microseconds
expr min lq mean median uq max neval
mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573 1671.804 10000
My benchmark timings show a timing improvement with only that small change made
and it is seemingly nominal. In my actual work, the sapply function is called
millions of times and this additional overhead propagates to some overall
additional computing time.
I have done some limited testing on various real data to verify that the
objects produced under both variants of the sapply (base R and my modified)
yield identical objects when simply is both TRUE or FALSE.
Perhaps someone else sees a counterexample where my proposed fix does not cause
for sapply to behave as expected.
Check out ?sapply for possible values of `simplify=` to see why your
proposal is not adequate.
For your example, lengths() is an order of magnitude faster than
sapply(., length). This is a example of the advantages of vectorization
(single call to an R function implemented in C) versus iteration (`for`
loops but also the *apply family calling an R function many times).
vapply() might also be relevant.
Often performance improvements come from looking one layer up from where
the problem occurs and re-thinking the algorithm. Why would one need to
call sapply() millions of times, in a situation where this becomes
rate-limiting? Can the algorithm be re-implemented to avoid this step?
Martin Morgan
Harold
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
This email message may contain legally privileged and/or...{{dropped:2}}
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.