Re: [R] Possible Improvement to sapply

Doran, Harold Tue, 13 Mar 2018 09:22:00 -0700

Quite possibly, and I’ll look into that. Aside from the work I was doing, 
however, I wonder if there is a way such that sapply could avoid the overhead 
of having to call the identical function to determine the conditional path.

From: William Dunlap [mailto:wdun...@tibco.com]
Sent: Tuesday, March 13, 2018 12:14 PM
To: Doran, Harold <hdo...@air.org>
Cc: Martin Morgan <martin.mor...@roswellpark.org>; r-help@r-project.org
Subject: Re: [R] Possible Improvement to sapply

Could your code use vapply instead of sapply?  vapply forces you to declare the 
type and dimensions
of FUN's output and stops if any call to FUN does not match the declaration.  
It can use much less
memory and time than sapply because it fills in the output array as it goes 
instead of calling lapply()
and seeing how it could be simplified.

Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Tue, Mar 13, 2018 at 7:06 AM, Doran, Harold 
<hdo...@air.org<mailto:hdo...@air.org>> wrote:
Martin

In terms of context of the actual problem, sapply is called millions of times 
because the work involves scoring individual students who took a test. A score 
for student A is generated and then student B and such and there are millions 
of students. The psychometric process of scoring students is complex and our 
code makes use of sapply many times for each student.

The toy example used length just to illustrate, our actual code doesn't do 
that. But your point is well taken, there may be a very good counterexample why 
my proposal doesn't achieve the goal is a generalizable way.

-----Original Message-----
From: Martin Morgan 
[mailto:martin.mor...@roswellpark.org<mailto:martin.mor...@roswellpark.org>]
Sent: Tuesday, March 13, 2018 9:43 AM
To: Doran, Harold <hdo...@air.org<mailto:hdo...@air.org>>; 
'r-help@r-project.org<mailto:r-help@r-project.org>' 
<r-help@r-project.org<mailto:r-help@r-project.org>>
Subject: Re: [R] Possible Improvement to sapply

On 03/13/2018 09:23 AM, Doran, Harold wrote:
> While working with sapply, the documentation states that the simplify
> argument will yield a vector, matrix etc "when possible". I was
> curious how the code actually defined "as possible" and see this
> within the function
>
> if (!identical(simplify, FALSE) && length(answer))
>
> This seems superfluous to me, in particular this part:
>
> !identical(simplify, FALSE)
>
> The preceding code could be reduced to
>
> if (simplify && length(answer))
>
> and it would not need to execute the call to identical in order to trigger 
> the conditional execution, which is known from the user's simplify = TRUE or 
> FALSE inputs. I *think* the extra call to identical is just unnecessary 
> overhead in this instance.
>
> Take for example, the following toy example code and benchmark results and a 
> small modification to sapply:
>
> myList <- list(a = rnorm(100), b = rnorm(100))
>
> answer <- lapply(X = myList, FUN = length) simplify = TRUE
>
> library(microbenchmark)
>
> mySapply <- function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE){
>       FUN <- match.fun(FUN)
>      answer <- lapply(X = X, FUN = FUN, ...)
>      if (USE.NAMES && is.character(X) && is.null(names(answer)))
>          names(answer) <- X
>      if (simplify && length(answer))
>          simplify2array(answer, higher = (simplify == "array"))
>      else answer
> }
>
>
>> microbenchmark(sapply(myList, length), times = 10000L)
> Unit: microseconds
>                     expr    min     lq     mean median     uq    max neval
>   sapply(myList, length) 14.156 15.572 16.67603 15.926 16.634 650.46
> 10000
>> microbenchmark(mySapply(myList, length), times = 10000L)
> Unit: microseconds
>                       expr    min     lq     mean median     uq      max neval
>   mySapply(myList, length) 13.095 14.864 16.02964 15.218 15.573
> 1671.804 10000
>
> My benchmark timings show a timing improvement with only that small change 
> made and it is seemingly nominal. In my actual work, the sapply function is 
> called millions of times and this additional overhead propagates to some 
> overall additional computing time.
>
> I have done some limited testing on various real data to verify that the 
> objects produced under both variants of the sapply (base R and my modified) 
> yield identical objects when simply is both TRUE or FALSE.
>
> Perhaps someone else sees a counterexample where my proposed fix does not 
> cause for sapply to behave as expected.
>

Check out ?sapply for possible values of `simplify=` to see why your proposal 
is not adequate.

For your example, lengths() is an order of magnitude faster than sapply(., 
length). This is a example of the advantages of vectorization (single call to 
an R function implemented in C) versus iteration (`for` loops but also the 
*apply family calling an R function many times).
vapply() might also be relevant.

Often performance improvements come from looking one layer up from where the 
problem occurs and re-thinking the algorithm. Why would one need to call 
sapply() millions of times, in a situation where this becomes rate-limiting? 
Can the algorithm be re-implemented to avoid this step?

Martin Morgan

> Harold
>
> ______________________________________________
> R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
> UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.

______________________________________________
R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Possible Improvement to sapply

Reply via email to