Re: [R] Done: Fast way of finding top-n values of a long vector

Stavros Macrakis Fri, 05 Jun 2009 14:57:06 -0700

On Fri, Jun 5, 2009 at 4:09 AM, Allan Engelhardt <[email protected]> wrote:


> I'm all done now.  The "max2" version below is what I went with in the end
> for my proposed change to caret::nearZeroVar (which used the "sort" method).
> Max Kuhn will make it available on CRAN soon.  It speeds up that routine by
> a factor 2-5 on my test cases and uses much less memory.


You can save a little in max2 like this:

max2a = {w<-which.max(x); x[w]/max(x[-w], na.rm=TRUE);}

If you don't need to handle NA's (or if you know a priori how many there
are), you can also speed up part:

  parta = {sel <- length(x)+c(-1,0); a<-sort.int(x, partial=sel,
na.last=NA)[2:1]; a[1]/a[2];}

which becomes about as fast as max2.

> library("rbenchmark")
set.seed(1); x <- runif(1e7, max=1e8);
benchmark(
  replications=20,
  columns=c("test","elapsed"),
  order="elapsed"
, sort = {a<-sort(x, decreasing=TRUE, na.last=NA)[1:2];
          a[1]/a[2];}
, qsrt = {a<-sort(x, decreasing=TRUE, na.last=NA, method="quick")[1:2];
          a[1]/a[2];}
, part = {a<-sort.int(-x, partial=1:2, na.last=NA)[1:2];
          a[1]/a[2];}
, parta = {end<-length(x)+c(-1,0);
           a<-sort.int(x, partial=end, na.last=FALSE)[end];
           a[1]/a[2]; }
, max1 = {m<-max(x, na.rm=TRUE);
          w<-which(x==m)[1];
          m/max(x[-w],na.rm=TRUE);}
, max2 = {w<-which.max(x);
          max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);}
, max2a = {w<-which.max(x);
          x[w]/max(x[-w], na.rm=TRUE);}
)

   test elapsed
7 max2a    7.80
6  max2    8.94
4 parta    9.05
3  part   10.72
5  max1   20.21
2  qsrt   49.33
1  sort   94.18


 For what it is worth, I also made a C version ("cmax" below) which of
> course is faster yet again and scales nicely for returning the top n values
> of the array:
>
> cmax <- function (v) {max <- vector("double",2); max <- .C("test",
> as.double(v), as.integer(length(v)), max, NAOK=TRUE)[[3]];
> return(max[1]/max[2]);}
>
> library("rbenchmark")
> set.seed(1); x <- runif(1e7, max=1e8); x[1] <- NA;
> benchmark(
> replications=20,
> columns=c("test","elapsed"),
> order="elapsed"
> , sort = {a<-sort(x, decreasing=TRUE, na.last=NA)[1:2]; a[1]/a[2];}
> , qsrt = {a<-sort(x, decreasing=TRUE, na.last=NA, method="quick")[1:2];
> a[1]/a[2];}
> , part = {a<-sort.int(-x, partial=1:2, na.last=NA)[1:2]; a[1]/a[2];}
> , max1 = {m<-max(x, na.rm=TRUE); w<-which(x==m)[1];
> m/max(x[-w],na.rm=TRUE);}
> , max2 = {w<-which.max(x); max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);}
> , cmax = {cmax(x);}
> )
> #   test elapsed
> # 6 cmax   4.394
> # 5 max2   8.954
> # 4 max1  18.835
> # 3 part  21.749
> # 2 qsrt  46.692
> # 1 sort  77.679
>
> Thanks for all the suggestions and comments.
>
> Allan.
>
>
> PS: Slightly off-topic but is there a way within the syntax of R to set up
> things so that 'sort' (or any function) would know it is called in a partial
> list context in sort(x)[1:2] and it therefore could choose to use the
> "partial" argument automatically for small [] lists?  The R interpreter of
> course knows full well that it is going to drop all but the first two values
> of the result before it calls 'sort'.  Perl has 'use Want' where howmany()
> and want(n) provides a subset of this functionality (essentially for []
> lists of the form 1:n).
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Done: Fast way of finding top-n values of a long vector

Reply via email to