[R] Done: Fast way of finding top-n values of a long vector

2009-06-05 Thread Allan Engelhardt
I'm all done now.  The max2 version below is what I went with in the 
end for my proposed change to caret::nearZeroVar (which used the sort 
method). Max Kuhn will make it available on CRAN soon.  It speeds up 
that routine by a factor 2-5 on my test cases and uses much less 
memory.  For what it is worth, I also made a C version (cmax below) 
which of course is faster yet again and scales nicely for returning the 
top n values of the array:


cmax - function (v) {max - vector(double,2); max - .C(test, 
as.double(v), as.integer(length(v)), max, NAOK=TRUE)[[3]]; 
return(max[1]/max[2]);}


library(rbenchmark)
set.seed(1); x - runif(1e7, max=1e8); x[1] - NA;
benchmark(
replications=20,
columns=c(test,elapsed),
order=elapsed
, sort = {a-sort(x, decreasing=TRUE, na.last=NA)[1:2]; a[1]/a[2];}
, qsrt = {a-sort(x, decreasing=TRUE, na.last=NA, method=quick)[1:2]; 
a[1]/a[2];}

, part = {a-sort.int(-x, partial=1:2, na.last=NA)[1:2]; a[1]/a[2];}
, max1 = {m-max(x, na.rm=TRUE); w-which(x==m)[1]; 
m/max(x[-w],na.rm=TRUE);}

, max2 = {w-which.max(x); max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);}
, cmax = {cmax(x);}
)
#   test elapsed
# 6 cmax   4.394
# 5 max2   8.954
# 4 max1  18.835
# 3 part  21.749
# 2 qsrt  46.692
# 1 sort  77.679

Thanks for all the suggestions and comments.

Allan.


PS: Slightly off-topic but is there a way within the syntax of R to set 
up things so that 'sort' (or any function) would know it is called in a 
partial list context in sort(x)[1:2] and it therefore could choose to 
use the partial argument automatically for small [] lists?  The R 
interpreter of course knows full well that it is going to drop all but 
the first two values of the result before it calls 'sort'.  Perl has 
'use Want' where howmany() and want(n) provides a subset of this 
functionality (essentially for [] lists of the form 1:n).


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Done: Fast way of finding top-n values of a long vector

2009-06-05 Thread Stavros Macrakis
On Fri, Jun 5, 2009 at 4:09 AM, Allan Engelhardt all...@cybaea.com wrote:

 I'm all done now.  The max2 version below is what I went with in the end
 for my proposed change to caret::nearZeroVar (which used the sort method).
 Max Kuhn will make it available on CRAN soon.  It speeds up that routine by
 a factor 2-5 on my test cases and uses much less memory.


You can save a little in max2 like this:

max2a = {w-which.max(x); x[w]/max(x[-w], na.rm=TRUE);}

If you don't need to handle NA's (or if you know a priori how many there
are), you can also speed up part:

  parta = {sel - length(x)+c(-1,0); a-sort.int(x, partial=sel,
na.last=NA)[2:1]; a[1]/a[2];}

which becomes about as fast as max2.

 library(rbenchmark)
set.seed(1); x - runif(1e7, max=1e8);
benchmark(
  replications=20,
  columns=c(test,elapsed),
  order=elapsed
, sort = {a-sort(x, decreasing=TRUE, na.last=NA)[1:2];
  a[1]/a[2];}
, qsrt = {a-sort(x, decreasing=TRUE, na.last=NA, method=quick)[1:2];
  a[1]/a[2];}
, part = {a-sort.int(-x, partial=1:2, na.last=NA)[1:2];
  a[1]/a[2];}
, parta = {end-length(x)+c(-1,0);
   a-sort.int(x, partial=end, na.last=FALSE)[end];
   a[1]/a[2]; }
, max1 = {m-max(x, na.rm=TRUE);
  w-which(x==m)[1];
  m/max(x[-w],na.rm=TRUE);}
, max2 = {w-which.max(x);
  max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);}
, max2a = {w-which.max(x);
  x[w]/max(x[-w], na.rm=TRUE);}
)

   test elapsed
7 max2a7.80
6  max28.94
4 parta9.05
3  part   10.72
5  max1   20.21
2  qsrt   49.33
1  sort   94.18


 For what it is worth, I also made a C version (cmax below) which of
 course is faster yet again and scales nicely for returning the top n values
 of the array:

 cmax - function (v) {max - vector(double,2); max - .C(test,
 as.double(v), as.integer(length(v)), max, NAOK=TRUE)[[3]];
 return(max[1]/max[2]);}

 library(rbenchmark)
 set.seed(1); x - runif(1e7, max=1e8); x[1] - NA;
 benchmark(
 replications=20,
 columns=c(test,elapsed),
 order=elapsed
 , sort = {a-sort(x, decreasing=TRUE, na.last=NA)[1:2]; a[1]/a[2];}
 , qsrt = {a-sort(x, decreasing=TRUE, na.last=NA, method=quick)[1:2];
 a[1]/a[2];}
 , part = {a-sort.int(-x, partial=1:2, na.last=NA)[1:2]; a[1]/a[2];}
 , max1 = {m-max(x, na.rm=TRUE); w-which(x==m)[1];
 m/max(x[-w],na.rm=TRUE);}
 , max2 = {w-which.max(x); max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);}
 , cmax = {cmax(x);}
 )
 #   test elapsed
 # 6 cmax   4.394
 # 5 max2   8.954
 # 4 max1  18.835
 # 3 part  21.749
 # 2 qsrt  46.692
 # 1 sort  77.679

 Thanks for all the suggestions and comments.

 Allan.


 PS: Slightly off-topic but is there a way within the syntax of R to set up
 things so that 'sort' (or any function) would know it is called in a partial
 list context in sort(x)[1:2] and it therefore could choose to use the
 partial argument automatically for small [] lists?  The R interpreter of
 course knows full well that it is going to drop all but the first two values
 of the result before it calls 'sort'.  Perl has 'use Want' where howmany()
 and want(n) provides a subset of this functionality (essentially for []
 lists of the form 1:n).

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.