Hi:

>From what I can tell, Henrik efficiently finds the 50 largest values without
the matrix
indices and Peter efficiently finds the matrix indices without the
corresponding values.
Let's combine the two:

x <- rnorm(8e6)
is.na(x) <- sample(8e6, 1e6)
n <- 50
x1 <- sort(x, decreasing=TRUE)[1:n]
# Find the indices that match the 50 largest values
ix <- which(x %in% x1)
# Map these to rows and columns assuming that x is shaped
# into a 4000 * 2000 matrix columnwise:
ro <- ix %% 4000
ro <- ifelse(ro == 0L, 4000, ro)  # just in case we get row 4000
co <- 1 + ix %/% 4000
# Find the x values corresponding to the indices in ix, in their proper
order
x2 <- x[ix]
d <- data.frame(row = ro, col = co, x = x2)
d[rev(order(d$x)), ]    # sort in decreasing order

To make sure this works, set up
xm <- matrix(x, nrow = 4000)

In my test,
> head(d)     # before sorting
   row col        x
1  280  17 4.375632
2 1346 179 4.506947
3 2661 214 4.438979
4 1870 308 4.447505
5 1951 344 4.497253
6   12 365 4.465841
> xm[280, 17]
[1] 4.375632
> xm[1870, 308]
[1] 4.447505

Looks OK. Total time to run the entire block of code:
   user  system elapsed
   3.38    0.19    3.57

Now let's try it with Henrik's second method:

system.time({
x <- rnorm(8e6)
is.na(x) <- sample(8e6, 1e6)
n <- 50
x2 <-  sort(-x, partial = n)
x2h <-  -sort(x2[1:n])
ix <- which(x %in% x2h)
ro <- ix %% 4000
ro <- ifelse(ro == 0L, 4000, ro)
co <- 1 + ix %/% 4000
x3 <- x[ix]
d <- data.frame(row = ro, col = co, x = x3)
d[rev(order(d$x)), ]
})
   user  system elapsed
   2.24    0.17    2.42

Not bad for 8M observations :)

HTH,
Dennis

On Fri, Jun 18, 2010 at 4:39 AM, Henrik Bengtsson <h...@stat.berkeley.edu>wrote:

> You might also want to consider _partial sorting_ by using the
> 'partial' argument of sort(), especially when the number of data
> points is really large.
>
> Since argument 'decreasing=FALSE' is not supported when using
> 'partial', you have to flip it yourself by negating the values, e.g.
>
> x <- rnorm(8e6);
> is.na(x) <- sample(length(x), size=1e6);
>
> n <- 50;
> t1 <- system.time({
>  x1 <- sort(x, decreasing=TRUE);
>  x1h <- x1[1:n];
> });
>
> t2 <- system.time({
>  x2 <- sort(-x, partial=n);
>  x2h <- -sort(x2[1:n]);
> });
>
> stopifnot(identical(x2h, x1h));
> print(t2/t1);
>     user    system   elapsed
> 0.3076923 0.7777778 0.3491525
>
> /Henrik
>
> On Fri, Jun 18, 2010 at 1:20 PM, Peter Ehlers <ehl...@ucalgary.ca> wrote:
> >
> > m <- matrix(round(rnorm(4000 * 2000), 4), nr = 4000)
> > is.na(m) <- sample(8e6, 1e6)
> >
> > system.time(
> >  idx <- which(
> >    matrix(m %in% head(sort(m, TRUE), 50),
> >           nr = nrow(m)), arr.ind = TRUE))
> >
> > #   user  system elapsed
> > #   3.12    0.19    3.18
> >
> >  -Peter Ehlers
> >
> >
> > On 2010-06-18 5:13, Dennis Murphy wrote:
> >>
> >> Hi:
> >>
> >> Here's a faked up example:
> >>
> >> a<- matrix(rnorm(4000*2000), 4000, 2000)
> >> # Generate some NAs in the matrix
> >> nr<- sample(50, 1:4000)
> >> nc<- sample(50, 1:2000)
> >> a[nr, nc]<- NA
> >>
> >> # convert to data frame:
> >> b<- data.frame(row = rep(1:4000, 2000), col = rep(1:2000, each = 4000),
> >>                           x = as.vector(a))
> >> # relatively time consuming...about 13.5 s on my machine
> >> bb<- b[rev(order(b$x, na.last = FALSE)), ]
> >>>
> >>> bb[1:10, ]
> >>
> >>          row  col        x
> >> 691269  3269  173 5.103704
> >> 7815076 3076 1954 4.961544
> >> 4999621 3621 1250 4.953265
> >> 500469   469  126 4.937655
> >> 5878224 2224 1470 4.929150
> >> 4287270 3270 1072 4.913791
> >> 4442521 2521 1111 4.896869
> >> 4668867  867 1168 4.863504
> >> 5716575  575 1430 4.760778
> >> 3055274 3274  764 4.758995
> >>
> >> HTH,
> >> Dennis
> >>
> >>
> >> On Thu, Jun 17, 2010 at 10:41 PM,
> >> uschlecht<ulrich.schle...@stanford.edu>wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>> I have a huge matrix (4000 * 2000 data points) and I would like to
> >>> retrieve
> >>> the coordinates (column and row) for the top 50 (or x) values. Some
> >>> positions in the matrix have NA as a value. These should be discarded.
> >>>
> >>> My current method is to replace all NAs by 0, then rank all the values
> >>> and
> >>> then extract the positions with the 50 highest ranks. It is very
> >>> time-consuming!
> >>>
> >>> Is there a simpler way to do this?
> >>>
> >>> Thank you,
> >>> Ulrich
> >>>
> >>> --
> >>> View this message in context:
> >>>
> >>>
> http://r.789695.n4.nabble.com/Find-the-50-highest-values-in-a-matrix-tp2259721p2259721.html
> >>> Sent from the R help mailing list archive at Nabble.com.
> >>>
> >>> ______________________________________________
> >>> R-help@r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >>        [[alternative HTML version deleted]]
> >>
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to