Hi: >From what I can tell, Henrik efficiently finds the 50 largest values without the matrix indices and Peter efficiently finds the matrix indices without the corresponding values. Let's combine the two:
x <- rnorm(8e6) is.na(x) <- sample(8e6, 1e6) n <- 50 x1 <- sort(x, decreasing=TRUE)[1:n] # Find the indices that match the 50 largest values ix <- which(x %in% x1) # Map these to rows and columns assuming that x is shaped # into a 4000 * 2000 matrix columnwise: ro <- ix %% 4000 ro <- ifelse(ro == 0L, 4000, ro) # just in case we get row 4000 co <- 1 + ix %/% 4000 # Find the x values corresponding to the indices in ix, in their proper order x2 <- x[ix] d <- data.frame(row = ro, col = co, x = x2) d[rev(order(d$x)), ] # sort in decreasing order To make sure this works, set up xm <- matrix(x, nrow = 4000) In my test, > head(d) # before sorting row col x 1 280 17 4.375632 2 1346 179 4.506947 3 2661 214 4.438979 4 1870 308 4.447505 5 1951 344 4.497253 6 12 365 4.465841 > xm[280, 17] [1] 4.375632 > xm[1870, 308] [1] 4.447505 Looks OK. Total time to run the entire block of code: user system elapsed 3.38 0.19 3.57 Now let's try it with Henrik's second method: system.time({ x <- rnorm(8e6) is.na(x) <- sample(8e6, 1e6) n <- 50 x2 <- sort(-x, partial = n) x2h <- -sort(x2[1:n]) ix <- which(x %in% x2h) ro <- ix %% 4000 ro <- ifelse(ro == 0L, 4000, ro) co <- 1 + ix %/% 4000 x3 <- x[ix] d <- data.frame(row = ro, col = co, x = x3) d[rev(order(d$x)), ] }) user system elapsed 2.24 0.17 2.42 Not bad for 8M observations :) HTH, Dennis On Fri, Jun 18, 2010 at 4:39 AM, Henrik Bengtsson <h...@stat.berkeley.edu>wrote: > You might also want to consider _partial sorting_ by using the > 'partial' argument of sort(), especially when the number of data > points is really large. > > Since argument 'decreasing=FALSE' is not supported when using > 'partial', you have to flip it yourself by negating the values, e.g. > > x <- rnorm(8e6); > is.na(x) <- sample(length(x), size=1e6); > > n <- 50; > t1 <- system.time({ > x1 <- sort(x, decreasing=TRUE); > x1h <- x1[1:n]; > }); > > t2 <- system.time({ > x2 <- sort(-x, partial=n); > x2h <- -sort(x2[1:n]); > }); > > stopifnot(identical(x2h, x1h)); > print(t2/t1); > user system elapsed > 0.3076923 0.7777778 0.3491525 > > /Henrik > > On Fri, Jun 18, 2010 at 1:20 PM, Peter Ehlers <ehl...@ucalgary.ca> wrote: > > > > m <- matrix(round(rnorm(4000 * 2000), 4), nr = 4000) > > is.na(m) <- sample(8e6, 1e6) > > > > system.time( > > idx <- which( > > matrix(m %in% head(sort(m, TRUE), 50), > > nr = nrow(m)), arr.ind = TRUE)) > > > > # user system elapsed > > # 3.12 0.19 3.18 > > > > -Peter Ehlers > > > > > > On 2010-06-18 5:13, Dennis Murphy wrote: > >> > >> Hi: > >> > >> Here's a faked up example: > >> > >> a<- matrix(rnorm(4000*2000), 4000, 2000) > >> # Generate some NAs in the matrix > >> nr<- sample(50, 1:4000) > >> nc<- sample(50, 1:2000) > >> a[nr, nc]<- NA > >> > >> # convert to data frame: > >> b<- data.frame(row = rep(1:4000, 2000), col = rep(1:2000, each = 4000), > >> x = as.vector(a)) > >> # relatively time consuming...about 13.5 s on my machine > >> bb<- b[rev(order(b$x, na.last = FALSE)), ] > >>> > >>> bb[1:10, ] > >> > >> row col x > >> 691269 3269 173 5.103704 > >> 7815076 3076 1954 4.961544 > >> 4999621 3621 1250 4.953265 > >> 500469 469 126 4.937655 > >> 5878224 2224 1470 4.929150 > >> 4287270 3270 1072 4.913791 > >> 4442521 2521 1111 4.896869 > >> 4668867 867 1168 4.863504 > >> 5716575 575 1430 4.760778 > >> 3055274 3274 764 4.758995 > >> > >> HTH, > >> Dennis > >> > >> > >> On Thu, Jun 17, 2010 at 10:41 PM, > >> uschlecht<ulrich.schle...@stanford.edu>wrote: > >> > >>> > >>> Hi, > >>> > >>> I have a huge matrix (4000 * 2000 data points) and I would like to > >>> retrieve > >>> the coordinates (column and row) for the top 50 (or x) values. Some > >>> positions in the matrix have NA as a value. These should be discarded. > >>> > >>> My current method is to replace all NAs by 0, then rank all the values > >>> and > >>> then extract the positions with the 50 highest ranks. It is very > >>> time-consuming! > >>> > >>> Is there a simpler way to do this? > >>> > >>> Thank you, > >>> Ulrich > >>> > >>> -- > >>> View this message in context: > >>> > >>> > http://r.789695.n4.nabble.com/Find-the-50-highest-values-in-a-matrix-tp2259721p2259721.html > >>> Sent from the R help mailing list archive at Nabble.com. > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >> > >> [[alternative HTML version deleted]] > >> > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.