Quoting [EMAIL PROTECTED]: > In [.data.frame if you replace this: > > ... > if (is.character(i)) { > rows <- attr(xx, "row.names") > i <- pmatch(i, rows, duplicates.ok = TRUE) > } > ... > > by this > > ... > if (is.character(i)) { > rows <- attr(xx, "row.names") > if (typeof(rows) == "integer") > i <- as.integer(i) > else > i <- pmatch(i, rows, duplicates.ok = TRUE) > } > ... > > then you get a huge boost: > > - with current [.data.frame > > system.time(for (i in 1:100) dat["1", ]) > user system elapsed > 34.994 1.084 37.915 > > - with "patched" [.data.frame > > system.time(for (i in 1:100) dat["1", ]) > user system elapsed > 0.264 0.068 0.364 >
mmmh, replacing i <- pmatch(i, rows, duplicates.ok = TRUE) by just i <- as.integer(i) was a bit naive. It will be wrong if rows is not a "seq_len" sequence. So I need to be more carefull by first calling 'match' to find the exact matches and then calling 'pmatch' _only_ on those indices that don't have an exact match. For example like doing something like this: if (is.character(i)) { rows <- attr(xx, "row.names") if (typeof(rows) == "integer") { i2 <- match(as.integer(i), rows) if (any(is.na(i2))) i2[is.na(i2)] <- pmatch(i[is.na(i2)], rows, duplicates.ok = TRUE) i <- i2 } else { i <- pmatch(i, rows, duplicates.ok = TRUE) } } Correctness: > dat2 <- data.frame(aa=c('a', 'b', 'c', 'd'), bb=1:4, row.names=c(11,25,1,3)) > dat2 aa bb 11 a 1 25 b 2 1 c 3 3 d 4 > dat2["1",] aa bb 1 c 3 > dat2["3",] aa bb 3 d 4 > dat2["2",] aa bb 25 b 2 Performance: > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) > dat <- as.data.frame(mat) > system.time(for (i in 1:100) dat["1", ]) user system elapsed 2.036 0.880 2.917 Still 17 times faster than with non-patched [.data.frame. Maybe 'pmatch(x, table, ...)' itself could be improved to be more efficient when 'x' is a character vector and 'table' an integer vector so the above trick is not needed anymore. My point is that something can probably be done to improve the performance of 'dat[i, ]' when the row names are integer and 'i' a character vector. I'm assuming that, in the typical use-case, there is an exact match for 'i' in the row names so converting those row names to a character vector in order to find this match is (most of the time) a waste of time. Cheers, H. > but maybe I'm missing somethig... > > Cheers, > H. > > > > > If you assign character row names, things will be a bit faster: > > > > # before > > system.time(for (i in 1:25) dat["2", ]) > > user system elapsed > > 9.337 0.404 10.731 > > > > # this looks funny, but has the desired result > > rownames(dat) <- rownames(dat) > > typeof(attr(dat, "row.names") > > > > # after > > system.time(for (i in 1:25) dat["2", ]) > > user system elapsed > > 0.343 0.226 0.608 > > > > And you probably would have seen this if you had looked at the the > > profiling data: > > > > Rprof() > > for (i in 1:25) dat["2", ] > > Rprof(NULL) > > summaryRprof() > > > > > > + seth > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel