On 3/3/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Quoting [EMAIL PROTECTED]: > > In [.data.frame if you replace this: > > > > ... > > if (is.character(i)) { > > rows <- attr(xx, "row.names") > > i <- pmatch(i, rows, duplicates.ok = TRUE) > > } > > ... > > > > by this > > > > ... > > if (is.character(i)) { > > rows <- attr(xx, "row.names") > > if (typeof(rows) == "integer") > > i <- as.integer(i) > > else > > i <- pmatch(i, rows, duplicates.ok = TRUE) > > } > > ... > > > > then you get a huge boost: > > > > - with current [.data.frame > > > system.time(for (i in 1:100) dat["1", ]) > > user system elapsed > > 34.994 1.084 37.915 > > > > - with "patched" [.data.frame > > > system.time(for (i in 1:100) dat["1", ]) > > user system elapsed > > 0.264 0.068 0.364 > > > > mmmh, replacing > i <- pmatch(i, rows, duplicates.ok = TRUE) > by just > i <- as.integer(i) > was a bit naive. It will be wrong if rows is not a "seq_len" sequence. > > So I need to be more carefull by first calling 'match' to find the exact > matches and then calling 'pmatch' _only_ on those indices that don't have > an exact match. For example like doing something like this: > > if (is.character(i)) { > rows <- attr(xx, "row.names") > if (typeof(rows) == "integer") { > i2 <- match(as.integer(i), rows) > if (any(is.na(i2))) > i2[is.na(i2)] <- pmatch(i[is.na(i2)], rows, duplicates.ok = > TRUE) > i <- i2 > } else { > i <- pmatch(i, rows, duplicates.ok = TRUE) > } > } > > Correctness: > > > dat2 <- data.frame(aa=c('a', 'b', 'c', 'd'), bb=1:4, > row.names=c(11,25,1,3)) > > dat2 > aa bb > 11 a 1 > 25 b 2 > 1 c 3 > 3 d 4 > > > dat2["1",] > aa bb > 1 c 3 > > > dat2["3",] > aa bb > 3 d 4 > > > dat2["2",] > aa bb > 25 b 2 > > Performance: > > > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) > > dat <- as.data.frame(mat) > > system.time(for (i in 1:100) dat["1", ]) > user system elapsed > 2.036 0.880 2.917 > > Still 17 times faster than with non-patched [.data.frame. > > Maybe 'pmatch(x, table, ...)' itself could be improved to be > more efficient when 'x' is a character vector and 'table' an > integer vector so the above trick is not needed anymore. > > My point is that something can probably be done to improve the > performance of 'dat[i, ]' when the row names are integer and 'i' > a character vector. I'm assuming that, in the typical use-case, > there is an exact match for 'i' in the row names so converting > those row names to a character vector in order to find this match > is (most of the time) a waste of time.
But why bother? If you know the index of the row, why not index with a numeric vector rather than a string? The behaviour in that case seems obvious and fast. Hadley ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel