Quoting hadley wickham <[EMAIL PROTECTED]>: > On 3/3/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > Quoting [EMAIL PROTECTED]: > > > In [.data.frame if you replace this: > > > > > > ... > > > if (is.character(i)) { > > > rows <- attr(xx, "row.names") > > > i <- pmatch(i, rows, duplicates.ok = TRUE) > > > } > > > ... > > > > > > by this > > > > > > ... > > > if (is.character(i)) { > > > rows <- attr(xx, "row.names") > > > if (typeof(rows) == "integer") > > > i <- as.integer(i) > > > else > > > i <- pmatch(i, rows, duplicates.ok = TRUE) > > > } > > > ... > > > > > > then you get a huge boost: > > > > > > - with current [.data.frame > > > > system.time(for (i in 1:100) dat["1", ]) > > > user system elapsed > > > 34.994 1.084 37.915 > > > > > > - with "patched" [.data.frame > > > > system.time(for (i in 1:100) dat["1", ]) > > > user system elapsed > > > 0.264 0.068 0.364 > > > > > > > mmmh, replacing > > i <- pmatch(i, rows, duplicates.ok = TRUE) > > by just > > i <- as.integer(i) > > was a bit naive. It will be wrong if rows is not a "seq_len" sequence. > > > > So I need to be more carefull by first calling 'match' to find the exact > > matches and then calling 'pmatch' _only_ on those indices that don't have > > an exact match. For example like doing something like this: > > > > if (is.character(i)) { > > rows <- attr(xx, "row.names") > > if (typeof(rows) == "integer") { > > i2 <- match(as.integer(i), rows) > > if (any(is.na(i2))) > > i2[is.na(i2)] <- pmatch(i[is.na(i2)], rows, duplicates.ok > = > > TRUE) > > i <- i2 > > } else { > > i <- pmatch(i, rows, duplicates.ok = TRUE) > > } > > } > > > > Correctness: > > > > > dat2 <- data.frame(aa=c('a', 'b', 'c', 'd'), bb=1:4, > > row.names=c(11,25,1,3)) > > > dat2 > > aa bb > > 11 a 1 > > 25 b 2 > > 1 c 3 > > 3 d 4 > > > > > dat2["1",] > > aa bb > > 1 c 3 > > > > > dat2["3",] > > aa bb > > 3 d 4 > > > > > dat2["2",] > > aa bb > > 25 b 2 > > > > Performance: > > > > > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) > > > dat <- as.data.frame(mat) > > > system.time(for (i in 1:100) dat["1", ]) > > user system elapsed > > 2.036 0.880 2.917 > > > > Still 17 times faster than with non-patched [.data.frame. > > > > Maybe 'pmatch(x, table, ...)' itself could be improved to be > > more efficient when 'x' is a character vector and 'table' an > > integer vector so the above trick is not needed anymore. > > > > My point is that something can probably be done to improve the > > performance of 'dat[i, ]' when the row names are integer and 'i' > > a character vector. I'm assuming that, in the typical use-case, > > there is an exact match for 'i' in the row names so converting > > those row names to a character vector in order to find this match > > is (most of the time) a waste of time. > > But why bother? If you know the index of the row, why not index with > a numeric vector rather than a string? The behaviour in that case > seems obvious and fast.
Because if I want to access a given row by its key (row name) then I _must_ use a string: > dat=data.frame(aa=letters[1:6], bb=1:6, row.names=as.integer(c(51, 52, 11, 25, 1, 3))) > dat aa bb 51 a 1 52 b 2 11 c 3 25 d 4 1 e 5 3 f 6 If my key is "1": > dat["1", ] aa bb 1 e 5 OK I can't use a numeric index: > dat[1, ] aa bb 51 a 1 Not what I want! With a big data frame (e.g. 10**6 rows), every time I do 'dat["1", ]' I'm charged the price of the coercion from a 10**6-element character vector to an integer vector. A very high (and unreasonable) price that could be easily avoided. You could argue that I can still work around this by extracting 'attr(dat, "row.names")' myself, check its mode, and then, if its mode is integer, use 'match' to find the position (i2) of my key in the row.names, then finally call 'dat[i2, ]'. Is is unreasonable to expect [.data.frame to do that for me? Cheers, H. > > Hadley > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel