I have an extremely large data frame (~13 million rows) that resembles the 
structure of the object tmp below in the reproducible code. In my real data, 
the variable, 'id' may or may not be ordered, but I think that is irrelevant.

I have a process that requires subsetting the data by id and then running each 
smaller data frame through a set of functions. One example below uses indexing 
and the other uses an explicit call to subset(), both return the same result, 
but indexing is faster.

Problem is in my real data, indexing must parse through millions of rows to 
evaluate the condition and this is expensive and a bottleneck in my code.  I'm 
curious if anyone can recommend an improvement that would somehow be less 
expensive and faster?

Thank you
Harold


tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))

idList <- unique(tmp$id)

### Fast, but not fast enough
system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))

### Not fast at all, a big bottleneck
system.time(replicate(500, subset(tmp, id == idList[1])))

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to