On Wed, 28 Sep 2016, "Doran, Harold" <hdo...@air.org> writes:
> I have an extremely large data frame (~13 million rows) that resembles > the structure of the object tmp below in the reproducible code. In my > real data, the variable, 'id' may or may not be ordered, but I think > that is irrelevant. > > I have a process that requires subsetting the data by id and then > running each smaller data frame through a set of functions. One > example below uses indexing and the other uses an explicit call to > subset(), both return the same result, but indexing is faster. > > Problem is in my real data, indexing must parse through millions of > rows to evaluate the condition and this is expensive and a bottleneck > in my code. I'm curious if anyone can recommend an improvement that > would somehow be less expensive and faster? > > Thank you > Harold > > > tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000)) > > idList <- unique(tmp$id) > > ### Fast, but not fast enough > system.time(replicate(500, tmp[which(tmp$id == idList[1]),])) > > ### Not fast at all, a big bottleneck > system.time(replicate(500, subset(tmp, id == idList[1]))) > If you really need only one column, it will be faster to extract that column and then to take a subset of it: system.time(replicate(500, tmp[[2L]][tmp$id == idList[1L]])) (A data.frame is a list of atomic vectors, and it is typically faster to first extract the component of interest, i.e. the specific column, and then to subset this vector. The result will, of course, be a vector, not a data.frame.) -- Enrico Schumann Lucerne, Switzerland http://enricoschumann.net ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.