On Wed, 28 Sep 2016, "Doran, Harold" <hdo...@air.org> writes:

> I have an extremely large data frame (~13 million rows) that resembles
> the structure of the object tmp below in the reproducible code. In my
> real data, the variable, 'id' may or may not be ordered, but I think
> that is irrelevant.
>
> I have a process that requires subsetting the data by id and then
> running each smaller data frame through a set of functions. One
> example below uses indexing and the other uses an explicit call to
> subset(), both return the same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of
> rows to evaluate the condition and this is expensive and a bottleneck
> in my code.  I'm curious if anyone can recommend an improvement that
> would somehow be less expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>

If you really need only one column, it will be faster
to extract that column and then to take a subset of it:

  system.time(replicate(500, tmp[[2L]][tmp$id == idList[1L]]))

(A data.frame is a list of atomic vectors, and it is
 typically faster to first extract the component of
 interest, i.e. the specific column, and then to subset
 this vector. The result will, of course, be a vector,
 not a data.frame.)


-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to