On Fri, 28 Jul 2006, Kevin B. Hendricks wrote: > > In my particular case, the problem was my data frame had over 1 > million lines had probably over 500,000 unique sort keys (ie. think > of it as an R factor with over 500,000 levels). The implementation > of "by" uses "tapply" which in turn uses "split". So "split" simply > ate up all the time trying to create 500,000 vectors each of short > length 1, 2, or 3; and the associated garbage collection. > > I simple loop that walked the short sequence of values (since the > data frame was already sorted) calculating what it needed, would work > much faster than splitting the original vector into so very many > smaller vectors (and the associated copying of data). > > That problem is very similar problem to the calculation of basic > stats on a short moving window over a very long vector. > > I will look into that package and maybe use it for a model for what I > want to do.
Splus8.0 has something like what you are talking about that provides a fast way to compute sapply(split(xVector, integerGroupCode), summaryFunction) for some common summary functions. The 'integerGroupCode' is typically the codes from a factor, but you could compute it in other ways. It needs to be a "small" integer in the range 1:ngroups (like the 'bin' argument to tabulate). Like tabulate(), which is called from table(), these are meant to be called from other functions that can set up appropriate group codes. E.g., groupSums or rowSums or fancier things could be based on this. They do not insist you sort the input in any way. That would really only be useful for group medians and we haven't written that one yet. The typical prototype is > igroupSums function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if(is.null( group)) 1 else max(as.integer(group), na.rm = T)) and the currently supported summary functions are mean : igroupMeans sum : igroupSums prod : igroupProds min : igroupMins max : igroupMaxs range : igroupRanges any : igroupAnys all : igroupAlls (The plurals are not all properly spelled and they are plural so match related functions like rowSums.) It might be useful to also have one to count the number of non-missing values in each group of x's. They are fast: > x<-runif(2e6) > i<-rep(1:1e6, 2) > sys.time(sx <- igroupSums(x,i)) [1] 0.66 0.67 > length(sx) [1] 1000000 On that machine R takes 44 seconds to go the lapply/split route: > unix.time(unlist(lapply(split(x,i), sum))) [1] 43.24 0.78 44.11 0.00 0.00 To save setup time in the S code, out-of-range values in the group argument (negatives, values greater than ngroup, and NA's), mean that the correponding element in x is ignored. ---------------------------------------------------------------------------- Bill Dunlap Insightful Corporation bill at insightful dot com 360-428-8146 "All statements in this message represent the opinions of the author and do not necessarily reflect Insightful Corporation policy or position." ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel