On Fri, 28 Jul 2006, Kevin B. Hendricks wrote: > Hi Bill, > > > Splus8.0 has something like what you are talking about > > that provides a fast way to compute > > sapply(split(xVector, integerGroupCode), summaryFunction) > > for some common summary functions. The 'integerGroupCode' > > is typically the codes from a factor, but you could compute > > it in other ways. It needs to be a "small" integer in > > the range 1:ngroups (like the 'bin' argument to tabulate). > > Like tabulate(), which is called from table(), these are > > meant to be called from other functions that can set up > > appropriate group codes. E.g., groupSums or rowSums or > > fancier things could be based on this. > > > > They do not insist you sort the input in any way. That > > would really only be useful for group medians and we haven't > > written that one yet. > > The sort is also useful for recoding each group into subgroups based > on some other numeric vector. This is the problem I run into trying > to build portfolios that can be used as benchmarks for long term > stock returns. > > Another issue I have is that to recode a long character string that I > use as a sort key for accessing a subgroup of the data in the > data.frame to a set of small integers is not fast. I can make a fast > implementation if the data is sorted by the key, but without the > sort, just converting my sort keys to the required small integer > codes would be expensive for very long vectors since my small integer > codes would have to reflect the order of the data (ie. be increasing > subportfolio numbers).
True, but the underlying grouped summary code shouldn't require you to do the sorting. If codes <- match(char, sort(unique(char))) is too slow then you could try sorting the data set by th 'char' column and doing codes <- cumsum(char[-1] != char[-length(char)]) (loop over columns before doing cumsum if there are several columns). If that isn't fast enough, then you could sort in the C code as well, but I think there could be lots of cases where that is slower. I've used this code for out of core applications, where I definitely do not want to sort the whole dataset. > More specifically, I am now converting all of my SAS code to R code > and the problem is I have lots of snippets of SAS that do the > following ... > > PROC SORT; > BY MDSIZ FSIZ; > > /* WRITE OUT THE MIN SIZE CUTOFF VALUES */ > PROC UNIVARIATE NOPRINT; > VAR FSIZ; > BY MDSIZ; > OUTPUT OUT=TMPS1 MIN=XMIN; > > where my sort key MDSIZ is a character string that is the > concatenation of the month ending date MD and the size portfolio of a > particular firm (SIZ) and I want to find the cutoff points (the mins) > for each of the portfolios for every month end date across all traded > firms. > > > > > > The typical prototype is > > > >> igroupSums > > function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if > > (is.null( > > group)) 1 else max(as.integer(group), na.rm = T)) > > > > and the currently supported summary functions are > > mean : igroupMeans > > sum : igroupSums > > prod : igroupProds > > min : igroupMins > > max : igroupMaxs > > range : igroupRanges > > any : igroupAnys > > all : igroupAlls > > SAS is similar in that is also has a specific list of functions you > can request including all of the basic stats from a PROC univariate > including higher moment stuff (skewness, kurtosis, robust > statistics, and even statistical test results for each coded > subgroup, and the nice thing is all combinations can be done with one > call. > > But to do that SAS does require the presorting, but it does run > really fast for even super long vectors with lots of sort keys. > > Similarly the next snippet of code, will take the file and resort it > by the portfolio key and then the market to book ratio (MTB) for all > trading firms for all monthly periods since 1980. It will then > split each size portfolio for each month ending date into 5 equal > portfolios based on market to book ratios (thus the need for the > sort). SAS returns a coded integer vector PMTB (made up of 1s to 5 > with 1s's for the smallest MTB and 5 for the largest MTB) repeated > for each subgroup of MDSIZ. PMTB matches the original vector in > length and therefore fits right into the data frame. > > /* SPLIT INTO Market to Book QUINTILES BY MDSIZ */ > PROC SORT; > BY MDSIZ MTB; > PROC RANK GROUPS=5 OUT=TMPS0; > VAR MTB; > RANKS PMTB; > BY MDSIZ; > > The problem of assigning elements of a long data vector to portfolios > and sub portfolios based on the values of specific data columns which > must be calculated at each step and are not fixed or hardcoded is one > that finance can run into (and therefore I run into it). > > So by sorting I could handle the need for "small integer" recoding > and the small integers would have meaning (i.e. higher values will > represent larger MTB firms, etc). > > That just leaves the problem of calculating stats on short sequences > of of a longer integer. > > > They are fast: > > > >> x<-runif(2e6) > >> i<-rep(1:1e6, 2) > >> sys.time(sx <- igroupSums(x,i)) > > [1] 0.66 0.67 > >> length(sx) > > [1] 1000000 > > > > On that machine R takes 44 seconds to go the lapply/split > > route: > > > >> unix.time(unlist(lapply(split(x,i), sum))) > > [1] 43.24 0.78 44.11 0.00 0.00 > > > Yes! That is exactly what I need. > > Are there plans for adding something like that to R? > > Thanks, > > Kevin > > ---------------------------------------------------------------------------- Bill Dunlap Insightful Corporation bill at insightful dot com 360-428-8146 "All statements in this message represent the opinions of the author and do not necessarily reflect Insightful Corporation policy or position." ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel