Which version of R are you looking at? R-devel has o merge() works more efficiently when there are relatively few matches between the data frames (for example, for 1-1 matching). The order of the result is changed for 'sort = FALSE'.
On Thu, 27 Jul 2006, Kevin B. Hendricks wrote: > Hi Developers, > > I am looking for another new project to help me get more up to speed > on R and to learn something outside of R internals. One recent R > issue I have run into is finding a fast implementations of the > equivalent to the following SAS code: > > /* MDPC is an integer sort key made from two integer columns */ > MDPC = (MD * 100000) + PCO; > > /* sort the dataset by the key */ > PROC SORT; > BY MDPC; > > /* print out count and sum for each unique sort key (subgroup) */ > /* use of BY MDPC requires sorting that data set by MDPC first in SAS */ > PROC UNIVARIATE NOPRINT; > VAR MVE; > BY MDPC; > OUTPUT OUT=TMP0 N=XN SUM=XSUM; > > Easy to do in R but the problem is the data set this is being run on > has 1,742,201 lines in it and takes up 196,868,713 bytes to store as > character data. The sort key has easily has over 200,000 unique keys > (if not twice that). > > My first R attempt was a simple > > # sort the data.frame gd and the sort key > sorder <- order(MDPC) > gd <- gd[sorder,] > MDPC <- MDPC[sorder] > attach(gd) > > # find the length and sum for each unique sort key > XN <- by(MVE, MDPC, length) > XSUM <- by(MVE, MDPC, sum) > GRPS <- levels(as.factor(MDPC)) > > Well the ordering and sorting was reasonably fast but the first "by" > statement was still running 4 hours later on my machine (a dual 2.6 > gig Opteron with 4 gig of main memory). This same snippet of code in > SAS running on a slower machine takes about 5 minutes of system time. > > I tried various simple R implementations of a "by_sorted" that I > thought might help > > # walk sorted array once and keep track of beginning > # and ending points for each unique sort key value in p > # and run function fcn on that sub sequence in vector v > # store the results in a vector > > by_sorted <- function(v, p, fcn) { > key <- p[[1]] > bp <- 1 > r <- NULL > for (i in 2:length(p)) { > if (key != p[[i]]) { > r <- c(r,fcn(v[bp:i-1])) > bp <- i > key <- p[[i]] > } > } > r <- c(r,fcn(v[bp:i])) > } > > but they took "forever" to run also (read that I killed those > attempts at 15 minutes of cpu time). > > I literally had the same issue when trying with "tapply". > > So unless it already exists someplace, I need a really fast > implementation of "by" for very large sorted data sets (probably > written in fortran or c) that will do the equivalent of what SAS does > with its "proc univariate by" approach with close to the same > performance. The code should only have to walk the array once (ie. > be linear in time with the number of rows in the vector). I have > similar issues with "merge" as well since merging data frames already > sorted by the same sort key should be fast as well and does not > appear to be. > > Before I jump into this and create my own "sorted large data set" > versions of "by" and "merge", I wanted to be sure it would be of > interest to others. If they work well and are well implemented (a > big if since I am really still just learning this - the whole point > of the project!) would something like this be of any interest for > internal use of R? Or is this something too specialized? > Is there an R function implemented in c or fortran that would make a > good "model" to follow for implementing something like this? > Would/should they be extensions of current implementations of "merge" > and "by" with an additions of a sorted=TRUE (defaulting to FALSE) > extra parameter. > > Or am I simply barking up the wrong tree here? > > Thanks, > > Kevin > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel