Your 2-million loop is overkill, because apparently in the (vast) majority of cases you don't need to loop at all. You could try something like this: 1. Split the price by id, e.g. price.list <- split(price,id) For each id, 2a. When price is not NA, assign it to next price _without_ using a for loop - e.g. next.price[!is.na(price)] <- price[!is.na(price)] 2b. Use a for loop only when price is NA, but even then work with vectors as much as you can, for example (untested) for (i in setdiff(which(is.na(price)),length(price))) { remaining.prices <- price[(i+1):length(price)] of.interest <- head(remaining.prices[!is.na(remaining.prices)],1) if (class(of.interest) == "logical") next.price[i] <- NA else next.price[i] <- of.interest } To run (2a) and (2b) you could use lapply(); to paste the bits together try do.call("rbind",price.list). You might also want to take a look at ?Rprof and check the archives for efficiency suggestions.
> -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of r user > Sent: Tuesday, January 03, 2006 11:59 AM > To: rhelp > Subject: [R] For loop gets exponentially slower as dataset > gets larger... > > > I am running R 2.1.1 in a Microsoft Windows XP environment. > > I have a matrix with three vectors ("columns") and ~2 > million "rows". The three vectors are date_, id, and price. > The data is ordered (sorted) by code and date_. > > (The matrix contains daily prices for several thousand > stocks, and has ~2 million "rows". If a stock did not trade > on a particular date, its price is set to "NA") > > I wish to add a fourth vector that is "next_price". ("Next > price" is the current price as long as the current price is > not "NA". If the current price is NA, the "next_price" is > the next price that the security with this same ID trades. > If the stock does not trade again, "next_price" is set to NA.) > > I wrote the following loop to calculate next_price. It > works as intended, but I have one problem. When I have only > 10,000 rows of data, the calculations are very fast. > However, when I run the loop on the full 2 million rows, it > seems to take ~ 1 second per row. > > Why is this happening? What can I do to speed the > calculations when running the loop on the full 2 million rows? > > (I am not running low on memory, but I am maxing out my CPU at 100%) > > Here is my code and some sample data: > > data<- data[order(data$code,data$date_),] > l<-dim(data)[1] > w<-3 > data[l,w+1]<-NA > > for (i in (l-1):(1)){ > > data[i,w+1]<-ifelse(is.na(data[i,w])==F,data[i,w],ifelse(data[ > i,2]==data[i+1,2],data[i+1,w+1],NA)) > } > > > date id price next_price > 6/24/2005 1635 444.7838 444.7838 > 6/27/2005 1635 448.4756 448.4756 > 6/28/2005 1635 455.4161 455.4161 > 6/29/2005 1635 454.6658 454.6658 > 6/30/2005 1635 453.9155 453.9155 > 7/1/2005 1635 453.3153 453.3153 > 7/4/2005 1635 NA 453.9155 > 7/5/2005 1635 453.9155 453.9155 > 7/6/2005 1635 453.0152 453.0152 > 7/7/2005 1635 452.8651 452.8651 > 7/8/2005 1635 456.0163 456.0163 > 12/19/2005 1635 442.6982 442.6982 > 12/20/2005 1635 446.5159 446.5159 > 12/21/2005 1635 452.4714 452.4714 > 12/22/2005 1635 451.074 451.074 > 12/23/2005 1635 454.6453 454.6453 > 12/27/2005 1635 NA NA > 12/28/2005 1635 NA NA > 12/1/2003 1881 66.1562 66.1562 > 12/2/2003 1881 64.9192 64.9192 > 12/3/2003 1881 66.0078 66.0078 > 12/4/2003 1881 65.8098 65.8098 > 12/5/2003 1881 64.1275 64.1275 > 12/8/2003 1881 64.8697 64.8697 > 12/9/2003 1881 63.5337 63.5337 > 12/10/2003 1881 62.9399 62.9399 > > > --------------------------------- > > [[alternative HTML version deleted]] > > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html