Dear R Helpers, A few weeks ago I asked for some help on how to accomplish modifications to data in a set of data frames. As part of that request I mentioned that I realized that one way to accomplish my goal was to put the data frames together in a list but that I was looking for a way to do it with data frames and a loop because I "believe the better thing is to work df by df for my particular situation".
A couple of posters asked me to provide more detail as to what is it about my situation that made data frame alterations in a loop more appropriate vs. a list. Life and the scoring of many exams intervened in the last several days, but with grades filed I am now able to return to this issue. First, let me provide some particulars regarding my situation. I am working with 5,863 data frames, each with 7 columns and between 5,686 and 21 rows of data. Each data frame contains the daily stock price history for an equity traded on one of the U.S. markets. I wanted to get an historical price change for each of the days on the file. If one were working with a single data from for IBM then the command is if(nrow(IBM)>129){IBM$Mo129<-ROC(IBM[,"Close"],n=129)} to get the Rate Of Change of the stock price relative to 129 trading days ago. This function is in the TTR library which is called by quantmod. So it strikes me that in one sense this is a simple fixed costs vs. variable costs question: Is it worth it to assemble the data frames into a list and then process them, putatively more quickly than going data frame by data frame, which does not require the up-front assembly. A look at the empirical results shows executing this set of functions df by df consumes 44.15 of elapsed time. > ptm <- proc.time() > > > ROCFunc<-function(DF){ + if(nrow(DF)>129){DF$Mo129<-ROC(DF[,"Close"],n=129)} + if(nrow(DF)> 65){DF$Mo65 <-ROC(DF[,"Close"],n= 65)} + if(nrow(DF)> 21){DF$Mo21 <-ROC(DF[,"Close"],n= 21)} + if(nrow(DF)> 10){DF$Mo10 <-ROC(DF[,"Close"],n= 10)} + if(nrow(DF)> 5){DF$Mo5 <-ROC(DF[,"Close"],n= 5)} + return(DF) + } > for(i in symbols) assign( i, ROCFunc(get(i))) > > > time<-proc.time() - ptm > time user system elapsed 43.52 0.58 44.15 Using a list approach, the assembly of the list requires 8.44 and then the processing requires 39.20 totaling 47.64. So a slight win for the data frame approach. [Continued] > ptm <- proc.time() > > list.object <- quote(list()) > list.object[ symbols ] <- lapply( symbols, as.name ) > biglist<-eval(list.object) > > > for (i in seq_along(biglist)) + { + biglist[[i]]<-subset(biglist[[i]],select=-c(Open,High,Low)) + #biglist[[i]]<-biglist[[i]][as.character(biglist[[i]]$Index) > "2007-01-01", ] + #biglist[[i]]$Index<- as.Date(biglist[[i]]$Index,format="%Y-%m-%d") + #biglist[[i]]<-xts(biglist[[i]][,-1],biglist[[i]][,1]) + #biglist[[i]]<-biglist[[i]]['2005-01-01/'] + } > > proc.time() - ptm user system elapsed 8.03 0.40 8.44 > ptm <- proc.time() > > rm(list=ls(pattern="^[A-Z]")) > > for (i in seq_along(biglist)) + { + if(nrow(biglist[[i]])>180) + { + biglist[[i]][["Mo180"]]<-ROC(biglist[[i]][["Close"]],n=129) + } + if(nrow(biglist[[i]])>90) + { + biglist[[i]][["Mo90"]] <-ROC(biglist[[i]][["Close"]],n=65) + } + if(nrow(biglist[[i]])>30) + { + biglist[[i]][["Mo30"]] <-ROC(biglist[[i]][["Close"]],n=21) + } + if(nrow(biglist[[i]])>10) + { + biglist[[i]][["Mo10"]] <-ROC(biglist[[i]][["Close"]],n=10) + } + if(nrow(biglist[[i]])>5) + { + biglist[[i]][["Mo5"]] <-ROC(biglist[[i]][["Close"]],n=5) + } + } > proc.time() - ptm user system elapsed 39.19 0.00 39.20 The larger issue for me, however, is recovering to the set of data frames with the new calculations completed inside each one. For this I used the following syntax that I gleaned from the web: data.frame(lapply(data.frame(t(sapply(biglist, `[`))), unlist)) But this results in Error in FUN(X[[2003L]], ...) : promise already under evaluation: recursive default argument reference or earlier problems? Calls: data.frame -> lapply -> FUN Execution halted In previous executions I have seen the all to familiar error message 'unable to allocate a vector of size...' indicating to me that I have run out of usable RAM at this last step. I have 8G on my machine, so RAM constraints are rarely a problem. This is the main reason that I said that I believed that a list approach was not the best for my situation: going that route will not result in a finished job. I hope that this demonstration answers the questions of the posters who posed the question and can potentially serve to provide an example to those who, like me recently, are beginning to explore how to execute on multiple data frames. I hope that this outweighs the fact that I have not asked a specific question nor provided re-producible code. Positive comments to advance the state of knowledge or improve my knowledge of the processes and syntax are invited. Flaming comments along the lines that I should RTFM are strongly discouraged. And many thanks to those who have improved my understanding of R through this list in the last few years. --John J. Sparks, Ph.D. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.