[R] To List or Not To List

Sparks, John James Thu, 16 May 2013 10:15:55 -0700

Dear R Helpers,

A few weeks ago I asked for some help on how to accomplish modifications
to data in a set of data frames.  As part of that request I mentioned that
I realized that one way to accomplish my goal was to put the data frames
together in a list but that I was looking for a way to do it with data
frames and a loop because I "believe the better thing is to work df by df
for my particular situation".


A couple of posters asked me to provide more detail as to what is it about
my situation that made data frame alterations in a loop more appropriate
vs. a list.

Life and the scoring of many exams intervened in the last several days,
but with grades filed I am now able to return to this issue.

First, let me provide some particulars regarding my situation.  I am
working with 5,863 data frames, each with 7 columns and between 5,686 and
21 rows of data.  Each data frame contains the daily stock price history
for an equity traded on one of the U.S. markets.  I wanted to get an
historical price change for each of the days on the file.  If one were
working with a single data from for IBM then the command is

if(nrow(IBM)>129){IBM$Mo129<-ROC(IBM[,"Close"],n=129)}

to get the Rate Of Change of the stock price relative to 129 trading days
ago.  This function is in the TTR library which is called by quantmod.

So it strikes me that in one sense this is a simple fixed costs vs.
variable costs question:  Is it worth it to assemble the data frames into
a list and then process them, putatively more quickly than going data
frame by data frame, which does not require the up-front assembly.

A look at the empirical results shows executing this set of functions df
by df consumes 44.15 of elapsed time.

> ptm <- proc.time()
>
>
>       ROCFunc<-function(DF){
+ if(nrow(DF)>129){DF$Mo129<-ROC(DF[,"Close"],n=129)}
+ if(nrow(DF)> 65){DF$Mo65 <-ROC(DF[,"Close"],n= 65)}
+ if(nrow(DF)> 21){DF$Mo21 <-ROC(DF[,"Close"],n= 21)}
+ if(nrow(DF)> 10){DF$Mo10 <-ROC(DF[,"Close"],n= 10)}
+ if(nrow(DF)>  5){DF$Mo5  <-ROC(DF[,"Close"],n=  5)}
+ return(DF)
+ }
> for(i in symbols) assign( i, ROCFunc(get(i)))
>
>
> time<-proc.time() - ptm
> time
   user  system elapsed
  43.52    0.58   44.15


Using a list approach, the assembly of the list requires 8.44 and then the
processing requires 39.20 totaling 47.64.  So a slight win for the data
frame approach. [Continued]

> ptm <- proc.time()
>
> list.object <- quote(list())
> list.object[ symbols ] <- lapply( symbols, as.name )
> biglist<-eval(list.object)
>
>
> for (i in seq_along(biglist))
+       {
+        biglist[[i]]<-subset(biglist[[i]],select=-c(Open,High,Low))
+        #biglist[[i]]<-biglist[[i]][as.character(biglist[[i]]$Index) >
"2007-01-01", ]
+        #biglist[[i]]$Index<- as.Date(biglist[[i]]$Index,format="%Y-%m-%d")
+        #biglist[[i]]<-xts(biglist[[i]][,-1],biglist[[i]][,1])
+        #biglist[[i]]<-biglist[[i]]['2005-01-01/']
+        }
>
>  proc.time() - ptm
   user  system elapsed
   8.03    0.40    8.44
>  ptm <- proc.time()
>
> rm(list=ls(pattern="^[A-Z]"))
>
> for (i in seq_along(biglist))
+ {
+        if(nrow(biglist[[i]])>180)
+               {
+               biglist[[i]][["Mo180"]]<-ROC(biglist[[i]][["Close"]],n=129)
+               }
+       if(nrow(biglist[[i]])>90)
+               {
+               biglist[[i]][["Mo90"]] <-ROC(biglist[[i]][["Close"]],n=65)
+               }
+       if(nrow(biglist[[i]])>30)
+               {
+               biglist[[i]][["Mo30"]] <-ROC(biglist[[i]][["Close"]],n=21)
+               }
+       if(nrow(biglist[[i]])>10)
+               {
+               biglist[[i]][["Mo10"]] <-ROC(biglist[[i]][["Close"]],n=10)
+               }
+               if(nrow(biglist[[i]])>5)
+               {
+               biglist[[i]][["Mo5"]] <-ROC(biglist[[i]][["Close"]],n=5)
+               }
+ }
> proc.time() - ptm
   user  system elapsed
  39.19    0.00   39.20


The larger issue for me, however, is recovering to the set of data frames
with the new calculations completed inside each one.  For this I used the
following syntax that I gleaned from the web:

data.frame(lapply(data.frame(t(sapply(biglist, `[`))), unlist))

But this results in
Error in FUN(X[[2003L]], ...) :
  promise already under evaluation: recursive default argument reference
or earlier problems?
Calls: data.frame -> lapply -> FUN
Execution halted

In previous executions I have seen the all to familiar error message
'unable to allocate a vector of size...' indicating to me that I have run
out of usable RAM at this last step.  I have 8G on my machine, so RAM
constraints are rarely a problem.  This is the main reason that I said
that I believed that a list approach was not the best for my situation: 
going that route will not result in a finished job.

I hope that this demonstration answers the questions of the posters who
posed the question and can potentially serve to provide an example to
those who, like me recently, are beginning to explore how to execute on
multiple data frames.  I hope that this outweighs the fact that I have not
asked a specific question nor provided re-producible code.  Positive
comments to advance the state of knowledge or improve my knowledge of the
processes and syntax are invited.  Flaming comments along the lines that I
should RTFM are strongly discouraged.

And many thanks to those who have improved my understanding of R through
this list in the last few years.

--John J. Sparks, Ph.D.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] To List or Not To List

Reply via email to