All interesting suggestions. I guess a better example of the code would have been a good idea. So, I'll put a relevant snippet here.
Rows are cases. There are multiple cases for each ID, marked with a date. I'm trying to calculate a time recency weighted score for a covariate, added as a new column in the data.frame. So, for each row, I need to see which ID it belongs to, then get all the scores prior to this row's date, then compute the recency weighted summary. Right now, I do this in an obvious, but very very slow way. Here is my slow code: ====================== for(i in 1:nrow(d)){ for(j in which( d$id == d$id[i] & d$date[j] < d$date[i]) ){ days_since = as.numeric( d$date[i] - d$date[j] ) w <- exp( -days_since/decay ) temp <- temp + w * as.numeric(d[j,'score']) wTemp <- wTemp + w } temp <- temp / wTemp d$newScore[i,] <- temp } ====================== One immediate thought was to turn the "date" into an integer. That should save a few cycles of date math. I need to do this process for a bunch of scores. A grid search over different time decay levels might be nice. So any speedup to this routine will save me a ton of time. Ideas? ---- Noah Silverman, M.S., C.Phil UCLA Department of Statistics 8117 Math Sciences Building Los Angeles, CA 90095 On 11/21/13, 5:51 AM, Rainer M Krug wrote: > > > On 11/21/13, 12:34 , Jim Holtman wrote: > > you need to show the statement in context with the rest of the > > script. you need to tell us what you want to do, not how you want > > to do it. > > Agreed - a few details will result in guesses (see my guess below) > > > > Sent from my iPad > > > On Nov 20, 2013, at 15:16, Noah Silverman > > <noahsilver...@g.ucla.edu> wrote: > > >> Hello, > >> > >> I have a fairly large data.frame. (About 150,000 rows of 100 > >> variables.) There are case IDs, and multiple entries for each ID, > >> with a date stamp. (i.e. records of peoples activity.) > >> > >> > >> I need to iterate over each person (record ID) in the data set, > >> and then process their data for each date. The processing part > >> is fast, the date part is fast. Locating the records is slow. > >> I've even tried using data.table, with ID set as the index, and > >> it is still slow. > >> > >> The line with the slow process (According to Rprof) is: > >> > >> > >> j <- which( d$id == person ) > > Possibly use > > d_by_id <- split(d, d$id) > > which splits the data.frame d into a listt, where each list represents > the data.frame of one id. > > But: Just a guess. > > Cheers, > > Rainer > > >> > >> (I then process all the records indexed by j, which seems fast > >> enough.) > >> > >> where d is my data.frame or data.table > >> > >> I thought that using the data.table indexing would speed things > >> up, but not in this case. > >> > >> Any ideas on how to speed this up? > >> > >> > >> Thanks! > >> > >> -- Noah Silverman, M.S., C.Phil UCLA Department of Statistics > >> 8117 Math Sciences Building Los Angeles, CA 90095 > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the > >> posting guide http://www.R-project.org/posting-guide.html and > >> provide commented, minimal, self-contained, reproducible code. > > > ______________________________________________ R-help@r-project.org > > mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do > > read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.