All interesting suggestions.

I guess a better example of the code would have been a good idea.  So,
I'll put a relevant snippet here.

Rows are cases.  There are multiple cases for each ID, marked with a
date.  I'm trying to calculate a time recency weighted score for a
covariate, added as a new column in the data.frame.

So, for each row, I need to see which ID it belongs to, then get all the
scores prior to this row's date, then compute the recency weighted summary.

Right now, I do this in an obvious, but very very slow way.

Here is my slow code:
======================
for(i in 1:nrow(d)){
    for(j in which( d$id == d$id[i] & d$date[j] < d$date[i]) ){
        days_since = as.numeric( d$date[i] - d$date[j] )
        w <- exp( -days_since/decay )
        temp <- temp + w * as.numeric(d[j,'score'])
        wTemp <- wTemp + w
    }

    temp <- temp / wTemp
    d$newScore[i,] <- temp
}
======================

One immediate thought was to turn the "date" into an integer.  That
should save a few cycles of date math.

I need to do this process for a bunch of scores.  A grid search over
different time decay levels might be nice.  So any speedup to this
routine will save me a ton of time.

Ideas?

----

Noah Silverman, M.S., C.Phil
UCLA Department of Statistics
8117 Math Sciences Building
Los Angeles, CA 90095

On 11/21/13, 5:51 AM, Rainer M Krug wrote:
>
>
> On 11/21/13, 12:34 , Jim Holtman wrote:
> > you need to show the statement in context with the rest of the
> > script.  you need to tell us what you want to do, not how you want
> > to do it.
>
> Agreed - a few details will result in guesses (see my guess below)
>
>
> > Sent from my iPad
>
> > On Nov 20, 2013, at 15:16, Noah Silverman
> > <noahsilver...@g.ucla.edu> wrote:
>
> >> Hello,
> >>
> >> I have a fairly large data.frame.  (About 150,000 rows of 100
> >> variables.) There are case IDs, and multiple entries for each ID,
> >> with a date stamp.  (i.e. records of peoples activity.)
> >>
> >>
> >> I need to iterate over each person (record ID) in the data set,
> >> and then process their data for each date.  The processing part
> >> is fast, the date part is fast.  Locating the records is slow.
> >> I've even tried using data.table, with ID set as the index, and
> >> it is still slow.
> >>
> >> The line with the slow process (According to Rprof) is:
> >>
> >>
> >> j <- which( d$id == person )
>
> Possibly use
>
> d_by_id <- split(d, d$id)
>
> which splits the data.frame d into a listt, where each list represents
> the data.frame of one id.
>
> But: Just a guess.
>
> Cheers,
>
> Rainer
>
> >>
> >> (I then process all the records indexed by j, which seems fast
> >> enough.)
> >>
> >> where d is my data.frame or data.table
> >>
> >> I thought that using the data.table indexing would speed things
> >> up, but not in this case.
> >>
> >> Any ideas on how to speed this up?
> >>
> >>
> >> Thanks!
> >>
> >> -- Noah Silverman, M.S., C.Phil UCLA Department of Statistics
> >> 8117 Math Sciences Building Los Angeles, CA 90095
> >>
> >> ______________________________________________
> >> R-help@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
> >> posting guide http://www.R-project.org/posting-guide.html and
> >> provide commented, minimal, self-contained, reproducible code.
>
> > ______________________________________________ R-help@r-project.org
> > mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
> > read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to