Re: [R] avoiding loop

parkbomee Tue, 03 Nov 2009 19:05:01 -0800


Thanks for your help.



> Date: Mon, 2 Nov 2009 18:50:42 -0500
> Subject: Re: [R] avoiding loop
> From: jholt...@gmail.com
> To: bbom...@hotmail.com
> CC: mtmor...@fhcrc.org; r-help@r-project.org
> 
> The first thing I would suggest is convert your dataframes to matrices
> so that you are not having to continually convert them in the calls to
> the functions.  Also I am not sure what the code:
> 
>       realized_prob = with(DF, {
>                                       ind <- (CHOSEN == 1)
>                                       n <- tapply(theta_multiple[ind], 
> CS[ind], sum)
>                                       d <- tapply(theta_multiple, CS, sum)
>                                       n / d   
>                                               })
> 
> is doing.  It looks like 'n' and 'd' might have different lengths
> since they are being created by two different (CS & CS[ind])
> sequences.  I have no idea why you are converting to the "DF"
> dataframe.  THere is no need for that.  You could just leave the
> vectors (e.g., theta_multiple, CS and ind) as they are and work with
> them.  This is probably where most of your time is being spent.  So if
> you start with matrices and leave the dataframes out of the main loop
> you will probably see an increase in performance.
> 
> 2009/11/2 parkbomee <bbom...@hotmail.com>:
> > This is the Rprof() report by self time.
> > Is it also possible that these routines, which take long self.time, are
> > causing the optim() to be slow?
> >
> >
> > $by.self
> >                         self.time self.pct total.time total.pct
> > "FUN"                       94.16     16.5      94.16      16.5
> > "unlist"                    80.46     14.1     120.54      21.1
> > "lapply"                    76.94     13.5     255.48      44.7
> > "match"                     60.76     10.6      60.88      10.7
> > "as.matrix.data.frame"      31.00      5.4      51.12       8.9
> > "as.character"              29.28      5.1      29.28       5.1
> > "unique.default"            24.36      4.3      24.40       4.3
> > "data.frame"                21.06      3.7      55.78       9.8
> > "split.default"             20.42      3.6      84.38      14.8
> > "tapply"                    13.84      2.4     414.28      72.5
> > "structure"                 11.32      2.0      22.36       3.9
> > "factor"                    11.08      1.9     127.68      22.3
> > "attributes<-"              11.00      1.9      11.00       1.9
> > "=="                        10.56      1.8      10.56       1.8
> > "%*%"                       10.30      1.8      10.30       1.8
> > "as.vector"                 10.22      1.8      10.22       1.8
> > "as.integer"                 9.86      1.7       9.86       1.7
> > "list"                       9.64      1.7       9.64       1.7
> > "exp"                        7.12      1.2       7.12       1.2
> > "as.data.frame.integer"      5.98      1.0       8.10       1.4
> >
> >> To: bbom...@hotmail.com
> >> CC: jholt...@gmail.com; r-help@r-project.org
> >> Subject: Re: [R] avoiding loop
> >> From: mtmor...@fhcrc.org
> >> Date: Sun, 1 Nov 2009 22:14:09 -0800
> >>
> >> parkbomee <bbom...@hotmail.com> writes:
> >>
> >> > Thank you all.
> >> >
> >> > What Chuck has suggested might not be applicable since the number of
> >> > different times is around 40,000.
> >> >
> >> > The object of optimization in my function is the varying "value",
> >> > which is basically data * parameter, of which "parameter" is the
> >> > object of optimization..
> >> >
> >> > And from the r profiling with a subset of data,
> >> > I got this report..any idea what "<Anonymous>" is?
> >> >
> >> >
> >> > $by.total
> >> > total.time total.pct self.time self.pct
> >> > "<Anonymous>" 571.56 100.0 0.02 0.0
> >> > "optim" 571.56 100.0 0.00 0.0
> >> > "fn" 571.54 100.0 0.98 0.2
> >>
> >> You're giving us 'by.total', so these are saying that all the time was
> >> spent in these functions or the functions they called. Probably all
> >> are in 'optim' and its arguments; since little self.time is spent
> >> here, there isn't much to work with
> >>
> >> > "eval" 423.74 74.1 0.00 0.0
> >> > "with.default" 423.74 74.1 0.00 0.0
> >> > "with" 423.74 74.1 0.00 0.0
> >>
> >> These are probably in the internals of optim, where the function
> >> you're trying to optimize is being set up for evaluation. Again
> >> there's little self.time, and all these say is that a big piece of the
> >> time is being spent in code called by this code.
> >>
> >> > "tapply" 414.28 72.5 13.84 2.4
> >> > "lapply" 255.48 44.7 76.94 13.5
> >> > "factor" 127.68 22.3 11.08 1.9
> >> > "unlist" 120.54 21.1 80.46 14.1
> >> > "FUN" 94.16 16.5 94.16 16.5
> >>
> >> these look like they are tapply-related calls (looking at the code for
> >> tapply, it calls lapply, factor, and unlist, and FUN is the function
> >> argument to tapply), perhaps from the function you're optimizing (did
> >> you implement this as suggested below? it would really help to have a
> >> possibly simplified version of the code you're calling).
> >>
> >> There is material to work with here, as apparently a fairly large
> >> amount of self.time is being spent in each of these functions. So
> >> here's a sample data set
> >>
> >> n <- 100000
> >> set.seed(123)
> >> df <- data.frame(time=sort(as.integer(ceiling(runif(n)*n/5))),
> >> value=ceiling(runif(n)*5))
> >>
> >> It would have been helpful for you to provide reproducible code like
> >> that above, so that the characteristics of your data were easily
> >> reproducible. Let's time tapply
> >>
> >> > replicate(5, {
> >> + system.time(x0 <<- tapply0(df$value, df$time, sum), gcFirst=TRUE)[[1]]
> >> + })
> >> [1] 0.316 0.316 0.308 0.320 0.304
> >>
> >> tapply is quite general, but in your case I think you'd be happy with
> >>
> >> tapply1 <- function(X, INDEX, FUN)
> >> unlist(lapply(split(X, INDEX), FUN), use.names=FALSE)
> >>
> >> > replicate(5, {
> >> + system.time(x1 <<- tapply1(df$value, df$time, sum), gcFirst=TRUE)[[1]]
> >> + })
> >> [1] 0.156 0.148 0.152 0.144 0.152
> >>
> >> so about twice the speed (timing depends quite a bit on what 'time' is,
> >> integer or numeric or character or factor). The vector values of the
> >> two calculations are identical, though tapply presents the data as an
> >> array with column names
> >>
> >> > identical(as.vector(x0), x1)
> >> [1] TRUE
> >>
> >> tapply allows FUN to be anything, but if the interest is in the sum of
> >> each time interval, and the time intervals can be assumed to be sorted
> >> (sorting is not expensive, so could be done on the fly), then
> >>
> >> tapply2 <- function(X, INDEX)
> >> {
> >> csum <- cumsum(c(0, X))
> >> idx <- diff(INDEX) != 0
> >> csum[c(FALSE, idx, TRUE)] - csum[c(TRUE, idx, FALSE)]
> >> }
> >>
> >> calculates the cumulative sum and the points in INDEX where the time
> >> intervals change. It then takes the difference over the appropriate
> >> interval.
> >>
> >> > replicate(5, {
> >> + system.time(x2 <<- tapply2(df$value, df$time), gcFirst=TRUE)[[1]]
> >> + })
> >> [1] 0.024 0.024 0.024 0.024 0.024
> >> > identical(as.vector(x0), x2)
> >> [1] TRUE
> >>
> >> This approach could be subject to rounding error (if csum gets very
> >> large and the intervals remain small). To calculate values where
> >> choice == 1 I think you'd want to
> >>
> >> tapply2(df$value * (df$choice==1), df$time)
> >>
> >> rather than sub-setting, so that the result of tapply2 is always a
> >> vector of the same length even when some time intervals never have
> >> choice==1.
> >>
> >> Because tapply in these examples seems so fast compared to your
> >> calculation, I wonder whether optim is evaluating your function many
> >> times, and that reformulating the optimization might lead to a very
> >> substantial speed-up?
> >>
> >> Martin
> >>
> >> > .
> >> > .
> >> > .
> >> > .
> >> > .
> >> >
> >> >
> >> >> Date: Sun, 1 Nov 2009 15:35:41 -0400
> >> >> Subject: Re: [R] avoiding loop
> >> >> From: jholt...@gmail.com
> >> >> To: bbom...@hotmail.com
> >> >> CC: dwinsem...@comcast.net; d.rizopou...@erasmusmc.nl;
> >> >> r-help@r-project.org
> >> >>
> >> >> What you need to do is to understand how to use Rprof so that you can
> >> >> determine where the time is being spent. It probably indicates that
> >> >> this is not the source of slowness in your optimization function. How
> >> >> much time are we talking about? You may spent more time trying to
> >> >> optimize the function than just running the current version even if it
> >> >> is "slow" (slow is a relative term and does not hold much meaning
> >> >> without some context round it).
> >> >>
> >> >> On Sat, Oct 31, 2009 at 11:36 PM, parkbomee <bbom...@hotmail.com>
> >> >> wrote:
> >> >> >
> >> >> > Thank you both.
> >> >> >
> >> >> > However, using tapply() instead of a loop does not seem to improve my
> >> >> > code much.
> >> >> > I am using this inside of an optimization function,
> >> >> > and it still takes more than it needs...
> >> >> >
> >> >> >
> >> >> >
> >> >> >> CC: bbom...@hotmail.com; r-help@r-project.org
> >> >> >> From: dwinsem...@comcast.net
> >> >> >> To: d.rizopou...@erasmusmc.nl
> >> >> >> Subject: Re: [R] avoiding loop
> >> >> >> Date: Sat, 31 Oct 2009 22:26:17 -0400
> >> >> >>
> >> >> >> This is pretty much equivalent:
> >> >> >>
> >> >> >> tapply(DF$value[DF$choice==1], DF$time[DF$choice==1], sum) /
> >> >> >> tapply(DF$value, DF$time, sum)
> >> >> >>
> >> >> >> And both will probably fail if the number of groups with choice==1
> >> >> >> is
> >> >> >> different than the number overall.
> >> >> >>
> >> >> >> --
> >> >> >> David.
> >> >> >>
> >> >> >> On Oct 31, 2009, at 5:14 PM, Dimitris Rizopoulos wrote:
> >> >> >>
> >> >> >> > one approach is the following:
> >> >> >> >
> >> >> >> > # say 'DF' is your data frame, then
> >> >> >> > with(DF, {
> >> >> >> > ind <- choice == 1
> >> >> >> > n <- tapply(value[ind], time[ind], sum)
> >> >> >> > d <- tapply(value, time, sum)
> >> >> >> > n / d
> >> >> >> > })
> >> >> >> >
> >> >> >> >
> >> >> >> > I hope it helps.
> >> >> >> >
> >> >> >> > Best,
> >> >> >> > Dimitris
> >> >> >> >
> >> >> >> >
> >> >> >> > parkbomee wrote:
> >> >> >> >> Hi all,
> >> >> >> >> I am trying to figure out a way to improve my code's efficiency
> >> >> >> >> by
> >> >> >> >> avoiding the use of loop.
> >> >> >> >> I want to calculate a conditional mean(?) given time.
> >> >> >> >> For example, from the data below, I want to calculate sum((value|
> >> >> >> >> choice==1)/sum(value)) across time.
> >> >> >> >> Is there a way to do it without using a loop?
> >> >> >> >> time cum_time choice value
> >> >> >> >> 1 4 1 3
> >> >> >> >> 1 4 0 2
> >> >> >> >> 1 4 0 3
> >> >> >> >> 1 4 0 3
> >> >> >> >> 2 6 1 4
> >> >> >> >> 2 6 0 4
> >> >> >> >> 2 6 0 2
> >> >> >> >> 2 6 0 4
> >> >> >> >> 2 6 0 2
> >> >> >> >> 2 6 0 2 3 4
> >> >> >> >> 1 2 3 4 0 3 3
> >> >> >> >> 4 0 5 3 4 0 2
> >> >> >> >> My code looks like
> >> >> >> >> objective[1] = value[1] / sum(value[1:cum_time[1])
> >> >> >> >> for (i in 2:max(time)){
> >> >> >> >> objective[i] = value[cum_time[i-1]+1] /
> >> >> >> >> sum(value[(cum_time[i-1]+1) : cum_time[i])])
> >> >> >> >> }
> >> >> >> >> sum(objective)
> >> >> >> >> Anyone have an idea that I can do this without using a loop??
> >> >> >> >> Thanks.
> >> >> >> >>
> >> >> >> >> _________________________________________________________________
> >> >> >> >> [[elided Hotmail spam]]
> >> >> >> >> [[alternative HTML version deleted]]
> >> >> >> >> ______________________________________________
> >> >> >> >> R-help@r-project.org mailing list
> >> >> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> >> >> PLEASE do read the posting guide
> >> >> >> >> http://www.R-project.org/posting-guide.html
> >> >> >> >> and provide commented, minimal, self-contained, reproducible
> >> >> >> >> code.
> >> >> >> >
> >> >> >> > --
> >> >> >> > Dimitris Rizopoulos
> >> >> >> > Assistant Professor
> >> >> >> > Department of Biostatistics
> >> >> >> > Erasmus University Medical Center
> >> >> >> >
> >> >> >> > Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
> >> >> >> > Tel: +31/(0)10/7043478
> >> >> >> > Fax: +31/(0)10/7043014
> >> >> >> >
> >> >> >> > ______________________________________________
> >> >> >> > R-help@r-project.org mailing list
> >> >> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> >> > PLEASE do read the posting guide
> >> >> >> > http://www.R-project.org/posting-guide.html
> >> >> >> > and provide commented, minimal, self-contained, reproducible code.
> >> >> >>
> >> >> >> David Winsemius, MD
> >> >> >> Heritage Laboratories
> >> >> >> West Hartford, CT
> >> >> >>
> >> >> >
> >> >> > _________________________________________________________________
> >> >> > [[elided Hotmail spam]]
> >> >> > [[alternative HTML version deleted]]
> >> >> >
> >> >> > ______________________________________________
> >> >> > R-help@r-project.org mailing list
> >> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> > PLEASE do read the posting guide
> >> >> > http://www.R-project.org/posting-guide.html
> >> >> > and provide commented, minimal, self-contained, reproducible code.
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Jim Holtman
> >> >> Cincinnati, OH
> >> >> +1 513 646 9390
> >> >>
> >> >> What is the problem that you are trying to solve?
> >> >
> >> > _________________________________________________________________
> >> > [[elided Hotmail spam]]
> >> >
> >> > [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > R-help@r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> > http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >>
> >> --
> >> Martin Morgan
> >> Computational Biology / Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N.
> >> PO Box 19024 Seattle, WA 98109
> >>
> >> Location: Arnold Building M1 B861
> >> Phone: (206) 667-2793
> >
> > ________________________________
> > ³» ¸¾´ë·Î ¹Ù²Ù´Â ¹è°æ È¸é »ö, ´õ¿í ´Ù¾çÇØÁø È¨ÆäÀÌÁö ½½¶óÀÌµå ¼î ±¸¼º! È® 
> > ´Þ¶óÁø MSN È¨ÆäÀÌÁö!
> 
> 
> 
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
> 
> What is the problem that you are trying to solve?
                                          
_________________________________________________________________
À©µµ¿ì ¶óÀÌºê ÄûÁîÇ®°í~¼±¹° ¹ÞÀ¸¼¼¿ä~

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] avoiding loop

Reply via email to