Thanks for your help.
> Date: Mon, 2 Nov 2009 18:50:42 -0500 > Subject: Re: [R] avoiding loop > From: jholt...@gmail.com > To: bbom...@hotmail.com > CC: mtmor...@fhcrc.org; r-help@r-project.org > > The first thing I would suggest is convert your dataframes to matrices > so that you are not having to continually convert them in the calls to > the functions. Also I am not sure what the code: > > realized_prob = with(DF, { > ind <- (CHOSEN == 1) > n <- tapply(theta_multiple[ind], > CS[ind], sum) > d <- tapply(theta_multiple, CS, sum) > n / d > }) > > is doing. It looks like 'n' and 'd' might have different lengths > since they are being created by two different (CS & CS[ind]) > sequences. I have no idea why you are converting to the "DF" > dataframe. THere is no need for that. You could just leave the > vectors (e.g., theta_multiple, CS and ind) as they are and work with > them. This is probably where most of your time is being spent. So if > you start with matrices and leave the dataframes out of the main loop > you will probably see an increase in performance. > > 2009/11/2 parkbomee <bbom...@hotmail.com>: > > This is the Rprof() report by self time. > > Is it also possible that these routines, which take long self.time, are > > causing the optim() to be slow? > > > > > > $by.self > > self.time self.pct total.time total.pct > > "FUN" 94.16 16.5 94.16 16.5 > > "unlist" 80.46 14.1 120.54 21.1 > > "lapply" 76.94 13.5 255.48 44.7 > > "match" 60.76 10.6 60.88 10.7 > > "as.matrix.data.frame" 31.00 5.4 51.12 8.9 > > "as.character" 29.28 5.1 29.28 5.1 > > "unique.default" 24.36 4.3 24.40 4.3 > > "data.frame" 21.06 3.7 55.78 9.8 > > "split.default" 20.42 3.6 84.38 14.8 > > "tapply" 13.84 2.4 414.28 72.5 > > "structure" 11.32 2.0 22.36 3.9 > > "factor" 11.08 1.9 127.68 22.3 > > "attributes<-" 11.00 1.9 11.00 1.9 > > "==" 10.56 1.8 10.56 1.8 > > "%*%" 10.30 1.8 10.30 1.8 > > "as.vector" 10.22 1.8 10.22 1.8 > > "as.integer" 9.86 1.7 9.86 1.7 > > "list" 9.64 1.7 9.64 1.7 > > "exp" 7.12 1.2 7.12 1.2 > > "as.data.frame.integer" 5.98 1.0 8.10 1.4 > > > >> To: bbom...@hotmail.com > >> CC: jholt...@gmail.com; r-help@r-project.org > >> Subject: Re: [R] avoiding loop > >> From: mtmor...@fhcrc.org > >> Date: Sun, 1 Nov 2009 22:14:09 -0800 > >> > >> parkbomee <bbom...@hotmail.com> writes: > >> > >> > Thank you all. > >> > > >> > What Chuck has suggested might not be applicable since the number of > >> > different times is around 40,000. > >> > > >> > The object of optimization in my function is the varying "value", > >> > which is basically data * parameter, of which "parameter" is the > >> > object of optimization.. > >> > > >> > And from the r profiling with a subset of data, > >> > I got this report..any idea what "<Anonymous>" is? > >> > > >> > > >> > $by.total > >> > total.time total.pct self.time self.pct > >> > "<Anonymous>" 571.56 100.0 0.02 0.0 > >> > "optim" 571.56 100.0 0.00 0.0 > >> > "fn" 571.54 100.0 0.98 0.2 > >> > >> You're giving us 'by.total', so these are saying that all the time was > >> spent in these functions or the functions they called. Probably all > >> are in 'optim' and its arguments; since little self.time is spent > >> here, there isn't much to work with > >> > >> > "eval" 423.74 74.1 0.00 0.0 > >> > "with.default" 423.74 74.1 0.00 0.0 > >> > "with" 423.74 74.1 0.00 0.0 > >> > >> These are probably in the internals of optim, where the function > >> you're trying to optimize is being set up for evaluation. Again > >> there's little self.time, and all these say is that a big piece of the > >> time is being spent in code called by this code. > >> > >> > "tapply" 414.28 72.5 13.84 2.4 > >> > "lapply" 255.48 44.7 76.94 13.5 > >> > "factor" 127.68 22.3 11.08 1.9 > >> > "unlist" 120.54 21.1 80.46 14.1 > >> > "FUN" 94.16 16.5 94.16 16.5 > >> > >> these look like they are tapply-related calls (looking at the code for > >> tapply, it calls lapply, factor, and unlist, and FUN is the function > >> argument to tapply), perhaps from the function you're optimizing (did > >> you implement this as suggested below? it would really help to have a > >> possibly simplified version of the code you're calling). > >> > >> There is material to work with here, as apparently a fairly large > >> amount of self.time is being spent in each of these functions. So > >> here's a sample data set > >> > >> n <- 100000 > >> set.seed(123) > >> df <- data.frame(time=sort(as.integer(ceiling(runif(n)*n/5))), > >> value=ceiling(runif(n)*5)) > >> > >> It would have been helpful for you to provide reproducible code like > >> that above, so that the characteristics of your data were easily > >> reproducible. Let's time tapply > >> > >> > replicate(5, { > >> + system.time(x0 <<- tapply0(df$value, df$time, sum), gcFirst=TRUE)[[1]] > >> + }) > >> [1] 0.316 0.316 0.308 0.320 0.304 > >> > >> tapply is quite general, but in your case I think you'd be happy with > >> > >> tapply1 <- function(X, INDEX, FUN) > >> unlist(lapply(split(X, INDEX), FUN), use.names=FALSE) > >> > >> > replicate(5, { > >> + system.time(x1 <<- tapply1(df$value, df$time, sum), gcFirst=TRUE)[[1]] > >> + }) > >> [1] 0.156 0.148 0.152 0.144 0.152 > >> > >> so about twice the speed (timing depends quite a bit on what 'time' is, > >> integer or numeric or character or factor). The vector values of the > >> two calculations are identical, though tapply presents the data as an > >> array with column names > >> > >> > identical(as.vector(x0), x1) > >> [1] TRUE > >> > >> tapply allows FUN to be anything, but if the interest is in the sum of > >> each time interval, and the time intervals can be assumed to be sorted > >> (sorting is not expensive, so could be done on the fly), then > >> > >> tapply2 <- function(X, INDEX) > >> { > >> csum <- cumsum(c(0, X)) > >> idx <- diff(INDEX) != 0 > >> csum[c(FALSE, idx, TRUE)] - csum[c(TRUE, idx, FALSE)] > >> } > >> > >> calculates the cumulative sum and the points in INDEX where the time > >> intervals change. It then takes the difference over the appropriate > >> interval. > >> > >> > replicate(5, { > >> + system.time(x2 <<- tapply2(df$value, df$time), gcFirst=TRUE)[[1]] > >> + }) > >> [1] 0.024 0.024 0.024 0.024 0.024 > >> > identical(as.vector(x0), x2) > >> [1] TRUE > >> > >> This approach could be subject to rounding error (if csum gets very > >> large and the intervals remain small). To calculate values where > >> choice == 1 I think you'd want to > >> > >> tapply2(df$value * (df$choice==1), df$time) > >> > >> rather than sub-setting, so that the result of tapply2 is always a > >> vector of the same length even when some time intervals never have > >> choice==1. > >> > >> Because tapply in these examples seems so fast compared to your > >> calculation, I wonder whether optim is evaluating your function many > >> times, and that reformulating the optimization might lead to a very > >> substantial speed-up? > >> > >> Martin > >> > >> > . > >> > . > >> > . > >> > . > >> > . > >> > > >> > > >> >> Date: Sun, 1 Nov 2009 15:35:41 -0400 > >> >> Subject: Re: [R] avoiding loop > >> >> From: jholt...@gmail.com > >> >> To: bbom...@hotmail.com > >> >> CC: dwinsem...@comcast.net; d.rizopou...@erasmusmc.nl; > >> >> r-help@r-project.org > >> >> > >> >> What you need to do is to understand how to use Rprof so that you can > >> >> determine where the time is being spent. It probably indicates that > >> >> this is not the source of slowness in your optimization function. How > >> >> much time are we talking about? You may spent more time trying to > >> >> optimize the function than just running the current version even if it > >> >> is "slow" (slow is a relative term and does not hold much meaning > >> >> without some context round it). > >> >> > >> >> On Sat, Oct 31, 2009 at 11:36 PM, parkbomee <bbom...@hotmail.com> > >> >> wrote: > >> >> > > >> >> > Thank you both. > >> >> > > >> >> > However, using tapply() instead of a loop does not seem to improve my > >> >> > code much. > >> >> > I am using this inside of an optimization function, > >> >> > and it still takes more than it needs... > >> >> > > >> >> > > >> >> > > >> >> >> CC: bbom...@hotmail.com; r-help@r-project.org > >> >> >> From: dwinsem...@comcast.net > >> >> >> To: d.rizopou...@erasmusmc.nl > >> >> >> Subject: Re: [R] avoiding loop > >> >> >> Date: Sat, 31 Oct 2009 22:26:17 -0400 > >> >> >> > >> >> >> This is pretty much equivalent: > >> >> >> > >> >> >> tapply(DF$value[DF$choice==1], DF$time[DF$choice==1], sum) / > >> >> >> tapply(DF$value, DF$time, sum) > >> >> >> > >> >> >> And both will probably fail if the number of groups with choice==1 > >> >> >> is > >> >> >> different than the number overall. > >> >> >> > >> >> >> -- > >> >> >> David. > >> >> >> > >> >> >> On Oct 31, 2009, at 5:14 PM, Dimitris Rizopoulos wrote: > >> >> >> > >> >> >> > one approach is the following: > >> >> >> > > >> >> >> > # say 'DF' is your data frame, then > >> >> >> > with(DF, { > >> >> >> > ind <- choice == 1 > >> >> >> > n <- tapply(value[ind], time[ind], sum) > >> >> >> > d <- tapply(value, time, sum) > >> >> >> > n / d > >> >> >> > }) > >> >> >> > > >> >> >> > > >> >> >> > I hope it helps. > >> >> >> > > >> >> >> > Best, > >> >> >> > Dimitris > >> >> >> > > >> >> >> > > >> >> >> > parkbomee wrote: > >> >> >> >> Hi all, > >> >> >> >> I am trying to figure out a way to improve my code's efficiency > >> >> >> >> by > >> >> >> >> avoiding the use of loop. > >> >> >> >> I want to calculate a conditional mean(?) given time. > >> >> >> >> For example, from the data below, I want to calculate sum((value| > >> >> >> >> choice==1)/sum(value)) across time. > >> >> >> >> Is there a way to do it without using a loop? > >> >> >> >> time cum_time choice value > >> >> >> >> 1 4 1 3 > >> >> >> >> 1 4 0 2 > >> >> >> >> 1 4 0 3 > >> >> >> >> 1 4 0 3 > >> >> >> >> 2 6 1 4 > >> >> >> >> 2 6 0 4 > >> >> >> >> 2 6 0 2 > >> >> >> >> 2 6 0 4 > >> >> >> >> 2 6 0 2 > >> >> >> >> 2 6 0 2 3 4 > >> >> >> >> 1 2 3 4 0 3 3 > >> >> >> >> 4 0 5 3 4 0 2 > >> >> >> >> My code looks like > >> >> >> >> objective[1] = value[1] / sum(value[1:cum_time[1]) > >> >> >> >> for (i in 2:max(time)){ > >> >> >> >> objective[i] = value[cum_time[i-1]+1] / > >> >> >> >> sum(value[(cum_time[i-1]+1) : cum_time[i])]) > >> >> >> >> } > >> >> >> >> sum(objective) > >> >> >> >> Anyone have an idea that I can do this without using a loop?? > >> >> >> >> Thanks. > >> >> >> >> > >> >> >> >> _________________________________________________________________ > >> >> >> >> [[elided Hotmail spam]] > >> >> >> >> [[alternative HTML version deleted]] > >> >> >> >> ______________________________________________ > >> >> >> >> R-help@r-project.org mailing list > >> >> >> >> https://stat.ethz.ch/mailman/listinfo/r-help > >> >> >> >> PLEASE do read the posting guide > >> >> >> >> http://www.R-project.org/posting-guide.html > >> >> >> >> and provide commented, minimal, self-contained, reproducible > >> >> >> >> code. > >> >> >> > > >> >> >> > -- > >> >> >> > Dimitris Rizopoulos > >> >> >> > Assistant Professor > >> >> >> > Department of Biostatistics > >> >> >> > Erasmus University Medical Center > >> >> >> > > >> >> >> > Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands > >> >> >> > Tel: +31/(0)10/7043478 > >> >> >> > Fax: +31/(0)10/7043014 > >> >> >> > > >> >> >> > ______________________________________________ > >> >> >> > R-help@r-project.org mailing list > >> >> >> > https://stat.ethz.ch/mailman/listinfo/r-help > >> >> >> > PLEASE do read the posting guide > >> >> >> > http://www.R-project.org/posting-guide.html > >> >> >> > and provide commented, minimal, self-contained, reproducible code. > >> >> >> > >> >> >> David Winsemius, MD > >> >> >> Heritage Laboratories > >> >> >> West Hartford, CT > >> >> >> > >> >> > > >> >> > _________________________________________________________________ > >> >> > [[elided Hotmail spam]] > >> >> > [[alternative HTML version deleted]] > >> >> > > >> >> > ______________________________________________ > >> >> > R-help@r-project.org mailing list > >> >> > https://stat.ethz.ch/mailman/listinfo/r-help > >> >> > PLEASE do read the posting guide > >> >> > http://www.R-project.org/posting-guide.html > >> >> > and provide commented, minimal, self-contained, reproducible code. > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Jim Holtman > >> >> Cincinnati, OH > >> >> +1 513 646 9390 > >> >> > >> >> What is the problem that you are trying to solve? > >> > > >> > _________________________________________________________________ > >> > [[elided Hotmail spam]] > >> > > >> > [[alternative HTML version deleted]] > >> > > >> > ______________________________________________ > >> > R-help@r-project.org mailing list > >> > https://stat.ethz.ch/mailman/listinfo/r-help > >> > PLEASE do read the posting guide > >> > http://www.R-project.org/posting-guide.html > >> > and provide commented, minimal, self-contained, reproducible code. > >> > >> -- > >> Martin Morgan > >> Computational Biology / Fred Hutchinson Cancer Research Center > >> 1100 Fairview Ave. N. > >> PO Box 19024 Seattle, WA 98109 > >> > >> Location: Arnold Building M1 B861 > >> Phone: (206) 667-2793 > > > > ________________________________ > > ³» ¸¾´ë·Î ¹Ù²Ù´Â ¹è°æ È¸é »ö, ´õ¿í ´Ù¾çÇØÁø ȨÆäÀÌÁö ½½¶óÀÌµå ¼î ±¸¼º! È® > > ´Þ¶óÁø MSN ȨÆäÀÌÁö! > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? _________________________________________________________________ À©µµ¿ì ¶óÀ̺ê ÄûÁîÇ®°í~¼±¹° ¹ÞÀ¸¼¼¿ä~ [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.