Hi all, After reading this interesting discussion I delved a bit deeper into the subject matter. The following snippet of code (see the end of my mail) compares three ways of performing this task, using ddply, ave and one yet unmentioned option: data.table (a package). The piece of code generates mock datasets which vary in size and number of factor levels for the factor. The results look like this (there is also a ggplot plot in the script that summarise the table):
> res datsize noClasses tave tddply tdata.table ...note that I cut out part of the table for readability... 17 1e+07 10 9.160 3.500 1.064 18 1e+07 50 10.126 4.483 1.364 19 1e+07 100 10.485 5.016 1.407 20 1e+07 200 10.680 6.901 1.435 21 1e+07 500 10.801 12.569 1.474 22 1e+07 1000 10.923 21.001 1.540 23 1e+07 2500 11.514 51.020 1.622 24 1e+07 10000 12.158 182.752 1.737 It is clear that the option of using data.table is by far the fastest of the three and scales quite nicely with the number of factor levels, in contrast to ddply. It is also faster than ave by up to a factor of 10. cheers, Paul library(ggplot2) library(data.table) theme_set(theme_bw()) datsize = c(10e4, 10e5, 10e6) noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3) comb = expand.grid(datsize = datsize, noClasses = noClasses) res = ddply(comb, .(datsize, noClasses), function(x) { expdata = data.frame(value = runif(x$datsize), cat = round(runif(x$datsize, min = 0, max = x$noClasses))) expdataDT = data.table(expdata) t1 = system.time(res1 <- with(expdata, ave(value, cat, FUN = sum))) t2 = system.time(res2 <- ddply(expdata, .(cat), summarise, val = sum(value))) t3 = system.time(res3 <- expdataDT[, sum(value), by = cat]) return(data.frame(tave = t1[3], tddply = t2[3], tdata.table = t3[3])) }, .progress = 'text') res ggplot(aes(x = noClasses, y = log(value), color = variable), data = melt(res, id.vars = c("datsize","noClasses"))) + facet_wrap(~ datsize) + geom_line() > sessionInfo() R version 2.13.0 (2011-04-13) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] data.table_1.6.3 ggplot2_0.8.9 proto_0.3-8 reshape_0.8.4 [5] plyr_1.5.2 fortunes_1.4-1 loaded via a namespace (and not attached): [1] digest_0.4.2 tcltk_2.13.0 tools_2.13.0 On 08/03/2011 01:25 PM, Caroline Faisst wrote: > Hello there, > > > I'm computing the total value of an order from the price of the order items > using a "for" loop and the "ifelse" function. I do this on a large dataframe > (close to 1m lines). The computation of this function is painfully slow: in > 1min only about 90 rows are calculated. > > > The computation time taken for a given number of rows increases with the > size of the dataset, see the example with my function below: > > > # small dataset: function performs well > > exampledata<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4),itemPrice=c(10,17,9,12,25,10,1,9,7)) > > exampledata[1,"orderAmount"]<-exampledata[1,"itemPrice"] > > system.time(for (i in 2:length(exampledata[,1])) > {exampledata[i,"orderAmount"]<-ifelse(exampledata[i,"orderID"]==exampledata[i-1,"orderID"],exampledata[i-1,"orderAmount"]+exampledata[i,"itemPrice"],exampledata[i,"itemPrice"])}) > > > # large dataset: the very same computational task takes much longer > > exampledata2<-data.frame(orderID=c(1,1,1,2,2,3,3,3,4,5:2000000),itemPrice=c(10,17,9,12,25,10,1,9,7,25:2000020)) > > exampledata2[1,"orderAmount"]<-exampledata2[1,"itemPrice"] > > system.time(for (i in 2:9) > {exampledata2[i,"orderAmount"]<-ifelse(exampledata2[i,"orderID"]==exampledata2[i-1,"orderID"],exampledata2[i-1,"orderAmount"]+exampledata2[i,"itemPrice"],exampledata2[i,"itemPrice"])}) > > > > Does someone know a way to increase the speed? > > > Thank you very much! > > Caroline > > [[alternative HTML version deleted]] > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Paul Hiemstra, Ph.D. Global Climate Division Royal Netherlands Meteorological Institute (KNMI) Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39 P.O. Box 201 | 3730 AE | De Bilt tel: +31 30 2206 494 http://intamap.geo.uu.nl/~paul http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.