I would like to thank once more everyone who helped me with this question. I compared the speed for different approaches. Below are the results of my comparisons - in case anyone is interested:
### Building an EXAMPLE FRAME with N rows - with groups and a lot of NAs: N<-100000 set.seed(1234) frame<-data.frame(group=rep(paste("group",1:10),N/10),a=rnorm(1:N),b=rnorm(1:N),c=rnorm(1:N),d=rnorm(1:N),e=rnorm(1:N),f=rnorm(1:N),g=rnorm(1:N)) frame<-frame[order(frame$group),] ## Introducing 60% NAs: names.used<-names(frame)[2:length(frame)] set.seed(1234) for(i in names.used){ i.for.NA<-sample(1:N,round((N*.6),0)) frame[[i]][i.for.NA]<-NA } lapply(frame[2:8], function(x) length(x[is.na(x)])) # Checking that it worked ORIGframe<-frame ## placeholder for the unchanged original frame ####### Objective of the code - divide each value by its group mean #### ### METHOD 1 - the FASTEST - using ave():############################## frame<-ORIGframe f2 <- function(frame) { for(i in 2:ncol(frame)) { frame[,i] <- ave(frame[,i], frame[,1], FUN=function(x)x/mean(x,na.rm=TRUE)) } frame } system.time({new.frame<-f2(frame)}) # Took me 0.23-0.27 sec ####################################### ### METHOD 2 - fast, just a bit slower - using data.table: ############################## # If you don't have it - install the package - NOT from CRAN: install.packages("data.table",repos="http://R-Forge.R-project.org") library(data.table) frame<-ORIGframe system.time({ table<-data.table(frame) colMeanFunction<-function(data,key){ data[[key]]=NULL ret=as.matrix(data)/matrix(rep(as.numeric(colMeans(as.data.frame(data),na.rm=T)),nrow(data)),nrow=nrow(data),ncol=ncol(data),byrow=T) return(ret) } groupedMeans = table[,colMeanFunction(.SD, "group"), by="group"] names.to.use<-names(groupedMeans) for(i in 1:length(groupedMeans)){groupedMeans[[i]]<-as.data.frame(groupedMeans[[i]])} groupedMeans<-do.call(cbind, groupedMeans) names(groupedMeans)<-names.to.use }) # Took me 0.37-.45 sec ####################################### ### METHOD 3 - fast, a tad slower (using model.matrix & matrix multiplication):############################## frame<-ORIGframe system.time({ mat <- as.matrix(frame[,-1]) mm <- model.matrix(~0+group,frame) col.grp.N <- crossprod( !is.na(mat), mm ) # Use this line if don't want to use NAs for mean calculations # col.grp.N <- crossprod( mat != 0 , mm ) # Use this line if don't want to use zeros for mean calculations mat[is.na(mat)] <- 0.0 col.grp.sum <- crossprod( mat, mm ) mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] ) is.na(mat) <- is.na(frame[,-1]) mat<-as.data.frame(mat) }) # Took me 0.44-0.50 sec ####################################### ### METHOD 5- much slower - it's the one I started with:############################## frame<-ORIGframe system.time({ frame <- do.call(cbind, lapply(names.used, function(x){ unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T))) })) }) # Took me 1.25-1.32 min ####################################### ### METHOD 6 - the slowest; using "plyr" and "ddply":############################## frame<-ORIGframe library(plyr) function3 <- function(x) x / mean(x, na.rm = TRUE) system.time({ grouping.factor<-"group" myvariables<-names(frame)[2:8] frame3<-ddply(frame, grouping.factor, colwise(function3, myvariables)) }) # Took me 1.36-1.47 min ####################################### Thanks again! Dimitri On Wed, Mar 31, 2010 at 8:29 PM, William Dunlap <wdun...@tibco.com> wrote: > Dimitri, > > You might try applying ave() to each column. E.g., use > > f2 <- function(frame) { > for(i in 2:ncol(frame)) { > frame[,i] <- ave(frame[,i], frame[,1], > FUN=function(x)x/mean(x,na.rm=TRUE)) > } > frame > } > > Note that this returns a data.frame and retains the > grouping column (the first) while your original > code returns a matrix without the grouping column. > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > >> -----Original Message----- >> From: r-help-boun...@r-project.org >> [mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter >> Sent: Tuesday, March 30, 2010 10:52 AM >> To: 'Dimitri Liakhovitski'; 'r-help' >> Subject: Re: [R] Code is too slow: mean-centering variables >> in a data framebysubgroup >> >> ?scale >> >> Bert Gunter >> Genentech Nonclinical Biostatistics >> >> >> >> -----Original Message----- >> From: r-help-boun...@r-project.org >> [mailto:r-help-boun...@r-project.org] On >> Behalf Of Dimitri Liakhovitski >> Sent: Tuesday, March 30, 2010 8:05 AM >> To: r-help >> Subject: [R] Code is too slow: mean-centering variables in a >> data frame >> bysubgroup >> >> Dear R-ers, >> >> I have a large data frame (several thousands of rows and about 2.5 >> thousand columns). One variable ("group") is a grouping variable with >> over 30 levels. And I have a lot of NAs. >> For each variable, I need to divide each value by variable mean - by >> subgroup. I have the code but it's way too slow - takes me about 1.5 >> hours. >> Below is a data example and my code that is too slow. Is there a >> different, faster way of doing the same thing? >> Thanks a lot for your advice! >> >> Dimitri >> >> >> # Building an example frame - with groups and a lot of NAs: >> set.seed(1234) >> frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1: > 100),b=rnorm(1 >> :100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:1 >> 00),g=rnorm(1: >> 100)) >> frame<-frame[order(frame$group),] >> names.used<-names(frame)[2:length(frame)] >> set.seed(1234) >> for(i in names.used){ >> i.for.NA<-sample(1:100,60) >> frame[[i]][i.for.NA]<-NA >> } >> frame >> >> ### Code that does what's needed but is too slow: >> Start<-Sys.time() >> frame <- do.call(cbind, lapply(names.used, function(x){ >> unlist(by(frame, frame$group, function(y) y[,x] / >> mean(y[,x],na.rm=T))) >> })) >> Finish<-Sys.time() >> print(Finish-Start) # Takes too long >> >> -- >> Dimitri Liakhovitski >> Ninah.com >> dimitri.liakhovit...@ninah.com >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > -- Dimitri Liakhovitski Ninah.com dimitri.liakhovit...@ninah.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.