OK, so you want to find some summary statistics for each column, where some columns could be completely missing.
Writing a small wrapper should help. When you use apply(), you are actually applying a function to every column (or row). First, let us simulate a dataset with 15 days/rows and 10 stations/columns ### simulate data set.seed(1) # for reproducibility mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10) mat[ mat > 45 ] <- NA # create some missing values mat[ ,9 ] <- NA # station 9's data is completely missing Here are two example of such wrappers : find.stats1 <- function( data, threshold=c(37,39,41) ){ n <- length(threshold) out <- matrix( nrow=(n + 1), ncol=ncol(data) ) # initialise out[1, ] <- apply(data, 2, function(x) ifelse( all(is.na(x)), NA, max(x, na.rm=T) )) for(i in 1:n) out[ i+1, ] <- colSums( data > threshold[i], na.rm=T ) rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") ) colnames(out) <- rownames(data) # name of the stations return( out ) } find.stats2 <- function( data, threshold=c(37,39,41) ){ n <- length(threshold) excess <- numeric( n ) out <- matrix( nrow=(n + 1), ncol=ncol(data) ) # initialise good <- which( apply( data, 2, function(x) !all(is.na(x)) ) ) # colums that are not completely missing out[ , good] <- apply( data[ , good], 2, function(x){ m <- max( x, na.rm=T ) for(i in 1:n){ excess[i] <- sum( x > threshold[i], na.rm=TRUE ) } return( c(m, excess) ) } ) rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") ) colnames(out) <- rownames(data) # name of the stations return( out ) } find.stats1( mat ) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] daily_max 44 42 39 41 45 43 42 45 NA 42 above_37 2 1 2 1 3 2 2 1 0 1 above_39 2 1 0 1 3 2 1 1 0 1 above_41 2 1 0 0 2 2 1 1 0 1 find.stats2( mat ) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] daily_max 44 42 39 41 45 43 42 45 NA 42 above_37 2 1 2 1 3 2 2 1 NA 1 above_39 2 1 0 1 3 2 1 1 NA 1 above_41 2 1 0 0 2 2 1 1 NA 1 On my laptop 'find.stats1' and 'find.stats2' (which is more flexible) takes 7 and 6 seconds respectively to execute on a dataset with 10000 stations and 365 days. Regards, Adai On Fri, 2005-06-10 at 20:05 +0200, Sander Oom wrote: > Dear all, > > Dimitris and Andy, thanks for your great help. I have progressed to the > following code which runs very fast and effective: > > mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10) > mat[mat>45] <- NA > mat<-NA > mat > temps <- c(35, 37, 39) > ind <- rbind( > t(sapply(temps, function(temp) > rowSums(mat > temp, na.rm=TRUE) )), > rowSums(!is.na(mat), na.rm=FALSE), > apply(mat, 1, max, na.rm=TRUE)) > ind <- t(ind) > ind > > However, some weather stations have missing values for the whole year. > Unfortunately, the code breaks down (when uncommenting mat<-NA). > > I have tried 'ifelse' statements in the functions, but it becomes even > more of a mess. I could subset the matrix before hand, but this would > mean merging with a complete matrix afterwards to make it compatible > with other years. That would slow things down. > > How can I make the code robust for rows containing all missing values? > > Thanks for your help, > > Sander. > > Dimitris Rizopoulos wrote: > > for the maximum you could use something like: > > > > ind[, 1] <- apply(mat, 2, max) > > > > I hope it helps. > > > > Best, > > Dimitris > > > > ---- > > Dimitris Rizopoulos > > Ph.D. Student > > Biostatistical Centre > > School of Public Health > > Catholic University of Leuven > > > > Address: Kapucijnenvoer 35, Leuven, Belgium > > Tel: +32/16/336899 > > Fax: +32/16/337015 > > Web: http://www.med.kuleuven.ac.be/biostat/ > > http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm > > > > > > > > ----- Original Message ----- > > From: "Sander Oom" <[EMAIL PROTECTED]> > > To: "Dimitris Rizopoulos" <[EMAIL PROTECTED]> > > Cc: <r-help@stat.math.ethz.ch> > > Sent: Friday, June 10, 2005 12:10 PM > > Subject: Re: [R] Replacing for loop with tapply!? > > > > > >>Thanks Dimitris, > >> > >>Very impressive! Much faster than before. > >> > >>Thanks to new found R.basic, I can simply rotate the result with > >>rotate270{R.basic}: > >> > >>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) > >>>temps <- c(37, 39, 41) > >>>################# > >>>#ind <- matrix(0, length(temps), ncol(mat)) > >>>ind <- matrix(0, 4, ncol(mat)) > >>>(startDate <- date()) > >>[1] "Fri Jun 10 12:08:01 2005" > >>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i]) > >>>ind[4, ] <- colMeans(max(mat)) > >>Error in colMeans(max(mat)) : 'x' must be an array of at least two > >>dimensions > >>>(endDate <- date()) > >>[1] "Fri Jun 10 12:08:02 2005" > >>>ind <- rotate270(ind) > >>>ind[1:10,] > >> V4 V3 V2 V1 > >>1 0 56 75 80 > >>2 0 46 53 60 > >>3 0 50 58 67 > >>4 0 60 72 80 > >>5 0 59 68 76 > >>6 0 55 67 74 > >>7 0 62 77 93 > >>8 0 45 57 67 > >>9 0 57 68 75 > >>10 0 61 66 76 > >> > >>However, I have not managed to get the row maximum using your > >>method? It > >>should be 50 for most rows, but my first guess code gives an error! > >> > >>Any suggestions? > >> > >>Sander > >> > >> > >> > >>Dimitris Rizopoulos wrote: > >>>maybe you are looking for something along these lines: > >>> > >>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000) > >>>temps <- c(37, 39, 41) > >>>################# > >>>ind <- matrix(0, length(temps), ncol(mat)) > >>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i]) > >>>ind > >>> > >>> > >>>I hope it helps. > >>> > >>>Best, > >>>Dimitris > >>> > >>>---- > >>>Dimitris Rizopoulos > >>>Ph.D. Student > >>>Biostatistical Centre > >>>School of Public Health > >>>Catholic University of Leuven > >>> > >>>Address: Kapucijnenvoer 35, Leuven, Belgium > >>>Tel: +32/16/336899 > >>>Fax: +32/16/337015 > >>>Web: http://www.med.kuleuven.ac.be/biostat/ > >>> http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm > >>> > >>> > >>>----- Original Message ----- > >>>From: "Sander Oom" <[EMAIL PROTECTED]> > >>>To: <r-help@stat.math.ethz.ch> > >>>Sent: Friday, June 10, 2005 10:50 AM > >>>Subject: [R] Replacing for loop with tapply!? > >>> > >>> > >>>>Dear all, > >>>> > >>>>We have a large data set with temperature data for weather stations > >>>>across the globe (15000 stations). > >>>> > >>>>For each station, we need to calculate the number of days a certain > >>>>temperature is exceeded. > >>>> > >>>>So far we used the following S code, where mat88 is a matrix > >>>>containing > >>>>rows of 365 daily temperatures for each of 15000 weather stations: > >>>> > >>>>m <- 37 > >>>>n <- 2 > >>>>outmat88 <- matrix(0, ncol = 4, nrow = nrow(mat88)) > >>>>for(i in 1:nrow(mat88)) { > >>>># i <- 3 > >>>>row1 <- as.data.frame(df88[i, ]) > >>>>temprow37 <- select.rows(row1, row1 > m) > >>>>temprow39 <- select.rows(row1, row1 > m + n) > >>>>temprow41 <- select.rows(row1, row1 > m + 2 * n) > >>>>outmat88[i, 1] <- max(row1, na.rm = T) > >>>>outmat88[i, 2] <- count.rows(temprow37) > >>>>outmat88[i, 3] <- count.rows(temprow39) > >>>>outmat88[i, 4] <- count.rows(temprow41) > >>>>} > >>>>outmat88 > >>>> > >>>>We have transferred the data to a more potent Linux box running R, > >>>>but > >>>>still hope to speed up the code. > >>>> > >>>>I know a for loop should be avoided when looking for speed. I also > >>>>know > >>>>the answer is in something like tapply, but my understanding of > >>>>these > >>>>commands is still to limited to see the solution. Could someone > >>>>show > >>>>me > >>>>the way!? > >>>> > >>>>Thanks in advance, > >>>> > >>>>Sander. > >>>>-- > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html