Doug is right, I think, that this would be easier with full indexing using the matrix.coo classe, if you want to use SparseM. But then the tapply seems to be the way to go.
url: www.econ.uiuc.edu/~roger Roger Koenker email [EMAIL PROTECTED] Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820 On Feb 1, 2007, at 7:22 AM, Douglas Bates wrote: > On 1/31/07, Jon Stearley <[EMAIL PROTECTED]> wrote: >> I need to sum the columns of a sparse matrix according to a factor - >> ie given a sparse matrix X and a factor fac of length ncol(X), sum >> the elements by column factors and return the sparse matrix Y of size >> nrow(X) by nlevels(f). The appended code does the job, but is >> unacceptably memory-bound because tapply() uses a non-sparse >> representation. Can anyone suggest a more memory and cpu efficient >> approach? Eg, a sparse matrix tapply method? Thanks. > > This is the sort of operation that is much more easily performed in > the triplet representation of a sparse matrix where each nonzero > element is represented by its row index, column index and value. > Using that representation you could map the column indices according > to the factor then convert back to one of the other representations. > The only question would be what to do about nonzeros in different > columns of the original matrix that get mapped to the same element in > the result. It turns out that in the sparse matrix code used by the > Matrix package the triplet representation allows for duplicate index > positions with the convention that the resulting value at a position > is the sum of the values of any triplets with that index pair. > > If you decide to use this approach please be aware that the indices > for the triplet representation in the Matrix package are 0-based (as > in C code) not 1-based (as in R code). (I imagine that Martin is > thinking "we really should change that" as he reads this part.) > >> >> -- >> +--------------------------------------------------------------+ >> | Jon Stearley (505) 845-7571 (FAX 844-9297) | >> | Sandia National Laboratories Scalable Systems Integration | >> +--------------------------------------------------------------+ >> >> >> # x and y are of SparseM class matrix.csr >> "aggregate.csr" <- >> function(x, fac) { >> # make a vector indicating the row of each nonzero >> rows <- integer(length=length([EMAIL PROTECTED])) >> [EMAIL PROTECTED]:nrow(x)]] <- 1 # put a 1 at start of each row >> rows <- as.integer(cumsum(rows)) # and finish with a cumsum >> >> # make a vector indicating the column factor of each nonzero >> f <- [EMAIL PROTECTED] >> >> # aggregate by row,f >> y <- tapply([EMAIL PROTECTED], list(rows,f), sum) >> >> # sparsify it >> y[is.na(y)] <- 0 # change tapply NAs to as.matrix.csr 0s >> y <- as.matrix.csr(y) >> >> y >> } >> ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.