Re: [R] memory-efficient column aggregation of a sparse matrix

roger koenker Thu, 01 Feb 2007 05:33:39 -0800

Doug is right, I think, that this would be easier with full indexing
using the  matrix.coo classe, if you want to use SparseM.  But
then the tapply seems to be the way to go.


url:    www.econ.uiuc.edu/~roger            Roger Koenker
email    [EMAIL PROTECTED]            Department of Economics
vox:     217-333-4558                University of Illinois
fax:       217-244-6678                Champaign, IL 61820


On Feb 1, 2007, at 7:22 AM, Douglas Bates wrote:

> On 1/31/07, Jon Stearley <[EMAIL PROTECTED]> wrote:
>> I need to sum the columns of a sparse matrix according to a factor -
>> ie given a sparse matrix X and a factor fac of length ncol(X), sum
>> the elements by column factors and return the sparse matrix Y of size
>> nrow(X) by nlevels(f).  The appended code does the job, but is
>> unacceptably memory-bound because tapply() uses a non-sparse
>> representation.  Can anyone suggest a more memory and cpu efficient
>> approach?  Eg, a sparse matrix tapply method?  Thanks.
>
> This is the sort of operation that is much more easily performed in
> the triplet representation of a sparse matrix where each nonzero
> element is represented by its row index, column index and value.
> Using that representation you could map the column indices according
> to the factor then convert back to one of the other representations.
> The only question would be what to do about nonzeros in different
> columns of the original matrix that get mapped to the same element in
> the result.  It turns out that in the sparse matrix code used by the
> Matrix package the triplet representation allows for duplicate index
> positions with the convention that the resulting value at a position
> is the sum of the values of any triplets with that index pair.
>
> If you decide to use this approach please be aware that the indices
> for the triplet representation in the Matrix package are 0-based (as
> in C code) not 1-based (as in R code).  (I imagine that Martin is
> thinking "we really should change that" as he reads this part.)
>
>>
>> --
>> +--------------------------------------------------------------+
>> | Jon Stearley                  (505) 845-7571  (FAX 844-9297) |
>> | Sandia National Laboratories  Scalable Systems Integration   |
>> +--------------------------------------------------------------+
>>
>>
>> # x and y are of SparseM class matrix.csr
>> "aggregate.csr" <-
>> function(x, fac) {
>>          # make a vector indicating the row of each nonzero
>>          rows <- integer(length=length([EMAIL PROTECTED]))
>>          [EMAIL PROTECTED]:nrow(x)]] <- 1 # put a 1 at start of each row
>>          rows <- as.integer(cumsum(rows)) # and finish with a cumsum
>>
>>          # make a vector indicating the column factor of each nonzero
>>          f <- [EMAIL PROTECTED]
>>
>>          # aggregate by row,f
>>          y <- tapply([EMAIL PROTECTED], list(rows,f), sum)
>>
>>          # sparsify it
>>          y[is.na(y)] <- 0  # change tapply NAs to as.matrix.csr 0s
>>          y <- as.matrix.csr(y)
>>
>>          y
>> }
>>

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] memory-efficient column aggregation of a sparse matrix

Reply via email to