Hey,
On 09/24/2013 03:53 PM, Jed Brown wrote:
Karl Rupp <[email protected]> writes:
I'm not talking about CSR vs. COO from the SpMV point of view, but
rather on how to store the actual data in global memory without
expensive subsequent sorts.
Sure, but this seems like such a minor detail. With PetscScalar=double
and PetscInt=int, we have 16 bytes/entry for COO and (nominally) 12
bytes/entry for CSR, and it only needs to go to GPU global memory and
back, not across to the CPU. I doubt the difference between 12 and 16
bytes/entry during assembly is a bottleneck.
I'm not worried about 12 bytes vs. 16 bytes, but rather about the
ordering of entries as a whole. If one assembles into something
CSR-like, then one can either run the SpMV right away, or merge entries
in each row of the matrix which have the same column indices. Merging
such entries can usually be done in shared memory, so the memory costs
is one read and write of the matrix nonzero entries in worst case.
On the contrary, if everything is assembled into a general COO format,
then one needs to sort the triplets by row first in order to be even
able to run SpMVs. The memory transactions required for this are
O(N log(N)) with N being the number of nonzeros. N is in almost all
cases larger than 10^6, so the log(N) hurts...
Best regards,
Karli