Karl Rupp <[email protected]> writes: >>> This can obviously be done incrementally, so storing a batch of >>> element matrices to global memory is not a problem. >> >> If you store element matrices to global memory, you're using a ton of >> bandwidth (about 20x the size of the matrix if using P1 tets). >> >> What if you do the sort/reduce thing within thread blocks, and only >> write the reduced version to global storage? > > My primary metric for GPU kernels is memory transfers from global memory > ('flops are free'), hence what I suggest for the assembly stage is to go > with something CSR-like rather than COO. Pure CSR may be too expensive > in terms of element lookup if there are several fields involved > (particularly 3d), so one could push (column-index, value) pairs for > each row and making the merge-by-key much cheaper than for arbitrary COO > matrices.
I think CSR vs. COO is a second-order optimization to be considered after the 20x redundancy has been eliminated and a synchronization strategy has been chosen (e.g., coloring vs redundant storage and later compression). > This, of course, requires the knowledge of the nonzero pattern and > couplings among elements, yet this is reasonably cheap to extract for a > large number of problems (for example, (non)linear PDEs without > adaptivity). Also, the nonzero pattern is rather cheap to obtain if one > uses coloring for avoiding expensive atomic writes to global memory. At this point, I don't mind having the nonzero pattern set ahead of time using CPU code. It's reassembly in time-dependent problems with no adaptivity or occasional adaptivity that I'm more concerned with.
pgpNXRIe_6ieR.pgp
Description: PGP signature
