This can obviously be done incrementally, so storing a batch of
element matrices to global memory is not a problem.

If you store element matrices to global memory, you're using a ton of
bandwidth (about 20x the size of the matrix if using P1 tets).

What if you do the sort/reduce thing within thread blocks, and only
write the reduced version to global storage?

My primary metric for GPU kernels is memory transfers from global memory ('flops are free'), hence what I suggest for the assembly stage is to go with something CSR-like rather than COO. Pure CSR may be too expensive in terms of element lookup if there are several fields involved (particularly 3d), so one could push (column-index, value) pairs for each row and making the merge-by-key much cheaper than for arbitrary COO matrices.

This, of course, requires the knowledge of the nonzero pattern and couplings among elements, yet this is reasonably cheap to extract for a large number of problems (for example, (non)linear PDEs without adaptivity). Also, the nonzero pattern is rather cheap to obtain if one uses coloring for avoiding expensive atomic writes to global memory.

Best regards,
Karli


Reply via email to