Re: [petsc-dev] Supporting OpenCL matrix assembly

Karl Rupp Tue, 24 Sep 2013 01:31:55 -0700

This can obviously be done incrementally, so storing a batch of
element matrices to global memory is not a problem.


If you store element matrices to global memory, you're using a ton of
bandwidth (about 20x the size of the matrix if using P1 tets).

What if you do the sort/reduce thing within thread blocks, and only
write the reduced version to global storage?

My primary metric for GPU kernels is memory transfers from global memory('flops are free'), hence what I suggest for the assembly stage is to gowith something CSR-like rather than COO. Pure CSR may be too expensivein terms of element lookup if there are several fields involved(particularly 3d), so one could push (column-index, value) pairs foreach row and making the merge-by-key much cheaper than for arbitrary COOmatrices.

This, of course, requires the knowledge of the nonzero pattern andcouplings among elements, yet this is reasonably cheap to extract for alarge number of problems (for example, (non)linear PDEs withoutadaptivity). Also, the nonzero pattern is rather cheap to obtain if oneuses coloring for avoiding expensive atomic writes to global memory.


Best regards,
Karli

Re: [petsc-dev] Supporting OpenCL matrix assembly

Reply via email to