Re: [petsc-dev] Supporting OpenCL matrix assembly

Jed Brown Tue, 24 Sep 2013 02:45:54 -0700

Karl Rupp <[email protected]> writes:

>>> This can obviously be done incrementally, so storing a batch of
>>> element matrices to global memory is not a problem.
>>
>> If you store element matrices to global memory, you're using a ton of
>> bandwidth (about 20x the size of the matrix if using P1 tets).
>>
>> What if you do the sort/reduce thing within thread blocks, and only
>> write the reduced version to global storage?
>
> My primary metric for GPU kernels is memory transfers from global memory 
> ('flops are free'), hence what I suggest for the assembly stage is to go 
> with something CSR-like rather than COO. Pure CSR may be too expensive 
> in terms of element lookup if there are several fields involved 
> (particularly 3d), so one could push (column-index, value) pairs for 
> each row and making the merge-by-key much cheaper than for arbitrary COO 
> matrices.


I think CSR vs. COO is a second-order optimization to be considered
after the 20x redundancy has been eliminated and a synchronization
strategy has been chosen (e.g., coloring vs redundant storage and later
compression).

> This, of course, requires the knowledge of the nonzero pattern and 
> couplings among elements, yet this is reasonably cheap to extract for a 
> large number of problems (for example, (non)linear PDEs without 
> adaptivity). Also, the nonzero pattern is rather cheap to obtain if one 
> uses coloring for avoiding expensive atomic writes to global memory.

At this point, I don't mind having the nonzero pattern set ahead of time
using CPU code.  It's reassembly in time-dependent problems with no
adaptivity or occasional adaptivity that I'm more concerned with.

pgpNXRIe_6ieR.pgp
Description: PGP signature

Re: [petsc-dev] Supporting OpenCL matrix assembly

Reply via email to