Matthew Knepley <[email protected]> writes:

> Okay, here is how I understand GPU matrix assembly. The only way it
> makes sense to me is in COO format which you may later convert. In
> mpiaijAssemble.cu I have code that
>
>   - Produces COO rows
>   - Segregates them into on and off-process rows

These users compute redundantly and set MAT_NO_OFF_PROC_ENTRIES.

>   - Sorts and reduces by key

... then insert into diagonal and off-diagonal parts of owned matrices.

> This can obviously be done incrementally, so storing a batch of
> element matrices to global memory is not a problem. 

If you store element matrices to global memory, you're using a ton of
bandwidth (about 20x the size of the matrix if using P1 tets).

What if you do the sort/reduce thing within thread blocks, and only
write the reduced version to global storage?

Attachment: pgpYjSTcR43rv.pgp
Description: PGP signature

Reply via email to