Matthew Knepley <[email protected]> writes: > Okay, here is how I understand GPU matrix assembly. The only way it > makes sense to me is in COO format which you may later convert. In > mpiaijAssemble.cu I have code that > > - Produces COO rows > - Segregates them into on and off-process rows
These users compute redundantly and set MAT_NO_OFF_PROC_ENTRIES. > - Sorts and reduces by key ... then insert into diagonal and off-diagonal parts of owned matrices. > This can obviously be done incrementally, so storing a batch of > element matrices to global memory is not a problem. If you store element matrices to global memory, you're using a ton of bandwidth (about 20x the size of the matrix if using P1 tets). What if you do the sort/reduce thing within thread blocks, and only write the reduced version to global storage?
pgpYjSTcR43rv.pgp
Description: PGP signature
