We have some motivated users that would like a way to assemble matrices on a device, without needing to store all the element matrices to global memory or to transfer them to the CPU. Given GPU execution models, this means we need something that can be done on-the-spot in kernels. So what about a function that can be called by device threads?
PetscErrorCode MatOpenCLGetSetValuesSource(Mat, synchronization_mechanism, char **); The user concatenates this type-specialized code into their source and calls MatSetValues(). The users I'm talking to here synchronize by coordinating threads using coloring of a sort. The user still needs to call MatAssemblyBegin/End from outside a kernel, though that function may or may not need to invoke its own kernel. Crazy?
pgpCNbw1Ocpld.pgp
Description: PGP signature
