Hi guys, as our discussion of memory is more and more drifting apart towards runtime and scheduling aspects, I'll try to wrap up the key points of the memory part of the discussion and postpone all runtime/execution aspects to 'Part 2' of the series.
* The proposed unification of memory handles (CPU and GPU) within *data of Vec could not find any backup, rather the GPU handles should remain in GPUarray (or any equivalent for OpenCL/CUDA). However, it is not yet clear whether we want to stick with library-specific names such as Vec_CUSP, or whether we want to go with runtime-specific names such as Vec_CUDA and Vec_OpenCL and probably dispatch into library-specific routines from there. Jed pointed out that Vec_OpenCL is probably too fuzzy, suggesting that Vec_LIBRARYNAME is the better option. * Barry backups my suggestion to have multi-GPU support for a single process, whereas Jed and Matt suggest to map one GPU to one MPI-process for reasons of simplicity. As the usual application of multi-GPU is within sparse matrix-vector products and block-based preconditioners, I note the following: - Such implementations are basically available out-of-the-box with MPI. - According to the manual, block-based preconditioners can also be configured on a per-process basis, thus allowing to use the individual streaming processors on a GPU efficiently (there is no native synchronization possible between streaming processors within a single kernel!). - The current multi-GPU support using txpetscgpu focuses on sparse matrix-vector products only (there are some hints in src/ksp/pc/impls/factor/ilu that forward-backward substitutions for ILU preconditioners on GPUs may also be available, yet I haven't found any actual code/kernels for that). Consequently, from the available functionality it seems that we can live with a one-GPU-per-process option. * Adding a bit of meta information to arrays in main RAM (without splitting up the actual buffer) for increased cache-awareness requires a demonstration of significant performance benefits for any further consideration. If my wrap-up missed some part of the discussion, please let me/us know. I'll now move on to the actual runtime and come up with more concrete ideas in 'Part 2' :-) Best regards, Karli
