On Oct 9, 2012, at 10:21 AM, Karl Rupp <rupp at mcs.anl.gov> wrote:
> Hi guys,
>
> as our discussion of memory is more and more drifting apart towards runtime
> and scheduling aspects, I'll try to wrap up the key points of the memory part
> of the discussion and postpone all runtime/execution aspects to 'Part 2' of
> the series.
>
> * The proposed unification of memory handles (CPU and GPU) within *data of
> Vec could not find any backup, rather the GPU handles should remain in
> GPUarray (or any equivalent for OpenCL/CUDA). However, it is not yet clear
> whether we want to stick with library-specific names such as Vec_CUSP, or
> whether we want to go with runtime-specific names such as Vec_CUDA and
> Vec_OpenCL and probably dispatch into library-specific routines from there.
> Jed pointed out that Vec_OpenCL is probably too fuzzy, suggesting that
> Vec_LIBRARYNAME is the better option.
The Vec_CUSP is most definitely built on top of CUSP and is not built
around generic CUDA hence going to Vec_CUDA from Vec_CUSP doesn't make sense to
me. If we had (have? as an alternative) a Vec class that was built directly on
CUDA then it could be called Vec_CUDA. Similarly if Vec_OpenCL is built
directly on generic OpenCL then that name is fine, if it is built on top of
something like ViennaCL then Vec_ViennaCL would be the way to go.
Barry
Paul has put in some code based on cusparse, I haven't had the energy to see
how that works. Perhaps there should be a Vec_CUSparse to that.
>
> * Barry backups my suggestion to have multi-GPU support for a single process,
> whereas Jed and Matt suggest to map one GPU to one MPI-process for reasons of
> simplicity. As the usual application of multi-GPU is within sparse
> matrix-vector products and block-based preconditioners, I note the following:
> - Such implementations are basically available out-of-the-box with MPI.
> - According to the manual, block-based preconditioners can also be configured
> on a per-process basis, thus allowing to use the individual streaming
> processors on a GPU efficiently (there is no native synchronization possible
> between streaming processors within a single kernel!).
> - The current multi-GPU support using txpetscgpu focuses on sparse
> matrix-vector products only (there are some hints in
> src/ksp/pc/impls/factor/ilu that forward-backward substitutions for ILU
> preconditioners on GPUs may also be available, yet I haven't found any actual
> code/kernels for that).
> Consequently, from the available functionality it seems that we can live with
> a one-GPU-per-process option.
>
> * Adding a bit of meta information to arrays in main RAM (without splitting
> up the actual buffer) for increased cache-awareness requires a demonstration
> of significant performance benefits for any further consideration.
>
> If my wrap-up missed some part of the discussion, please let me/us know. I'll
> now move on to the actual runtime and come up with more concrete ideas in
> 'Part 2' :-)
>
> Best regards,
> Karli
>
>