CUDA: Part 1: Memory

Barry Smith Tue, 9 Oct 2012 13:30:18 -0500

On Oct 9, 2012, at 10:21 AM, Karl Rupp <rupp at mcs.anl.gov> wrote:

> Hi guys,
> 
> as our discussion of memory is more and more drifting apart towards runtime 
> and scheduling aspects, I'll try to wrap up the key points of the memory part 
> of the discussion and postpone all runtime/execution aspects to 'Part 2' of 
> the series.
> 
> * The proposed unification of memory handles (CPU and GPU) within *data of 
> Vec could not find any backup, rather the GPU handles should remain in 
> GPUarray (or any equivalent for OpenCL/CUDA). However, it is not yet clear 
> whether we want to stick with library-specific names such as Vec_CUSP, or 
> whether we want to go with runtime-specific names such as Vec_CUDA and 
> Vec_OpenCL and probably dispatch into library-specific routines from there. 
> Jed pointed out that Vec_OpenCL is probably too fuzzy, suggesting that 
> Vec_LIBRARYNAME is the better option.



     The Vec_CUSP is most definitely built on top of CUSP and is not built 
around generic CUDA hence going to Vec_CUDA from Vec_CUSP doesn't make sense to 
me. If we had (have? as an alternative) a Vec class that was built directly on 
CUDA then it could be called Vec_CUDA. Similarly if Vec_OpenCL is built 
directly on generic OpenCL then that name is fine, if it is built on top of 
something like ViennaCL then Vec_ViennaCL would be the way to go.

    Barry


Paul has put in some code based on cusparse, I haven't had the energy to see 
how that works. Perhaps there should be a Vec_CUSparse to that. 

> 
> * Barry backups my suggestion to have multi-GPU support for a single process, 
> whereas Jed and Matt suggest to map one GPU to one MPI-process for reasons of 
> simplicity. As the usual application of multi-GPU is within sparse 
> matrix-vector products and block-based preconditioners, I note the following:
> - Such implementations are basically available out-of-the-box with MPI.
> - According to the manual, block-based preconditioners can also be configured 
> on a per-process basis, thus allowing to use the individual streaming 
> processors on a GPU efficiently (there is no native synchronization possible 
> between streaming processors within a single kernel!).
> - The current multi-GPU support using txpetscgpu focuses on sparse 
> matrix-vector products only (there are some hints in 
> src/ksp/pc/impls/factor/ilu that forward-backward substitutions for ILU 
> preconditioners on GPUs may also be available, yet I haven't found any actual 
> code/kernels for that).
> Consequently, from the available functionality it seems that we can live with 
> a one-GPU-per-process option.
> 
> * Adding a bit of meta information to arrays in main RAM (without splitting 
> up the actual buffer) for increased cache-awareness requires a demonstration 
> of significant performance benefits for any further consideration.
> 
> If my wrap-up missed some part of the discussion, please let me/us know. I'll 
> now move on to the actual runtime and come up with more concrete ideas in 
> 'Part 2' :-)
> 
> Best regards,
> Karli
> 
>

[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Reply via email to