Hi Paul, > No problem. I have some code that will be helpful for these classes. In > particular, In txpetscgpu, I have code that figures out which data to > message to/from the GPU in the parallel SpMV. It's done using > cudaStreams which allows the comm to be overlapped with the computation > kernel. > > If you make the hierarchy as described below, it would be natural to > move the txpetscgpu code into Vec_CUDA class.
Great, that should lower the barrier for interfacing with other CUDA-based libraries significantly, and also simplify the maintenance of txpetscgpu. Best regards, Karli >> Hi Paul, >> >> thanks for the comments. I'll have a look whether we can have an >> intermediate layer for CUDA and OpenCL, e.g. >> Vec_Seq -> Vec_CUDA -> Vec_Thrust. >> This should allow us to define a broader set of operations on Vec_CUDA >> (and similarly for matrices), particularly such not covered by >> CUSparse and Thrust. >> >> Best regards, >> Karli >> >> >> On 10/09/2012 03:48 PM, Paul Mullowney wrote: >>> I think the current vector class should be Vec_Thrust with >>> >>> -vec_type thrust (not cusp) >>> >>> First, most of the vector functions are computed from kernels in the >>> Thrust library (although there may be an occasional CUSP or CUBLAS >>> function call). Second, it is not clear how long CUSP is going to >>> survive ... and I think Nvidia puts more energy into CUSPARSE and >>> Thrust. >>> >>> I think a Vec_CUDA would be very useful ... there is a lot you could do >>> with this that you can't currently do with Thrust. >>> >>> I think separating the Mat types into CUSP and CUSPARSE is sensible. >>> >>> -Paul >>> >>> >>> >>>>> Hi guys, >>>>> >>>>> as our discussion of memory is more and more drifting apart towards >>>>> runtime and scheduling aspects, I'll try to wrap up the key points of >>>>> the memory part of the discussion and postpone all runtime/execution >>>>> aspects to 'Part 2' of the series. >>>>> >>>>> * The proposed unification of memory handles (CPU and GPU) within >>>>> *data of Vec could not find any backup, rather the GPU handles should >>>>> remain in GPUarray (or any equivalent for OpenCL/CUDA). However, it >>>>> is not yet clear whether we want to stick with library-specific names >>>>> such as Vec_CUSP, or whether we want to go with runtime-specific >>>>> names such as Vec_CUDA and Vec_OpenCL and probably dispatch into >>>>> library-specific routines from there. Jed pointed out that Vec_OpenCL >>>>> is probably too fuzzy, suggesting that Vec_LIBRARYNAME is the better >>>>> option. >>>> >>>> The Vec_CUSP is most definitely built on top of CUSP and is not >>>> built around generic CUDA hence going to Vec_CUDA from Vec_CUSP >>>> doesn't make sense to me. If we had (have? as an alternative) a Vec >>>> class that was built directly on CUDA then it could be called >>>> Vec_CUDA. Similarly if Vec_OpenCL is built directly on generic OpenCL >>>> then that name is fine, if it is built on top of something like >>>> ViennaCL then Vec_ViennaCL would be the way to go. >>>> >>>> Barry >>>> >>>> >>>> Paul has put in some code based on cusparse, I haven't had the energy >>>> to see how that works. Perhaps there should be a Vec_CUSparse to that. >>>> >>>>> * Barry backups my suggestion to have multi-GPU support for a single >>>>> process, whereas Jed and Matt suggest to map one GPU to one >>>>> MPI-process for reasons of simplicity. As the usual application of >>>>> multi-GPU is within sparse matrix-vector products and block-based >>>>> preconditioners, I note the following: >>>>> - Such implementations are basically available out-of-the-box with >>>>> MPI. >>>>> - According to the manual, block-based preconditioners can also be >>>>> configured on a per-process basis, thus allowing to use the >>>>> individual streaming processors on a GPU efficiently (there is no >>>>> native synchronization possible between streaming processors within a >>>>> single kernel!). >>>>> - The current multi-GPU support using txpetscgpu focuses on sparse >>>>> matrix-vector products only (there are some hints in >>>>> src/ksp/pc/impls/factor/ilu that forward-backward substitutions for >>>>> ILU preconditioners on GPUs may also be available, yet I haven't >>>>> found any actual code/kernels for that). >>>>> Consequently, from the available functionality it seems that we can >>>>> live with a one-GPU-per-process option. >>>>> >>>>> * Adding a bit of meta information to arrays in main RAM (without >>>>> splitting up the actual buffer) for increased cache-awareness >>>>> requires a demonstration of significant performance benefits for any >>>>> further consideration. >>>>> >>>>> If my wrap-up missed some part of the discussion, please let me/us >>>>> know. I'll now move on to the actual runtime and come up with more >>>>> concrete ideas in 'Part 2' :-) >>>>> >>>>> Best regards, >>>>> Karli >>>>> >>>>> >>> >
