Hi Paul,

>>> * Reduce CUSP dependency: The current elementary operations are
mainly realized via CUSP. With better support via CUSPARSE and
CUBLAS, I'd add a separate 'native' CUDA backend so that we can
provide a full set of vector and sparse matrix operations out of the
default NVIDIA toolchain. We will still keep CUSP for its
preconditioners, yet we no longer depend on it.
Agreed. In the past, I've suggested a -vec_type cuda (not cusp). All the
CUSP operations can be done with Thrust algorithms. Since Thrust comes
default with CUDA, one can have only a CUDA dependency.

Yes, I opt for
 -vec_type cuda
if everything needed is shipped with the CUDA toolkit. I even tend to avoid Thrust as much as possible and go with CUBLAS/CUSPARSE because we get faster compilation and less compiler warnings this way, but that's an implementation detail :-)


* Integrate last bits of txpetscgpu package. I assume Paul will
provide a helping hand here.
Of course. This will go much faster as much of the hard work is done. Do
people want support for different matrix formats in the CUSP classes :
i.e. diagonal, ellpack, hybrid? I think the CUSP preconditioners can be
derived from matrices stored in non-csr format (although they're likely
just doing a convert under the hood).

Since people keep asking for fast SpMV, we should provide these other formats as well (actually, they are partially provided with your update to the CUSPARSE bindings already). The main reason for CUSP is the SA preconditioner, for which SpMV performance doesn't really matter.


* Documentation: Add a chapter on GPUs to the manual, particularly on
what to expect and what not to expect. Update documentation on
webpage regarding installation.
I will help with the manual.

Cheers :-)


* Integration of FEM quadrature from SNES ex52. The CUDA part
requiring code generation is not very elegant, while the OpenCL
approach is better suited for a library integration thanks to JIT.
However, this requires user code to be provided as a string (again
not very elegant) or loaded from file (more reasonable). How much FEM
functionality do we want to provide via PETSc?
Multi-GPU is a highly pressing need, IMO. Need to figure out how to make
Block Jacobi and ASM run efficiently.

The tricky part here is to balance processes vs. threads vs. GPUs. If we use more than one GPU per process, we will duplicate more and more of the current MPI logic over time just to move data between GPUs. However, if we just use one GPU per process, we will under-utilize the CPU unless we have a good interaction with threadcomm.

Best regards,
Karli

Reply via email to