Steve, are you subscribed to the petsc-dev mailing list? Karl Rupp <[email protected]> writes:
> Hi Paul, > > >>> * Reduce CUSP dependency: The current elementary operations are >>>> mainly realized via CUSP. With better support via CUSPARSE and >>>> CUBLAS, I'd add a separate 'native' CUDA backend so that we can >>>> provide a full set of vector and sparse matrix operations out of the >>>> default NVIDIA toolchain. We will still keep CUSP for its >>>> preconditioners, yet we no longer depend on it. >> Agreed. In the past, I've suggested a -vec_type cuda (not cusp). All the >> CUSP operations can be done with Thrust algorithms. Since Thrust comes >> default with CUDA, one can have only a CUDA dependency. > > Yes, I opt for > -vec_type cuda > if everything needed is shipped with the CUDA toolkit. I even tend to > avoid Thrust as much as possible and go with CUBLAS/CUSPARSE because we > get faster compilation and less compiler warnings this way, but that's > an implementation detail :-) > > >>>> * Integrate last bits of txpetscgpu package. I assume Paul will >>>> provide a helping hand here. >> Of course. This will go much faster as much of the hard work is done. Do >> people want support for different matrix formats in the CUSP classes : >> i.e. diagonal, ellpack, hybrid? I think the CUSP preconditioners can be >> derived from matrices stored in non-csr format (although they're likely >> just doing a convert under the hood). > > Since people keep asking for fast SpMV, we should provide these other > formats as well (actually, they are partially provided with your update > to the CUSPARSE bindings already). The main reason for CUSP is the SA > preconditioner, for which SpMV performance doesn't really matter. Well, SpMV affects cycle time, but setup is primarily sparse matrix-matrix. >>>> * Documentation: Add a chapter on GPUs to the manual, particularly on >>>> what to expect and what not to expect. Update documentation on >>>> webpage regarding installation. >> I will help with the manual. > > Cheers :-) > > >>>> * Integration of FEM quadrature from SNES ex52. The CUDA part >>>> requiring code generation is not very elegant, while the OpenCL >>>> approach is better suited for a library integration thanks to JIT. >>>> However, this requires user code to be provided as a string (again >>>> not very elegant) or loaded from file (more reasonable). How much FEM >>>> functionality do we want to provide via PETSc? >> Multi-GPU is a highly pressing need, IMO. Need to figure out how to make >> Block Jacobi and ASM run efficiently. > > The tricky part here is to balance processes vs. threads vs. GPUs. If we > use more than one GPU per process, we will duplicate more and more of > the current MPI logic over time just to move data between GPUs. However, > if we just use one GPU per process, we will under-utilize the CPU unless > we have a good interaction with threadcomm. > > Best regards, > Karli
pgpV3nKBTs1sh.pgp
Description: PGP signature
