Zisheng Ye via petsc-users <petsc-users@mcs.anl.gov> writes:

> Dear PETSc Team
>
> We are testing the GPU support in PETSc's KSPSolve, especially for the GAMG 
> and Hypre preconditioners. We have encountered several issues that we would 
> like to ask for your suggestions.
>
> First, we have couple of questions when working with a single MPI rank:
>
>   1.  We have tested two backends, CUDA and Kokkos. One commonly encountered 
> error is related to SpGEMM in CUDA when the mat is large as listed below:
>
> cudaMalloc((void **)&buffer2, bufferSize2) error( cudaErrorMemoryAllocation): 
> out of memory
>
> For CUDA backend, one can use "-matmatmult_backend_cpu -matptap_backend_cpu" 
> to avoid these problems. However, there seems no equivalent options in Kokkos 
> backend. Is there any good practice to avoid this error for both backends and 
> if we can avoid this error in Kokkos backend?

Junchao will know more about KK tuning, but the faster GPU matrix-matrix 
algorithms use extra memory. We should be able to make the host option 
available with kokkos.

>   2.  We have tested the combination of Hypre and Kokkos as backend. It looks 
> like this combination is not compatible with each other, as we observed that 
> KSPSolve takes a greater number of iterations to exit, and the residual norm 
> in the post-checking is much larger than the one obtained when working with 
> CUDA backend. This happens for matrices with block size larger than 1. Is 
> there any explanation to the error?
>
> Second, we have couple more questions when working with multiple MPI ranks:
>
>   1.  We are currently using OpenMPI as we couldnt get Intel MPI to work as a 
> GPU-aware MPI, is this a known issue with Intel MPI?

As far as I know, Intel's MPI is only for SYCL/Intel GPUs. In general, 
GPU-aware MPI has been incredibly flaky on all HPC systems despite being 
introduced ten years ago.

>   2.  With OpenMPI we currently see a slow down when increasing the MPI count 
> as shown in the figure below, is this normal?

Could you share -log_view output from a couple representative runs? You could 
send those here or to petsc-ma...@mcs.anl.gov. We need to see what kind of work 
is not scaling to attribute what may be causing it.

Reply via email to