Zisheng Ye via petsc-users <petsc-users@mcs.anl.gov> writes: > Dear PETSc Team > > We are testing the GPU support in PETSc's KSPSolve, especially for the GAMG > and Hypre preconditioners. We have encountered several issues that we would > like to ask for your suggestions. > > First, we have couple of questions when working with a single MPI rank: > > 1. We have tested two backends, CUDA and Kokkos. One commonly encountered > error is related to SpGEMM in CUDA when the mat is large as listed below: > > cudaMalloc((void **)&buffer2, bufferSize2) error( cudaErrorMemoryAllocation): > out of memory > > For CUDA backend, one can use "-matmatmult_backend_cpu -matptap_backend_cpu" > to avoid these problems. However, there seems no equivalent options in Kokkos > backend. Is there any good practice to avoid this error for both backends and > if we can avoid this error in Kokkos backend?
Junchao will know more about KK tuning, but the faster GPU matrix-matrix algorithms use extra memory. We should be able to make the host option available with kokkos. > 2. We have tested the combination of Hypre and Kokkos as backend. It looks > like this combination is not compatible with each other, as we observed that > KSPSolve takes a greater number of iterations to exit, and the residual norm > in the post-checking is much larger than the one obtained when working with > CUDA backend. This happens for matrices with block size larger than 1. Is > there any explanation to the error? > > Second, we have couple more questions when working with multiple MPI ranks: > > 1. We are currently using OpenMPI as we couldnt get Intel MPI to work as a > GPU-aware MPI, is this a known issue with Intel MPI? As far as I know, Intel's MPI is only for SYCL/Intel GPUs. In general, GPU-aware MPI has been incredibly flaky on all HPC systems despite being introduced ten years ago. > 2. With OpenMPI we currently see a slow down when increasing the MPI count > as shown in the figure below, is this normal? Could you share -log_view output from a couple representative runs? You could send those here or to petsc-ma...@mcs.anl.gov. We need to see what kind of work is not scaling to attribute what may be causing it.