Hi, I'm trying to run a parallel matrix vector build and linear solution with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the following error:
terminate called after throwing an instance of 'thrust::system::system_error' what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Program received signal SIGABRT: Process abort signal. Backtrace for this error: terminate called after throwing an instance of 'thrust::system::system_error' what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Program received signal SIGABRT: Process abort signal. I'm new to submitting jobs in slurm that also use GPU resources, so I might be doing something wrong in my submission script. This is it: #!/bin/bash #SBATCH -J test #SBATCH -e /home/Issues/PETSc/test.err #SBATCH -o /home/Issues/PETSc/test.log #SBATCH --partition=batch #SBATCH --ntasks=2 #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --ntasks-per-node=2 #SBATCH --time=01:00:00 #SBATCH --gres=gpu:1 export OMP_NUM_THREADS=1 module load cuda/11.5 module load openmpi/4.1.1 cd /home/Issues/PETSc mpirun -n 2 /home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds -vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg If anyone has any suggestions on how o troubleshoot this please let me know. Thanks! Marcos