ex19 on Nvidia Tesla K40m : CUDA ERROR (code = 101, invalid device ordinal)

Barry Smith Thu, 14 Jul 2022 09:55:18 -0700

  So the PETSc test all run, including the test that uses a GPU.

  The hypre test is failing. It is impossible to tell from the output why.


  You can run it manually, cd src/snes/tutorials

make ex19
mpiexec -n 1 ./ex19 -dm_vec_type cuda -dm_mat_type aijcusparse -da_refine 3 
-snes_monitor_short -ksp_norm_type unpreconditioned -pc_type hypre -info > 
somefile

then take a look at the output in somefile and send it to us. 

  Barry



> On Jul 14, 2022, at 12:32 PM, Juan Pablo de Lima Costa Salazar via 
> petsc-users <[email protected]> wrote:
> 
> Hello,
> 
> I was hoping to get help regarding a runtime error I am encountering on a 
> cluster node with 4 Tesla K40m GPUs after configuring PETSc with the 
> following command:
> 
> $./configure --force \
>                   --with-precision=double  \
>                   --with-debugging=0 \
>                   --COPTFLAGS=-O3 \
>                   --CXXOPTFLAGS=-O3 \
>                   --FOPTFLAGS=-O3 \
>                   PETSC_ARCH=linux64GccDPInt32-spack \
>                   --download-fblaslapack \
>                   --download-openblas \
>                   --download-hypre \
>                   
> --download-hypre-configure-arguments=--enable-unified-memory \
>                   --with-mpi-dir=/opt/ohpc/pub/mpi/openmpi4-gnu9/4.0.4 \
>                   --with-cuda=1 \
>                   --download-suitesparse \
>                   --download-dir=downloads \
>                   
> --with-cudac=/opt/ohpc/admin/spack/0.15.0/opt/spack/linux-centos8-ivybridge/gcc-9.3.0/cuda-11.7.0-hel25vgwc7fixnvfl5ipvnh34fnskw3m/bin/nvcc
>  \
>                   --with-packages-download-dir=downloads \
>                   --download-sowing=downloads/v1.1.26-p4.tar.gz \
>                   --with-cuda-arch=35
> 
> When I run
> 
> $ make PETSC_DIR=/home/juan/OpenFOAM/juan-v2206/petsc-cuda 
> PETSC_ARCH=linux64GccDPInt32-spack check
> Running check examples to verify correct installation
> Using PETSC_DIR=/home/juan/OpenFOAM/juan-v2206/petsc-cuda and 
> PETSC_ARCH=linux64GccDPInt32-spack
> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process
> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes
> 3,5c3,15
> <   1 SNES Function norm 4.12227e-06 
> <   2 SNES Function norm 6.098e-11 
> < Number of SNES iterations = 2
> ---
> > CUDA ERROR (code = 101, invalid device ordinal) at memory.c:139
> > CUDA ERROR (code = 101, invalid device ordinal) at memory.c:139
> > --------------------------------------------------------------------------
> > Primary job  terminated normally, but 1 process returned
> > a non-zero exit code. Per user-direction, the job has been aborted.
> > --------------------------------------------------------------------------
> > --------------------------------------------------------------------------
> > mpiexec detected that one or more processes exited with non-zero status, 
> > thus causing
> > the job to be terminated. The first process to do so was:
> > 
> >   Process name: [[52712,1],0]
> >   Exit code:    1
> > --------------------------------------------------------------------------
> /home/juan/OpenFOAM/juan-v2206/petsc-cuda/src/snes/tutorials
> Possible problem with ex19 running with hypre, diffs above
> =========================================
> C/C++ example src/snes/tutorials/ex19 run successfully with cuda
> C/C++ example src/snes/tutorials/ex19 run successfully with suitesparse
> Fortran example src/snes/tutorials/ex5f run successfully with 1 MPI process
> Completed test examples
> 
> I have compiled the code on the head node (without GPUs) and on the compute 
> node where there are 4 GPUs. 
> 
> $nvidia-debugdump -l
> Found 4 NVIDIA devices
>       Device ID:              0
>       Device name:            Tesla K40m
>       GPU internal ID:        0320717032250
> 
>       Device ID:              1
>       Device name:            Tesla K40m
>       GPU internal ID:        0320717031968
> 
>       Device ID:              2
>       Device name:            Tesla K40m
>       GPU internal ID:        0320717032246
> 
>       Device ID:              3
>       Device name:            Tesla K40m
>       GPU internal ID:        0320717032235
> 
> Attached are the log files form configure and make.
> 
> Any pointers are highly appreciated. My intention is to use PETSc as a linear 
> solver for OpenFOAM, leveraging the availability of GPUs at the same time. 
> Currently I can run PETSc without GPU support. 
> 
> Cheers,
> Juan S.
> 
> 
> 
> 
> 
> <configure.log.tar.gz><make.log.tar.gz>

Re: [petsc-users] Error running src/snes/tutorials/ex19 on Nvidia Tesla K40m : CUDA ERROR (code = 101, invalid device ordinal)

Reply via email to