Looks like you are tripping up the following:

cerr = cupmGetDeviceCount(&ndev);
if (PetscUnlikely(cerr == cupmErrorStubLibrary)) {
  … // handle missing driver or stub library
} else {CHKERRCUPM(cerr);} // your error here

Is it an error if a user configures with cuda (i.e. signals intent to use cuda) 
but disables all the devices? On the one hand, yes this can be considered an 
error if the user inadvertently disables the devices via this environment 
variable without knowing, but on the other hand they should be able to freely 
set this variable without petsc crashing… Should we warn users? Handle this 
silently?

Note that petsc does provide '-device_enable none’ option to disable all 
devices, or if you only want to disable cuda devices '-device_enable_cuda none’ 
which should achieve the same effect as CUDA_VISIBLE_DEVICES=-1. But maybe it 
is too obscure to ask users to know about and use these flags instead of 
setting the cuda env variables. (Btw, can you test that using 
‘-device_enable_cuda none’ does not crash when setting CUDA_VISIBLE_DEVICES=-1?)

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

> On Nov 1, 2021, at 10:09, Stefano Zampini <[email protected]> wrote:
> 
> Just found out that if we configure with cuda and then want to run on CPU 
> only using CUDA_VISIBLE_DEVICES=-1 PETSc errors out. Is this intended 
> behavior? I supposed it should work
> This is with main
> 
> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check
> Running check examples to verify correct installation
> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and 
> PETSC_ARCH=arch-ecrcml-cuda-double
> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process
> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes
> C/C++ example src/snes/tutorials/ex19 run successfully with cuda
> Completed test examples
> 
> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check 
> CUDA_VISIBLE_DEVICES=1
> Running check examples to verify correct installation
> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and 
> PETSC_ARCH=arch-ecrcml-cuda-double
> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process
> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes
> C/C++ example src/snes/tutorials/ex19 run successfully with cuda
> Completed test examples
> 
> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check 
> CUDA_VISIBLE_DEVICES=-1
> Running check examples to verify correct installation
> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and 
> PETSC_ARCH=arch-ecrcml-cuda-double
> Possible error running C/C++ src/snes/tutorials/ex19 with 1 MPI process
> See http://www.mcs.anl.gov/petsc/documentation/faq.html 
> <http://www.mcs.anl.gov/petsc/documentation/faq.html>
> [0]PETSC ERROR: --------------------- Error Message 
> --------------------------------------------------------------
> [0]PETSC ERROR: GPU error 
> [0]PETSC ERROR: cuda error 100 (cudaErrorNoDevice) : no CUDA-capable device 
> is detected
> [0]PETSC ERROR: See https://petsc.org/release/faq/ 
> <https://petsc.org/release/faq/> for trouble shooting.
> [0]PETSC ERROR: Petsc Development GIT revision: v3.16.0-368-g72b201b202  GIT 
> Date: 2021-10-29 14:48:19 +0300
> [0]PETSC ERROR: ./ex19 on a arch-ecrcml-cuda-double named qaysar.kaust.edu.sa 
> <http://qaysar.kaust.edu.sa/> by zampins Mon Nov  1 18:06:12 2021
> [0]PETSC ERROR: Configure options 
> --with-blaslapack-include=/home/zampins/miniforge/envs/ecrcml-cuda/include 
> --with-blaslapack-lib=/home/zampins/miniforge/envs/ecrcml-cuda/lib/libmkl_rt.so
>  --download-h2opus --with-cuda 
> --with-kblas-dir=/home/zampins/miniforge/envs/ecrcml-cuda 
> --with-magma-dir=/home/zampins/miniforge/envs/ecrcml-cuda 
> --LDFLAGS=/usr/lib/x86_64-linux-gnu/libcuda.so --with-debugging=1 
> --with-openmp --with-precision=double --with-fc=0 
> PETSC_ARCH=arch-ecrcml-cuda-double 
> PETSC_DIR=/home/zampins/miniforge/Devel/petsc
> [0]PETSC ERROR: #1 initialize() at 
> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:302
> [0]PETSC ERROR: #2 PetscDeviceInitializeTypeFromOptions_Private() at 
> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/interface/device.cxx:292
> [0]PETSC ERROR: #3 PetscDeviceInitializeFromOptions_Internal() at 
> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/interface/device.cxx:417
> [0]PETSC ERROR: #4 PetscInitialize_Common() at 
> /home/zampins/miniforge/Devel/petsc/src/sys/objects/pinit.c:956
> [0]PETSC ERROR: #5 PetscInitialize() at 
> /home/zampins/miniforge/Devel/petsc/src/sys/objects/pinit.c:1231
> --------------------------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> 
> [

Reply via email to