https://gitlab.com/petsc/petsc/-/merge_requests/4512 <https://gitlab.com/petsc/petsc/-/merge_requests/4512>
Best regards, Jacob Faibussowitsch (Jacob Fai - booss - oh - vitch) > On Nov 1, 2021, at 11:00, Barry Smith <bsm...@petsc.dev> wrote: > > > PETSc code could check for the environmental variable > CUDA_VISIBLE_DEVICES=-1 if that makes sense to resolve the situation. > > > >> On Nov 1, 2021, at 11:43 AM, Jacob Faibussowitsch <jacob....@gmail.com >> <mailto:jacob....@gmail.com>> wrote: >> >> Looks like you are tripping up the following: >> >> cerr = cupmGetDeviceCount(&ndev); >> if (PetscUnlikely(cerr == cupmErrorStubLibrary)) { >> … // handle missing driver or stub library >> } else {CHKERRCUPM(cerr);} // your error here >> >> Is it an error if a user configures with cuda (i.e. signals intent to use >> cuda) but disables all the devices? On the one hand, yes this can be >> considered an error if the user inadvertently disables the devices via this >> environment variable without knowing, but on the other hand they should be >> able to freely set this variable without petsc crashing… Should we warn >> users? Handle this silently? >> >> Note that petsc does provide '-device_enable none’ option to disable all >> devices, or if you only want to disable cuda devices '-device_enable_cuda >> none’ which should achieve the same effect as CUDA_VISIBLE_DEVICES=-1. But >> maybe it is too obscure to ask users to know about and use these flags >> instead of setting the cuda env variables. (Btw, can you test that using >> ‘-device_enable_cuda none’ does not crash when setting >> CUDA_VISIBLE_DEVICES=-1?) >> >> Best regards, >> >> Jacob Faibussowitsch >> (Jacob Fai - booss - oh - vitch) >> >>> On Nov 1, 2021, at 10:09, Stefano Zampini <stefano.zamp...@gmail.com >>> <mailto:stefano.zamp...@gmail.com>> wrote: >>> >>> Just found out that if we configure with cuda and then want to run on CPU >>> only using CUDA_VISIBLE_DEVICES=-1 PETSc errors out. Is this intended >>> behavior? I supposed it should work >>> This is with main >>> >>> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check >>> Running check examples to verify correct installation >>> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and >>> PETSC_ARCH=arch-ecrcml-cuda-double >>> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process >>> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes >>> C/C++ example src/snes/tutorials/ex19 run successfully with cuda >>> Completed test examples >>> >>> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check >>> CUDA_VISIBLE_DEVICES=1 >>> Running check examples to verify correct installation >>> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and >>> PETSC_ARCH=arch-ecrcml-cuda-double >>> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process >>> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes >>> C/C++ example src/snes/tutorials/ex19 run successfully with cuda >>> Completed test examples >>> >>> (ecrcml-cuda) zampins@qaysar:~/miniforge/Devel/petsc$ make check >>> CUDA_VISIBLE_DEVICES=-1 >>> Running check examples to verify correct installation >>> Using PETSC_DIR=/home/zampins/miniforge/Devel/petsc and >>> PETSC_ARCH=arch-ecrcml-cuda-double >>> Possible error running C/C++ src/snes/tutorials/ex19 with 1 MPI process >>> See http://www.mcs.anl.gov/petsc/documentation/faq.html >>> <http://www.mcs.anl.gov/petsc/documentation/faq.html> >>> [0]PETSC ERROR: --------------------- Error Message >>> -------------------------------------------------------------- >>> [0]PETSC ERROR: GPU error >>> [0]PETSC ERROR: cuda error 100 (cudaErrorNoDevice) : no CUDA-capable device >>> is detected >>> [0]PETSC ERROR: See https://petsc.org/release/faq/ >>> <https://petsc.org/release/faq/> for trouble shooting. >>> [0]PETSC ERROR: Petsc Development GIT revision: v3.16.0-368-g72b201b202 >>> GIT Date: 2021-10-29 14:48:19 +0300 >>> [0]PETSC ERROR: ./ex19 on a arch-ecrcml-cuda-double named >>> qaysar.kaust.edu.sa <http://qaysar.kaust.edu.sa/> by zampins Mon Nov 1 >>> 18:06:12 2021 >>> [0]PETSC ERROR: Configure options >>> --with-blaslapack-include=/home/zampins/miniforge/envs/ecrcml-cuda/include >>> --with-blaslapack-lib=/home/zampins/miniforge/envs/ecrcml-cuda/lib/libmkl_rt.so >>> --download-h2opus --with-cuda >>> --with-kblas-dir=/home/zampins/miniforge/envs/ecrcml-cuda >>> --with-magma-dir=/home/zampins/miniforge/envs/ecrcml-cuda >>> --LDFLAGS=/usr/lib/x86_64-linux-gnu/libcuda.so --with-debugging=1 >>> --with-openmp --with-precision=double --with-fc=0 >>> PETSC_ARCH=arch-ecrcml-cuda-double >>> PETSC_DIR=/home/zampins/miniforge/Devel/petsc >>> [0]PETSC ERROR: #1 initialize() at >>> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:302 >>> [0]PETSC ERROR: #2 PetscDeviceInitializeTypeFromOptions_Private() at >>> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/interface/device.cxx:292 >>> [0]PETSC ERROR: #3 PetscDeviceInitializeFromOptions_Internal() at >>> /home/zampins/miniforge/Devel/petsc/src/sys/objects/device/interface/device.cxx:417 >>> [0]PETSC ERROR: #4 PetscInitialize_Common() at >>> /home/zampins/miniforge/Devel/petsc/src/sys/objects/pinit.c:956 >>> [0]PETSC ERROR: #5 PetscInitialize() at >>> /home/zampins/miniforge/Devel/petsc/src/sys/objects/pinit.c:1231 >>> -------------------------------------------------------------------------- >>> Primary job terminated normally, but 1 process returned >>> a non-zero exit code. Per user-direction, the job has been aborted. >>> -------------------------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> >>> [ >> >