On Fri, May 28, 2021 at 10:40 AM Barry Smith <[email protected]> wrote:
> > Thanks. On machines such as this one where you have to use $MPIEXEC to > run code you will still need to provide the generation with > -with-cuda-gencodearch=70. On systems where it can directly query the GPU > without MPIEXEC it will automatically produce the correct result. Otherwise > it will guess by compiling for different generations but this can produce > an incorrect answer. > Yes, on Summit with CUDA-11, the script guesses sm_80, but actually it should be sm_70. Probably, we can test hostname and then set a correct cuda arch for common machines. But it kind of overreacts. > > Barry > > > On May 28, 2021, at 7:59 AM, Mark Adams <[email protected]> wrote: > > > > On Thu, May 27, 2021 at 11:50 PM Barry Smith <[email protected]> wrote: > >> >> Mark, >> >> >> >> Where did you run the little test program I sent you >> >> 1) when it produced >> >> The 1120 and negative number and (was this on the compile server or >> on a compute node?) >> > > This is fine now. look at my last email. I was not using srun. > > >> 2) when it produced the correct answer? (compile server or compute node?) >> >> Do you run configure on a compile server (that has no GPUs) or a compute >> server that has GPUs >> > > You have to do everything on the compute nodes on Cori/gpu. > > >> Don't spend your time bisecting PETSc we know exactly where the problem >> is, we just don't see how it happens. >> > >> cuda.py, if it cannot find deviceQuery and if you did not provide a >> generation arch with -with-cuda-gencodearch=70, >> > > I thought I was not supposed to use that anymore. It sounds like it is > optional. > > >> runs a version of the little code I sent you to get the number but it is >> ??apparently?? producing garbage or not running on the compiler server and >> gives the wrong number 1120. >> > > Does PETSc use MPIEXEC to run this? > > Note, I have not been able to get 'make check' to work on Cori/gpu. I use > '-with-mpiexec=srun -G1 [-c 20]' and it fails to execute the tests. > > OK, putting -with-cuda-gencodearch=70 back in has fixed this problem. It > is running now. > > Thanks, > > >> >> Just use the option -with-cuda-gencodearch=70 (you do not need to >> pass this information to any flags any more, just with this option and it >> will use it). >> >> Barry >> >> Ideally we want it to figure it out automatically and this little test >> program in configure is suppose to do this but since that is not always >> working yet you should just use -with-cuda-gencodearch=70 >> >> >> >> On May 27, 2021, at 5:45 AM, Mark Adams <[email protected]> wrote: >> >> FYI, I was running the test incorrectly: >> 03:38 cgpu12 ~/petsc_install$ srun -n 1 -G 1 ./a.out >> 70 >> 70 >> >> On Wed, May 26, 2021 at 10:21 PM Mark Adams <[email protected]> wrote: >> >>> I had git bisect working and was 4 steps away when I got a new crash. >>> configure.log is empty. >>> >>> 19:15 1 cgpu02 (a531cba26b...)|BISECTING ~/petsc$ git bisect bad >>> Bisecting: 19 revisions left to test after this (roughly 4 steps) >>> [149e269f455574fbe8ce3ebaf42121ae7fdf0635] Merge branch >>> 'tisaac/feature-spqr' into 'main' >>> 19:16 cgpu02 (149e269f45...)|BISECTING ~/petsc$ >>> ../arch-cori-gpu-opt-gcc.py PETSC_DIR=$PWD >>> >>> =============================================================================== >>> Configuring PETSc to compile on your system >>> >>> >>> =============================================================================== >>> >>> ******************************************************************************* >>> CONFIGURATION CRASH (Please send configure.log to >>> [email protected]) >>> >>> ******************************************************************************* >>> >>> EOL while scanning string literal (cuda.py, line 176) >>> File "/global/u2/m/madams/petsc/config/configure.py", line 455, in >>> petsc_configure >>> framework = >>> config.framework.Framework(['--configModules=PETSc.Configure','--optionsModule=config.compilerOptions']+sys.argv[1:], >>> loadArgDB = 0) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >>> 107, in __init__ >>> self.createChildren() >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >>> 344, in createChildren >>> self.getChild(moduleName) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >>> 329, in getChild >>> config.setupDependencies(self) >>> File "/global/u2/m/madams/petsc/config/PETSc/Configure.py", line 80, >>> in setupDependencies >>> self.blasLapack = >>> framework.require('config.packages.BlasLapack',self) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >>> 349, in require >>> config = self.getChild(moduleName, keywordArgs) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >>> 329, in getChild >>> config.setupDependencies(self) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/BlasLapack.py", >>> line 21, in setupDependencies >>> config.package.Package.setupDependencies(self, framework) >>> File "/global/u2/m/madams/petsc/config/BuildSystem/config/package.py", >>> line 151, in setupDependencies >>> self.mpi = framework.require('config.packages.MPI',self) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >>> 349, in require >>> config = self.getChild(moduleName, keywordArgs) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >>> 329, in getChild >>> config.setupDependencies(self) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/MPI.py", line >>> 73, in setupDependencies >>> self.mpich = framework.require('config.packages.MPICH', self) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >>> 349, in require >>> config = self.getChild(moduleName, keywordArgs) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >>> 329, in getChild >>> config.setupDependencies(self) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/packages/MPICH.py", >>> line 16, in setupDependencies >>> self.cuda = framework.require('config.packages.cuda',self) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >>> 349, in require >>> config = self.getChild(moduleName, keywordArgs) >>> File >>> "/global/u2/m/madams/petsc/config/BuildSystem/config/framework.py", line >>> 302, in getChild >>> type = __import__(moduleName, globals(), locals(), >>> ['Configure']).Configure >>> 19:16 cgpu02 (149e269f45...)|BISECTING ~/petsc$ >>> ../arch-cori-gpu-opt-gcc.py PETSC_DIR=$PWD >>> >>> On Wed, May 26, 2021 at 10:10 PM Junchao Zhang <[email protected]> >>> wrote: >>> >>>> >>>> >>>> >>>> On Wed, May 26, 2021 at 6:13 PM Barry Smith <[email protected]> wrote: >>>> >>>>> >>>>> What is HOST=cori09 Does it have GPUs? >>>>> >>>>> >>>>> https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_164490976c8e07e028a8f1ce1f5cd42d6 >>>>> >>>>> Seems to clearly state >>>>> >>>>> int cudaDeviceProp >>>>> <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp> >>>>> ::major >>>>> <https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_164490976c8e07e028a8f1ce1f5cd42d6> >>>>> [inherited] >>>>> >>>>> Major compute capability >>>>> >>>>> >>>>> Mark, please compile and run this program on the machine you are >>>>> running configure on >>>>> >>>>> #include <stdio.h> >>>>> #include <cuda.h> >>>>> #include <cuda_runtime.h> >>>>> #include <cuda_runtime_api.h> >>>>> #include <cuda_device_runtime_api.h> >>>>> int main(int arg,char **args) >>>>> { >>>>> struct cudaDeviceProp dp; >>>>> cudaGetDeviceProperties(&dp, 0); >>>>> printf("%d\n",10*dp.major+dp.minor); >>>>> >>>>> int major,minor; >>>>> cuDeviceGetAttribute(&major, >>>>> CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, 0); >>>>> cuDeviceGetAttribute(&minor, >>>>> CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, 0); >>>>> printf("%d\n",10*major+minor); >>>>> return(0); >>>>> >>>> Probably, you need to check the return code of these two function calls >>>> to make sure they are correct. >>>> >>>> >>>>> } >>>>> >>>>> This is what I get >>>>> >>>>> $ nvcc mytest.c -lcuda >>>>> ~/petsc* (main=)* arch-main >>>>> $ ./a.out >>>>> 70 >>>>> 70 >>>>> >>>>> Which is exactly what it is suppose to do. >>>>> >>>>> Barry >>>>> >>>>> On May 26, 2021, at 5:31 PM, Barry Smith <[email protected]> wrote: >>>>> >>>>> >>>>> Yes, this code which I guess never got hit before >>>>> >>>>> cudaDeviceProp dp; cudaGetDeviceProperties(&dp, 0); >>>>> printf("%d\n",10*dp.major+dp.minor); >>>>> return(0);; >>>>> >>>>> is using the wrong property for the generation. >>>>> >>>>> Back to the CUDA documentation for the correct information. >>>>> >>>>> >>>>> >>>>> On May 26, 2021, at 3:47 PM, Jacob Faibussowitsch <[email protected]> >>>>> wrote: >>>>> >>>>> 1120 sounds suspiciously like some CUDA version rather than >>>>> architecture or compute capability… >>>>> >>>>> Best regards, >>>>> >>>>> Jacob Faibussowitsch >>>>> (Jacob Fai - booss - oh - vitch) >>>>> Cell: +1 (312) 694-3391 >>>>> >>>>> On May 26, 2021, at 22:29, Mark Adams <[email protected]> wrote: >>>>> >>>>> I started to get this error today on Cori. >>>>> >>>>> nvcc fatal : Unsupported gpu architecture 'compute_1120' >>>>> >>>>> I am pretty sure I had a clean build but I can redo it if you don't >>>>> know where this is from. >>>>> >>>>> Thanks, >>>>> Mark >>>>> <configure.log> >>>>> >>>>> >>>>> >>>>> >> >
