Re: [petsc-users] Cannot eagerly initialize cuda, as doing so results in cuda error 35 (cudaErrorInsufficientDriver) : CUDA driver version is insufficient for CUDA runtime version

Junchao Zhang Mon, 31 Jan 2022 08:20:06 -0800

Fande,
  From your configure_main.log

cuda:
  Version:  10.1
  Includes:
-I/apps/local/spack/software/gcc-7.5.0/cuda-10.1.243-v4ymjqcrr7f72qfiuzsstuy5jiajbuey/include
  Library:
 
-Wl,-rpath,/apps/local/spack/software/gcc-7.5.0/cuda-10.1.243-v4ymjqcrr7f72qfiuzsstuy5jiajbuey/lib64
-L/apps/local/spack/software/gcc-7.5.0/cuda-10.1.243-v4ymjqcrr7f72qfiuzsstuy5jiajbuey/lib64
-L/apps/local/spack/software/gcc-7.5.0/cuda-10.1.243-v4ymjqcrr7f72qfiuzsstuy5jiajbuey/lib64/stubs
-lcudart -lcufft -lcublas -lcusparse -lcusolver -lcurand -lcuda



You can see the `stubs` directory is not in rpath. We took a lot of effort
to achieve that. You need to double check the reason.

--Junchao Zhang


On Mon, Jan 31, 2022 at 9:40 AM Fande Kong <fdkong...@gmail.com> wrote:

> OK,
>
> Finally we resolved the issue.  The issue was that there were two libcuda
> libs on a GPU compute node:  /usr/lib64/libcuda
> and 
> /apps/local/spack/software/gcc-7.5.0/cuda-10.1.243-v4ymjqcrr7f72qfiuzsstuy5jiajbuey/lib64/stubs/libcuda.
> But on a login node there is one libcuda lib:
> /apps/local/spack/software/gcc-7.5.0/cuda-10.1.243-v4ymjqcrr7f72qfiuzsstuy5jiajbuey/lib64/stubs/libcuda.
> We can not see  /usr/lib64/libcuda from a login node where I was compiling
> the code.
>
> Before the Junchao's commit, we did not have  "-Wl,-rpath" to force PETSc
> take
> /apps/local/spack/software/gcc-7.5.0/cuda-10.1.243-v4ymjqcrr7f72qfiuzsstuy5jiajbuey/lib64/stubs/libcuda.
> A code compiled on a login node could correctly pick up the cuda lib
> from  /usr/lib64/libcuda at runtime.  When with "-Wl,-rpath", the code
> always  takes the cuda lib from
> /apps/local/spack/software/gcc-7.5.0/cuda-10.1.243-v4ymjqcrr7f72qfiuzsstuy5jiajbuey/lib64/stubs/libcuda,
> wihch was a bad lib.
>
> Right now, I just compiled code on a compute node instead of a login node,
> PETSc was able to pick up the correct lib from  /usr/lib64/libcuda, and
> everything ran fine.
>
> I am not sure whether or not it is a good idea to search for "stubs" since
> the system might have the correct ones in other places.  Should not I do a
> batch compiling?
>
> Thanks,
>
> Fande
>
>
> On Wed, Jan 26, 2022 at 1:49 PM Fande Kong <fdkong...@gmail.com> wrote:
>
>> Yes, please see the attached file.
>>
>> Fande
>>
>> On Wed, Jan 26, 2022 at 11:49 AM Junchao Zhang <junchao.zh...@gmail.com>
>> wrote:
>>
>>> Do you have the configure.log with main?
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Wed, Jan 26, 2022 at 12:26 PM Fande Kong <fdkong...@gmail.com> wrote:
>>>
>>>> I am on the petsc-main
>>>>
>>>> commit 1390d3a27d88add7d79c9b38bf1a895ae5e67af6
>>>>
>>>> Merge: 96c919c d5f3255
>>>>
>>>> Author: Satish Balay <ba...@mcs.anl.gov>
>>>>
>>>> Date:   Wed Jan 26 10:28:32 2022 -0600
>>>>
>>>>
>>>>     Merge remote-tracking branch 'origin/release'
>>>>
>>>>
>>>> It is still broken.
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Fande
>>>>
>>>> On Wed, Jan 26, 2022 at 7:40 AM Junchao Zhang <junchao.zh...@gmail.com>
>>>> wrote:
>>>>
>>>>> The good uses the compiler's default library/header path.  The bad
>>>>> searches from cuda toolkit path and uses rpath linking.
>>>>> Though the paths look the same on the login node, they could have
>>>>> different behavior on a compute node depending on its environment.
>>>>> I think we fixed the issue in cuda.py (i.e., first try the compiler's
>>>>> default, then toolkit).  That's why I wanted Fande to use petsc/main.
>>>>>
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>> On Tue, Jan 25, 2022 at 11:59 PM Barry Smith <bsm...@petsc.dev> wrote:
>>>>>
>>>>>>
>>>>>> bad has extra
>>>>>>
>>>>>> -L/apps/local/spack/software/gcc-7.5.0/cuda-10.1.243-v4ymjqcrr7f72qfiuzsstuy5jiajbuey/lib64/stubs
>>>>>>  -lcuda
>>>>>>
>>>>>> good does not.
>>>>>>
>>>>>> Try removing the stubs directory and -lcuda from the bad
>>>>>> $PETSC_ARCH/lib/petsc/conf/variables and likely the bad will start 
>>>>>> working.
>>>>>>
>>>>>> Barry
>>>>>>
>>>>>> I never liked the stubs stuff.
>>>>>>
>>>>>> On Jan 25, 2022, at 11:29 PM, Fande Kong <fdkong...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Junchao,
>>>>>>
>>>>>> I attached a "bad" configure log and a "good" configure log.
>>>>>>
>>>>>> The "bad" one was on produced
>>>>>> at 246ba74192519a5f34fb6e227d1c64364e19ce2c
>>>>>>
>>>>>> and the "good" one at 384645a00975869a1aacbd3169de62ba40cad683
>>>>>>
>>>>>> This good hash is the last good hash that is just the right before
>>>>>> the bad one.
>>>>>>
>>>>>> I think you could do a comparison  between these two logs, and check
>>>>>> what the differences were.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Fande
>>>>>>
>>>>>> On Tue, Jan 25, 2022 at 8:21 PM Junchao Zhang <
>>>>>> junchao.zh...@gmail.com> wrote:
>>>>>>
>>>>>>> Fande, could you send the configure.log that works (i.e., before
>>>>>>> this offending commit)?
>>>>>>> --Junchao Zhang
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 25, 2022 at 8:21 PM Fande Kong <fdkong...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Not sure if this is helpful. I did "git bisect", and here was the
>>>>>>>> result:
>>>>>>>>
>>>>>>>> [kongf@sawtooth2 petsc]$ git bisect bad
>>>>>>>> 246ba74192519a5f34fb6e227d1c64364e19ce2c is the first bad commit
>>>>>>>> commit 246ba74192519a5f34fb6e227d1c64364e19ce2c
>>>>>>>> Author: Junchao Zhang <jczh...@mcs.anl.gov>
>>>>>>>> Date:   Wed Oct 13 05:32:43 2021 +0000
>>>>>>>>
>>>>>>>>     Config: fix CUDA library and header dirs
>>>>>>>>
>>>>>>>> :040000 040000 187c86055adb80f53c1d0565a8888704fec43a96
>>>>>>>> ea1efd7f594fd5e8df54170bc1bc7b00f35e4d5f M config
>>>>>>>>
>>>>>>>>
>>>>>>>> Started from this commit, and GPU did not work for me on our HPC
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Fande
>>>>>>>>
>>>>>>>> On Tue, Jan 25, 2022 at 7:18 PM Fande Kong <fdkong...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 25, 2022 at 9:04 AM Jacob Faibussowitsch <
>>>>>>>>> jacob....@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Configure should not have an impact here I think. The reason I
>>>>>>>>>> had you run `cudaGetDeviceCount()` is because this is the CUDA call 
>>>>>>>>>> (and in
>>>>>>>>>> fact the only CUDA call) in the initialization sequence that returns 
>>>>>>>>>> the
>>>>>>>>>> error code. There should be no prior CUDA calls. Maybe this is a 
>>>>>>>>>> problem
>>>>>>>>>> with oversubscribing GPU’s? In the runs that crash, how many ranks 
>>>>>>>>>> are
>>>>>>>>>> using any given GPU  at once? Maybe MPS is required.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I used one MPI rank.
>>>>>>>>>
>>>>>>>>> Fande
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>>
>>>>>>>>>> Jacob Faibussowitsch
>>>>>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>>>>>>
>>>>>>>>>> On Jan 21, 2022, at 12:01, Fande Kong <fdkong...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks Jacob,
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 20, 2022 at 6:25 PM Jacob Faibussowitsch <
>>>>>>>>>> jacob....@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Segfault is caused by the following check at
>>>>>>>>>>> src/sys/objects/device/impls/cupm/cupmdevice.cxx:349 being a
>>>>>>>>>>> PetscUnlikelyDebug() rather than just PetscUnlikely():
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> if (PetscUnlikelyDebug(_defaultDevice < 0)) { // _defaultDevice
>>>>>>>>>>> is in fact < 0 here and uncaught
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> To clarify:
>>>>>>>>>>>
>>>>>>>>>>> “lazy” initialization is not that lazy after all, it still does
>>>>>>>>>>> some 50% of the initialization that “eager” initialization does. It 
>>>>>>>>>>> stops
>>>>>>>>>>> short initializing the CUDA runtime, checking CUDA aware MPI, 
>>>>>>>>>>> gathering
>>>>>>>>>>> device data, and initializing cublas and friends. Lazy also 
>>>>>>>>>>> importantly
>>>>>>>>>>> swallows any errors that crop up during initialization, storing the
>>>>>>>>>>> resulting error code for later (specifically _defaultDevice =
>>>>>>>>>>> -init_error_value;).
>>>>>>>>>>>
>>>>>>>>>>> So whether you initialize lazily or eagerly makes no difference
>>>>>>>>>>> here, as _defaultDevice will always contain -35.
>>>>>>>>>>>
>>>>>>>>>>> The bigger question is why cudaGetDeviceCount() is returning
>>>>>>>>>>> cudaErrorInsufficientDriver. Can you compile and run
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> #include <cuda_runtime.h>
>>>>>>>>>>>
>>>>>>>>>>> int main()
>>>>>>>>>>> {
>>>>>>>>>>>   int ndev;
>>>>>>>>>>>   return cudaGetDeviceCount(&ndev):
>>>>>>>>>>> }
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> Then show the value of "echo $?”?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Modify your code a little to get more information.
>>>>>>>>>>
>>>>>>>>>> #include <cuda_runtime.h>
>>>>>>>>>> #include <cstdio>
>>>>>>>>>>
>>>>>>>>>> int main()
>>>>>>>>>> {
>>>>>>>>>>   int ndev;
>>>>>>>>>>   int error = cudaGetDeviceCount(&ndev);
>>>>>>>>>>   printf("ndev %d \n", ndev);
>>>>>>>>>>   printf("error %d \n", error);
>>>>>>>>>>   return 0;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Results:
>>>>>>>>>>
>>>>>>>>>> $ ./a.out
>>>>>>>>>> ndev 4
>>>>>>>>>> error 0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have not read the PETSc cuda initialization code yet. If I need
>>>>>>>>>> to guess at what was happening. I will naively think that PETSc did 
>>>>>>>>>> not get
>>>>>>>>>> correct GPU information in the configuration because the compiler 
>>>>>>>>>> node does
>>>>>>>>>> not have GPUs, and there was no way to get any GPU device 
>>>>>>>>>> information.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> During the runtime on GPU nodes, PETSc might have incorrect
>>>>>>>>>> information grabbed during configuration and had this kind of false 
>>>>>>>>>> error
>>>>>>>>>> message.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Fande
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>>
>>>>>>>>>>> Jacob Faibussowitsch
>>>>>>>>>>> (Jacob Fai - booss - oh - vitch)
>>>>>>>>>>>
>>>>>>>>>>> On Jan 20, 2022, at 17:47, Matthew Knepley <knep...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 20, 2022 at 6:44 PM Fande Kong <fdkong...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Jed
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jan 20, 2022 at 4:34 PM Jed Brown <j...@jedbrown.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> You can't create CUDA or Kokkos Vecs if you're running on a
>>>>>>>>>>>>> node without a GPU.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am running the code on compute nodes that do have GPUs.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> If you are actually running on GPUs, why would you need lazy
>>>>>>>>>>> initialization? It would not break with GPUs present.
>>>>>>>>>>>
>>>>>>>>>>>    Matt
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> With PETSc-3.16.1, I  got good speedup by running GAMG on
>>>>>>>>>>>> GPUs.  That might be a bug of PETSc-main.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Fande
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> KSPSetUp              13 1.0 6.4400e-01 1.0 2.02e+09 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  5  0  0  0   0  5  0  0  0  3140   
>>>>>>>>>>>> 64630     15
>>>>>>>>>>>> 1.05e+02    5 3.49e+01 100
>>>>>>>>>>>> KSPSolve               1 1.0 1.0109e+00 1.0 3.49e+10 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0 87  0  0  0   0 87  0  0  0 34522   
>>>>>>>>>>>> 69556      4
>>>>>>>>>>>> 4.35e-03    1 2.38e-03 100
>>>>>>>>>>>> KSPGMRESOrthog       142 1.0 1.2674e-01 1.0 1.06e+10 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0 27  0  0  0   0 27  0  0  0 83755   
>>>>>>>>>>>> 87801      0
>>>>>>>>>>>> 0.00e+00    0 0.00e+00 100
>>>>>>>>>>>> SNESSolve              1 1.0 4.4402e+01 1.0 4.00e+10 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00 21100  0  0  0  21100  0  0  0   901   
>>>>>>>>>>>> 51365     57
>>>>>>>>>>>> 1.10e+03   52 8.78e+02 100
>>>>>>>>>>>> SNESSetUp              1 1.0 3.9101e-05 1.0 0.00e+00 0.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       
>>>>>>>>>>>> 0      0
>>>>>>>>>>>> 0.00e+00    0 0.00e+00  0
>>>>>>>>>>>> SNESFunctionEval       2 1.0 1.7097e+01 1.0 1.60e+07 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  8  0  0  0  0   8  0  0  0  0     1       
>>>>>>>>>>>> 0      0
>>>>>>>>>>>> 0.00e+00    6 1.92e+02  0
>>>>>>>>>>>> SNESJacobianEval       1 1.0 1.6213e+01 1.0 2.80e+07 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  8  0  0  0  0   8  0  0  0  0     2       
>>>>>>>>>>>> 0      0
>>>>>>>>>>>> 0.00e+00    1 3.20e+01  0
>>>>>>>>>>>> SNESLineSearch         1 1.0 8.5582e+00 1.0 1.24e+08 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  4  0  0  0  0   4  0  0  0  0    14   
>>>>>>>>>>>> 64153      1
>>>>>>>>>>>> 3.20e+01    3 9.61e+01 94
>>>>>>>>>>>> PCGAMGGraph_AGG        5 1.0 3.0509e+00 1.0 8.19e+07 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0    27       
>>>>>>>>>>>> 0      5
>>>>>>>>>>>> 3.49e+01    9 7.43e+01  0
>>>>>>>>>>>> PCGAMGCoarse_AGG       5 1.0 3.8711e+00 1.0 0.00e+00 0.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0       
>>>>>>>>>>>> 0      0
>>>>>>>>>>>> 0.00e+00    0 0.00e+00  0
>>>>>>>>>>>> PCGAMGProl_AGG         5 1.0 7.0748e-01 1.0 0.00e+00 0.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       
>>>>>>>>>>>> 0      0
>>>>>>>>>>>> 0.00e+00    0 0.00e+00  0
>>>>>>>>>>>> PCGAMGPOpt_AGG         5 1.0 1.2904e+00 1.0 2.14e+09 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  1  5  0  0  0   1  5  0  0  0  1661   
>>>>>>>>>>>> 29807     26
>>>>>>>>>>>> 7.15e+02   20 2.90e+02 99
>>>>>>>>>>>> GAMG: createProl       5 1.0 8.9489e+00 1.0 2.22e+09 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  4  6  0  0  0   4  6  0  0  0   249   
>>>>>>>>>>>> 29666     31
>>>>>>>>>>>> 7.50e+02   29 3.64e+02 96
>>>>>>>>>>>>   Graph               10 1.0 3.0478e+00 1.0 8.19e+07 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0    27       
>>>>>>>>>>>> 0      5
>>>>>>>>>>>> 3.49e+01    9 7.43e+01  0
>>>>>>>>>>>>   MIS/Agg              5 1.0 4.1290e-01 1.0 0.00e+00 0.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       
>>>>>>>>>>>> 0      0
>>>>>>>>>>>> 0.00e+00    0 0.00e+00  0
>>>>>>>>>>>>   SA: col data         5 1.0 1.9127e-02 1.0 0.00e+00 0.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       
>>>>>>>>>>>> 0      0
>>>>>>>>>>>> 0.00e+00    0 0.00e+00  0
>>>>>>>>>>>>   SA: frmProl0         5 1.0 6.2662e-01 1.0 0.00e+00 0.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       
>>>>>>>>>>>> 0      0
>>>>>>>>>>>> 0.00e+00    0 0.00e+00  0
>>>>>>>>>>>>   SA: smooth           5 1.0 4.9595e-01 1.0 1.21e+08 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   244    
>>>>>>>>>>>> 2709     15
>>>>>>>>>>>> 1.97e+02   15 2.55e+02 90
>>>>>>>>>>>> GAMG: partLevel        5 1.0 4.7330e-01 1.0 6.98e+08 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  2  0  0  0   0  2  0  0  0  1475    
>>>>>>>>>>>> 4120      5
>>>>>>>>>>>> 1.78e+02   10 2.55e+02 100
>>>>>>>>>>>> PCGAMG Squ l00         1 1.0 2.6027e+00 1.0 0.00e+00 0.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0       
>>>>>>>>>>>> 0      0
>>>>>>>>>>>> 0.00e+00    0 0.00e+00  0
>>>>>>>>>>>> PCGAMG Gal l00         1 1.0 3.8406e-01 1.0 5.48e+08 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  1426    
>>>>>>>>>>>> 4270      1
>>>>>>>>>>>> 1.48e+02    2 2.11e+02 100
>>>>>>>>>>>> PCGAMG Opt l00         1 1.0 2.4932e-01 1.0 7.20e+07 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   289    
>>>>>>>>>>>> 2653      1
>>>>>>>>>>>> 6.41e+01    1 1.13e+02 100
>>>>>>>>>>>> PCGAMG Gal l01         1 1.0 6.6279e-02 1.0 1.09e+08 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1645    
>>>>>>>>>>>> 3851      1
>>>>>>>>>>>> 2.40e+01    2 3.64e+01 100
>>>>>>>>>>>> PCGAMG Opt l01         1 1.0 2.9544e-02 1.0 7.15e+06 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   242    
>>>>>>>>>>>> 1671      1
>>>>>>>>>>>> 4.84e+00    1 1.23e+01 100
>>>>>>>>>>>> PCGAMG Gal l02         1 1.0 1.8874e-02 1.0 3.72e+07 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1974    
>>>>>>>>>>>> 3636      1
>>>>>>>>>>>> 5.04e+00    2 6.58e+00 100
>>>>>>>>>>>> PCGAMG Opt l02         1 1.0 7.4353e-03 1.0 2.40e+06 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   323    
>>>>>>>>>>>> 1457      1
>>>>>>>>>>>> 7.71e-01    1 2.30e+00 100
>>>>>>>>>>>> PCGAMG Gal l03         1 1.0 2.8479e-03 1.0 4.10e+06 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  1440    
>>>>>>>>>>>> 2266      1
>>>>>>>>>>>> 4.44e-01    2 5.51e-01 100
>>>>>>>>>>>> PCGAMG Opt l03         1 1.0 8.2684e-04 1.0 2.80e+05 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   339    
>>>>>>>>>>>> 1667      1
>>>>>>>>>>>> 6.72e-02    1 2.03e-01 100
>>>>>>>>>>>> PCGAMG Gal l04         1 1.0 1.2238e-03 1.0 2.09e+05 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   170     
>>>>>>>>>>>> 244      1
>>>>>>>>>>>> 2.05e-02    2 2.53e-02 100
>>>>>>>>>>>> PCGAMG Opt l04         1 1.0 4.1008e-04 1.0 1.77e+04 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    43     
>>>>>>>>>>>> 165      1
>>>>>>>>>>>> 4.49e-03    1 1.19e-02 100
>>>>>>>>>>>> PCSetUp                2 1.0 9.9632e+00 1.0 4.95e+09 1.0
>>>>>>>>>>>> 0.0e+00 0.0e+00 0.0e+00  5 12  0  0  0   5 12  0  0  0   496   
>>>>>>>>>>>> 17826     55
>>>>>>>>>>>> 1.03e+03   45 6.54e+02 98
>>>>>>>>>>>> PCSetUpOnBlocks       44 1.0 9.9087e-04 1.0 2.88e+03 1.0
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> The point of lazy initialization is to make it possible to run
>>>>>>>>>>>>> a solve that doesn't use a GPU in PETSC_ARCH that supports GPUs, 
>>>>>>>>>>>>> regardless
>>>>>>>>>>>>> of whether a GPU is actually present.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Fande Kong <fdkong...@gmail.com> writes:
>>>>>>>>>>>>>
>>>>>>>>>>>>> > I spoke too soon. It seems that we have trouble creating
>>>>>>>>>>>>> cuda/kokkos vecs
>>>>>>>>>>>>> > now. Got Segmentation fault.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Fande
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>>>>>> > 0x00002aaab5558b11 in
>>>>>>>>>>>>> >
>>>>>>>>>>>>> Petsc::CUPMDevice<(Petsc::CUPMDeviceType)0>::CUPMDeviceInternal::initialize
>>>>>>>>>>>>> > (this=0x1) at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:54
>>>>>>>>>>>>> > 54 PetscErrorCode
>>>>>>>>>>>>> CUPMDevice<T>::CUPMDeviceInternal::initialize() noexcept
>>>>>>>>>>>>> > Missing separate debuginfos, use: debuginfo-install
>>>>>>>>>>>>> > bzip2-libs-1.0.6-13.el7.x86_64
>>>>>>>>>>>>> elfutils-libelf-0.176-5.el7.x86_64
>>>>>>>>>>>>> > elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-325.el7_9.x86_64
>>>>>>>>>>>>> > libX11-1.6.7-4.el7_9.x86_64 libXau-1.0.8-2.1.el7.x86_64
>>>>>>>>>>>>> > libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64
>>>>>>>>>>>>> > libibmad-5.4.0.MLNX20190423.1d917ae-0.1.49224.x86_64
>>>>>>>>>>>>> > libibumad-43.1.1.MLNX20200211.078947f-0.1.49224.x86_64
>>>>>>>>>>>>> > libibverbs-41mlnx1-OFED.4.9.0.0.7.49224.x86_64
>>>>>>>>>>>>> > libmlx4-41mlnx1-OFED.4.7.3.0.3.49224.x86_64
>>>>>>>>>>>>> > libmlx5-41mlnx1-OFED.4.9.0.1.2.49224.x86_64
>>>>>>>>>>>>> libnl3-3.2.28-4.el7.x86_64
>>>>>>>>>>>>> > librdmacm-41mlnx1-OFED.4.7.3.0.6.49224.x86_64
>>>>>>>>>>>>> > librxe-41mlnx1-OFED.4.4.2.4.6.49224.x86_64
>>>>>>>>>>>>> libxcb-1.13-1.el7.x86_64
>>>>>>>>>>>>> > libxml2-2.9.1-6.el7_9.6.x86_64
>>>>>>>>>>>>> numactl-libs-2.0.12-5.el7.x86_64
>>>>>>>>>>>>> > systemd-libs-219-78.el7_9.3.x86_64 xz-libs-5.2.2-1.el7.x86_64
>>>>>>>>>>>>> > zlib-1.2.7-19.el7_9.x86_64
>>>>>>>>>>>>> > (gdb) bt
>>>>>>>>>>>>> > #0  0x00002aaab5558b11 in
>>>>>>>>>>>>> >
>>>>>>>>>>>>> Petsc::CUPMDevice<(Petsc::CUPMDeviceType)0>::CUPMDeviceInternal::initialize
>>>>>>>>>>>>> > (this=0x1) at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:54
>>>>>>>>>>>>> > #1  0x00002aaab5558db7 in
>>>>>>>>>>>>> > Petsc::CUPMDevice<(Petsc::CUPMDeviceType)0>::getDevice
>>>>>>>>>>>>> > (this=this@entry=0x2aaab7f37b70
>>>>>>>>>>>>> > <CUDADevice>, device=0x115da00, id=-35, id@entry=-1) at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/impls/cupm/cupmdevice.cxx:344
>>>>>>>>>>>>> > #2  0x00002aaab55577de in PetscDeviceCreate (type=type@entry
>>>>>>>>>>>>> =PETSC_DEVICE_CUDA,
>>>>>>>>>>>>> > devid=devid@entry=-1, device=device@entry=0x2aaab7f37b48
>>>>>>>>>>>>> > <defaultDevices+8>) at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/interface/device.cxx:107
>>>>>>>>>>>>> > #3  0x00002aaab5557b3a in
>>>>>>>>>>>>> PetscDeviceInitializeDefaultDevice_Internal
>>>>>>>>>>>>> > (type=type@entry=PETSC_DEVICE_CUDA,
>>>>>>>>>>>>> defaultDeviceId=defaultDeviceId@entry=-1)
>>>>>>>>>>>>> > at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/interface/device.cxx:273
>>>>>>>>>>>>> > #4  0x00002aaab5557bf6 in PetscDeviceInitialize
>>>>>>>>>>>>> > (type=type@entry=PETSC_DEVICE_CUDA)
>>>>>>>>>>>>> > at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/sys/objects/device/interface/device.cxx:234
>>>>>>>>>>>>> > #5  0x00002aaab5661fcd in VecCreate_SeqCUDA (V=0x115d150) at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/impls/seq/seqcuda/veccuda.c:244
>>>>>>>>>>>>> > #6  0x00002aaab5649b40 in VecSetType (vec=vec@entry
>>>>>>>>>>>>> =0x115d150,
>>>>>>>>>>>>> > method=method@entry=0x2aaab70b45b8 "seqcuda") at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vecreg.c:93
>>>>>>>>>>>>> > #7  0x00002aaab579c33f in VecCreate_CUDA (v=0x115d150) at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/impls/mpi/mpicuda/
>>>>>>>>>>>>> > mpicuda.cu:214
>>>>>>>>>>>>> > #8  0x00002aaab5649b40 in VecSetType (vec=vec@entry
>>>>>>>>>>>>> =0x115d150,
>>>>>>>>>>>>> > method=method@entry=0x7fffffff9260 "cuda") at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vecreg.c:93
>>>>>>>>>>>>> > #9  0x00002aaab5648bf1 in VecSetTypeFromOptions_Private
>>>>>>>>>>>>> (vec=0x115d150,
>>>>>>>>>>>>> > PetscOptionsObject=0x7fffffff9210) at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vector.c:1263
>>>>>>>>>>>>> > #10 VecSetFromOptions (vec=0x115d150) at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/petsc/src/vec/vec/interface/vector.c:1297
>>>>>>>>>>>>> > #11 0x00002aaab02ef227 in libMesh::PetscVector<double>::init
>>>>>>>>>>>>> > (this=0x11cd1a0, n=441, n_local=441, fast=false,
>>>>>>>>>>>>> ptype=libMesh::PARALLEL)
>>>>>>>>>>>>> > at
>>>>>>>>>>>>> >
>>>>>>>>>>>>> /home/kongf/workhome/sawtooth/moosegpu/scripts/../libmesh/installed/include/libmesh/petsc_vector.h:693
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > On Thu, Jan 20, 2022 at 1:09 PM Fande Kong <
>>>>>>>>>>>>> fdkong...@gmail.com> wrote:
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >> Thanks, Jed,
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> This worked!
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> Fande
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> On Wed, Jan 19, 2022 at 11:03 PM Jed Brown <
>>>>>>>>>>>>> j...@jedbrown.org> wrote:
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >>> Fande Kong <fdkong...@gmail.com> writes:
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> > On Wed, Jan 19, 2022 at 11:39 AM Jacob Faibussowitsch <
>>>>>>>>>>>>> >>> jacob....@gmail.com>
>>>>>>>>>>>>> >>> > wrote:
>>>>>>>>>>>>> >>> >
>>>>>>>>>>>>> >>> >> Are you running on login nodes or compute nodes (I
>>>>>>>>>>>>> can’t seem to tell
>>>>>>>>>>>>> >>> from
>>>>>>>>>>>>> >>> >> the configure.log)?
>>>>>>>>>>>>> >>> >>
>>>>>>>>>>>>> >>> >
>>>>>>>>>>>>> >>> > I was compiling codes on login nodes, and running codes
>>>>>>>>>>>>> on compute
>>>>>>>>>>>>> >>> nodes.
>>>>>>>>>>>>> >>> > Login nodes do not have GPUs, but compute nodes do have
>>>>>>>>>>>>> GPUs.
>>>>>>>>>>>>> >>> >
>>>>>>>>>>>>> >>> > Just to be clear, the same thing (code, machine) with
>>>>>>>>>>>>> PETSc-3.16.1
>>>>>>>>>>>>> >>> worked
>>>>>>>>>>>>> >>> > perfectly. I have this trouble with PETSc-main.
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> I assume you can
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> export PETSC_OPTIONS='-device_enable lazy'
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> and it'll work.
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> I think this should be the default. The main complaint is
>>>>>>>>>>>>> that timing the
>>>>>>>>>>>>> >>> first GPU-using event isn't accurate if it includes
>>>>>>>>>>>>> initialization, but I
>>>>>>>>>>>>> >>> think this is mostly hypothetical because you can't trust
>>>>>>>>>>>>> any timing that
>>>>>>>>>>>>> >>> doesn't preload in some form and the first GPU-using event
>>>>>>>>>>>>> will almost
>>>>>>>>>>>>> >>> always be something uninteresting so I think it will
>>>>>>>>>>>>> rarely lead to
>>>>>>>>>>>>> >>> confusion. Meanwhile, eager initialization is viscerally
>>>>>>>>>>>>> disruptive for
>>>>>>>>>>>>> >>> lots of people.
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>>>> experiments is infinitely more interesting than any results to 
>>>>>>>>>>> which their
>>>>>>>>>>> experiments lead.
>>>>>>>>>>> -- Norbert Wiener
>>>>>>>>>>>
>>>>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> <configure_bad.log><configure_good.log>
>>>>>>
>>>>>>
>>>>>>

Re: [petsc-users] Cannot eagerly initialize cuda, as doing so results in cuda error 35 (cudaErrorInsufficientDriver) : CUDA driver version is insufficient for CUDA runtime version

Reply via email to