Re: [petsc-dev] Memory problem with OpenMP and Fieldsplit sub solvers

Mark Adams Fri, 22 Jan 2021 09:13:59 -0800

The docs get the job done: marketing is happy because it says you can do
whatever you want and legal is happy because it says it may not always
work. Win win.


But I did add an array of handles and that works.

On Fri, Jan 22, 2021 at 10:31 AM Stefano Zampini <[email protected]>
wrote:

> I like NVIDIA docs, here is the message I got from what you posted:
> multiple threads can call cublas with the same handle but I would not do it
> if I were you….
> In other words, everything is doable until you encounter an error. Very
> clear.
>
> On Jan 22, 2021, at 4:51 PM, Mark Adams <[email protected]> wrote:
>
> OK, I found the problem. It is in cuBlas. This is the code for VecNorm in
> VecCuda, with print statement *added*:
>
>     cberr =
> cublasXnrm2(cublasv2handle,bn,xarray,one,z);CHKERRCUBLAS(cberr);
>
> *PetscScalar h_val; cudaMemcpy(&h_val, &xarray[0], sizeof(PetscScalar),
> cudaMemcpyDeviceToHost);    PetscPrintf(PETSC_COMM_SELF,"VecNorm_SeqCUDA
> %d) x[0]=%g |z|=%g\n",omp_get_thread_num(),h_val,*z);*
>
> After running a small job several times (this is not deterministic) I got
> a run with a different result and the first VecNorm in an OMP loops gives:
>
> VecNorm_SeqCUDA 0) x[0]=-8.38153e-08 |z|=0.
>
> Clearly wrong. The cuBlas doc says:
>
> 2.1.3. Thread Safety
> <https://docs.nvidia.com/cuda/cublas/index.html#thread-safety2>
>
> The library is thread safe and its functions can be called from multiple
> host threads, even with the same handle. When multiple threads share the
> same handle, extreme care needs to be taken when the handle configuration
> is changed because that change will affect potentially subsequent cuBLAS
> calls in all threads. It is even more true for the destruction of the
> handle. So it is not recommended that multiple thread share the same cuBLAS
> handle.
> There are static handles in src/sys/objects/cuda/handle.c. Do you think I
> should make these arrays of handles for each OMP thread?
>
> If so, should I make a global #define PETSC_MAX_THREADS? assuming there is
> nothing like this already.
>
> Mark
>
>
> On Thu, Jan 21, 2021 at 6:37 PM Mark Adams <[email protected]> wrote:
>
>> This did not work. I verified that MPI_Init_thread is being called
>> correctly and that MPI returns that it supports this highest level of
>> thread safety.
>>
>> I am going to ask ORNL.
>>
>> And if I use:
>>
>> -fieldsplit_i1_ksp_norm_type none
>> -fieldsplit_i1_ksp_max_it 300
>>
>> for all 9 "i" variables, I can run normal iterations on the 10th
>> variable, in a 10 species problem, and it works perfectly with 10 threads.
>>
>> So it is definitely that VecNorm is not thread safe.
>>
>> And, I want to call SuperLU_dist, which uses threads, but I don't want
>> SuperLU to start using threads. Is there a way to tell superLU that there
>> are no threads but have PETSc use them?
>>
>> Thanks,
>> Mark
>>
>> On Thu, Jan 21, 2021 at 5:19 PM Mark Adams <[email protected]> wrote:
>>
>>> OK, the problem is probably:
>>>
>>> PetscMPIInt PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_FUNNELED;
>>>
>>> There is an example that sets:
>>>
>>> PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_MULTIPLE;
>>>
>>> This is what I need.
>>>
>>>
>>>
>>>
>>> On Thu, Jan 21, 2021 at 2:26 PM Mark Adams <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Jan 21, 2021 at 2:11 PM Matthew Knepley <[email protected]>
>>>> wrote:
>>>>
>>>>> On Thu, Jan 21, 2021 at 2:02 PM Mark Adams <[email protected]> wrote:
>>>>>
>>>>>> On Thu, Jan 21, 2021 at 1:44 PM Matthew Knepley <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> On Thu, Jan 21, 2021 at 11:16 AM Mark Adams <[email protected]> wrote:
>>>>>>>
>>>>>>>> Yes, the problem is that each KSP solver is running in an OMP
>>>>>>>> thread (So at this point it only works for SELF and its Landau so it 
>>>>>>>> is all
>>>>>>>> I need). It looks like MPI reductions called with a comm_self are not
>>>>>>>> thread safe (eg, the could say, this is one proc, thus, just copy send 
>>>>>>>> -->
>>>>>>>> recv, but they don't)
>>>>>>>>
>>>>>>>
>>>>>>> Instead of using SELF, how about Comm_dup() for each thread?
>>>>>>>
>>>>>>
>>>>>> OK, raw MPI_Comm_dup. I tried PetscCommDup. Let me this.
>>>>>> Thanks,
>>>>>>
>>>>>
>>>>> You would have to dup them all outside the OMP section, since it is
>>>>> not threadsafe. Then each thread uses one I think.
>>>>>
>>>>
>>>> Yea sure. I do it in SetUp.
>>>>
>>>> Well that worked to get *different Comms*, finally, I still get the
>>>> same problem. The number of iterations differ wildly. This two species and
>>>> two threads (13 SNES its that is not deterministic). Way below is one
>>>> thread (8 its) and fairly uniform iteration counts.
>>>>
>>>> Maybe this MPI is just not thread safe at all. Let me look into it.
>>>> Thanks anyway,
>>>>
>>>>    0 SNES Function norm 4.974994975313e-03
>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link:
>>>> 0x80017c60. Comms pc=0x67ad27c0 ksp=*0x7ffe1600* newcomm=0x8014b6e0
>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link:
>>>> 0x7ffdabc0. Comms pc=0x67ad27c0 ksp=*0x7fff70d0* newcomm=0x7ffe9980
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 282
>>>>     1 SNES Function norm 1.836376279964e-05
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL
>>>> iterations 19
>>>>     2 SNES Function norm 3.059930074740e-07
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL
>>>> iterations 15
>>>>     3 SNES Function norm 4.744275398121e-08
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL
>>>> iterations 4
>>>>     4 SNES Function norm 4.014828563316e-08
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 456
>>>>     5 SNES Function norm 5.670836337808e-09
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL
>>>> iterations 2
>>>>     6 SNES Function norm 2.410421401323e-09
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL
>>>> iterations 18
>>>>     7 SNES Function norm 6.533948191791e-10
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 458
>>>>     8 SNES Function norm 1.008133815842e-10
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL
>>>> iterations 9
>>>>     9 SNES Function norm 1.690450876038e-11
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL
>>>> iterations 4
>>>>    10 SNES Function norm 1.336383986009e-11
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 463
>>>>    11 SNES Function norm 1.873022410774e-12
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 113
>>>>    12 SNES Function norm 1.801834606518e-13
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL
>>>> iterations 1
>>>>    13 SNES Function norm 1.004397317339e-13
>>>>   Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations
>>>> 13
>>>>
>>>>
>>>>
>>>>
>>>>     0 SNES Function norm 4.974994975313e-03
>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link:
>>>> 0x6e265010. Comms pc=0x56450340 ksp=0x6e2168d0 newcomm=0x6e265090
>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link:
>>>> 0x6e25bc40. Comms pc=0x56450340 ksp=0x6e22c1d0 newcomm=0x6e21e8f0
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 282
>>>>     1 SNES Function norm 1.836376279963e-05
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 380
>>>>     2 SNES Function norm 3.018499983019e-07
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 387
>>>>     3 SNES Function norm 1.826353175637e-08
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 391
>>>>     4 SNES Function norm 1.378600599548e-09
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 392
>>>>     5 SNES Function norm 1.077289085611e-10
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 394
>>>>     6 SNES Function norm 8.571891727748e-12
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 395
>>>>     7 SNES Function norm 6.897647643450e-13
>>>>       Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL
>>>> iterations 395
>>>>     8 SNES Function norm 5.606434614114e-14
>>>>   Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 8
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>    Matt
>>>>>
>>>>>
>>>>>>   Matt
>>>>>>>
>>>>>>>
>>>>>>>> On Thu, Jan 21, 2021 at 10:46 AM Matthew Knepley <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Thu, Jan 21, 2021 at 10:34 AM Mark Adams <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> It looks like PETSc is just too clever for me. I am trying to get
>>>>>>>>>> a different MPI_Comm into each block, but PETSc is thwarting me:
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It looks like you are using SELF. Is that what you want? Do you
>>>>>>>>> want a bunch of comms with the same group, but independent somehow? I 
>>>>>>>>> am
>>>>>>>>> confused.
>>>>>>>>>
>>>>>>>>>    Matt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>   if (jac->use_openmp) {
>>>>>>>>>>     ierr          =
>>>>>>>>>> KSPCreate(MPI_COMM_SELF,&ilink->ksp);CHKERRQ(ierr);
>>>>>>>>>> PetscPrintf(PETSC_COMM_SELF,"In PCFieldSplitSetFields_FieldSplit
>>>>>>>>>> with -------------- link: %p. Comms %p
>>>>>>>>>> %p\n",ilink,PetscObjectComm((PetscObject)pc),PetscObjectComm((PetscObject)ilink->ksp));
>>>>>>>>>>   } else {
>>>>>>>>>>     ierr          =
>>>>>>>>>> KSPCreate(PetscObjectComm((PetscObject)pc),&ilink->ksp);CHKERRQ(ierr);
>>>>>>>>>>   }
>>>>>>>>>>
>>>>>>>>>> produces:
>>>>>>>>>>
>>>>>>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link:
>>>>>>>>>> 0x7e9cb4f0. Comms 0x660c6ad0 0x660c6ad0
>>>>>>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link:
>>>>>>>>>> 0x7e88f7d0. Comms 0x660c6ad0 0x660c6ad0
>>>>>>>>>>
>>>>>>>>>> How can I work around this?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 21, 2021 at 7:41 AM Mark Adams <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 20, 2021 at 6:21 PM Barry Smith <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Jan 20, 2021, at 3:09 PM, Mark Adams <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> So I put in a temporary hack to get the first Fieldsplit apply
>>>>>>>>>>>> to NOT use OMP and it sort of works.
>>>>>>>>>>>>
>>>>>>>>>>>> Preonly/lu is fine. GMRES calls vector creates/dups in every
>>>>>>>>>>>> solve so that is a big problem.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   It should definitely not be creating vectors "in every"
>>>>>>>>>>>> solve. But it does do lazy allocation of needed restarted vectors 
>>>>>>>>>>>> which may
>>>>>>>>>>>> make it look like it is creating "every" vectors in every solve.  
>>>>>>>>>>>> You can
>>>>>>>>>>>> use -ksp_gmres_preallocate to force it to create all the restart 
>>>>>>>>>>>> vectors up
>>>>>>>>>>>> front at KSPSetUp().
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Well, I run the first solve w/o OMP and I see Vec dups in
>>>>>>>>>>> cuSparse Vecs in the 2nd solve.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   Why is creating vectors "at every solve" a problem? It is not
>>>>>>>>>>>> thread safe I guess?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It dies when it looks at the options database, in a Free in the
>>>>>>>>>>> get-options method to be exact (see stacks).
>>>>>>>>>>>
>>>>>>>>>>> ======= Backtrace: =========
>>>>>>>>>>> /lib64/libc.so.6(cfree+0x4a0)[0x200021839be0]
>>>>>>>>>>>
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscFreeAlign+0x4c)[0x2000002a368c]
>>>>>>>>>>>
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscOptionsEnd_Private+0xf4)[0x2000002e53f0]
>>>>>>>>>>>
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x7c6c28)[0x2000008b6c28]
>>>>>>>>>>>
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreate_SeqCUDA+0x11c)[0x20000052c510]
>>>>>>>>>>>
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecSetType+0x670)[0x200000549664]
>>>>>>>>>>>
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreateSeqCUDA+0x150)[0x20000052c0b0]
>>>>>>>>>>>
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x43c198)[0x20000052c198]
>>>>>>>>>>>
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicate+0x44)[0x200000542168]
>>>>>>>>>>>
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs_Default+0x148)[0x200000543820]
>>>>>>>>>>>
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs+0x54)[0x2000005425f4]
>>>>>>>>>>>
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(KSPCreateVecs+0x4b4)[0x2000016f0aec]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Richardson works except the convergence test gets confused,
>>>>>>>>>>>> presumably because MPI reductions with PETSC_COMM_SELF is not 
>>>>>>>>>>>> threadsafe.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> One fix for the norms might be to create each subdomain solver
>>>>>>>>>>>> with a different communicator.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    Yes you could do that. It might actually be the correct
>>>>>>>>>>>> thing to do also, if you have multiple threads call MPI reductions 
>>>>>>>>>>>> on the
>>>>>>>>>>>> same communicator that would be a problem. Each KSP should get a 
>>>>>>>>>>>> new
>>>>>>>>>>>> MPI_Comm.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> OK. I will only do this.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> What most experimenters take for granted before they begin their
>>>>>>>>> experiments is infinitely more interesting than any results to which 
>>>>>>>>> their
>>>>>>>>> experiments lead.
>>>>>>>>> -- Norbert Wiener
>>>>>>>>>
>>>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> What most experimenters take for granted before they begin their
>>>>>>> experiments is infinitely more interesting than any results to which 
>>>>>>> their
>>>>>>> experiments lead.
>>>>>>> -- Norbert Wiener
>>>>>>>
>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> What most experimenters take for granted before they begin their
>>>>> experiments is infinitely more interesting than any results to which their
>>>>> experiments lead.
>>>>> -- Norbert Wiener
>>>>>
>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>
>>>>
>

Re: [petsc-dev] Memory problem with OpenMP and Fieldsplit sub solvers

Reply via email to