The docs get the job done: marketing is happy because it says you can do whatever you want and legal is happy because it says it may not always work. Win win.
But I did add an array of handles and that works. On Fri, Jan 22, 2021 at 10:31 AM Stefano Zampini <[email protected]> wrote: > I like NVIDIA docs, here is the message I got from what you posted: > multiple threads can call cublas with the same handle but I would not do it > if I were you…. > In other words, everything is doable until you encounter an error. Very > clear. > > On Jan 22, 2021, at 4:51 PM, Mark Adams <[email protected]> wrote: > > OK, I found the problem. It is in cuBlas. This is the code for VecNorm in > VecCuda, with print statement *added*: > > cberr = > cublasXnrm2(cublasv2handle,bn,xarray,one,z);CHKERRCUBLAS(cberr); > > *PetscScalar h_val; cudaMemcpy(&h_val, &xarray[0], sizeof(PetscScalar), > cudaMemcpyDeviceToHost); PetscPrintf(PETSC_COMM_SELF,"VecNorm_SeqCUDA > %d) x[0]=%g |z|=%g\n",omp_get_thread_num(),h_val,*z);* > > After running a small job several times (this is not deterministic) I got > a run with a different result and the first VecNorm in an OMP loops gives: > > VecNorm_SeqCUDA 0) x[0]=-8.38153e-08 |z|=0. > > Clearly wrong. The cuBlas doc says: > > 2.1.3. Thread Safety > <https://docs.nvidia.com/cuda/cublas/index.html#thread-safety2> > > The library is thread safe and its functions can be called from multiple > host threads, even with the same handle. When multiple threads share the > same handle, extreme care needs to be taken when the handle configuration > is changed because that change will affect potentially subsequent cuBLAS > calls in all threads. It is even more true for the destruction of the > handle. So it is not recommended that multiple thread share the same cuBLAS > handle. > There are static handles in src/sys/objects/cuda/handle.c. Do you think I > should make these arrays of handles for each OMP thread? > > If so, should I make a global #define PETSC_MAX_THREADS? assuming there is > nothing like this already. > > Mark > > > On Thu, Jan 21, 2021 at 6:37 PM Mark Adams <[email protected]> wrote: > >> This did not work. I verified that MPI_Init_thread is being called >> correctly and that MPI returns that it supports this highest level of >> thread safety. >> >> I am going to ask ORNL. >> >> And if I use: >> >> -fieldsplit_i1_ksp_norm_type none >> -fieldsplit_i1_ksp_max_it 300 >> >> for all 9 "i" variables, I can run normal iterations on the 10th >> variable, in a 10 species problem, and it works perfectly with 10 threads. >> >> So it is definitely that VecNorm is not thread safe. >> >> And, I want to call SuperLU_dist, which uses threads, but I don't want >> SuperLU to start using threads. Is there a way to tell superLU that there >> are no threads but have PETSc use them? >> >> Thanks, >> Mark >> >> On Thu, Jan 21, 2021 at 5:19 PM Mark Adams <[email protected]> wrote: >> >>> OK, the problem is probably: >>> >>> PetscMPIInt PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_FUNNELED; >>> >>> There is an example that sets: >>> >>> PETSC_MPI_THREAD_REQUIRED = MPI_THREAD_MULTIPLE; >>> >>> This is what I need. >>> >>> >>> >>> >>> On Thu, Jan 21, 2021 at 2:26 PM Mark Adams <[email protected]> wrote: >>> >>>> >>>> >>>> On Thu, Jan 21, 2021 at 2:11 PM Matthew Knepley <[email protected]> >>>> wrote: >>>> >>>>> On Thu, Jan 21, 2021 at 2:02 PM Mark Adams <[email protected]> wrote: >>>>> >>>>>> On Thu, Jan 21, 2021 at 1:44 PM Matthew Knepley <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> On Thu, Jan 21, 2021 at 11:16 AM Mark Adams <[email protected]> wrote: >>>>>>> >>>>>>>> Yes, the problem is that each KSP solver is running in an OMP >>>>>>>> thread (So at this point it only works for SELF and its Landau so it >>>>>>>> is all >>>>>>>> I need). It looks like MPI reductions called with a comm_self are not >>>>>>>> thread safe (eg, the could say, this is one proc, thus, just copy send >>>>>>>> --> >>>>>>>> recv, but they don't) >>>>>>>> >>>>>>> >>>>>>> Instead of using SELF, how about Comm_dup() for each thread? >>>>>>> >>>>>> >>>>>> OK, raw MPI_Comm_dup. I tried PetscCommDup. Let me this. >>>>>> Thanks, >>>>>> >>>>> >>>>> You would have to dup them all outside the OMP section, since it is >>>>> not threadsafe. Then each thread uses one I think. >>>>> >>>> >>>> Yea sure. I do it in SetUp. >>>> >>>> Well that worked to get *different Comms*, finally, I still get the >>>> same problem. The number of iterations differ wildly. This two species and >>>> two threads (13 SNES its that is not deterministic). Way below is one >>>> thread (8 its) and fairly uniform iteration counts. >>>> >>>> Maybe this MPI is just not thread safe at all. Let me look into it. >>>> Thanks anyway, >>>> >>>> 0 SNES Function norm 4.974994975313e-03 >>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>> 0x80017c60. Comms pc=0x67ad27c0 ksp=*0x7ffe1600* newcomm=0x8014b6e0 >>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>> 0x7ffdabc0. Comms pc=0x67ad27c0 ksp=*0x7fff70d0* newcomm=0x7ffe9980 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 282 >>>> 1 SNES Function norm 1.836376279964e-05 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>> iterations 19 >>>> 2 SNES Function norm 3.059930074740e-07 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>> iterations 15 >>>> 3 SNES Function norm 4.744275398121e-08 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>> iterations 4 >>>> 4 SNES Function norm 4.014828563316e-08 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 456 >>>> 5 SNES Function norm 5.670836337808e-09 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>> iterations 2 >>>> 6 SNES Function norm 2.410421401323e-09 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>> iterations 18 >>>> 7 SNES Function norm 6.533948191791e-10 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 458 >>>> 8 SNES Function norm 1.008133815842e-10 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>> iterations 9 >>>> 9 SNES Function norm 1.690450876038e-11 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>> iterations 4 >>>> 10 SNES Function norm 1.336383986009e-11 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 463 >>>> 11 SNES Function norm 1.873022410774e-12 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 113 >>>> 12 SNES Function norm 1.801834606518e-13 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_ATOL >>>> iterations 1 >>>> 13 SNES Function norm 1.004397317339e-13 >>>> Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations >>>> 13 >>>> >>>> >>>> >>>> >>>> 0 SNES Function norm 4.974994975313e-03 >>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>> 0x6e265010. Comms pc=0x56450340 ksp=0x6e2168d0 newcomm=0x6e265090 >>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>> 0x6e25bc40. Comms pc=0x56450340 ksp=0x6e22c1d0 newcomm=0x6e21e8f0 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 282 >>>> 1 SNES Function norm 1.836376279963e-05 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 380 >>>> 2 SNES Function norm 3.018499983019e-07 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 387 >>>> 3 SNES Function norm 1.826353175637e-08 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 391 >>>> 4 SNES Function norm 1.378600599548e-09 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 392 >>>> 5 SNES Function norm 1.077289085611e-10 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 394 >>>> 6 SNES Function norm 8.571891727748e-12 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 395 >>>> 7 SNES Function norm 6.897647643450e-13 >>>> Linear fieldsplit_e_ solve converged due to CONVERGED_RTOL >>>> iterations 395 >>>> 8 SNES Function norm 5.606434614114e-14 >>>> Nonlinear solve converged due to CONVERGED_SNORM_RELATIVE iterations 8 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>> >>>>> Matt >>>>> >>>>> >>>>>> Matt >>>>>>> >>>>>>> >>>>>>>> On Thu, Jan 21, 2021 at 10:46 AM Matthew Knepley <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> On Thu, Jan 21, 2021 at 10:34 AM Mark Adams <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> It looks like PETSc is just too clever for me. I am trying to get >>>>>>>>>> a different MPI_Comm into each block, but PETSc is thwarting me: >>>>>>>>>> >>>>>>>>> >>>>>>>>> It looks like you are using SELF. Is that what you want? Do you >>>>>>>>> want a bunch of comms with the same group, but independent somehow? I >>>>>>>>> am >>>>>>>>> confused. >>>>>>>>> >>>>>>>>> Matt >>>>>>>>> >>>>>>>>> >>>>>>>>>> if (jac->use_openmp) { >>>>>>>>>> ierr = >>>>>>>>>> KSPCreate(MPI_COMM_SELF,&ilink->ksp);CHKERRQ(ierr); >>>>>>>>>> PetscPrintf(PETSC_COMM_SELF,"In PCFieldSplitSetFields_FieldSplit >>>>>>>>>> with -------------- link: %p. Comms %p >>>>>>>>>> %p\n",ilink,PetscObjectComm((PetscObject)pc),PetscObjectComm((PetscObject)ilink->ksp)); >>>>>>>>>> } else { >>>>>>>>>> ierr = >>>>>>>>>> KSPCreate(PetscObjectComm((PetscObject)pc),&ilink->ksp);CHKERRQ(ierr); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> produces: >>>>>>>>>> >>>>>>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>>>>>>>> 0x7e9cb4f0. Comms 0x660c6ad0 0x660c6ad0 >>>>>>>>>> In PCFieldSplitSetFields_FieldSplit with -------------- link: >>>>>>>>>> 0x7e88f7d0. Comms 0x660c6ad0 0x660c6ad0 >>>>>>>>>> >>>>>>>>>> How can I work around this? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jan 21, 2021 at 7:41 AM Mark Adams <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 20, 2021 at 6:21 PM Barry Smith <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Jan 20, 2021, at 3:09 PM, Mark Adams <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> So I put in a temporary hack to get the first Fieldsplit apply >>>>>>>>>>>> to NOT use OMP and it sort of works. >>>>>>>>>>>> >>>>>>>>>>>> Preonly/lu is fine. GMRES calls vector creates/dups in every >>>>>>>>>>>> solve so that is a big problem. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> It should definitely not be creating vectors "in every" >>>>>>>>>>>> solve. But it does do lazy allocation of needed restarted vectors >>>>>>>>>>>> which may >>>>>>>>>>>> make it look like it is creating "every" vectors in every solve. >>>>>>>>>>>> You can >>>>>>>>>>>> use -ksp_gmres_preallocate to force it to create all the restart >>>>>>>>>>>> vectors up >>>>>>>>>>>> front at KSPSetUp(). >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Well, I run the first solve w/o OMP and I see Vec dups in >>>>>>>>>>> cuSparse Vecs in the 2nd solve. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Why is creating vectors "at every solve" a problem? It is not >>>>>>>>>>>> thread safe I guess? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> It dies when it looks at the options database, in a Free in the >>>>>>>>>>> get-options method to be exact (see stacks). >>>>>>>>>>> >>>>>>>>>>> ======= Backtrace: ========= >>>>>>>>>>> /lib64/libc.so.6(cfree+0x4a0)[0x200021839be0] >>>>>>>>>>> >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscFreeAlign+0x4c)[0x2000002a368c] >>>>>>>>>>> >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(PetscOptionsEnd_Private+0xf4)[0x2000002e53f0] >>>>>>>>>>> >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x7c6c28)[0x2000008b6c28] >>>>>>>>>>> >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreate_SeqCUDA+0x11c)[0x20000052c510] >>>>>>>>>>> >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecSetType+0x670)[0x200000549664] >>>>>>>>>>> >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecCreateSeqCUDA+0x150)[0x20000052c0b0] >>>>>>>>>>> >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(+0x43c198)[0x20000052c198] >>>>>>>>>>> >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicate+0x44)[0x200000542168] >>>>>>>>>>> >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs_Default+0x148)[0x200000543820] >>>>>>>>>>> >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(VecDuplicateVecs+0x54)[0x2000005425f4] >>>>>>>>>>> >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-cuda-omp/lib/libpetsc.so.3.014(KSPCreateVecs+0x4b4)[0x2000016f0aec] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Richardson works except the convergence test gets confused, >>>>>>>>>>>> presumably because MPI reductions with PETSC_COMM_SELF is not >>>>>>>>>>>> threadsafe. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> One fix for the norms might be to create each subdomain solver >>>>>>>>>>>> with a different communicator. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yes you could do that. It might actually be the correct >>>>>>>>>>>> thing to do also, if you have multiple threads call MPI reductions >>>>>>>>>>>> on the >>>>>>>>>>>> same communicator that would be a problem. Each KSP should get a >>>>>>>>>>>> new >>>>>>>>>>>> MPI_Comm. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> OK. I will only do this. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> What most experimenters take for granted before they begin their >>>>>>>>> experiments is infinitely more interesting than any results to which >>>>>>>>> their >>>>>>>>> experiments lead. >>>>>>>>> -- Norbert Wiener >>>>>>>>> >>>>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted before they begin their >>>>>>> experiments is infinitely more interesting than any results to which >>>>>>> their >>>>>>> experiments lead. >>>>>>> -- Norbert Wiener >>>>>>> >>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> What most experimenters take for granted before they begin their >>>>> experiments is infinitely more interesting than any results to which their >>>>> experiments lead. >>>>> -- Norbert Wiener >>>>> >>>>> https://www.cse.buffalo.edu/~knepley/ >>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>> >>>> >
