> On Apr 3, 2018, at 5:43 PM, Fande Kong <fdkong...@gmail.com> wrote: > > > > On Tue, Apr 3, 2018 at 9:12 AM, Stefano Zampini <stefano.zamp...@gmail.com > <mailto:stefano.zamp...@gmail.com>> wrote: > >> On Apr 3, 2018, at 4:58 PM, Satish Balay <ba...@mcs.anl.gov >> <mailto:ba...@mcs.anl.gov>> wrote: >> >> On Tue, 3 Apr 2018, Kong, Fande wrote: >> >>> On Tue, Apr 3, 2018 at 1:17 AM, Smith, Barry F. <bsm...@mcs.anl.gov >>> <mailto:bsm...@mcs.anl.gov>> wrote: >>> >>>> >>>> Each external package definitely needs its own duplicated communicator; >>>> cannot share between packages. >>>> >>>> The only problem with the dups below is if they are in a loop and get >>>> called many times. >>>> >>> >>> >>> The "standard test" that has this issue actually has 1K fields. MOOSE >>> creates its own field-split preconditioner (not based on the PETSc >>> fieldsplit), and each filed is associated with one PC HYPRE. If PETSc >>> duplicates communicators, we should easily reach the limit 2048. >>> >>> I also want to confirm what extra communicators are introduced in the bad >>> commit. >> >> To me it looks like there is 1 extra comm created [for MATHYPRE] for each >> PCHYPRE that is created [which also creates one comm for this object]. >> > > You’re right; however, it was the same before the commit. > I don’t understand how this specific commit is related with this issue, being > the error not in the MPI_Comm_Dup which is inside MatCreate_MATHYPRE. > Actually, the error comes from MPI_Comm_create > > frame #5: 0x00000001068defd4 libmpi.12.dylib`MPI_Comm_create + 3492 > frame #6: 0x00000001061345d9 > libpetsc.3.07.dylib`hypre_GenerateSubComm(comm=-1006627852, > participate=<unavailable>, new_comm_ptr=<unavailable>) + 409 at > gen_redcs_mat.c:531 [opt] > frame #7: 0x000000010618f8ba > libpetsc.3.07.dylib`hypre_GaussElimSetup(amg_data=0x00007fe7ff857a00, > level=<unavailable>, relax_type=9) + 74 at par_relax.c:4209 [opt] > frame #8: 0x0000000106140e93 > libpetsc.3.07.dylib`hypre_BoomerAMGSetup(amg_vdata=<unavailable>, > A=0x00007fe80842aff0, f=0x00007fe80842a980, u=0x00007fe80842a510) + 17699 at > par_amg_setup.c:2108 [opt] > frame #9: 0x0000000105ec773c > libpetsc.3.07.dylib`PCSetUp_HYPRE(pc=<unavailable>) + 2540 at hypre.c:226 [opt > > How did you perform the bisection? make clean + make all ? Which version of > HYPRE are you using? > > I did more aggressively. > > "rm -rf arch-darwin-c-opt-bisect " > > "./configure --optionsModule=config.compilerOptions -with-debugging=no > --with-shared-libraries=1 --with-mpi=1 --download-fblaslapack=1 > --download-metis=1 --download-parmetis=1 --download-superlu_dist=1 > --download-hypre=1 --download-mumps=1 --download-scalapack=1 > PETSC_ARCH=arch-darwin-c-opt-bisect" >
Good, so this removes some possible sources of errors > > HYPRE verison: > > > self.gitcommit = 'v2.11.1-55-g2ea0e43' > self.download = ['git://https://github.com/LLNL/hypre > <https://github.com/LLNL/hypre>','https://github.com/LLNL/hypre/archive/'+self.gitcommit+'.tar.gz > <https://github.com/LLNL/hypre/archive/'+self.gitcommit+'.tar.gz>'] > > When reconfiguring, the HYPRE version can be different too (that commit is from 11/2016, so the HYPRE version used by the PETSc configure can have been upgraded too) > I do not think this is caused by HYPRE. > > Fande, > > > >> But you might want to verify [by linking with mpi trace library?] >> >> >> There are some debugging hints at >> https://lists.mpich.org/pipermail/discuss/2012-December/000148.html >> <https://lists.mpich.org/pipermail/discuss/2012-December/000148.html> [wrt >> mpich] - which I haven't checked.. >> >> Satish >> >>> >>> >>> Fande, >>> >>> >>> >>>> >>>> To debug the hypre/duplication issue in MOOSE I would run in the >>>> debugger with a break point in MPI_Comm_dup() and see >>>> who keeps calling it an unreasonable amount of times. (My guess is this is >>>> a new "feature" in hypre that they will need to fix but only debugging will >>>> tell) >>>> >>>> Barry >>>> >>>> >>>>> On Apr 2, 2018, at 7:44 PM, Balay, Satish <ba...@mcs.anl.gov >>>>> <mailto:ba...@mcs.anl.gov>> wrote: >>>>> >>>>> We do a MPI_Comm_dup() for objects related to externalpackages. >>>>> >>>>> Looks like we added a new mat type MATHYPRE - in 3.8 that PCHYPRE is >>>>> using. Previously there was one MPI_Comm_dup() PCHYPRE - now I think >>>>> is one more for MATHYPRE - so more calls to MPI_Comm_dup in 3.8 vs 3.7 >>>>> >>>>> src/dm/impls/da/hypre/mhyp.c: ierr = MPI_Comm_dup(PetscObjectComm(( >>>> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr); >>>>> src/dm/impls/da/hypre/mhyp.c: ierr = MPI_Comm_dup(PetscObjectComm(( >>>> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr); >>>>> src/dm/impls/swarm/data_ex.c: ierr = MPI_Comm_dup(comm,&d->comm); >>>> CHKERRQ(ierr); >>>>> src/ksp/pc/impls/hypre/hypre.c: ierr = MPI_Comm_dup(PetscObjectComm(( >>>> PetscObject)pc),&(jac->comm_hypre));CHKERRQ(ierr); >>>>> src/ksp/pc/impls/hypre/hypre.c: ierr = MPI_Comm_dup(PetscObjectComm(( >>>> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr); >>>>> src/ksp/pc/impls/hypre/hypre.c: ierr = MPI_Comm_dup(PetscObjectComm(( >>>> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr); >>>>> src/ksp/pc/impls/spai/ispai.c: ierr = >>>> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ispai->comm_ >>>> spai));CHKERRQ(ierr); >>>>> src/mat/examples/tests/ex152.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD, >>>> &comm);CHKERRQ(ierr); >>>>> src/mat/impls/aij/mpi/mkl_cpardiso/mkl_cpardiso.c: ierr = >>>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mat_mkl_ >>>> cpardiso->comm_mkl_cpardiso));CHKERRQ(ierr); >>>>> src/mat/impls/aij/mpi/mumps/mumps.c: ierr = >>>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mumps->comm_ >>>> mumps));CHKERRQ(ierr); >>>>> src/mat/impls/aij/mpi/pastix/pastix.c: ierr = >>>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->pastix_ >>>> comm));CHKERRQ(ierr); >>>>> src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c: ierr = >>>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->comm_ >>>> superlu));CHKERRQ(ierr); >>>>> src/mat/impls/hypre/mhypre.c: ierr = MPI_Comm_dup(PetscObjectComm(( >>>> PetscObject)B),&hB->comm);CHKERRQ(ierr); >>>>> src/mat/partition/impls/pmetis/pmetis.c: ierr = >>>> MPI_Comm_dup(pcomm,&comm);CHKERRQ(ierr); >>>>> src/sys/mpiuni/mpi.c: MPI_COMM_SELF, MPI_COMM_WORLD, and a >>>> MPI_Comm_dup() of each of these (duplicates of duplicates return the same >>>> communictor) >>>>> src/sys/mpiuni/mpi.c:int MPI_Comm_dup(MPI_Comm comm,MPI_Comm *out) >>>>> src/sys/objects/pinit.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,& >>>> local_comm);CHKERRQ(ierr); >>>>> src/sys/objects/pinit.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,& >>>> local_comm);CHKERRQ(ierr); >>>>> src/sys/objects/tagm.c: ierr = MPI_Comm_dup(comm_in,comm_out) >>>> ;CHKERRQ(ierr); >>>>> src/sys/utils/mpiu.c: ierr = MPI_Comm_dup(comm,&local_comm) >>>> ;CHKERRQ(ierr); >>>>> src/ts/impls/implicit/sundials/sundials.c: ierr = >>>> MPI_Comm_dup(PetscObjectComm((PetscObject)ts),&(cvode->comm_ >>>> sundials));CHKERRQ(ierr); >>>>> >>>>> Perhaps we need a PetscCommDuplicateExternalPkg() to somehow avoid >>>> these MPI_Comm_dup() calls? >>>>> >>>>> Satish >>>>> >>>>> On Tue, 3 Apr 2018, Smith, Barry F. wrote: >>>>> >>>>>> >>>>>> Are we sure this is a PETSc comm issue and not a hypre comm >>>> duplication issue >>>>>> >>>>>> frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_ >>>> GenerateSubComm(comm=-1006627852, participate=<unavailable>, >>>> new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt] >>>>>> >>>>>> Looks like hypre is needed to generate subcomms, perhaps it generates >>>> too many? >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>>> On Apr 2, 2018, at 7:07 PM, Derek Gaston <fried...@gmail.com >>>>>>> <mailto:fried...@gmail.com>> wrote: >>>>>>> >>>>>>> I’m working with Fande on this and I would like to add a bit more. >>>> There are many circumstances where we aren’t working on COMM_WORLD at all >>>> (e.g. working on a sub-communicator) but PETSc was initialized using >>>> MPI_COMM_WORLD (think multi-level solves)… and we need to create >>>> arbitrarily many PETSc vecs/mats/solvers/preconditioners and solve. We >>>> definitely can’t rely on using PETSC_COMM_WORLD to avoid triggering >>>> duplication. >>>>>>> >>>>>>> Can you explain why PETSc needs to duplicate the communicator so much? >>>>>>> >>>>>>> Thanks for your help in tracking this down! >>>>>>> >>>>>>> Derek >>>>>>> >>>>>>> On Mon, Apr 2, 2018 at 5:44 PM Kong, Fande <fande.k...@inl.gov >>>>>>> <mailto:fande.k...@inl.gov>> wrote: >>>>>>> Why we do not use user-level MPI communicators directly? What are >>>> potential risks here? >>>>>>> >>>>>>> >>>>>>> Fande, >>>>>>> >>>>>>> On Mon, Apr 2, 2018 at 5:08 PM, Satish Balay <ba...@mcs.anl.gov >>>>>>> <mailto:ba...@mcs.anl.gov>> >>>> wrote: >>>>>>> PETSC_COMM_WORLD [via PetscCommDuplicate()] attempts to minimize calls >>>> to MPI_Comm_dup() - thus potentially avoiding such errors >>>>>>> >>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs >>>>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs>. >>>> anl.gov_petsc_petsc-2Dcurrent_docs_manualpages_Sys_ >>>> PetscCommDuplicate.html&d=DwIBAg&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB_ >>>> _aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmi >>>> CY&m=jgv7gpZ3K52d_FWMgkK9yEScbLA7pkrWydFuJnYflsU&s=_ >>>> zpWRcyk3kHuEHoq02NDqYExnXIohXpNnjyabYnnDjU&e= >>>>>>> >>>>>>> >>>>>>> Satish >>>>>>> >>>>>>> On Mon, 2 Apr 2018, Kong, Fande wrote: >>>>>>> >>>>>>>> On Mon, Apr 2, 2018 at 4:23 PM, Satish Balay <ba...@mcs.anl.gov >>>>>>>> <mailto:ba...@mcs.anl.gov>> >>>> wrote: >>>>>>>> >>>>>>>>> Does this 'standard test' use MPI_COMM_WORLD' to crate PETSc objects? >>>>>>>>> >>>>>>>>> If so - you could try changing to PETSC_COMM_WORLD >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I do not think we are using PETSC_COMM_WORLD when creating PETSc >>>> objects. >>>>>>>> Why we can not use MPI_COMM_WORLD? >>>>>>>> >>>>>>>> >>>>>>>> Fande, >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Satish >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, 2 Apr 2018, Kong, Fande wrote: >>>>>>>>> >>>>>>>>>> Hi All, >>>>>>>>>> >>>>>>>>>> I am trying to upgrade PETSc from 3.7.6 to 3.8.3 for MOOSE and its >>>>>>>>>> applications. I have a error message for a standard test: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *preconditioners/pbp.lots_of_variables: MPI had an >>>>>>>>>> errorpreconditioners/pbp.lots_of_variables: >>>>>>>>>> ------------------------------------------------ >>>>>>>>> preconditioners/pbp.lots_of_variables: >>>>>>>>>> Other MPI error, error stack:preconditioners/pbp.lots_of_variables: >>>>>>>>>> PMPI_Comm_dup(177)..................: MPI_Comm_dup(comm=0x84000001, >>>>>>>>>> new_comm=0x97d1068) failedpreconditioners/pbp.lots_of_variables: >>>>>>>>>> PMPI_Comm_dup(162)..................: >>>>>>>>>> preconditioners/pbp.lots_of_variables: >>>>>>>>>> MPIR_Comm_dup_impl(57)..............: >>>>>>>>>> preconditioners/pbp.lots_of_variables: >>>>>>>>>> MPIR_Comm_copy(739).................: >>>>>>>>>> preconditioners/pbp.lots_of_variables: >>>>>>>>>> MPIR_Get_contextid_sparse_group(614): Too many communicators >>>> (0/2048 >>>>>>>>> free >>>>>>>>>> on this process; ignore_id=0)* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I did "git bisect', and the following commit introduces this issue: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *commit 49a781f5cee36db85e8d5b951eec29f10ac13593Author: Stefano >>>> Zampini >>>>>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com> >>>>>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>>>Date: >>>>>>>>>> Sat >>>> Nov 5 >>>>>>>>>> 20:15:19 2016 +0300 PCHYPRE: use internal Mat of type MatHYPRE >>>>>>>>>> hpmat already stores two HYPRE vectors* >>>>>>>>>> >>>>>>>>>> Before I debug line-by-line, anyone has a clue on this? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Fande, >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>>> > >