On Tue, Apr 3, 2018 at 9:32 AM, Satish Balay <ba...@mcs.anl.gov> wrote:
> On Tue, 3 Apr 2018, Stefano Zampini wrote: > > > > > > On Apr 3, 2018, at 4:58 PM, Satish Balay <ba...@mcs.anl.gov> wrote: > > > > > > On Tue, 3 Apr 2018, Kong, Fande wrote: > > > > > >> On Tue, Apr 3, 2018 at 1:17 AM, Smith, Barry F. <bsm...@mcs.anl.gov> > wrote: > > >> > > >>> > > >>> Each external package definitely needs its own duplicated > communicator; > > >>> cannot share between packages. > > >>> > > >>> The only problem with the dups below is if they are in a loop and > get > > >>> called many times. > > >>> > > >> > > >> > > >> The "standard test" that has this issue actually has 1K fields. MOOSE > > >> creates its own field-split preconditioner (not based on the PETSc > > >> fieldsplit), and each filed is associated with one PC HYPRE. If PETSc > > >> duplicates communicators, we should easily reach the limit 2048. > > >> > > >> I also want to confirm what extra communicators are introduced in the > bad > > >> commit. > > > > > > To me it looks like there is 1 extra comm created [for MATHYPRE] for > each PCHYPRE that is created [which also creates one comm for this object]. > > > > > > > You’re right; however, it was the same before the commit. > > https://urldefense.proofpoint.com/v2/url?u=https-3A__ > bitbucket.org_petsc_petsc_commits_49a781f5cee36db85e8d5b951eec29 > f10ac13593&d=DwIDaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB_ > _aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmi > CY&m=6_ukwovpDrK5BL_94S4ezasw2a3S15SM59R41rSY-Yw&s= > r8xHYLKF9LtJHReR6Jmfeei3OfwkQNiGrKXAgeqPVQ8&e= > Before the commit - PCHYPRE was not calling MatConvert(MATHYPRE) [this > results in an additional call to MPI_Comm_dup() for hypre calls] PCHYPRE > was calling MatHYPRE_IJMatrixCreate() directly [which I presume reusing the > comm created by the call to MPI_Comm_dup() in PCHYPRE - for hypre calls] > > > > > I don’t understand how this specific commit is related with this issue, > being the error not in the MPI_Comm_Dup which is inside MatCreate_MATHYPRE. > Actually, the error comes from MPI_Comm_create > > > > frame #5: 0x00000001068defd4 libmpi.12.dylib`MPI_Comm_create + 3492 > > frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_ > GenerateSubComm(comm=-1006627852, participate=<unavailable>, > new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt] > > frame #7: 0x000000010618f8ba libpetsc.3.07.dylib`hypre_ > GaussElimSetup(amg_data=0x00007fe7ff857a00, level=<unavailable>, > relax_type=9) + 74 at par_relax.c:4209 [opt] > > frame #8: 0x0000000106140e93 libpetsc.3.07.dylib`hypre_ > BoomerAMGSetup(amg_vdata=<unavailable>, A=0x00007fe80842aff0, > f=0x00007fe80842a980, u=0x00007fe80842a510) + 17699 at par_amg_setup.c:2108 > [opt] > > frame #9: 0x0000000105ec773c > > libpetsc.3.07.dylib`PCSetUp_HYPRE(pc=<unavailable>) > + 2540 at hypre.c:226 [opt > > I thought this trace comes up after applying your patch > This trace comes from Mac > > - ierr = MatDestroy(&jac->hpmat);CHKERRQ(ierr); > - ierr = MatConvert(pc->pmat,MATHYPRE,MAT_INITIAL_MATRIX,&jac-> > hpmat);CHKERRQ(ierr); > + ierr = MatConvert(pc->pmat,MATHYPRE,jac->hpmat ? MAT_REUSE_MATRIX : > MAT_INITIAL_MATRIX,&jac->hpmat);CHKERRQ(ierr); > > The stack before this patch was: [its a different format - so it was > obtained in a different way than the above method?] > > preconditioners/pbp.lots_of_variables: Other MPI error, error stack: > preconditioners/pbp.lots_of_variables: PMPI_Comm_dup(177)..................: > MPI_Comm_dup(comm=0x84000001, new_comm=0x97d1068) failed > preconditioners/pbp.lots_of_variables: PMPI_Comm_dup(162)............ > ......: > preconditioners/pbp.lots_of_variables: MPIR_Comm_dup_impl(57)........ > ......: > preconditioners/pbp.lots_of_variables: MPIR_Comm_copy(739)........... > ......: > preconditioners/pbp.lots_of_variables: MPIR_Get_contextid_sparse_group(614): > Too many communicators (0/2048 free on this process; ignore_id=0) > This comes from a Linux (it is a test box), and I do not have access to it. Fande, > > Satish > > > > > How did you perform the bisection? make clean + make all ? Which version > of HYPRE are you using? > > > > > But you might want to verify [by linking with mpi trace library?] > > > > > > > > > There are some debugging hints at https://urldefense.proofpoint. > com/v2/url?u=https-3A__lists.mpich.org_pipermail_discuss_ > 2012-2DDecember_000148.html&d=DwIDaQ&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB_ > _aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmi > CY&m=6_ukwovpDrK5BL_94S4ezasw2a3S15SM59R41rSY-Yw& > s=XUy9n2kmdq262Gwrn_RMXYR-bIyiKViCvp4fRfGCP9w&e= [wrt mpich] - which I > haven't checked.. > > > > > > Satish > > > > > >> > > >> > > >> Fande, > > >> > > >> > > >> > > >>> > > >>> To debug the hypre/duplication issue in MOOSE I would run in the > > >>> debugger with a break point in MPI_Comm_dup() and see > > >>> who keeps calling it an unreasonable amount of times. (My guess is > this is > > >>> a new "feature" in hypre that they will need to fix but only > debugging will > > >>> tell) > > >>> > > >>> Barry > > >>> > > >>> > > >>>> On Apr 2, 2018, at 7:44 PM, Balay, Satish <ba...@mcs.anl.gov> > wrote: > > >>>> > > >>>> We do a MPI_Comm_dup() for objects related to externalpackages. > > >>>> > > >>>> Looks like we added a new mat type MATHYPRE - in 3.8 that PCHYPRE is > > >>>> using. Previously there was one MPI_Comm_dup() PCHYPRE - now I think > > >>>> is one more for MATHYPRE - so more calls to MPI_Comm_dup in 3.8 vs > 3.7 > > >>>> > > >>>> src/dm/impls/da/hypre/mhyp.c: ierr = MPI_Comm_dup(PetscObjectComm(( > > >>> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr); > > >>>> src/dm/impls/da/hypre/mhyp.c: ierr = MPI_Comm_dup(PetscObjectComm(( > > >>> PetscObject)B),&(ex->hcomm));CHKERRQ(ierr); > > >>>> src/dm/impls/swarm/data_ex.c: ierr = MPI_Comm_dup(comm,&d->comm); > > >>> CHKERRQ(ierr); > > >>>> src/ksp/pc/impls/hypre/hypre.c: ierr = > MPI_Comm_dup(PetscObjectComm(( > > >>> PetscObject)pc),&(jac->comm_hypre));CHKERRQ(ierr); > > >>>> src/ksp/pc/impls/hypre/hypre.c: ierr = > MPI_Comm_dup(PetscObjectComm(( > > >>> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr); > > >>>> src/ksp/pc/impls/hypre/hypre.c: ierr = > MPI_Comm_dup(PetscObjectComm(( > > >>> PetscObject)pc),&(ex->hcomm));CHKERRQ(ierr); > > >>>> src/ksp/pc/impls/spai/ispai.c: ierr = > > >>> MPI_Comm_dup(PetscObjectComm((PetscObject)pc),&(ispai->comm_ > > >>> spai));CHKERRQ(ierr); > > >>>> src/mat/examples/tests/ex152.c: ierr = > MPI_Comm_dup(MPI_COMM_WORLD, > > >>> &comm);CHKERRQ(ierr); > > >>>> src/mat/impls/aij/mpi/mkl_cpardiso/mkl_cpardiso.c: ierr = > > >>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mat_mkl_ > > >>> cpardiso->comm_mkl_cpardiso));CHKERRQ(ierr); > > >>>> src/mat/impls/aij/mpi/mumps/mumps.c: ierr = > > >>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(mumps->comm_ > > >>> mumps));CHKERRQ(ierr); > > >>>> src/mat/impls/aij/mpi/pastix/pastix.c: ierr = > > >>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->pastix_ > > >>> comm));CHKERRQ(ierr); > > >>>> src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c: ierr = > > >>> MPI_Comm_dup(PetscObjectComm((PetscObject)A),&(lu->comm_ > > >>> superlu));CHKERRQ(ierr); > > >>>> src/mat/impls/hypre/mhypre.c: ierr = MPI_Comm_dup(PetscObjectComm(( > > >>> PetscObject)B),&hB->comm);CHKERRQ(ierr); > > >>>> src/mat/partition/impls/pmetis/pmetis.c: ierr = > > >>> MPI_Comm_dup(pcomm,&comm);CHKERRQ(ierr); > > >>>> src/sys/mpiuni/mpi.c: MPI_COMM_SELF, MPI_COMM_WORLD, and a > > >>> MPI_Comm_dup() of each of these (duplicates of duplicates return the > same > > >>> communictor) > > >>>> src/sys/mpiuni/mpi.c:int MPI_Comm_dup(MPI_Comm comm,MPI_Comm *out) > > >>>> src/sys/objects/pinit.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,& > > >>> local_comm);CHKERRQ(ierr); > > >>>> src/sys/objects/pinit.c: ierr = MPI_Comm_dup(MPI_COMM_WORLD,& > > >>> local_comm);CHKERRQ(ierr); > > >>>> src/sys/objects/tagm.c: ierr = MPI_Comm_dup(comm_in,comm_out) > > >>> ;CHKERRQ(ierr); > > >>>> src/sys/utils/mpiu.c: ierr = MPI_Comm_dup(comm,&local_comm) > > >>> ;CHKERRQ(ierr); > > >>>> src/ts/impls/implicit/sundials/sundials.c: ierr = > > >>> MPI_Comm_dup(PetscObjectComm((PetscObject)ts),&(cvode->comm_ > > >>> sundials));CHKERRQ(ierr); > > >>>> > > >>>> Perhaps we need a PetscCommDuplicateExternalPkg() to somehow avoid > > >>> these MPI_Comm_dup() calls? > > >>>> > > >>>> Satish > > >>>> > > >>>> On Tue, 3 Apr 2018, Smith, Barry F. wrote: > > >>>> > > >>>>> > > >>>>> Are we sure this is a PETSc comm issue and not a hypre comm > > >>> duplication issue > > >>>>> > > >>>>> frame #6: 0x00000001061345d9 libpetsc.3.07.dylib`hypre_ > > >>> GenerateSubComm(comm=-1006627852, participate=<unavailable>, > > >>> new_comm_ptr=<unavailable>) + 409 at gen_redcs_mat.c:531 [opt] > > >>>>> > > >>>>> Looks like hypre is needed to generate subcomms, perhaps it > generates > > >>> too many? > > >>>>> > > >>>>> Barry > > >>>>> > > >>>>> > > >>>>>> On Apr 2, 2018, at 7:07 PM, Derek Gaston <fried...@gmail.com> > wrote: > > >>>>>> > > >>>>>> I’m working with Fande on this and I would like to add a bit more. > > >>> There are many circumstances where we aren’t working on COMM_WORLD > at all > > >>> (e.g. working on a sub-communicator) but PETSc was initialized using > > >>> MPI_COMM_WORLD (think multi-level solves)… and we need to create > > >>> arbitrarily many PETSc vecs/mats/solvers/preconditioners and > solve. We > > >>> definitely can’t rely on using PETSC_COMM_WORLD to avoid triggering > > >>> duplication. > > >>>>>> > > >>>>>> Can you explain why PETSc needs to duplicate the communicator so > much? > > >>>>>> > > >>>>>> Thanks for your help in tracking this down! > > >>>>>> > > >>>>>> Derek > > >>>>>> > > >>>>>> On Mon, Apr 2, 2018 at 5:44 PM Kong, Fande <fande.k...@inl.gov> > wrote: > > >>>>>> Why we do not use user-level MPI communicators directly? What are > > >>> potential risks here? > > >>>>>> > > >>>>>> > > >>>>>> Fande, > > >>>>>> > > >>>>>> On Mon, Apr 2, 2018 at 5:08 PM, Satish Balay <ba...@mcs.anl.gov> > > >>> wrote: > > >>>>>> PETSC_COMM_WORLD [via PetscCommDuplicate()] attempts to minimize > calls > > >>> to MPI_Comm_dup() - thus potentially avoiding such errors > > >>>>>> > > >>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mcs. > > >>> anl.gov_petsc_petsc-2Dcurrent_docs_manualpages_Sys_ > > >>> PetscCommDuplicate.html&d=DwIBAg&c=54IZrppPQZKX9mLzcGdPfFD1hxrcB_ > > >>> _aEkJFOKJFd00&r=DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmi > > >>> CY&m=jgv7gpZ3K52d_FWMgkK9yEScbLA7pkrWydFuJnYflsU&s=_ > > >>> zpWRcyk3kHuEHoq02NDqYExnXIohXpNnjyabYnnDjU&e= > > >>>>>> > > >>>>>> > > >>>>>> Satish > > >>>>>> > > >>>>>> On Mon, 2 Apr 2018, Kong, Fande wrote: > > >>>>>> > > >>>>>>> On Mon, Apr 2, 2018 at 4:23 PM, Satish Balay <ba...@mcs.anl.gov> > > >>> wrote: > > >>>>>>> > > >>>>>>>> Does this 'standard test' use MPI_COMM_WORLD' to crate PETSc > objects? > > >>>>>>>> > > >>>>>>>> If so - you could try changing to PETSC_COMM_WORLD > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> I do not think we are using PETSC_COMM_WORLD when creating PETSc > > >>> objects. > > >>>>>>> Why we can not use MPI_COMM_WORLD? > > >>>>>>> > > >>>>>>> > > >>>>>>> Fande, > > >>>>>>> > > >>>>>>> > > >>>>>>>> > > >>>>>>>> Satish > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On Mon, 2 Apr 2018, Kong, Fande wrote: > > >>>>>>>> > > >>>>>>>>> Hi All, > > >>>>>>>>> > > >>>>>>>>> I am trying to upgrade PETSc from 3.7.6 to 3.8.3 for MOOSE and > its > > >>>>>>>>> applications. I have a error message for a standard test: > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> *preconditioners/pbp.lots_of_variables: MPI had an > > >>>>>>>>> errorpreconditioners/pbp.lots_of_variables: > > >>>>>>>>> ------------------------------------------------ > > >>>>>>>> preconditioners/pbp.lots_of_variables: > > >>>>>>>>> Other MPI error, error stack:preconditioners/pbp. > lots_of_variables: > > >>>>>>>>> PMPI_Comm_dup(177)..................: > MPI_Comm_dup(comm=0x84000001, > > >>>>>>>>> new_comm=0x97d1068) failedpreconditioners/pbp. > lots_of_variables: > > >>>>>>>>> PMPI_Comm_dup(162)..................: > > >>>>>>>>> preconditioners/pbp.lots_of_variables: > > >>>>>>>>> MPIR_Comm_dup_impl(57)..............: > > >>>>>>>>> preconditioners/pbp.lots_of_variables: > > >>>>>>>>> MPIR_Comm_copy(739).................: > > >>>>>>>>> preconditioners/pbp.lots_of_variables: > > >>>>>>>>> MPIR_Get_contextid_sparse_group(614): Too many communicators > > >>> (0/2048 > > >>>>>>>> free > > >>>>>>>>> on this process; ignore_id=0)* > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> I did "git bisect', and the following commit introduces this > issue: > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> *commit 49a781f5cee36db85e8d5b951eec29f10ac13593Author: > Stefano > > >>> Zampini > > >>>>>>>>> <stefano.zamp...@gmail.com <stefano.zamp...@gmail.com>>Date: > Sat > > >>> Nov 5 > > >>>>>>>>> 20:15:19 2016 +0300 PCHYPRE: use internal Mat of type > MatHYPRE > > >>>>>>>>> hpmat already stores two HYPRE vectors* > > >>>>>>>>> > > >>>>>>>>> Before I debug line-by-line, anyone has a clue on this? > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Fande, > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>> > > >>> > > > > >