Thanks! I tried the patch - and this testcase doesn't hang anymore.. Satish
On Tue, 17 Apr 2018, Min Si wrote: > Hi all, > > Thanks for narrowing down the problem. I checked the MPICH code and believe > this is a bug in MPICH. I just created a PR to fix it: > https://github.com/pmodels/mpich/pull/3097 > > It should be merged into MPICH master branch soon. > > Thanks, > Min > > On 2018/04/17 14:10, Eric Chamberland wrote: > > Hi, > > > > are we talking about the "tag" passed to MPI_Isend for example? > > > > but does that mean there is something to change for any MPI call which > > involves tags usage or is it only a PETSc "bad" tag usage? > > > > thanks Satish for your finding! > > > > Eric > > > > On 16/04/18 11:31 PM, Satish Balay wrote: > >> On Tue, 13 Mar 2018, Eric Chamberland wrote: > >> > >>> Hi, > >>> > >>> each night we are testing mpich/master with our petsc-based code. I don't > >>> know if PETSc team is doing the same thing with mpich/master? (Maybe it > >>> is a > >>> good idea?) > >>> > >>> Everything was fine (except the issue > >>> https://github.com/pmodels/mpich/issues/2892) up to commit 7b8d64debd, but > >>> since commit mpich:a8a2b30fd21), I have a segfault on a any parallel > >>> nightly > >>> test. > >> > >> I attempted a bisect of the above range of commits - and narrowed down to: > >> > >>>>>>>>> > >> db11d4c4a70e39a28be88ed32f00542301699e08 is the first bad commit > >> <<<<<<< > >>>>>>>>>> > >> balay@asterix /home/balay/soft/build/mpich ((db11d4c4a...)|BISECTING) > >> $ git show db11d4c4a70e39a28be88ed32f00542301699e08 > >> commit db11d4c4a70e39a28be88ed32f00542301699e08 (HEAD, refs/bisect/bad) > >> Author: Ken Raffenetti <raffe...@mcs.anl.gov> > >> Date: Thu Feb 15 11:37:59 2018 -0600 > >> > >> init: Fix tag upper limit initialization > >> The starting point for this value is equivalent to the usable tag > >> bits > >> macro. This value should be set before device initialization, > >> otherwise devices will assume they have more bits than are actually > >> available. > >> Signed-off-by: Wesley Bland <wesley.bl...@intel.com> > >> > >> diff --git a/src/mpi/init/initthread.c b/src/mpi/init/initthread.c > >> index cbc41f4d5..b31ae2f07 100644 > >> --- a/src/mpi/init/initthread.c > >> +++ b/src/mpi/init/initthread.c > >> @@ -403,7 +403,7 @@ int MPIR_Init_thread(int *argc, char ***argv, int > >> required, int *provided) > >> MPIR_Process.attrs.host = MPI_PROC_NULL; > >> MPIR_Process.attrs.io = MPI_PROC_NULL; > >> MPIR_Process.attrs.lastusedcode = MPI_ERR_LASTCODE; > >> - MPIR_Process.attrs.tag_ub = 0; > >> + MPIR_Process.attrs.tag_ub = MPIR_TAG_USABLE_BITS; > >> MPIR_Process.attrs.universe = MPIR_UNIVERSE_SIZE_NOT_SET; > >> MPIR_Process.attrs.wtime_is_global = 0; > >> @@ -531,13 +531,6 @@ int MPIR_Init_thread(int *argc, char ***argv, int > >> required, int *provided) > >> MPIR_Assert(((unsigned) MPIR_Process. > >> attrs.tag_ub & ((unsigned) MPIR_Process.attrs.tag_ub + > >> 1)) == 0); > >> - /* Set aside tag space for tagged collectives and failure > >> notification */ > >> -#ifdef HAVE_TAG_ERROR_BITS > >> - MPIR_Process.attrs.tag_ub >>= 3; > >> -#else > >> - MPIR_Process.attrs.tag_ub >>= 1; > >> -#endif > >> - > >> /* Assert: tag_ub is at least the minimum asked for in the MPI spec > >> */ > >> MPIR_Assert(MPIR_Process.attrs.tag_ub >= 32767); > >> <<<<<<<<<<<<<<<<< > >> > >> Reverthing this patch gets mpich-3.3b2 working with petsc > >> > >> Satish > >> > > >