FWIW: those tests hang for me with TCP (I don’t have openib on my cluster). I’ll check it with your change as well
> On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Ralph, > > > this looks like an other hang :-( > > > i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8 cores per > socket) with infiniband, > > and i always observe the same hang at the same place. > > > surprisingly, i do not get any hang if i blacklist the openib btl > > > the patch below can be used to avoid the hang with infiniband or for > debugging purpose > > the hang occurs in communicator 6, and if i skip tests on communicator 2, no > hang happens. > > the hang occur on an intercomm : > task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm > > task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm > > task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then both > hang in MPI_Wait() > > surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling the > hang only occurs with the openib btl, > > since vader should be used here. > > > diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c > b/intel_tests/src/MPI_Issend_rtoa_c.c > index 8b26f84..b9a704b 100644 > --- a/intel_tests/src/MPI_Issend_rtoa_c.c > +++ b/intel_tests/src/MPI_Issend_rtoa_c.c > @@ -173,8 +177,9 @@ int main(int argc, char *argv[]) > > for (comm_count = 0; comm_count < MPITEST_num_comm_sizes(); > comm_count++) { > comm_index = MPITEST_get_comm_index(comm_count); > comm_type = MPITEST_get_comm_type(comm_count); > + if (2 == comm_count) continue; > > /* > @@ -312,6 +330,9 @@ int main(int argc, char *argv[]) > * left sub-communicator > */ > > + if (6 == comm_count && 12 == length_count && > MPITEST_current_rank < 2) { > + /* insert a breakpoint here */ > + } > * Reset a bunch of variables that will be set when we get our > > > as a side note, which is very unlikely related to this issue, i noticed the > following programs works fine, > > though it is reasonnable to expect a hang. > the root cause is MPI_Send uses the eager protocol, and though communicators > used by MPI_Send and MPI_Recv > > are different, they have the same (recycled) CID. > > fwiw, the tests also completes with mpich. > > > if not already done, should we provide an option not to recycle CIDs ? > > or flush unexpected/unmatched messages when a communicator is free'd ? > > > Cheers, > > > Gilles > > #include <stdio.h> > #include <mpi.h> > > /* send a message (eager mode) in a communicator, and then > * receive it in an other communicator, but with the same CID > */ > int main(int argc, char *argv[]) { > int rank, size; > int b; > MPI_Comm comm; > > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > MPI_Comm_size(MPI_COMM_WORLD, &size); > if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1); > > MPI_Comm_dup(MPI_COMM_WORLD, &comm); > if (0 == rank) { > b = 0x55555555; > MPI_Send(&b, 1, MPI_INT, 1, 0, comm); > } > MPI_Comm_free(&comm); > > MPI_Comm_dup(MPI_COMM_WORLD, &comm); > if (1 == rank) { > b = 0xAAAAAAAA; > MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE); > if (0x55555555 != b) MPI_Abort(MPI_COMM_WORLD, 2); > } > MPI_Comm_free(&comm); > > MPI_Finalize(); > > return 0; > } > > > On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote: >> ok, will double check tomorrow this was the very same hang i fixed earlier >> >> Cheers, >> >> Gilles >> >> On Monday, September 5, 2016, r...@open-mpi.org <mailto:r...@open-mpi.org> >> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote: >> I was just looking at the overnight MTT report, and these were present going >> back a long ways in both branches. They are in the Intel test suite. >> >> If you have already addressed them, then thanks! >> >> > On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet >> > <gilles.gouaillar...@gmail.com <javascript:;>> wrote: >> > >> > Ralph, >> > >> > I fixed a hang earlier today in master, and the PR for v2.x is at >> > https://github.com/open-mpi/ompi-release/pull/1368 >> > <https://github.com/open-mpi/ompi-release/pull/1368> >> > >> > Can you please make sure you are running the latest master ? >> > >> > Which testsuite do these tests come from ? >> > I will have a look tomorrow if the hang is still there >> > >> > Cheers, >> > >> > Gilles >> > >> > r...@open-mpi.org <javascript:;> wrote: >> >> Hey folks >> >> >> >> All of the tests that involve either ISsend_ator, SSend_ator, >> >> ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does anyone >> >> know what these tests do, and why we never seem to pass them? >> >> >> >> Do we care? >> >> Ralph >> >> >> >> _______________________________________________ >> >> devel mailing list >> >> devel@lists.open-mpi.org <javascript:;> >> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >> > _______________________________________________ >> > devel mailing list >> > devel@lists.open-mpi.org <javascript:;> >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <javascript:;> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel