FWIW: those tests hang for me with TCP (I don’t have openib on my cluster). 
I’ll check it with your change as well


> On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Ralph,
> 
> 
> this looks like an other hang :-(
> 
> 
> i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8 cores per 
> socket) with infiniband,
> 
> and i always observe the same hang at the same place.
> 
> 
> surprisingly, i do not get any hang if i blacklist the openib btl
> 
> 
> the patch below can be used to avoid the hang with infiniband or for 
> debugging purpose
> 
> the hang occurs in communicator 6, and if i skip tests on communicator 2, no 
> hang happens.
> 
> the hang occur on an intercomm :
> task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm
> 
> task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm
> 
> task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then both 
> hang in MPI_Wait()
> 
> surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling the 
> hang only occurs with the openib btl,
> 
> since vader should be used here.
> 
> 
> diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c 
> b/intel_tests/src/MPI_Issend_rtoa_c.c
> index 8b26f84..b9a704b 100644
> --- a/intel_tests/src/MPI_Issend_rtoa_c.c
> +++ b/intel_tests/src/MPI_Issend_rtoa_c.c
> @@ -173,8 +177,9 @@ int main(int argc, char *argv[])
>  
>      for (comm_count = 0; comm_count < MPITEST_num_comm_sizes();
>           comm_count++) {
>          comm_index = MPITEST_get_comm_index(comm_count);
>          comm_type = MPITEST_get_comm_type(comm_count);
> +        if (2 == comm_count) continue;
>  
>          /*
> @@ -312,6 +330,9 @@ int main(int argc, char *argv[])
>                       * left sub-communicator
>                       */
>  
> +                    if (6 == comm_count && 12 == length_count && 
> MPITEST_current_rank < 2) {
> +                        /* insert a breakpoint here */
> +                    }
>           * Reset a bunch of variables that will be set when we get our
> 
> 
> as a side note, which is very unlikely related to this issue, i noticed the 
> following programs works fine,
> 
> though it is reasonnable to expect a hang.
> the root cause is MPI_Send uses the eager protocol, and though communicators 
> used by MPI_Send and MPI_Recv
> 
> are different, they have the same (recycled) CID.
> 
> fwiw, the tests also completes with mpich.
> 
> 
> if not already done, should we provide an option not to recycle CIDs ?
> 
> or flush unexpected/unmatched messages when a communicator is free'd ?
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> #include <stdio.h>
> #include <mpi.h>
> 
> /* send a message (eager mode) in a communicator, and then
>  * receive it in an other communicator, but with the same CID
>  */
> int main(int argc, char *argv[]) {
>     int rank, size;
>     int b;
>     MPI_Comm comm;
> 
>     MPI_Init(&argc, &argv);
>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>     MPI_Comm_size(MPI_COMM_WORLD, &size);
>     if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1);
> 
>     MPI_Comm_dup(MPI_COMM_WORLD, &comm);
>     if (0 == rank) {
>         b = 0x55555555;
>         MPI_Send(&b, 1, MPI_INT, 1, 0, comm);
>     }
>     MPI_Comm_free(&comm);
> 
>     MPI_Comm_dup(MPI_COMM_WORLD, &comm);
>     if (1 == rank) {
>         b = 0xAAAAAAAA;
>         MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
>         if (0x55555555 != b) MPI_Abort(MPI_COMM_WORLD, 2);
>     }
>     MPI_Comm_free(&comm);
> 
>     MPI_Finalize();
> 
>     return 0;
> }
> 
> 
> On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote:
>> ok,  will double check tomorrow this was the very same hang i fixed earlier
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Monday, September 5, 2016, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:
>> I was just looking at the overnight MTT report, and these were present going 
>> back a long ways in both branches. They are in the Intel test suite.
>> 
>> If you have already addressed them, then thanks!
>> 
>> > On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet 
>> > <gilles.gouaillar...@gmail.com <javascript:;>> wrote:
>> >
>> > Ralph,
>> >
>> > I fixed a hang earlier today in master, and the PR for v2.x is at 
>> > https://github.com/open-mpi/ompi-release/pull/1368 
>> > <https://github.com/open-mpi/ompi-release/pull/1368>
>> >
>> > Can you please make sure you are running the latest master ?
>> >
>> > Which testsuite do these tests come from ?
>> > I will have a look tomorrow if the hang is still there
>> >
>> > Cheers,
>> >
>> > Gilles
>> >
>> > r...@open-mpi.org <javascript:;> wrote:
>> >> Hey folks
>> >>
>> >> All of the tests that involve either ISsend_ator, SSend_ator, 
>> >> ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does anyone 
>> >> know what these tests do, and why we never seem to pass them?
>> >>
>> >> Do we care?
>> >> Ralph
>> >>
>> >> _______________________________________________
>> >> devel mailing list
>> >> devel@lists.open-mpi.org <javascript:;>
>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> > _______________________________________________
>> > devel mailing list
>> > devel@lists.open-mpi.org <javascript:;>
>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <javascript:;>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to