I can make MPI_Issend_rtoa deadlock with vader and sm.

  George.


On Tue, Sep 6, 2016 at 12:06 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:

> FWIW: those tests hang for me with TCP (I don’t have openib on my
> cluster). I’ll check it with your change as well
>
>
> On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>
> Ralph,
>
>
> this looks like an other hang :-(
>
>
> i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8 cores
> per socket) with infiniband,
>
> and i always observe the same hang at the same place.
>
>
> surprisingly, i do not get any hang if i blacklist the openib btl
>
>
> the patch below can be used to avoid the hang with infiniband or for
> debugging purpose
>
> the hang occurs in communicator 6, and if i skip tests on communicator 2,
> no hang happens.
>
> the hang occur on an intercomm :
>
> task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm
>
> task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm
>
> task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then
> both hang in MPI_Wait()
>
> surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling
> the hang only occurs with the openib btl,
>
> since vader should be used here.
>
>
> diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c
> b/intel_tests/src/MPI_Issend_rtoa_c.c
> index 8b26f84..b9a704b 100644
> --- a/intel_tests/src/MPI_Issend_rtoa_c.c
> +++ b/intel_tests/src/MPI_Issend_rtoa_c.c
> @@ -173,8 +177,9 @@ int main(int argc, char *argv[])
>
>      for (comm_count = 0; comm_count < MPITEST_num_comm_sizes();
>           comm_count++) {
>          comm_index = MPITEST_get_comm_index(comm_count);
>          comm_type = MPITEST_get_comm_type(comm_count);
> +        if (2 == comm_count) continue;
>
>          /*
> @@ -312,6 +330,9 @@ int main(int argc, char *argv[])
>                       * left sub-communicator
>                       */
>
> +                    if (6 == comm_count && 12 == length_count &&
> MPITEST_current_rank < 2) {
> +                        /* insert a breakpoint here */
> +                    }
>           * Reset a bunch of variables that will be set when we get our
>
>
>
> as a side note, which is very unlikely related to this issue, i noticed
> the following programs works fine,
>
> though it is reasonnable to expect a hang.
>
> the root cause is MPI_Send uses the eager protocol, and though
> communicators used by MPI_Send and MPI_Recv
>
> are different, they have the same (recycled) CID.
>
> fwiw, the tests also completes with mpich.
>
>
> if not already done, should we provide an option not to recycle CIDs ?
>
> or flush unexpected/unmatched messages when a communicator is free'd ?
>
>
> Cheers,
>
>
> Gilles
>
>
> #include <stdio.h>
> #include <mpi.h>
>
> /* send a message (eager mode) in a communicator, and then
>  * receive it in an other communicator, but with the same CID
>  */
> int main(int argc, char *argv[]) {
>     int rank, size;
>     int b;
>     MPI_Comm comm;
>
>     MPI_Init(&argc, &argv);
>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>     MPI_Comm_size(MPI_COMM_WORLD, &size);
>     if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1);
>
>     MPI_Comm_dup(MPI_COMM_WORLD, &comm);
>     if (0 == rank) {
>         b = 0x55555555;
>         MPI_Send(&b, 1, MPI_INT, 1, 0, comm);
>     }
>     MPI_Comm_free(&comm);
>
>     MPI_Comm_dup(MPI_COMM_WORLD, &comm);
>     if (1 == rank) {
>         b = 0xAAAAAAAA;
>         MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
>         if (0x55555555 != b) MPI_Abort(MPI_COMM_WORLD, 2);
>     }
>     MPI_Comm_free(&comm);
>
>     MPI_Finalize();
>
>     return 0;
> }
>
>
> On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote:
>
> ok,  will double check tomorrow this was the very same hang i fixed
> earlier
>
> Cheers,
>
> Gilles
>
> On Monday, September 5, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote:
>
>> I was just looking at the overnight MTT report, and these were present
>> going back a long ways in both branches. They are in the Intel test suite.
>>
>> If you have already addressed them, then thanks!
>>
>> > On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>> >
>> > Ralph,
>> >
>> > I fixed a hang earlier today in master, and the PR for v2.x is at
>> https://github.com/open-mpi/ompi-release/pull/1368
>> >
>> > Can you please make sure you are running the latest master ?
>> >
>> > Which testsuite do these tests come from ?
>> > I will have a look tomorrow if the hang is still there
>> >
>> > Cheers,
>> >
>> > Gilles
>> >
>> > r...@open-mpi.org wrote:
>> >> Hey folks
>> >>
>> >> All of the tests that involve either ISsend_ator, SSend_ator,
>> ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does anyone know
>> what these tests do, and why we never seem to pass them?
>> >>
>> >> Do we care?
>> >> Ralph
>> >>
>> >> _______________________________________________
>> >> devel mailing list
>> >> devel@lists.open-mpi.org
>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> > _______________________________________________
>> > devel mailing list
>> > devel@lists.open-mpi.org
>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
>
> _______________________________________________
> devel mailing 
> listde...@lists.open-mpi.orghttps://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to