[OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

John DelSignore via devel Mon, 04 May 2020 07:37:48 -0700

Hi folks,

I cloned a fresh copy of OMPI master this morning at ~8:30am EDT and rebuilt. I'm running a very simple test code on three Centos 7.[56] nodes named microway[123] over TCP. I'm seeing a fatal error similar to the following:

[microway3.totalviewtech.com:227713] ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL

The case of prun launching an OMPI code does not work correctly. The MPI processes seem to launch OK, but there is the follwoing OMPI error at the point where the processes communicate. In the following case, I have DVM running on three nodes "microway[123]":

mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name --personality ompi ./tx_basic_mpitx_basic_mpiHello from proc (0)MESSAGE: microway3.totalviewtech.comHello from proc (1): microway1Hello from proc (2): microway2.totalviewtech.com--------------------------------------------------------------------------WARNING: Open MPI failed to TCP connect to a peer MPI process. Thisshould not happen.Your Open MPI job may now hang or fail. Local host: microway1 PID: 282716 Message: connect() to 10.71.2.58:1024 failed Error: No route to host (113)--------------------------------------------------------------------------[microway1:282716] ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL--------------------------------------------------------------------------An MPI communication peer process has unexpectedly disconnected. Thisusually indicates a failure in the peer process (e.g., a crash orotherwise exiting without calling MPI_FINALIZE first).Although this local MPI process will likely now behave unpredictably(it may even hang or crash), the root cause of this problem is thefailure of the peer -- that is what you need to investigate. Forexample, there may be a core file that you can examine. Moregenerally: such peer hangups are frequently caused by application bugsor other external events. Local host: microway3 Local PID: 214271 Peer host: microway1--------------------------------------------------------------------------mic:/amd/home/jdelsign/PMIx>

If I use mpirun to launch the program it works whether or not a DVM is already running (first without a DVM, then with a DVM):

mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name --personality ompi --hostfile myhostfile ./tx_basic_mpi tx_basic_mpi Hello from proc (0) MESSAGE: microway1 Hello from proc (1): microway2.totalviewtech.com Hello from proc (2): microway3.totalviewtech.com All Done! mic:/amd/home/jdelsign/PMIx> mic:/amd/home/jdelsign/PMIx>prte --hostfile ./myhostfile --daemonize mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name --personality ompi --hostfile myhostfile ./tx_basic_mpi tx_basic_mpi Hello from proc (0) MESSAGE: microway1 Hello from proc (1): microway2.totalviewtech.com Hello from proc (2): microway3.totalviewtech.com All Done! mic:/amd/home/jdelsign/PMIx>

But if I use mpirun to launch 3 processes from microway3 and use a hostfile that contains only microway[23], I get a similar failure as the prun case:

mic:/amd/home/jdelsign/PMIx>hostname microway3.totalviewtech.com mic:/amd/home/jdelsign/PMIx>cat myhostfile2microway2 slots=16microway3 slots=16mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name --personality ompi --hostfile myhostfile2 ./tx_basic_mpitx_basic_mpiHello from proc (0)MESSAGE: microway2.totalviewtech.comHello from proc (1): microway2.totalviewtech.com--------------------------------------------------------------------------WARNING: Open MPI failed to TCP connect to a peer MPI process. Thisshould not happen.Your Open MPI job may now hang or fail. Local host: microway3 PID: 227713 Message: connect() to 10.71.2.58:1024 failed Error: No route to host (113)--------------------------------------------------------------------------[microway3.totalviewtech.com:227713] ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL[microway2][[32270,1],1][../../../../../../ompi/opal/mca/btl/tcp/btl_tcp.c:566:mca_btl_tcp_recv_blocking] recv(13) failed: Connection reset by peer (104)mic:/amd/home/jdelsign/PMIx>

I asked my therapist (Ralph) about it, and he said,

"It looks to me like the btl/tcp component is having trouble correctly selecting a route to use when opening communications across hosts. I've seen this in my docker setup too, but thought perhaps it was just a docker-related issue.

What's weird in your last example is that both procs are on the same node, and therefore they should only be using shared memory to communicate - the btl/tcp component shouldn't be trying to create a connection at all."

Cheers, John D.

This e-mail may contain information that is privileged or confidential. If you are not the intended recipient, please delete the e-mail and any attachments and notify us immediately.

[OMPI devel] OMPI master fatal error in pml_ob1_sendreq.c

Reply via email to