John, The common denominator across all these errors is an error from connect while trying to connect to 10.71.2.58 on port 1024. Who is 10.71.2.58 ? If the firewall open ? Is the port 1024 allowed to connect to ?
George. On Mon, May 4, 2020 at 11:36 AM John DelSignore via devel < devel@lists.open-mpi.org> wrote: > Inline below... > > On 2020-05-04 11:09, Ralph Castain via devel wrote: > > Staring at this some more, I do have the following questions: > > * in your first case, it looks like "prte" was started from microway3 - > correct? > > Yes, "prte" was started from microway3. > > > * in the second case, that worked, it looks like "mpirun" was executed > from microway1 - correct? > > No, "mpirun" was executed from microway3. > > > * in the third case, you state that "mpirun" was again executed from > microway3, and the process output confirms that > > Yes, "mpirun" was started from microway3. > > > I'm wondering if the issue here might actually be that PRRTE expects the > ordering of hosts in the hostfile to start with the host it is sitting on - > i.e., if the node index number between the various daemons is getting > confused. Can you perhaps see what happens with the failing cases if you > put microway3 at the top of the hostfile and execute prte/mpirun from > microway3 as before? > > OK, the first failing case: > > mic:/amd/home/jdelsign/PMIx>pterm > pterm failed to initialize, likely due to no DVM being available > mic:/amd/home/jdelsign/PMIx>cat myhostfile3 > microway3 slots=16 > microway1 slots=16 > microway2 slots=16 > mic:/amd/home/jdelsign/PMIx>prte --hostfile ./myhostfile3 --daemonize > mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name > --personality ompi ./tx_basic_mpi > tx_basic_mpi > Hello from proc (0) > MESSAGE: microway3.totalviewtech.com > Hello from proc (1): microway1 > Hello from proc (2): microway2.totalviewtech.com > -------------------------------------------------------------------------- > WARNING: Open MPI failed to TCP connect to a peer MPI process. This > should not happen. > > Your Open MPI job may now hang or fail. > > Local host: microway1 > PID: 292266 > Message: connect() to 10.71.2.58:1024 failed > Error: No route to host (113) > -------------------------------------------------------------------------- > [microway1:292266] > ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL > mic:/amd/home/jdelsign/PMIx>hostname > microway3.totalviewtech.com > mic:/amd/home/jdelsign/PMIx> > > And the second failing test case: > > mic:/amd/home/jdelsign/PMIx>pterm > pterm failed to initialize, likely due to no DVM being available > mic:/amd/home/jdelsign/PMIx>cat myhostfile3+2 > microway3 slots=16 > microway2 slots=16 > mic:/amd/home/jdelsign/PMIx> > mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name > --personality ompi --hostfile myhostfile3+2 ./tx_basic_mpi > tx_basic_mpi > Hello from proc (0) > MESSAGE: microway3.totalviewtech.com > Hello from proc (1): microway3.totalviewtech.com > -------------------------------------------------------------------------- > WARNING: Open MPI failed to TCP connect to a peer MPI process. This > should not happen. > > Your Open MPI job may now hang or fail. > > Local host: microway3 > PID: 271144 > Message: connect() to 10.71.2.58:1024 failed > Error: No route to host (113) > -------------------------------------------------------------------------- > [microway3.totalviewtech.com:271144] > ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL > Hello from proc (2): microway2.totalviewtech.com > mic:/amd/home/jdelsign/PMIx> > > So, AFAICT, host name order didn't matter. > > Cheers, John D. > > > > > > On May 4, 2020, at 7:34 AM, John DelSignore via devel < > devel@lists.open-mpi.org> wrote: > > Hi folks, > > I cloned a fresh copy of OMPI master this morning at ~8:30am EDT and > rebuilt. I'm running a very simple test code on three Centos 7.[56] nodes > named microway[123] over TCP. I'm seeing a fatal error similar to the > following: > > [microway3.totalviewtech.com:227713] > ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL > > The case of prun launching an OMPI code does not work correctly. The MPI > processes seem to launch OK, but there is the follwoing OMPI error at the > point where the processes communicate. In the following case, I have DVM > running on three nodes "microway[123]": > > mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name > --personality ompi ./tx_basic_mpi > tx_basic_mpi > Hello from proc (0) > MESSAGE: microway3.totalviewtech.com > Hello from proc (1): microway1 > Hello from proc (2): microway2.totalviewtech.com > -------------------------------------------------------------------------- > WARNING: Open MPI failed to TCP connect to a peer MPI process. This > should not happen. > > Your Open MPI job may now hang or fail. > > Local host: microway1 > PID: 282716 > Message: connect() to 10.71.2.58:1024 failed > Error: No route to host (113) > -------------------------------------------------------------------------- > *[microway1:282716] > ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL* > -------------------------------------------------------------------------- > An MPI communication peer process has unexpectedly disconnected. This > usually indicates a failure in the peer process (e.g., a crash or > otherwise exiting without calling MPI_FINALIZE first). > > Although this local MPI process will likely now behave unpredictably > (it may even hang or crash), the root cause of this problem is the > failure of the peer -- that is what you need to investigate. For > example, there may be a core file that you can examine. More > generally: such peer hangups are frequently caused by application bugs > or other external events. > > Local host: microway3 > Local PID: 214271 > Peer host: microway1 > -------------------------------------------------------------------------- > mic:/amd/home/jdelsign/PMIx> > > If I use mpirun to launch the program it works whether or not a DVM is > already running (first without a DVM, then with a DVM): > > mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name > --personality ompi --hostfile myhostfile ./tx_basic_mpi > tx_basic_mpi > Hello from proc (0) > MESSAGE: microway1 > Hello from proc (1): microway2.totalviewtech.com > Hello from proc (2): microway3.totalviewtech.com > All Done! > mic:/amd/home/jdelsign/PMIx> > mic:/amd/home/jdelsign/PMIx>prte --hostfile ./myhostfile --daemonize > mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name > --personality ompi --hostfile myhostfile ./tx_basic_mpi > tx_basic_mpi > Hello from proc (0) > MESSAGE: microway1 > Hello from proc (1): microway2.totalviewtech.com > Hello from proc (2): microway3.totalviewtech.com > All Done! > mic:/amd/home/jdelsign/PMIx> > > But if I use mpirun to launch 3 processes from microway3 and use a > hostfile that contains only microway[23], I get a similar failure as the > prun case: > > mic:/amd/home/jdelsign/PMIx>hostname > microway3.totalviewtech.com > mic:/amd/home/jdelsign/PMIx>cat myhostfile2 > microway2 slots=16 > microway3 slots=16 > mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name > --personality ompi --hostfile myhostfile2 ./tx_basic_mpi > tx_basic_mpi > Hello from proc (0) > MESSAGE: microway2.totalviewtech.com > Hello from proc (1): microway2.totalviewtech.com > -------------------------------------------------------------------------- > WARNING: Open MPI failed to TCP connect to a peer MPI process. This > should not happen. > > Your Open MPI job may now hang or fail. > > Local host: microway3 > PID: 227713 > Message: connect() to 10.71.2.58:1024 failed > Error: No route to host (113) > -------------------------------------------------------------------------- > *[microway3.totalviewtech.com <http://microway3.totalviewtech.com>:227713] > ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL* > [microway2][[32270,1],1][../../../../../../ompi/opal/mca/btl/tcp/btl_tcp.c:566:mca_btl_tcp_recv_blocking] > recv(13) failed: Connection reset by peer (104) > mic:/amd/home/jdelsign/PMIx> > > I asked my therapist (Ralph) about it, and he said, > > "It looks to me like the btl/tcp component is having trouble correctly > selecting a route to use when opening communications across hosts. I've > seen this in my docker setup too, but thought perhaps it was just a > docker-related issue. > What's weird in your last example is that both procs are on the same node, > and therefore they should only be using shared memory to communicate - the > btl/tcp component shouldn't be trying to create a connection at all." > > Cheers, John D. > > > > This e-mail may contain information that is privileged or confidential. If > you are not the intended recipient, please delete the e-mail and any > attachments and notify us immediately. > > > > > *CAUTION:* This email originated from outside of the organization. Do not > click on links or open attachments unless you recognize the sender and know > the content is safe. > > > This e-mail may contain information that is privileged or confidential. If > you are not the intended recipient, please delete the e-mail and any > attachments and notify us immediately. > >