John,

The common denominator across all these errors is an error from connect
while trying to connect to 10.71.2.58 on port 1024. Who is 10.71.2.58 ? If
the firewall open ? Is the port 1024 allowed to connect to ?

  George.


On Mon, May 4, 2020 at 11:36 AM John DelSignore via devel <
devel@lists.open-mpi.org> wrote:

> Inline below...
>
> On 2020-05-04 11:09, Ralph Castain via devel wrote:
>
> Staring at this some more, I do have the following questions:
>
> * in your first case, it looks like "prte" was started from microway3 -
> correct?
>
> Yes, "prte" was started from microway3.
>
>
> * in the second case, that worked, it looks like "mpirun" was executed
> from microway1 - correct?
>
> No, "mpirun" was executed from microway3.
>
>
> * in the third case, you state that "mpirun" was again executed from
> microway3, and the process output confirms that
>
> Yes, "mpirun" was started from microway3.
>
>
> I'm wondering if the issue here might actually be that PRRTE expects the
> ordering of hosts in the hostfile to start with the host it is sitting on -
> i.e., if the node index number between the various daemons is getting
> confused. Can you perhaps see what happens with the failing cases if you
> put microway3 at the top of the hostfile and execute prte/mpirun from
> microway3 as before?
>
> OK, the first failing case:
>
> mic:/amd/home/jdelsign/PMIx>pterm
> pterm failed to initialize, likely due to no DVM being available
> mic:/amd/home/jdelsign/PMIx>cat myhostfile3
> microway3 slots=16
> microway1 slots=16
> microway2 slots=16
> mic:/amd/home/jdelsign/PMIx>prte --hostfile ./myhostfile3 --daemonize
> mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name
> --personality ompi ./tx_basic_mpi
> tx_basic_mpi
> Hello from proc (0)
> MESSAGE: microway3.totalviewtech.com
> Hello from proc (1): microway1
> Hello from proc (2): microway2.totalviewtech.com
> --------------------------------------------------------------------------
> WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
> should not happen.
>
> Your Open MPI job may now hang or fail.
>
>   Local host: microway1
>   PID:        292266
>   Message:    connect() to 10.71.2.58:1024 failed
>   Error:      No route to host (113)
> --------------------------------------------------------------------------
> [microway1:292266]
> ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
> mic:/amd/home/jdelsign/PMIx>hostname
> microway3.totalviewtech.com
> mic:/amd/home/jdelsign/PMIx>
>
> And the second failing test case:
>
> mic:/amd/home/jdelsign/PMIx>pterm
> pterm failed to initialize, likely due to no DVM being available
> mic:/amd/home/jdelsign/PMIx>cat myhostfile3+2
> microway3 slots=16
> microway2 slots=16
> mic:/amd/home/jdelsign/PMIx>
> mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name
> --personality ompi --hostfile myhostfile3+2 ./tx_basic_mpi
> tx_basic_mpi
> Hello from proc (0)
> MESSAGE: microway3.totalviewtech.com
> Hello from proc (1): microway3.totalviewtech.com
> --------------------------------------------------------------------------
> WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
> should not happen.
>
> Your Open MPI job may now hang or fail.
>
>   Local host: microway3
>   PID:        271144
>   Message:    connect() to 10.71.2.58:1024 failed
>   Error:      No route to host (113)
> --------------------------------------------------------------------------
> [microway3.totalviewtech.com:271144]
> ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
> Hello from proc (2): microway2.totalviewtech.com
> mic:/amd/home/jdelsign/PMIx>
>
> So, AFAICT, host name order didn't matter.
>
> Cheers, John D.
>
>
>
>
>
> On May 4, 2020, at 7:34 AM, John DelSignore via devel <
> devel@lists.open-mpi.org> wrote:
>
> Hi folks,
>
> I cloned a fresh copy of OMPI master this morning at ~8:30am EDT and
> rebuilt. I'm running a very simple test code on three Centos 7.[56] nodes
> named microway[123] over TCP. I'm seeing a fatal error similar to the
> following:
>
> [microway3.totalviewtech.com:227713]
> ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL
>
> The case of prun launching an OMPI code does not work correctly. The MPI
> processes seem to launch OK, but there is the follwoing OMPI error at the
> point where the processes communicate. In the following case, I have DVM
> running on three nodes "microway[123]":
>
> mic:/amd/home/jdelsign/PMIx>prun -n 3 --map-by node -x MESSAGE=name
> --personality ompi ./tx_basic_mpi
> tx_basic_mpi
> Hello from proc (0)
> MESSAGE: microway3.totalviewtech.com
> Hello from proc (1): microway1
> Hello from proc (2): microway2.totalviewtech.com
> --------------------------------------------------------------------------
> WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
> should not happen.
>
> Your Open MPI job may now hang or fail.
>
>   Local host: microway1
>   PID:        282716
>   Message:    connect() to 10.71.2.58:1024 failed
>   Error:      No route to host (113)
> --------------------------------------------------------------------------
> *[microway1:282716]
> ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL*
> --------------------------------------------------------------------------
> An MPI communication peer process has unexpectedly disconnected.  This
> usually indicates a failure in the peer process (e.g., a crash or
> otherwise exiting without calling MPI_FINALIZE first).
>
> Although this local MPI process will likely now behave unpredictably
> (it may even hang or crash), the root cause of this problem is the
> failure of the peer -- that is what you need to investigate.  For
> example, there may be a core file that you can examine.  More
> generally: such peer hangups are frequently caused by application bugs
> or other external events.
>
>   Local host: microway3
>   Local PID:  214271
>   Peer host:  microway1
> --------------------------------------------------------------------------
> mic:/amd/home/jdelsign/PMIx>
>
> If I use mpirun to launch the program it works whether or not a DVM is
> already running (first without a DVM, then with a DVM):
>
> mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name
> --personality ompi --hostfile myhostfile ./tx_basic_mpi
> tx_basic_mpi
> Hello from proc (0)
> MESSAGE: microway1
> Hello from proc (1): microway2.totalviewtech.com
> Hello from proc (2): microway3.totalviewtech.com
> All Done!
> mic:/amd/home/jdelsign/PMIx>
> mic:/amd/home/jdelsign/PMIx>prte --hostfile ./myhostfile --daemonize
> mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name
> --personality ompi --hostfile myhostfile ./tx_basic_mpi
> tx_basic_mpi
> Hello from proc (0)
> MESSAGE: microway1
> Hello from proc (1): microway2.totalviewtech.com
> Hello from proc (2): microway3.totalviewtech.com
> All Done!
> mic:/amd/home/jdelsign/PMIx>
>
> But if I use mpirun to launch 3 processes from microway3 and use a
> hostfile that contains only microway[23], I get a similar failure as the
> prun case:
>
> mic:/amd/home/jdelsign/PMIx>hostname
> microway3.totalviewtech.com
> mic:/amd/home/jdelsign/PMIx>cat myhostfile2
> microway2 slots=16
> microway3 slots=16
> mic:/amd/home/jdelsign/PMIx>mpirun -n 3 --map-by node -x MESSAGE=name
> --personality ompi --hostfile myhostfile2 ./tx_basic_mpi
> tx_basic_mpi
> Hello from proc (0)
> MESSAGE: microway2.totalviewtech.com
> Hello from proc (1): microway2.totalviewtech.com
> --------------------------------------------------------------------------
> WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
> should not happen.
>
> Your Open MPI job may now hang or fail.
>
>   Local host: microway3
>   PID:        227713
>   Message:    connect() to 10.71.2.58:1024 failed
>   Error:      No route to host (113)
> --------------------------------------------------------------------------
> *[microway3.totalviewtech.com <http://microway3.totalviewtech.com>:227713]
> ../../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:189 FATAL*
> [microway2][[32270,1],1][../../../../../../ompi/opal/mca/btl/tcp/btl_tcp.c:566:mca_btl_tcp_recv_blocking]
> recv(13) failed: Connection reset by peer (104)
> mic:/amd/home/jdelsign/PMIx>
>
> I asked my therapist (Ralph) about it, and he said,
>
> "It looks to me like the btl/tcp component is having trouble correctly
> selecting a route to use when opening communications across hosts. I've
> seen this in my docker setup too, but thought perhaps it was just a
> docker-related issue.
> What's weird in your last example is that both procs are on the same node,
> and therefore they should only be using shared memory to communicate - the
> btl/tcp component shouldn't be trying to create a connection at all."
>
> Cheers, John D.
>
>
>
> This e-mail may contain information that is privileged or confidential. If
> you are not the intended recipient, please delete the e-mail and any
> attachments and notify us immediately.
>
>
>
>
> *CAUTION:* This email originated from outside of the organization. Do not
> click on links or open attachments unless you recognize the sender and know
> the content is safe.
>
>
> This e-mail may contain information that is privileged or confidential. If
> you are not the intended recipient, please delete the e-mail and any
> attachments and notify us immediately.
>
>

Reply via email to