On Jun 3, 2014, at 3:06 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> wrote:
> Ralph, > > i get no more complains about rtc :-) > > but MPI_Abort still hangs :-( > > i reviewed my configuration and the hang is not related to one node having > one IB port and the other node having two IB ports. > > the two nodes can establish TCP connections via : > - eth0 (but they are *not* on the same subnet) > - ib0 (and they *are* on the same subnet) > > from the logs, it seems eth0 is "discarded" and only ib0 is used. That would be correct - we don't really "discard" eth0, but default to using the interfaces on the common subnet to avoid routing > when the task abort, it hangs ... > > > > i attached the logs i took on two VM with a "simpler" config : > - slurm0 has one eth port (eth0) > * eth0 is on 192.168.122.100/24 (network 0) > * eth0:1 is on 10.0.0.1/24 (network 0) > - slurm3 has two eth ports (eth0 and eth1) > * eth0 is on 192.168.222.0/24 (network 1) > * eth1 is on 10.0.0.2/24 (network 0) > > network0 and network1 are connected to a router. > > > from slurm0, i launch : > > mpirun -np 1 -host slurm3 --mca btl tcp,self --mca oob_base_verbose 10 ./abort Is this running under slurm? Or are you running under rsh? > > the oob logs are attached > > Cheers, > > Gilles > > On Tue, Jun 3, 2014 at 12:10 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > Thanks Ralph, > > i will try this tomorrow > > Cheers, > > Gilles > > > > On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain <r...@open-mpi.org> wrote: > I think I have this fixed with r31928, but have no way to test it on my > machine. Please see if it works for you. > > > On Jun 2, 2014, at 7:09 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> This is indeed the problem - we are trying to send a message and don't know >> how to get it somewhere. I'll break the loop, and then ask that you run this >> again with -mca oob_base_verbose 10 so we can see the intended recipient. >> >> On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com> wrote: >> >>> #7 0x00007fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from >>> /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14954.php > > > <abort.oob.log.gz>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14964.php