Ralph, i get no more complains about rtc :-)
but MPI_Abort still hangs :-( i reviewed my configuration and the hang is not related to one node having one IB port and the other node having two IB ports. the two nodes can establish TCP connections via : - eth0 (but they are *not* on the same subnet) - ib0 (and they *are* on the same subnet) from the logs, it seems eth0 is "discarded" and only ib0 is used. when the task abort, it hangs ... i attached the logs i took on two VM with a "simpler" config : - slurm0 has one eth port (eth0) * eth0 is on 192.168.122.100/24 (network 0) * eth0:1 is on 10.0.0.1/24 (network 0) - slurm3 has two eth ports (eth0 and eth1) * eth0 is on 192.168.222.0/24 (network 1) * eth1 is on 10.0.0.2/24 (network 0) network0 and network1 are connected to a router. from slurm0, i launch : mpirun -np 1 -host slurm3 --mca btl tcp,self --mca oob_base_verbose 10 ./abort the oob logs are attached Cheers, Gilles On Tue, Jun 3, 2014 at 12:10 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Thanks Ralph, > > i will try this tomorrow > > Cheers, > > Gilles > > > > On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> I think I have this fixed with r31928, but have no way to test it on my >> machine. Please see if it works for you. >> >> >> On Jun 2, 2014, at 7:09 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >> This is indeed the problem - we are trying to send a message and don't >> know how to get it somewhere. I'll break the loop, and then ask that you >> run this again with -mca oob_base_verbose 10 so we can see the intended >> recipient. >> >> On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com> wrote: >> >> #7 0x00007fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from >> /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so >> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/06/14954.php >> > >
abort.oob.log.gz
Description: GNU Zip compressed data