On Jun 3, 2014, at 3:06 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> 
wrote:

> Ralph,
> 
> i get no more complains about rtc :-)
> 
> but MPI_Abort still hangs :-(
> 
> i reviewed my configuration and the hang is not related to one node having 
> one IB port and the other node having two IB ports.
> 
> the two nodes can establish TCP connections via :
> - eth0 (but they are *not* on the same subnet)
> - ib0 (and they *are* on the same subnet)
> 
> from the logs, it seems eth0 is "discarded" and only ib0 is used.

That would be correct - we don't really "discard" eth0, but default to using 
the interfaces on the common subnet to avoid routing

> when the task abort, it hangs ...
> 
> 
> 
> i attached the logs i took on two VM with a "simpler" config :
> - slurm0 has one eth port (eth0)
>   * eth0 is on 192.168.122.100/24 (network 0)
>   * eth0:1 is on 10.0.0.1/24 (network 0)
> - slurm3 has two eth ports (eth0 and eth1)
>   * eth0 is on 192.168.222.0/24 (network 1)
>   * eth1 is on 10.0.0.2/24 (network 0)
> 
> network0 and network1 are connected to a router.
> 
> 
> from slurm0, i launch :
> 
> mpirun -np 1 -host slurm3 --mca btl tcp,self --mca oob_base_verbose 10 ./abort

Is this running under slurm? Or are you running under rsh?

> 
> the oob logs are attached
> 
> Cheers,
> 
> Gilles
> 
> On Tue, Jun 3, 2014 at 12:10 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> Thanks Ralph,
> 
> i will try this tomorrow
> 
> Cheers,
> 
> Gilles
> 
> 
> 
> On Tue, Jun 3, 2014 at 12:03 AM, Ralph Castain <r...@open-mpi.org> wrote:
> I think I have this fixed with r31928, but have no way to test it on my 
> machine. Please see if it works for you.
> 
> 
> On Jun 2, 2014, at 7:09 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> This is indeed the problem - we are trying to send a message and don't know 
>> how to get it somewhere. I'll break the loop, and then ask that you run this 
>> again with -mca oob_base_verbose 10 so we can see the intended recipient.
>> 
>> On Jun 2, 2014, at 3:55 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com> wrote:
>> 
>>> #7  0x00007fe8fab67ce3 in mca_oob_tcp_component_hop_unknown () from 
>>> /.../local/ompi-trunk/lib/openmpi/mca_oob_tcp.so
>> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14954.php
> 
> 
> <abort.oob.log.gz>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14964.php

Reply via email to