Re: [OMPI devel] trunk hang when nodes have similar but private network

Jeff Squyres (jsquyres) Wed, 13 Aug 2014 15:39:38 -0400 (EDT)

Paul: I think this is a slippery slope.

As I understand it, these private/on-host IP addresses are generated somewhat 
randomly (e.g., for on-host VM networking -- I don't know if the IP's for Phi 
on-host networking are pseudo-random or [effectively] fixed).  So you might end 
up in a situation like this:


server A: has br0 on-host IP address 10.0.0.23/8 ***same as server C
server B: has br0 on-host IP address 10.0.0.25/8
server C: has br0 on-host IP address 10.0.0.23/8 ***same as server A
server D: has br0 on-host IP address 10.0.0.107/8

In this case, servers A and C will detect that they have the same IP.  "Ah ha!" 
they say. "I'll just not use br0, because clearly this is erroneous".

But how will servers B and D know this?

You'll likely get the same "hang" behavior that we currently have, because B 
may try to send to A on 10.0.0.23/8.

Hence, the additional logic may not actually solve the problem.

I'm thinking that this is a human-configuration issue -- there may not be a 
good way to detect this automatically.

...unless there's a bit in Linux interfaces that says "this is an on-host 
network".  Does that exist?  Because that would be a better way to disqualify 
Linux IP interfaces.


On Aug 13, 2014, at 1:57 PM, Paul Hargrove <[email protected]> wrote:

> I think that in this case one *could* add logic that would disqualify the 
> subnet because every compute node in the job has the SAME address.  In fact, 
> any subnet on which two or more compute nodes have the same address must be 
> suspect.  If this logic were introduced, the 127.0.0.1 loopback address 
> wouldn't need to be a special case.
> 
> This is just an observation, not a feature request (at least not on my part).
> 
> -Paul
> 
> 
> On Wed, Aug 13, 2014 at 7:55 AM, Jeff Squyres (jsquyres) <[email protected]> 
> wrote:
> I think this is expected behavior.
> 
> If you have networks that you need Open MPI to ignore (e.g., a private 
> network that *looks* reachable between multiple servers -- because the 
> interfaces are on the same subnet -- but actually *isn't*), then the 
> include/exclude mechanism is the right way to exclude them.
> 
> That being said, I'm not sure why the behavior is different between trunk and 
> v1.8.
> 
> 
> On Aug 13, 2014, at 1:41 AM, Gilles Gouaillardet 
> <[email protected]> wrote:
> 
> > Folks,
> >
> > i noticed mpirun (trunk) hangs when running any mpi program on two nodes
> > *and* each node has a private network with the same ip
> > (in my case, each node has a private network to a MIC)
> >
> > in order to reproduce the problem, you can simply run (as root) on the
> > two compute nodes
> > brctl addbr br0
> > ifconfig br0 192.168.255.1 netmask 255.255.255.0
> >
> > mpirun will hang
> >
> > a workaroung is to add --mca btl_tcp_if_include eth0
> >
> > v1.8 does not hang in this case
> >
> > Cheers,
> >
> > Gilles
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/08/15623.php
> 
> 
> --
> Jeff Squyres
> [email protected]
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15631.php
> 
> 
> 
> -- 
> Paul H. Hargrove                          [email protected]
> Future Technologies Group
> Computer and Data Sciences Department     Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15636.php


-- 
Jeff Squyres
[email protected]
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] trunk hang when nodes have similar but private network

Reply via email to