Paul: I think this is a slippery slope. As I understand it, these private/on-host IP addresses are generated somewhat randomly (e.g., for on-host VM networking -- I don't know if the IP's for Phi on-host networking are pseudo-random or [effectively] fixed). So you might end up in a situation like this:
server A: has br0 on-host IP address 10.0.0.23/8 ***same as server C server B: has br0 on-host IP address 10.0.0.25/8 server C: has br0 on-host IP address 10.0.0.23/8 ***same as server A server D: has br0 on-host IP address 10.0.0.107/8 In this case, servers A and C will detect that they have the same IP. "Ah ha!" they say. "I'll just not use br0, because clearly this is erroneous". But how will servers B and D know this? You'll likely get the same "hang" behavior that we currently have, because B may try to send to A on 10.0.0.23/8. Hence, the additional logic may not actually solve the problem. I'm thinking that this is a human-configuration issue -- there may not be a good way to detect this automatically. ...unless there's a bit in Linux interfaces that says "this is an on-host network". Does that exist? Because that would be a better way to disqualify Linux IP interfaces. On Aug 13, 2014, at 1:57 PM, Paul Hargrove <[email protected]> wrote: > I think that in this case one *could* add logic that would disqualify the > subnet because every compute node in the job has the SAME address. In fact, > any subnet on which two or more compute nodes have the same address must be > suspect. If this logic were introduced, the 127.0.0.1 loopback address > wouldn't need to be a special case. > > This is just an observation, not a feature request (at least not on my part). > > -Paul > > > On Wed, Aug 13, 2014 at 7:55 AM, Jeff Squyres (jsquyres) <[email protected]> > wrote: > I think this is expected behavior. > > If you have networks that you need Open MPI to ignore (e.g., a private > network that *looks* reachable between multiple servers -- because the > interfaces are on the same subnet -- but actually *isn't*), then the > include/exclude mechanism is the right way to exclude them. > > That being said, I'm not sure why the behavior is different between trunk and > v1.8. > > > On Aug 13, 2014, at 1:41 AM, Gilles Gouaillardet > <[email protected]> wrote: > > > Folks, > > > > i noticed mpirun (trunk) hangs when running any mpi program on two nodes > > *and* each node has a private network with the same ip > > (in my case, each node has a private network to a MIC) > > > > in order to reproduce the problem, you can simply run (as root) on the > > two compute nodes > > brctl addbr br0 > > ifconfig br0 192.168.255.1 netmask 255.255.255.0 > > > > mpirun will hang > > > > a workaroung is to add --mca btl_tcp_if_include eth0 > > > > v1.8 does not hang in this case > > > > Cheers, > > > > Gilles > > _______________________________________________ > > devel mailing list > > [email protected] > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/08/15623.php > > > -- > Jeff Squyres > [email protected] > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > [email protected] > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15631.php > > > > -- > Paul H. Hargrove [email protected] > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > [email protected] > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15636.php -- Jeff Squyres [email protected] For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
