Another random thought for Gilles situation: why not oob-TCP-if-include ib0?  
(And not eth0)

That should solve his problem, but not the larger issue I raised in my previous 
email.

Sent from my phone. No type good.

On Jun 4, 2014, at 9:32 PM, "Gilles Gouaillardet" 
<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>> wrote:

Thanks Ralf,

for the time being, i just found a workaround
--mca oob_tcp_if_include eth0

Generally speaking, is openmpi doing the wiser thing ?
here is what i mean :
the cluster i work on (4k+ nodes) each node has two ip interfaces :
 * eth0 (gigabit ethernet) : because of the cluster size, several subnets are 
used.
 * ib0 (IP over IB) : only one subnet
i can easily understand such a large cluster is not so common, but on the other 
hand i do not believe the IP configuration (subnetted gigE and single subnet 
IPoIB) can be called exotic.

if nodes from different eth0 subnets are used, and if i understand correctly 
your previous replies, orte will "discard" eth0 because nodes cannot contact 
each other "directly".
directly means the nodes are not on the same subnet. that being said, they can 
communicate via IP thanks to IP routing (i mean IP routing, i do *not* mean 
orte routing).
that means orte communications will use IPoIB which might not be the best thing 
to do since establishing an IPoIB connection can be long (especially at scale 
*and* if the arp table is not populated)

is my understanding correct so far ?

bottom line, i would have expected openmpi uses eth0 regardless IP routing is 
required, and ib0 is simply not used (or eventually used as a fallback option)

this leads to my next question : is the current default ok ? if not should we 
change it and how ?
/*
imho :
 - IP routing is not always a bad/slow thing
 - gigE can sometimes be better than IPoIB)
*/

i am fine if at the end :
- this issue is fixed
- we decide it is up to the sysadmin to make --mca oob_tcp_if_include eth0 the 
default if this is really thought to be best for the cluster. (and i can try to 
draft a faq if needed)

Cheers,

Gilles

On Wed, Jun 4, 2014 at 11:50 PM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:

I'll work on it - may take a day or two to really fix. Only impacts systems 
with mismatched interfaces, which is why we aren't generally seeing it.

_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/06/14972.php

Reply via email to