I keep explaining that we don't "discard" anything, but there really isn't any point to continuing trying to explain the system. With the announced intention of completing the move of the BTLs to OPAL, I no longer need the multi-module complexity in the OOB/TCP. So I have removed it and gone back to the single module that connects to everything.
Try r31956 - hopefully will resolve your connectivity issues. Still looking at the MPI_Abort hang as I'm having trouble replicating it. On Jun 5, 2014, at 7:16 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > Jeff, > > as pointed by Ralph, i do wish using eth0 for oob messages. > > i work on a 4k+ nodes cluster with a very decent gigabit ethernet > network (reasonable oversubscription + switches > from a reputable vendor you are familiar with ;-) ) > my experience is that IPoIB can be very slow at establishing a > connection, especially if the arp table is not populated > (as far as i understand, this involves the subnet manager and > performance can be very random especially if all nodes issue > arp requests at the same time) > on the other hand, performance is much more stable when using the > subnetted IP network. > > as Ralf also pointed, i can imagine some architects neglect their > ethernet network (e.g. highly oversubscribed + low end switches) > and in this case ib0 is a best fit for oob messages. > >> As a simple solution, there could be an TCP oob MCA param that says >> "regardless of peer IP address, I can connect to them" (i.e., assume IP >> routing will make everything work out ok). > +1 and/or an option to tell oob mca "do not discard the interface simply > because the peer IP is not in the same subnet" > > Cheers, > > Gilles > > On 2014/06/05 23:01, Ralph Castain wrote: >> Because Gilles wants to avoid using IB for TCP messages, and using eth0 also >> solves the problem (the messages just route) >> >> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >> wrote: >> >>> Another random thought for Gilles situation: why not oob-TCP-if-include >>> ib0? (And not eth0) >>> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14982.php