Ralph, sorry for my poor understanding ...
i tried r31956 and it solved both issues : - MPI_Abort does not hang any more if nodes are on different eth0 subnets - MPI_Init does not hang any more if hosts have different number of IB ports this likely explains why you are having trouble replicating it ;-) Thanks a lot ! Gilles On Fri, Jun 6, 2014 at 11:45 AM, Ralph Castain <r...@open-mpi.org> wrote: > I keep explaining that we don't "discard" anything, but there really isn't > any point to continuing trying to explain the system. With the announced > intention of completing the move of the BTLs to OPAL, I no longer need the > multi-module complexity in the OOB/TCP. So I have removed it and gone back > to the single module that connects to everything. > > Try r31956 - hopefully will resolve your connectivity issues. > > Still looking at the MPI_Abort hang as I'm having trouble replicating it. > > > On Jun 5, 2014, at 7:16 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > > > Jeff, > > > > as pointed by Ralph, i do wish using eth0 for oob messages. > > > > i work on a 4k+ nodes cluster with a very decent gigabit ethernet > > network (reasonable oversubscription + switches > > from a reputable vendor you are familiar with ;-) ) > > my experience is that IPoIB can be very slow at establishing a > > connection, especially if the arp table is not populated > > (as far as i understand, this involves the subnet manager and > > performance can be very random especially if all nodes issue > > arp requests at the same time) > > on the other hand, performance is much more stable when using the > > subnetted IP network. > > > > as Ralf also pointed, i can imagine some architects neglect their > > ethernet network (e.g. highly oversubscribed + low end switches) > > and in this case ib0 is a best fit for oob messages. > > > >> As a simple solution, there could be an TCP oob MCA param that says > "regardless of peer IP address, I can connect to them" (i.e., assume IP > routing will make everything work out ok). > > +1 and/or an option to tell oob mca "do not discard the interface simply > > because the peer IP is not in the same subnet" > > > > Cheers, > > > > Gilles > > > > On 2014/06/05 23:01, Ralph Castain wrote: > >> Because Gilles wants to avoid using IB for TCP messages, and using eth0 > also solves the problem (the messages just route) > >> > >> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > >> > >>> Another random thought for Gilles situation: why not > oob-TCP-if-include ib0? (And not eth0) > >>> > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14982.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14983.php >