Ralph,

sorry for my poor understanding ...

i tried r31956 and it solved both issues :
- MPI_Abort does not hang any more if nodes are on different eth0 subnets
- MPI_Init does not hang any more if hosts have different number of IB ports

this likely explains why you are having trouble replicating it ;-)

Thanks a lot !

Gilles


On Fri, Jun 6, 2014 at 11:45 AM, Ralph Castain <r...@open-mpi.org> wrote:

> I keep explaining that we don't "discard" anything, but there really isn't
> any point to continuing trying to explain the system. With the announced
> intention of completing the move of the BTLs to OPAL, I no longer need the
> multi-module complexity in the OOB/TCP. So I have removed it and gone back
> to the single module that connects to everything.
>
> Try r31956 - hopefully will resolve your connectivity issues.
>
> Still looking at the MPI_Abort hang as I'm having trouble replicating it.
>
>
> On Jun 5, 2014, at 7:16 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> > Jeff,
> >
> > as pointed by Ralph, i do wish using eth0 for oob messages.
> >
> > i work on a 4k+ nodes cluster with a very decent gigabit ethernet
> > network (reasonable oversubscription + switches
> > from a reputable vendor you are familiar with ;-) )
> > my experience is that IPoIB can be very slow at establishing a
> > connection, especially if the arp table is not populated
> > (as far as i understand, this involves the subnet manager and
> > performance can be very random especially if all nodes issue
> > arp requests at the same time)
> > on the other hand, performance is much more stable when using the
> > subnetted IP network.
> >
> > as Ralf also pointed, i can imagine some architects neglect their
> > ethernet network (e.g. highly oversubscribed + low end switches)
> > and in this case ib0 is a best fit for oob messages.
> >
> >> As a simple solution, there could be an TCP oob MCA param that says
> "regardless of peer IP address, I can connect to them" (i.e., assume IP
> routing will make everything work out ok).
> > +1 and/or an option to tell oob mca "do not discard the interface simply
> > because the peer IP is not in the same subnet"
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2014/06/05 23:01, Ralph Castain wrote:
> >> Because Gilles wants to avoid using IB for TCP messages, and using eth0
> also solves the problem (the messages just route)
> >>
> >> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> wrote:
> >>
> >>> Another random thought for Gilles situation: why not
> oob-TCP-if-include ib0?  (And not eth0)
> >>>
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/14982.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/14983.php
>

Reply via email to