Kewl - thanks!

On Jun 5, 2014, at 9:28 PM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> 
wrote:

> Ralph,
> 
> sorry for my poor understanding ...
> 
> i tried r31956 and it solved both issues :
> - MPI_Abort does not hang any more if nodes are on different eth0 subnets
> - MPI_Init does not hang any more if hosts have different number of IB ports
> 
> this likely explains why you are having trouble replicating it ;-)
> 
> Thanks a lot !
> 
> Gilles
> 
> 
> On Fri, Jun 6, 2014 at 11:45 AM, Ralph Castain <r...@open-mpi.org> wrote:
> I keep explaining that we don't "discard" anything, but there really isn't 
> any point to continuing trying to explain the system. With the announced 
> intention of completing the move of the BTLs to OPAL, I no longer need the 
> multi-module complexity in the OOB/TCP. So I have removed it and gone back to 
> the single module that connects to everything.
> 
> Try r31956 - hopefully will resolve your connectivity issues.
> 
> Still looking at the MPI_Abort hang as I'm having trouble replicating it.
> 
> 
> On Jun 5, 2014, at 7:16 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
> 
> > Jeff,
> >
> > as pointed by Ralph, i do wish using eth0 for oob messages.
> >
> > i work on a 4k+ nodes cluster with a very decent gigabit ethernet
> > network (reasonable oversubscription + switches
> > from a reputable vendor you are familiar with ;-) )
> > my experience is that IPoIB can be very slow at establishing a
> > connection, especially if the arp table is not populated
> > (as far as i understand, this involves the subnet manager and
> > performance can be very random especially if all nodes issue
> > arp requests at the same time)
> > on the other hand, performance is much more stable when using the
> > subnetted IP network.
> >
> > as Ralf also pointed, i can imagine some architects neglect their
> > ethernet network (e.g. highly oversubscribed + low end switches)
> > and in this case ib0 is a best fit for oob messages.
> >
> >> As a simple solution, there could be an TCP oob MCA param that says 
> >> "regardless of peer IP address, I can connect to them" (i.e., assume IP 
> >> routing will make everything work out ok).
> > +1 and/or an option to tell oob mca "do not discard the interface simply
> > because the peer IP is not in the same subnet"
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2014/06/05 23:01, Ralph Castain wrote:
> >> Because Gilles wants to avoid using IB for TCP messages, and using eth0 
> >> also solves the problem (the messages just route)
> >>
> >> On Jun 5, 2014, at 5:00 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> >> wrote:
> >>
> >>> Another random thought for Gilles situation: why not oob-TCP-if-include 
> >>> ib0?  (And not eth0)
> >>>
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/06/14982.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14983.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14984.php

Reply via email to