It turns out the OMPI behavior today was divergent from what is written in the README. We already explicitly state that
- If specified, the "btl_tcp_if_exclude" parameter must include the loopback device ("lo" on many Linux platforms), or Open MPI will not be able to route MPI messages using the TCP BTL. For example: "mpirun --mca btl_tcp_if_exclude lo,eth1 ..." So, with this patch we are now README compliant ! George. On Fri, Sep 23, 2016 at 7:03 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > George, > > OK then, > I recommend we explicitly state in the README that loopback interface can > no more be omitted from btl_tcp_if_exclude when running on multiple nodes > > Cheers, > > Gilles > > > On Thursday, September 22, 2016, George Bosilca <bosi...@icl.utk.edu> > wrote: > >> Thanks for clarifying, I now understand what your objection/suggestion >> was. We all misconfigured OMPI at least once, but that allowed us to learn >> how to do it right. >> >> Instead of adding extra protections for corner-cases, maybe we should fix >> our exclusivity flag so that the scenario you describe would not happen. >> >> George. >> >> PS: "btl_tcp_if_exclude = ^ib0" qualifies as a honest mistake. I >> wouldn't dare proposing a new MCA param to prevent this ... >> >> >> On Wed, Sep 21, 2016 at 10:54 PM, Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com> wrote: >> >>> ok, i was not clear >>> >>> by "let's consider the case where "lo" is *not* excluded via the >>> btl_tcp_if_exclude MCA param" i really meant >>> "let's consider the case where the value of the btl_tcp_if_exclude MCA >>> param has been forced to a list of network/interfaces that do not >>> contain any reference (e.g. name nor subnet) to the loopback >>> interface" >>> /* in a previous example, i did mpirun --mca btl_tcp_if_exclude ^ib0 */ >>> >>> my concern is that openmpi-mca-params.conf contains >>> btl_tcp_if_exclude = ^ib0 >>> >>> then hiccups will start when Open MPI is updated, and i expect some >>> complains. >>> of course we can reply, doc should have been read and advices >>> followed, so one cannot complain just because he has been lucky so >>> far. >>> or we can do things a bit differently so we do not run into this case >>> >>> /* if btl/self is excluded, the app will not start and it is trivial >>> to append to the error message a note asking to ensure btl/self was >>> not excluded. >>> in this case, i do not think we have a mechanism to issue a warning >>> message (e.g. "ensure lo is excluded") when hiccups occur. */ >>> >>> Cheers, >>> >>> Gilles >>> >>> On Thu, Sep 22, 2016 at 9:54 AM, George Bosilca <bosi...@icl.utk.edu> >>> wrote: >>> > On Wednesday, September 21, 2016, Gilles Gouaillardet >>> > <gilles.gouaillar...@gmail.com> wrote: >>> >> >>> >> George, >>> >> >>> >> let's consider the case where "lo" is *not* excluded via the >>> >> btl_tcp_if_exclude MCA param >>> >> (if i understand correctly, the following is also true if "lo" is >>> >> included via the btl_tcp_if_include MCA param) >>> >> >>> >> currently, and because of/thanks to the test that is done "deep >>> inside" >>> >> 1) on a disconnected laptop, mpirun --mca btl tcp,self ... fails with >>> >> 2 tasks or more because tasks cannot reach each other >>> >> 2) on a (connected) cluster, "lo" is never used and mpirun --mca btl >>> >> tcp,self ... does not hang when tasks are running on two nodes or more >>> >> >>> >> with your proposal : >>> >> 3) on a disconnected laptop, mpirun --mca btl tcp,self ... works with >>> >> any number of taks, because "lo" is used by btl/tcp >>> >> 4) on a (connected) cluster, "lo" is used and mpirun --mca btl >>> >> tcp,self ... will very likely hang when tasks are running on two nodes >>> >> or more >>> >> >>> >> am i right so far ? >>> > >>> > >>> > No, you are missing the fact that thanks to our if_exclude (which >>> contains >>> > by default 127.0.0.0/24) we will never use lo (not even with my >>> patch). >>> > Thus, local interfaces will remain out of reach for most users, with >>> the >>> > exception of those that manually force the inclusion of lo via >>> if_include. >>> > >>> > On a cluster where a user explicitly enable lo, there will be some >>> hiccups >>> > during startup. However, as Paul states we explicitly discourage >>> people of >>> > doing that in the README. Second, the connection over lo will >>> eventually >>> > timeout, and lo it will be dropped and all pending communications will >>> be >>> > redirected through another TCP interface. >>> > >>> > Cheers, >>> > George. >>> > >>> > >>> >> >>> >> my concern is 4) >>> >> as Paul pointed out, we can consider this is not an issue since this >>> >> is a user/admin mistake, and we do not care whether this is an honest >>> >> one or not. that being said, this is not very friendly since something >>> >> that is working fine today will (likely) start hanging when your patch >>> >> is merged. >>> >> >>> >> my suggestion differs since it is basically 2) and 3), which can be >>> >> seen as the best of both worlds >>> >> >>> >> makes sense ? >>> >> >>> >> as a side note, there were some discussions about automatically adding >>> >> the self btl, >>> >> and even offering a user friendly alternative to --mca btl xxx >>> >> (for example --networks shm,infiniband. today Open MPI does not >>> >> provide any alternative to btl/self. also infiniband can be used via >>> >> btl/openib, mtl/mxm or libfabric, which makes it painful to >>> >> blacklist). i cannot remember the outcome of the discussion (if any). >>> >> >>> >> Cheers, >>> >> >>> >> Gilles >>> >> >>> >> On Thu, Sep 22, 2016 at 4:57 AM, George Bosilca <bosi...@icl.utk.edu> >>> >> wrote: >>> >> > Gilles, >>> >> > >>> >> > I don't understand how your proposal is any different than what we >>> have >>> >> > today. I quote "If [locality flag is set], then we could keep a hard >>> >> > coded >>> >> > test so 127.x.y.z address (and IPv6 equivalent) are never used >>> (even if >>> >> > included or not excluded) for inter node communication". We already >>> have >>> >> > a >>> >> > hardcoded test to prevent 127.x.y.z addresses from being used. In >>> fact >>> >> > we >>> >> > have 2 tests, one because this address range is part of our default >>> >> > if_exclude, and then a second test (that only does something useful >>> in >>> >> > case >>> >> > you manually added lo* to if_include) deep inside the IP matching >>> logic. >>> >> > >>> >> > George. >>> >> > >>> >> > >>> >> > On Wed, Sep 21, 2016 at 12:36 PM, Gilles Gouaillardet >>> >> > <gilles.gouaillar...@gmail.com> wrote: >>> >> >> >>> >> >> George, >>> >> >> >>> >> >> i got that, and i consider my suggestion as an improvement to your >>> >> >> proposal. >>> >> >> >>> >> >> if i want to exclude ib0, i might want to >>> >> >> mpirun --mca btl_tcp_if_exclude ib0 ... >>> >> >> >>> >> >> to me, this is an honest mistake, but with your proposal, i would >>> be >>> >> >> screwed when >>> >> >> running on more than one node because i should have >>> >> >> mpirun --mca btl_tcp_if_exclude ib0,lo ... >>> >> >> >>> >> >> and if this parameter is set by the admin in the system-wide >>> config, >>> >> >> then this configuration must be adapted by the admin, and that >>> could >>> >> >> generate some confusion. >>> >> >> >>> >> >> my suggestion simply adds a "safety net" to your proposal >>> >> >> >>> >> >> for the sake of completion, i do not really care whether there >>> should >>> >> >> be a safety net or not if localhost is explicitly included via the >>> the >>> >> >> btl_tcp_if_include MCA parameter >>> >> >> >>> >> >> a different and safe/friendly proposal is to add a new >>> >> >> btl_tcp_if_exclude_localhost MCA param, which is true by default, >>> so >>> >> >> you would simply force it to false if you want to MPI_Comm_spawn or >>> >> >> use the tcp btl on your disconnected laptop. >>> >> >> >>> >> >> as a side note, this reminds me that the openib/btl is used by >>> default >>> >> >> for intra node communication between two tasks from different jobs >>> (sm >>> >> >> nor vader cannot be used yet, and btl/openib has a higher >>> exclusivity >>> >> >> than btl/tcp). my first impression is that i am not so comfortable >>> >> >> with that, and we could add yet an other MCA parameter so >>> btl/openib >>> >> >> disqualifies itself for intra node communications. >>> >> >> >>> >> >> >>> >> >> Cheers, >>> >> >> >>> >> >> Gilles >>> >> >> >>> >> >> On Thu, Sep 22, 2016 at 12:56 AM, George Bosilca < >>> bosi...@icl.utk.edu> >>> >> >> wrote: >>> >> >> > My proposal is not about adding new ways of deciding what is >>> local >>> >> >> > and >>> >> >> > what >>> >> >> > not. I proposed to use the corresponding MCA parameters to allow >>> the >>> >> >> > user to >>> >> >> > decide. More specifically, I want to be able to change the >>> exclude >>> >> >> > and >>> >> >> > include MCA to enable TCP over local addresses. >>> >> >> > >>> >> >> > George >>> >> >> > >>> >> >> > >>> >> >> > On Sep 21, 2016 4:32 PM, "Gilles Gouaillardet" >>> >> >> > <gilles.gouaillar...@gmail.com> wrote: >>> >> >> >> >>> >> >> >> George, >>> >> >> >> >>> >> >> >> Is proc locality already set at that time ? >>> >> >> >> >>> >> >> >> If yes, then we could keep a hard coded test so 127.x.y.z >>> address >>> >> >> >> (and >>> >> >> >> IPv6 equivalent) are never used (even if included or not >>> excluded) >>> >> >> >> for >>> >> >> >> inter >>> >> >> >> node communication >>> >> >> >> >>> >> >> >> Cheers, >>> >> >> >> >>> >> >> >> Gilles >>> >> >> >> >>> >> >> >> "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote: >>> >> >> >> >On Sep 21, 2016, at 10:56 AM, George Bosilca < >>> bosi...@icl.utk.edu> >>> >> >> >> > wrote: >>> >> >> >> >> >>> >> >> >> >> No, because 127.x.x.x is by default part of the exclude, so >>> it >>> >> >> >> >> will >>> >> >> >> >> never get into the modex. The problem today, is that even if >>> you >>> >> >> >> >> manually >>> >> >> >> >> remove it from the exclude and add it to the include, it >>> will not >>> >> >> >> >> work, >>> >> >> >> >> because of the hardcoded checks. Once we remove those checks, >>> >> >> >> >> things >>> >> >> >> >> will >>> >> >> >> >> work the way we expect, interfaces are removed because they >>> don't >>> >> >> >> >> match the >>> >> >> >> >> provided addresses. >>> >> >> >> > >>> >> >> >> >Gotcha. >>> >> >> >> > >>> >> >> >> >> I would have agreed with you if the current code was doing a >>> >> >> >> >> better >>> >> >> >> >> decision of what is local and what not. But it is not, it >>> simply >>> >> >> >> >> remove all >>> >> >> >> >> 127.x.x.x interfaces (opal/util/net.c:222). Thus, the only >>> thing >>> >> >> >> >> the >>> >> >> >> >> current >>> >> >> >> >> code does, is preventing a power-user from using the loopback >>> >> >> >> >> (despite being >>> >> >> >> >> explicitly enabled via the corresponding MCA parameters). >>> >> >> >> > >>> >> >> >> >Fair enough. >>> >> >> >> > >>> >> >> >> >Should we have a keyword that can be used in the >>> >> >> >> > btl_tcp_if_include/exclude (e.g., "local") that removes all >>> >> >> >> > local-only >>> >> >> >> > interfaces? I.E., all 127.x.x.x/8 interfaces *and* all >>> local-only >>> >> >> >> > interfaces (e.g., bridging interfaces to local VMs and the >>> like)? >>> >> >> >> > >>> >> >> >> >We could then replace the default "127.0.0.0/8" value in >>> >> >> >> > btl_tcp_if_exclude with this token, and therefore actually >>> exclude >>> >> >> >> > the >>> >> >> >> > VM-only interfaces (which have caused some users problems in >>> the >>> >> >> >> > past). >>> >> >> >> > >>> >> >> >> >-- >>> >> >> >> >Jeff Squyres >>> >> >> >> >jsquy...@cisco.com >>> >> >> >> >For corporate legal information go to: >>> >> >> >> > http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >> >> >> > >>> >> >> >> >_______________________________________________ >>> >> >> >> >devel mailing list >>> >> >> >> >devel@lists.open-mpi.org >>> >> >> >> >https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >> >> >> _______________________________________________ >>> >> >> >> devel mailing list >>> >> >> >> devel@lists.open-mpi.org >>> >> >> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >> >> > >>> >> >> > >>> >> >> > _______________________________________________ >>> >> >> > devel mailing list >>> >> >> > devel@lists.open-mpi.org >>> >> >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >> >> _______________________________________________ >>> >> >> devel mailing list >>> >> >> devel@lists.open-mpi.org >>> >> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >> > >>> >> > >>> >> > >>> >> > _______________________________________________ >>> >> > devel mailing list >>> >> > devel@lists.open-mpi.org >>> >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >> _______________________________________________ >>> >> devel mailing list >>> >> devel@lists.open-mpi.org >>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> > >>> > >>> > _______________________________________________ >>> > devel mailing list >>> > devel@lists.open-mpi.org >>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >> >> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel