I vote for Ralph's proposal. 2015-09-03 10:05 GMT-06:00 Ralph Castain <r...@open-mpi.org>:
> As we discussed on the phone, I prefer the bullet #3 approach - ask RedHat > to build/distribute 1.10.0 without PSM2 support, and let Intel provide a > PSM2-enabled variant via their current proprietary distribution channel > until they can provide a “clean” solution to the community. > > If that hasn’t happened prior to a 1.10.1 release, we can then remove PSM2 > at that time. I’m hoping the solution will appear prior to that point :-) > > > > On Sep 3, 2015, at 8:46 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > > > Ralph and I just chatted about this on the phone. I think I understand > his position better now. > > > > Just to be clear/put some context in this conversation: > > > > 1. PSM (aka "PSM1") supports TrueScale Intel networks > > 2. PSM2 supports OmniScale Intel networks > > > > ------ > > > > The following three solutions are more-or-less equivalent: > > > > a. add "mtl=^psm2" in the mca-params.conf file (George's proposal) > > b. configure --without-psm2 (similar to George's proposal) > > c. we release 10.0.1 with no PSM2 MTL (Ralph's proposal) > > > > In all 3 cases, the OmniScale end user will not have support for their > network (and will likely fall back to TCP?). TrueScale users are > unaffected. > > > > Technically, there's a 4th solution (proposed by Red Hat): the distro > provides 2 different Open MPI installations -- one for (everything+PSM1), > another for (everything+PSM2). I agree that this is (very) undesirable. > In this case, *all* users are penalized -- not just TrueScale/OmniScale > users -- because all users will now wonder "Which Open MPI should I use?" > (even if they're not TS/OS users, and it doesn't matter which one they use, > they still have to expend unnecessary mental energy trying to understand > why there are two, and which they should use). Meh. > > > > Hence, we're back to the three possible "more-or-less equivalent" > solutions: a, b, or c. I say "more-or-less" because there *is* a semantic > difference between a/b and c: > > > > 1. For a/b: packagers are responsible for the solution, and also > responsible for *documenting* the solution (so that Omniscale users can > figure out why they are getting lousy performance). > > 2. For c: Open MPI is responsible for the solution; we'll likely note in > NEWS that PSM2 support was removed. > > > > Hence, for the "let's release 1.10.1 without PSM2" solution, users have > a (potentially) easier way of figuring out why they're not getting good > performance. > > > > That being said: > > > > 1. I'm not 100% convinced that users will go to the NEWS file to figure > out why they're not getting good performance. True, it's our > officially-sanctioned method for publishing information to users, but I > don't think that it's the first place that comes to mind when you're > diagnosing a performance problem. > > > > 2. It seems like we have handled this kind of situation differently in > the past. > > > > 2a. E.g., when we had the hcol/ml conflict, we asked Mellanox for a > solution. They promised to release a new libhcol that fixed the problem, > and in the meantime, told their customers to get Mellanox Open MPI from > mellanox.com that immediately fixed the problem. > > > > 2b. Similarly, Cisco distributed its own Cisco Open MPI when we wanted > to have libfabric support in the Open MPI v1.8.x series. > > > > 2c. This case is not entirely the same as the above two examples, but I > think it's similar in spirit: a distro is trying to be all-inclusive with > other freely-distributed software in that distro (i.e., both PSM1 and > PSM2), and a vendor-specific issue is causing a problem with that plan. > > > > 3. I therefore think we should take the same approach that we have taken > with other vendors in the past: > > > > 3a. Red Hat (and other packagers) can do whatever they need to do to > package Open MPI 1.10.0. In this case, Red Hat is asking our advice as to > how to package it (because they include both PSM1 and PSM2 support in their > distro, and this creates a conflict in Open MPI). > > > > ==> My $0.02: we should tell Red Hat to build --without-psm2, because > then users can see that "ompi_info | grep psm2" will be empty. That's a > dead giveaway that that Open MPI installation has no PSM2 support. > > > > 3b. Intel can support its customers by having an "Intel Open MPI" > distribution (or whatever they want to name it, just as long as it is not > named plan/vanilla "Open MPI") that is configured/built to support both > PSM1/PSM2 via their normal software distribution mechanism. > > > > 3c. If there's some solution Intel would like to push upstream to the > Open MPI community, great -- it can go through the normal review process > and be accepted upstream (i.e., just like we work every day). That > solution can then be included in future releases. > > > > How does that sound? > > > > > > > > > >> On Sep 3, 2015, at 10:48 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > >> > >> Ralph, > >> > >> if I correctly read between the lines of your second point, omnipath > (PSM2) is working out of the box. I am not sure this is the case, and/or my > extrapolation might be incorrect. > >> > >> if I understood correctly, psm2 is a new feature. > >> from a distro point of view, that could be a new package (known not to > support PSM), or a mpirun-psm2 wrapper, or a release note (e.g. use --mca > mtl ^psm or a psm2 param file) > >> > >> I still do not get how removing PSM2 makes things better > >> (and the same result can be achieved by configuring with --without-psm2) > >> > >> Cheers, > >> > >> Gilles > >> > >> On Thursday, September 3, 2015, Ralph Castain <r...@open-mpi.org> wrote: > >> I guess I didn’t make it clear in my prior comment, so let me try > again. I understand about dlopen and the fix that George proposed - we had > internally discussed this as well. However, the questions that raises are: > >> > >> 1. how does the distro (Michal) decide which PSM module to disable by > default in their package? > >> > >> 2. how does the user “discover” that their fabric has automatically > been disabled, especially since this has never been the case before? > >> > >> I’ll raise the procedural question at our next telecon. I certainly > take no pleasure out of generating releases, so if we have a better > solution, I’m all for it! > >> > >> > >>> On Sep 3, 2015, at 5:55 AM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >>> > >>> I agree with what George says. > >>> > >>> AFAIK, Red Hat builds Open MPI support for dlopen, so the config file > option is probably suitable. > >>> > >>> However, I have to admit that I resent the fact that PSM's poor > upgrade path design is forcing both the Open MPI and libfabric communities > to have similar confusing conversations (e.g., see > https://github.com/ofiwg/libfabric/issues/1258#issuecomment-137426271). > >>> > >>> Specifically: because of the design of PSM1/PSM2, both Open MPI and > libfabric will have to adjust their configury and use dlopen/function > pointer indirection to "solve" the problem of supporting both PSM1 and PSM2. > >>> > >>> Does that seem weird to anyone else? > >>> > >>> IMNSHO, if you have to have extremely confusing conversations in > multiple software communities explaining your configury, > function-pointer-indirection code (i.e., PR > https://github.com/ofiwg/libfabric/pull/1259), compilation, and linking > scheme to upgrade to a new library, you're doing it wrong. > >>> > >>> > >>> > >>> > >>>> On Sep 3, 2015, at 7:19 AM, George Bosilca <bosi...@icl.utk.edu> > wrote: > >>>> > >>>> Hi Michael, > >>>> > >>>> I might have missed some context when proposing this solution. As > Gilles suggested if you build Open MPI without support for dlopen > (configure option --disable-dlopen) this simple solution will not work > because the symbol conflict issue is generated deep inside the constructors > of the 2 libraries. > >>>> > >>>> Yes, the "mtl = ^psm" (or ^psm2 depending on which one you want to > disable) should go in the openmpi-mca-params.conf that gets installed in > the $(sysconfigdir). > >>>> > >>>> Thanks, > >>>> George. > >>>> > >>>> > >>>> On Thu, Sep 3, 2015 at 5:14 AM, Michal Schmidt <mschm...@redhat.com> > wrote: > >>>> [I apologize for not threading the email properly. I was not > subscribed > >>>> before and found the conversation in the web archive.] > >>>> > >>>> Hello, > >>>> > >>>> I am the one who discovered the PSM vs. PSM2 library conflict and > >>>> proposed the temporary workaround of having two builds of the openmpi > >>>> package. > >>>> > >>>> George Bosilca wrote: > >>>>> 3. Except if the distro builds OMPI statically, I see no reason to > >>>>> have 2 build of OMPI due to conflicting symbols between two shared > >>>>> libraries that OMPI MCA load willingly. Why a simple "mtl = ^psm" in > >>>>> the OMPI system wide configuration file is not enough to solve the > >>>>> issue? > >>>> > >>>> Thank you for this suggestion. It would go into > openmpi-mca-params.conf, > >>>> right? I will try it. > >>>> > >>>> Regards, > >>>> Michal > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17927.php > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17928.php > >>> > >>> > >>> -- > >>> Jeff Squyres > >>> jsquy...@cisco.com > >>> For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17931.php > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17933.php > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17937.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17939.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17940.php