As we discussed on the phone, I prefer the bullet #3 approach - ask RedHat to build/distribute 1.10.0 without PSM2 support, and let Intel provide a PSM2-enabled variant via their current proprietary distribution channel until they can provide a “clean” solution to the community.
If that hasn’t happened prior to a 1.10.1 release, we can then remove PSM2 at that time. I’m hoping the solution will appear prior to that point :-) > On Sep 3, 2015, at 8:46 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > Ralph and I just chatted about this on the phone. I think I understand his > position better now. > > Just to be clear/put some context in this conversation: > > 1. PSM (aka "PSM1") supports TrueScale Intel networks > 2. PSM2 supports OmniScale Intel networks > > ------ > > The following three solutions are more-or-less equivalent: > > a. add "mtl=^psm2" in the mca-params.conf file (George's proposal) > b. configure --without-psm2 (similar to George's proposal) > c. we release 10.0.1 with no PSM2 MTL (Ralph's proposal) > > In all 3 cases, the OmniScale end user will not have support for their > network (and will likely fall back to TCP?). TrueScale users are unaffected. > > Technically, there's a 4th solution (proposed by Red Hat): the distro > provides 2 different Open MPI installations -- one for (everything+PSM1), > another for (everything+PSM2). I agree that this is (very) undesirable. In > this case, *all* users are penalized -- not just TrueScale/OmniScale users -- > because all users will now wonder "Which Open MPI should I use?" (even if > they're not TS/OS users, and it doesn't matter which one they use, they still > have to expend unnecessary mental energy trying to understand why there are > two, and which they should use). Meh. > > Hence, we're back to the three possible "more-or-less equivalent" solutions: > a, b, or c. I say "more-or-less" because there *is* a semantic difference > between a/b and c: > > 1. For a/b: packagers are responsible for the solution, and also responsible > for *documenting* the solution (so that Omniscale users can figure out why > they are getting lousy performance). > 2. For c: Open MPI is responsible for the solution; we'll likely note in NEWS > that PSM2 support was removed. > > Hence, for the "let's release 1.10.1 without PSM2" solution, users have a > (potentially) easier way of figuring out why they're not getting good > performance. > > That being said: > > 1. I'm not 100% convinced that users will go to the NEWS file to figure out > why they're not getting good performance. True, it's our > officially-sanctioned method for publishing information to users, but I don't > think that it's the first place that comes to mind when you're diagnosing a > performance problem. > > 2. It seems like we have handled this kind of situation differently in the > past. > > 2a. E.g., when we had the hcol/ml conflict, we asked Mellanox for a solution. > They promised to release a new libhcol that fixed the problem, and in the > meantime, told their customers to get Mellanox Open MPI from mellanox.com > that immediately fixed the problem. > > 2b. Similarly, Cisco distributed its own Cisco Open MPI when we wanted to > have libfabric support in the Open MPI v1.8.x series. > > 2c. This case is not entirely the same as the above two examples, but I think > it's similar in spirit: a distro is trying to be all-inclusive with other > freely-distributed software in that distro (i.e., both PSM1 and PSM2), and a > vendor-specific issue is causing a problem with that plan. > > 3. I therefore think we should take the same approach that we have taken with > other vendors in the past: > > 3a. Red Hat (and other packagers) can do whatever they need to do to package > Open MPI 1.10.0. In this case, Red Hat is asking our advice as to how to > package it (because they include both PSM1 and PSM2 support in their distro, > and this creates a conflict in Open MPI). > > ==> My $0.02: we should tell Red Hat to build --without-psm2, because then > users can see that "ompi_info | grep psm2" will be empty. That's a dead > giveaway that that Open MPI installation has no PSM2 support. > > 3b. Intel can support its customers by having an "Intel Open MPI" > distribution (or whatever they want to name it, just as long as it is not > named plan/vanilla "Open MPI") that is configured/built to support both > PSM1/PSM2 via their normal software distribution mechanism. > > 3c. If there's some solution Intel would like to push upstream to the Open > MPI community, great -- it can go through the normal review process and be > accepted upstream (i.e., just like we work every day). That solution can > then be included in future releases. > > How does that sound? > > > > >> On Sep 3, 2015, at 10:48 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com> wrote: >> >> Ralph, >> >> if I correctly read between the lines of your second point, omnipath (PSM2) >> is working out of the box. I am not sure this is the case, and/or my >> extrapolation might be incorrect. >> >> if I understood correctly, psm2 is a new feature. >> from a distro point of view, that could be a new package (known not to >> support PSM), or a mpirun-psm2 wrapper, or a release note (e.g. use --mca >> mtl ^psm or a psm2 param file) >> >> I still do not get how removing PSM2 makes things better >> (and the same result can be achieved by configuring with --without-psm2) >> >> Cheers, >> >> Gilles >> >> On Thursday, September 3, 2015, Ralph Castain <r...@open-mpi.org> wrote: >> I guess I didn’t make it clear in my prior comment, so let me try again. I >> understand about dlopen and the fix that George proposed - we had internally >> discussed this as well. However, the questions that raises are: >> >> 1. how does the distro (Michal) decide which PSM module to disable by >> default in their package? >> >> 2. how does the user “discover” that their fabric has automatically been >> disabled, especially since this has never been the case before? >> >> I’ll raise the procedural question at our next telecon. I certainly take no >> pleasure out of generating releases, so if we have a better solution, I’m >> all for it! >> >> >>> On Sep 3, 2015, at 5:55 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >>> wrote: >>> >>> I agree with what George says. >>> >>> AFAIK, Red Hat builds Open MPI support for dlopen, so the config file >>> option is probably suitable. >>> >>> However, I have to admit that I resent the fact that PSM's poor upgrade >>> path design is forcing both the Open MPI and libfabric communities to have >>> similar confusing conversations (e.g., see >>> https://github.com/ofiwg/libfabric/issues/1258#issuecomment-137426271). >>> >>> Specifically: because of the design of PSM1/PSM2, both Open MPI and >>> libfabric will have to adjust their configury and use dlopen/function >>> pointer indirection to "solve" the problem of supporting both PSM1 and PSM2. >>> >>> Does that seem weird to anyone else? >>> >>> IMNSHO, if you have to have extremely confusing conversations in multiple >>> software communities explaining your configury, >>> function-pointer-indirection code (i.e., PR >>> https://github.com/ofiwg/libfabric/pull/1259), compilation, and linking >>> scheme to upgrade to a new library, you're doing it wrong. >>> >>> >>> >>> >>>> On Sep 3, 2015, at 7:19 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>> >>>> Hi Michael, >>>> >>>> I might have missed some context when proposing this solution. As Gilles >>>> suggested if you build Open MPI without support for dlopen (configure >>>> option --disable-dlopen) this simple solution will not work because the >>>> symbol conflict issue is generated deep inside the constructors of the 2 >>>> libraries. >>>> >>>> Yes, the "mtl = ^psm" (or ^psm2 depending on which one you want to >>>> disable) should go in the openmpi-mca-params.conf that gets installed in >>>> the $(sysconfigdir). >>>> >>>> Thanks, >>>> George. >>>> >>>> >>>> On Thu, Sep 3, 2015 at 5:14 AM, Michal Schmidt <mschm...@redhat.com> wrote: >>>> [I apologize for not threading the email properly. I was not subscribed >>>> before and found the conversation in the web archive.] >>>> >>>> Hello, >>>> >>>> I am the one who discovered the PSM vs. PSM2 library conflict and >>>> proposed the temporary workaround of having two builds of the openmpi >>>> package. >>>> >>>> George Bosilca wrote: >>>>> 3. Except if the distro builds OMPI statically, I see no reason to >>>>> have 2 build of OMPI due to conflicting symbols between two shared >>>>> libraries that OMPI MCA load willingly. Why a simple "mtl = ^psm" in >>>>> the OMPI system wide configuration file is not enough to solve the >>>>> issue? >>>> >>>> Thank you for this suggestion. It would go into openmpi-mca-params.conf, >>>> right? I will try it. >>>> >>>> Regards, >>>> Michal >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/09/17927.php >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/09/17928.php >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/09/17931.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/09/17933.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/09/17937.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/17939.php