I vote for Ralph's proposal.

2015-09-03 10:05 GMT-06:00 Ralph Castain <r...@open-mpi.org>:

> As we discussed on the phone, I prefer the bullet #3 approach - ask RedHat
> to build/distribute 1.10.0 without PSM2 support, and let Intel provide a
> PSM2-enabled variant via their current proprietary distribution channel
> until they can provide a “clean” solution to the community.
>
> If that hasn’t happened prior to a 1.10.1 release, we can then remove PSM2
> at that time. I’m hoping the solution will appear prior to that point :-)
>
>
> > On Sep 3, 2015, at 8:46 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> wrote:
> >
> > Ralph and I just chatted about this on the phone.  I think I understand
> his position better now.
> >
> > Just to be clear/put some context in this conversation:
> >
> > 1. PSM (aka "PSM1") supports TrueScale Intel networks
> > 2. PSM2 supports OmniScale Intel networks
> >
> > ------
> >
> > The following three solutions are more-or-less equivalent:
> >
> > a. add "mtl=^psm2" in the mca-params.conf file (George's proposal)
> > b. configure --without-psm2 (similar to George's proposal)
> > c. we release 10.0.1 with no PSM2 MTL (Ralph's proposal)
> >
> > In all 3 cases, the OmniScale end user will not have support for their
> network (and will likely fall back to TCP?).  TrueScale users are
> unaffected.
> >
> > Technically, there's a 4th solution (proposed by Red Hat): the distro
> provides 2 different Open MPI installations -- one for (everything+PSM1),
> another for (everything+PSM2).  I agree that this is (very) undesirable.
> In this case, *all* users are penalized -- not just TrueScale/OmniScale
> users -- because all users will now wonder "Which Open MPI should I use?"
> (even if they're not TS/OS users, and it doesn't matter which one they use,
> they still have to expend unnecessary mental energy trying to understand
> why there are two, and which they should use).  Meh.
> >
> > Hence, we're back to the three possible "more-or-less equivalent"
> solutions: a, b, or c.  I say "more-or-less" because there *is* a semantic
> difference between a/b and c:
> >
> > 1. For a/b: packagers are responsible for the solution, and also
> responsible for *documenting* the solution (so that Omniscale users can
> figure out why they are getting lousy performance).
> > 2. For c: Open MPI is responsible for the solution; we'll likely note in
> NEWS that PSM2 support was removed.
> >
> > Hence, for the "let's release 1.10.1 without PSM2" solution, users have
> a (potentially) easier way of figuring out why they're not getting good
> performance.
> >
> > That being said:
> >
> > 1. I'm not 100% convinced that users will go to the NEWS file to figure
> out why they're not getting good performance.  True, it's our
> officially-sanctioned method for publishing information to users, but I
> don't think that it's the first place that comes to mind when you're
> diagnosing a performance problem.
> >
> > 2. It seems like we have handled this kind of situation differently in
> the past.
> >
> > 2a. E.g., when we had the hcol/ml conflict, we asked Mellanox for a
> solution.  They promised to release a new libhcol that fixed the problem,
> and in the meantime, told their customers to get Mellanox Open MPI from
> mellanox.com that immediately fixed the problem.
> >
> > 2b. Similarly, Cisco distributed its own Cisco Open MPI when we wanted
> to have libfabric support in the Open MPI v1.8.x series.
> >
> > 2c. This case is not entirely the same as the above two examples, but I
> think it's similar in spirit: a distro is trying to be all-inclusive with
> other freely-distributed software in that distro (i.e., both PSM1 and
> PSM2), and a vendor-specific issue is causing a problem with that plan.
> >
> > 3. I therefore think we should take the same approach that we have taken
> with other vendors in the past:
> >
> > 3a. Red Hat (and other packagers) can do whatever they need to do to
> package Open MPI 1.10.0.  In this case, Red Hat is asking our advice as to
> how to package it (because they include both PSM1 and PSM2 support in their
> distro, and this creates a conflict in Open MPI).
> >
> > ==> My $0.02: we should tell Red Hat to build --without-psm2, because
> then users can see that "ompi_info | grep psm2" will be empty.  That's a
> dead giveaway that that Open MPI installation has no PSM2 support.
> >
> > 3b. Intel can support its customers by having an "Intel Open MPI"
> distribution (or whatever they want to name it, just as long as it is not
> named plan/vanilla "Open MPI") that is configured/built to support both
> PSM1/PSM2 via their normal software distribution mechanism.
> >
> > 3c. If there's some solution Intel would like to push upstream to the
> Open MPI community, great -- it can go through the normal review process
> and be accepted upstream (i.e., just like we work every day).  That
> solution can then be included in future releases.
> >
> > How does that sound?
> >
> >
> >
> >
> >> On Sep 3, 2015, at 10:48 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >>
> >> Ralph,
> >>
> >> if I correctly read between the lines of your second point, omnipath
> (PSM2) is working out of the box. I am not sure this is the case, and/or my
> extrapolation might be incorrect.
> >>
> >> if I understood correctly, psm2 is a new feature.
> >> from a distro point of view, that could be a new package (known not to
> support PSM), or a mpirun-psm2 wrapper, or a release note (e.g. use --mca
> mtl ^psm or a psm2 param file)
> >>
> >> I still do not get how removing PSM2 makes things better
> >> (and the same result can be achieved by configuring with --without-psm2)
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On Thursday, September 3, 2015, Ralph Castain <r...@open-mpi.org> wrote:
> >> I guess I didn’t make it clear in my prior comment, so let me try
> again. I understand about dlopen and the fix that George proposed - we had
> internally discussed this as well. However, the questions that raises are:
> >>
> >> 1. how does the distro (Michal) decide which PSM module to disable by
> default in their package?
> >>
> >> 2. how does the user “discover” that their fabric has automatically
> been disabled, especially since this has never been the case before?
> >>
> >> I’ll raise the procedural question at our next telecon. I certainly
> take no pleasure out of generating releases, so if we have a better
> solution, I’m all for it!
> >>
> >>
> >>> On Sep 3, 2015, at 5:55 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> >>>
> >>> I agree with what George says.
> >>>
> >>> AFAIK, Red Hat builds Open MPI support for dlopen, so the config file
> option is probably suitable.
> >>>
> >>> However, I have to admit that I resent the fact that PSM's poor
> upgrade path design is forcing both the Open MPI and libfabric communities
> to have similar confusing conversations (e.g., see
> https://github.com/ofiwg/libfabric/issues/1258#issuecomment-137426271).
> >>>
> >>> Specifically: because of the design of PSM1/PSM2, both Open MPI and
> libfabric will have to adjust their configury and use dlopen/function
> pointer indirection to "solve" the problem of supporting both PSM1 and PSM2.
> >>>
> >>> Does that seem weird to anyone else?
> >>>
> >>> IMNSHO, if you have to have extremely confusing conversations in
> multiple software communities explaining your configury,
> function-pointer-indirection code (i.e., PR
> https://github.com/ofiwg/libfabric/pull/1259), compilation, and linking
> scheme to upgrade to a new library, you're doing it wrong.
> >>>
> >>>
> >>>
> >>>
> >>>> On Sep 3, 2015, at 7:19 AM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
> >>>>
> >>>> Hi Michael,
> >>>>
> >>>> I might have missed some context when proposing this solution. As
> Gilles suggested if you build Open MPI without support for dlopen
> (configure option --disable-dlopen) this simple solution will not work
> because the symbol conflict issue is generated deep inside the constructors
> of the 2 libraries.
> >>>>
> >>>> Yes, the "mtl = ^psm" (or ^psm2 depending on which one you want to
> disable) should go in the openmpi-mca-params.conf that gets installed in
> the $(sysconfigdir).
> >>>>
> >>>> Thanks,
> >>>> George.
> >>>>
> >>>>
> >>>> On Thu, Sep 3, 2015 at 5:14 AM, Michal Schmidt <mschm...@redhat.com>
> wrote:
> >>>> [I apologize for not threading the email properly. I was not
> subscribed
> >>>> before and found the conversation in the web archive.]
> >>>>
> >>>> Hello,
> >>>>
> >>>> I am the one who discovered the PSM vs. PSM2 library conflict and
> >>>> proposed the temporary workaround of having two builds of the openmpi
> >>>> package.
> >>>>
> >>>> George Bosilca wrote:
> >>>>> 3. Except if the distro builds OMPI statically, I see no reason to
> >>>>> have 2 build of OMPI due to conflicting symbols between two shared
> >>>>> libraries that OMPI MCA load willingly. Why a simple "mtl = ^psm" in
> >>>>> the OMPI system wide configuration file is not enough to solve the
> >>>>> issue?
> >>>>
> >>>> Thank you for this suggestion. It would go into
> openmpi-mca-params.conf,
> >>>> right? I will try it.
> >>>>
> >>>> Regards,
> >>>> Michal
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17927.php
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17928.php
> >>>
> >>>
> >>> --
> >>> Jeff Squyres
> >>> jsquy...@cisco.com
> >>> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17931.php
> >>
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17933.php
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17937.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17939.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17940.php

Reply via email to