As we discussed on the phone, I prefer the bullet #3 approach - ask RedHat to 
build/distribute 1.10.0 without PSM2 support, and let Intel provide a 
PSM2-enabled variant via their current proprietary distribution channel until 
they can provide a “clean” solution to the community.

If that hasn’t happened prior to a 1.10.1 release, we can then remove PSM2 at 
that time. I’m hoping the solution will appear prior to that point :-)


> On Sep 3, 2015, at 8:46 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> 
> Ralph and I just chatted about this on the phone.  I think I understand his 
> position better now.
> 
> Just to be clear/put some context in this conversation:
> 
> 1. PSM (aka "PSM1") supports TrueScale Intel networks
> 2. PSM2 supports OmniScale Intel networks
> 
> ------
> 
> The following three solutions are more-or-less equivalent:
> 
> a. add "mtl=^psm2" in the mca-params.conf file (George's proposal)
> b. configure --without-psm2 (similar to George's proposal)
> c. we release 10.0.1 with no PSM2 MTL (Ralph's proposal)
> 
> In all 3 cases, the OmniScale end user will not have support for their 
> network (and will likely fall back to TCP?).  TrueScale users are unaffected.
> 
> Technically, there's a 4th solution (proposed by Red Hat): the distro 
> provides 2 different Open MPI installations -- one for (everything+PSM1), 
> another for (everything+PSM2).  I agree that this is (very) undesirable.  In 
> this case, *all* users are penalized -- not just TrueScale/OmniScale users -- 
> because all users will now wonder "Which Open MPI should I use?" (even if 
> they're not TS/OS users, and it doesn't matter which one they use, they still 
> have to expend unnecessary mental energy trying to understand why there are 
> two, and which they should use).  Meh.
> 
> Hence, we're back to the three possible "more-or-less equivalent" solutions: 
> a, b, or c.  I say "more-or-less" because there *is* a semantic difference 
> between a/b and c:
> 
> 1. For a/b: packagers are responsible for the solution, and also responsible 
> for *documenting* the solution (so that Omniscale users can figure out why 
> they are getting lousy performance).
> 2. For c: Open MPI is responsible for the solution; we'll likely note in NEWS 
> that PSM2 support was removed.
> 
> Hence, for the "let's release 1.10.1 without PSM2" solution, users have a 
> (potentially) easier way of figuring out why they're not getting good 
> performance.
> 
> That being said:
> 
> 1. I'm not 100% convinced that users will go to the NEWS file to figure out 
> why they're not getting good performance.  True, it's our 
> officially-sanctioned method for publishing information to users, but I don't 
> think that it's the first place that comes to mind when you're diagnosing a 
> performance problem.
> 
> 2. It seems like we have handled this kind of situation differently in the 
> past.  
> 
> 2a. E.g., when we had the hcol/ml conflict, we asked Mellanox for a solution. 
>  They promised to release a new libhcol that fixed the problem, and in the 
> meantime, told their customers to get Mellanox Open MPI from mellanox.com 
> that immediately fixed the problem.
> 
> 2b. Similarly, Cisco distributed its own Cisco Open MPI when we wanted to 
> have libfabric support in the Open MPI v1.8.x series.
> 
> 2c. This case is not entirely the same as the above two examples, but I think 
> it's similar in spirit: a distro is trying to be all-inclusive with other 
> freely-distributed software in that distro (i.e., both PSM1 and PSM2), and a 
> vendor-specific issue is causing a problem with that plan.
> 
> 3. I therefore think we should take the same approach that we have taken with 
> other vendors in the past:
> 
> 3a. Red Hat (and other packagers) can do whatever they need to do to package 
> Open MPI 1.10.0.  In this case, Red Hat is asking our advice as to how to 
> package it (because they include both PSM1 and PSM2 support in their distro, 
> and this creates a conflict in Open MPI).
> 
> ==> My $0.02: we should tell Red Hat to build --without-psm2, because then 
> users can see that "ompi_info | grep psm2" will be empty.  That's a dead 
> giveaway that that Open MPI installation has no PSM2 support.
> 
> 3b. Intel can support its customers by having an "Intel Open MPI" 
> distribution (or whatever they want to name it, just as long as it is not 
> named plan/vanilla "Open MPI") that is configured/built to support both 
> PSM1/PSM2 via their normal software distribution mechanism.
> 
> 3c. If there's some solution Intel would like to push upstream to the Open 
> MPI community, great -- it can go through the normal review process and be 
> accepted upstream (i.e., just like we work every day).  That solution can 
> then be included in future releases.
> 
> How does that sound?
> 
> 
> 
> 
>> On Sep 3, 2015, at 10:48 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com> wrote:
>> 
>> Ralph,
>> 
>> if I correctly read between the lines of your second point, omnipath (PSM2) 
>> is working out of the box. I am not sure this is the case, and/or my 
>> extrapolation might be incorrect.
>> 
>> if I understood correctly, psm2 is a new feature.
>> from a distro point of view, that could be a new package (known not to 
>> support PSM), or a mpirun-psm2 wrapper, or a release note (e.g. use --mca 
>> mtl ^psm or a psm2 param file)
>> 
>> I still do not get how removing PSM2 makes things better
>> (and the same result can be achieved by configuring with --without-psm2)
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Thursday, September 3, 2015, Ralph Castain <r...@open-mpi.org> wrote:
>> I guess I didn’t make it clear in my prior comment, so let me try again. I 
>> understand about dlopen and the fix that George proposed - we had internally 
>> discussed this as well. However, the questions that raises are:
>> 
>> 1. how does the distro (Michal) decide which PSM module to disable by 
>> default in their package?
>> 
>> 2. how does the user “discover” that their fabric has automatically been 
>> disabled, especially since this has never been the case before?
>> 
>> I’ll raise the procedural question at our next telecon. I certainly take no 
>> pleasure out of generating releases, so if we have a better solution, I’m 
>> all for it!
>> 
>> 
>>> On Sep 3, 2015, at 5:55 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>>> wrote:
>>> 
>>> I agree with what George says.
>>> 
>>> AFAIK, Red Hat builds Open MPI support for dlopen, so the config file 
>>> option is probably suitable.
>>> 
>>> However, I have to admit that I resent the fact that PSM's poor upgrade 
>>> path design is forcing both the Open MPI and libfabric communities to have 
>>> similar confusing conversations (e.g., see 
>>> https://github.com/ofiwg/libfabric/issues/1258#issuecomment-137426271).
>>> 
>>> Specifically: because of the design of PSM1/PSM2, both Open MPI and 
>>> libfabric will have to adjust their configury and use dlopen/function 
>>> pointer indirection to "solve" the problem of supporting both PSM1 and PSM2.
>>> 
>>> Does that seem weird to anyone else?
>>> 
>>> IMNSHO, if you have to have extremely confusing conversations in multiple 
>>> software communities explaining your configury, 
>>> function-pointer-indirection code (i.e., PR 
>>> https://github.com/ofiwg/libfabric/pull/1259), compilation, and linking 
>>> scheme to upgrade to a new library, you're doing it wrong.
>>> 
>>> 
>>> 
>>> 
>>>> On Sep 3, 2015, at 7:19 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>> 
>>>> Hi Michael,
>>>> 
>>>> I might have missed some context when proposing this solution. As Gilles 
>>>> suggested if you build Open MPI without support for dlopen (configure 
>>>> option --disable-dlopen) this simple solution will not work because the 
>>>> symbol conflict issue is generated deep inside the constructors of the 2 
>>>> libraries.
>>>> 
>>>> Yes, the "mtl = ^psm" (or ^psm2 depending on which one you want to 
>>>> disable) should go in the openmpi-mca-params.conf that gets installed in 
>>>> the $(sysconfigdir).
>>>> 
>>>> Thanks,
>>>> George.
>>>> 
>>>> 
>>>> On Thu, Sep 3, 2015 at 5:14 AM, Michal Schmidt <mschm...@redhat.com> wrote:
>>>> [I apologize for not threading the email properly. I was not subscribed
>>>> before and found the conversation in the web archive.]
>>>> 
>>>> Hello,
>>>> 
>>>> I am the one who discovered the PSM vs. PSM2 library conflict and
>>>> proposed the temporary workaround of having two builds of the openmpi
>>>> package.
>>>> 
>>>> George Bosilca wrote:
>>>>> 3. Except if the distro builds OMPI statically, I see no reason to
>>>>> have 2 build of OMPI due to conflicting symbols between two shared
>>>>> libraries that OMPI MCA load willingly. Why a simple "mtl = ^psm" in
>>>>> the OMPI system wide configuration file is not enough to solve the
>>>>> issue?
>>>> 
>>>> Thank you for this suggestion. It would go into openmpi-mca-params.conf,
>>>> right? I will try it.
>>>> 
>>>> Regards,
>>>> Michal
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2015/09/17927.php
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2015/09/17928.php
>>> 
>>> 
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/09/17931.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/09/17933.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/09/17937.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17939.php

Reply via email to