Re: [OMPI devel] PML selection logic

Ralph H Castain Thu, 26 Jun 2008 09:26:37 -0400

Just to complete this thread...

Brian raised a very good point, so we identified it on the weekly telecon as
a subject that really should be discussed at next week's technical meeting.
I think we can find a reasonable answer, but there are several ways it can
be done. So rather than doing our usual piecemeal approach to the solution,
it makes sense to begin talking about a more holistic design for
accommodating both needs.


Thanks Brian for pointing out the bigger picture.
Ralph



On 6/24/08 8:22 AM, "Brian W. Barrett" <brbar...@open-mpi.org> wrote:

> yeah, that could be a problem, but it's such a minority case and we've got
> to draw the line somewhere.
> 
> Of course, it seems like this is a never ending battle between two
> opposing forces...  The desire to do the "right thing" all the time at
> small and medium scale and the desire to scale out to the "big thing".
> It seems like in the quest to kill off the modex, we've run into these
> pretty often.
> 
> The modex doesn't hurt us at small scale (indeed, we're probably ok with
> the routed communication pattern up to 512 nodes or so if we don't do
> anything stupid, maybe further).  Is it time to admit defeat in this
> argument and have a configure option that turns off the modex (at the cost
> of some of these correctness checks) for the large machines, but keeps
> things simple for the common case?  I'm sure there are other things where
> this will come up, so perhaps a --enable-large-scale?  Maybe it's a dumb
> idea, but it seems like we've made a lot of compromises lately around
> this, where no one ends up really happy with the solution :/.
> 
> Brian
> 
> 
> On Tue, 24 Jun 2008, George Bosilca wrote:
> 
>> Brian hinted a possible bug in one of his replies. How does this work in the
>> case of dynamic processes? We can envision several scenarios, but lets take a
>> simple: 2 jobs that get connected with connect/accept. One might publish the
>> PML name (simply because the -mca argument was on) and one might not?
>> 
>> george.
>> 
>> On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote:
>> 
>>> Also sounds good to me.
>>> 
>>> Note that the most difficult part of the forward-looking plan is that we
>>> usually can't tell the difference between "something failed to initialize"
>>> and "you don't have support for feature X".
>>> 
>>> I like the general philosophy of: running out of the box always works just
>>> fine, but if you/the sysadmin is smart, you can get performance
>>> improvements.
>>> 
>>> 
>>> On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote:
>>> 
>>>> I concur
>>>> - galen
>>>> 
>>>> On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote:
>>>> 
>>>>> That sounds like a reasonable plan to me.
>>>>> 
>>>>> Brian
>>>>> 
>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>> 
>>>>>> Okay, so let's explore an alternative that preserves the support you are
>>>>>> seeking for the "ignorant user", but doesn't penalize everyone else.
>>>>>> What we
>>>>>> could do is simply set things up so that:
>>>>>> 
>>>>>> 1. if -mca plm xyz is provided, then no modex data is added
>>>>>> 
>>>>>> 2. if it is not provided, then only rank=0 inserts the data. All other
>>>>>> procs
>>>>>> simply check their own selection against the one given by rank=0
>>>>>> 
>>>>>> Now, if a knowledgeable user or sys admin specifies what to use for
>>>>>> their
>>>>>> system, we won't penalize their startup time. A user who doesn't know
>>>>>> what
>>>>>> to do gets to run, albeit less scalably on startup.
>>>>>> 
>>>>>> Looking forward from there, we can look to a day where failing to
>>>>>> initialize
>>>>>> something that exists on the system could be detected in some other
>>>>>> fashion,
>>>>>> letting the local proc abort since it would know that other procs that
>>>>>> detected similar capabilities may well have selected that PML. For now,
>>>>>> though, this would solve the problem.
>>>>>> 
>>>>>> Make sense?
>>>>>> Ralph
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 6/23/08 1:31 PM, "Brian W. Barrett" <brbar...@open-mpi.org> wrote:
>>>>>> 
>>>>>>> The problem is that we default to OB1, but that's not the right choice
>>>>>>> for
>>>>>>> some platforms (like Pathscale / PSM), where there's a huge performance
>>>>>>> hit for using OB1.  So we run into a situation where user installs Open
>>>>>>> MPI, starts running, gets horrible performance, bad mouths Open MPI,
>>>>>>> and
>>>>>>> now we're in that game again.  Yeah, the sys admin should know what to
>>>>>>> do,
>>>>>>> but it doesn't always work that way.
>>>>>>> 
>>>>>>> Brian
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>> 
>>>>>>>> My fault - I should be more precise in my language. ;-/
>>>>>>>> 
>>>>>>>> #1 is not adequate, IMHO, as it forces us to -always- do a modex. It
>>>>>>>> seems
>>>>>>>> to me that a simpler solution to what you describe is for the user to
>>>>>>>> specify -mca pml ob1, or -mca pml cm. If the latter, then you could
>>>>>>>> deal
>>>>>>>> with the failed-to-initialize problem cleanly by having the proc
>>>>>>>> directly
>>>>>>>> abort.
>>>>>>>> 
>>>>>>>> Again, sometimes I think we attempt to automate too many things. This
>>>>>>>> seems
>>>>>>>> like a pretty clear case where you know what you want - the sys admin,
>>>>>>>> if
>>>>>>>> nobody else, can certainly set that mca param in the default param
>>>>>>>> file!
>>>>>>>> 
>>>>>>>> Otherwise, it seems to me that you are relying on the modex to detect
>>>>>>>> that
>>>>>>>> your proc failed to init the correct subsystem. I hate to force a
>>>>>>>> modex just
>>>>>>>> for that - if so, then perhaps this could again be a settable option
>>>>>>>> to
>>>>>>>> avoid requiring non-scalable behavior for those of us who want
>>>>>>>> scalability?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 6/23/08 1:21 PM, "Brian W. Barrett" <brbar...@open-mpi.org> wrote:
>>>>>>>> 
>>>>>>>>> The selection code was added because frequently high speed
>>>>>>>>> interconnects
>>>>>>>>> fail to initialize properly due to random stuff happening (yes,
>>>>>>>>> that's a
>>>>>>>>> horrible statement, but true).  We ran into a situation with some
>>>>>>>>> really
>>>>>>>>> flaky machines where most of the processes would chose CM, but a
>>>>>>>>> couple
>>>>>>>>> would fail to initialize the MTL and therefore chose OB1.  This lead
>>>>>>>>> to a
>>>>>>>>> hang situation, which is the worst of the worst.
>>>>>>>>> 
>>>>>>>>> I think #1 is adequate, although it doesn't handle spawn particularly
>>>>>>>>> well.  And spawn is generally used in environments where such network
>>>>>>>>> mismatches are most likely to occur.
>>>>>>>>> 
>>>>>>>>> Brian
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>>>>>>>> 
>>>>>>>>>> Since my goal is to eliminate the modex completely for managed
>>>>>>>>>> installations, could you give me a brief understanding of this
>>>>>>>>>> eventual PML
>>>>>>>>>> selection logic? It would help to hear an example of how and why
>>>>>>>>>> different
>>>>>>>>>> procs could get different answers - and why we would want to allow
>>>>>>>>>> them to
>>>>>>>>>> do so.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Ralph
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 6/23/08 11:59 AM, "Aurélien Bouteiller" <boute...@eecs.utk.edu>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> The first approach sounds fair enough to me. We should avoid 2 and
>>>>>>>>>>> 3
>>>>>>>>>>> as the pml selection mechanism used to be
>>>>>>>>>>> more complex before we reduced it to accommodate a major design bug
>>>>>>>>>>> in
>>>>>>>>>>> the BTL selection process. When using the complete PML selection,
>>>>>>>>>>> BTL
>>>>>>>>>>> would be initialized several times, leading to a variety of bugs.
>>>>>>>>>>> Eventually the PML selection should return to its old self, when
>>>>>>>>>>> the
>>>>>>>>>>> BTL bug gets fixed.
>>>>>>>>>>> 
>>>>>>>>>>> Aurelien
>>>>>>>>>>> 
>>>>>>>>>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
>>>>>>>>>>> 
>>>>>>>>>>>> Yo all
>>>>>>>>>>>> 
>>>>>>>>>>>> I've been doing further research into the modex and came across
>>>>>>>>>>>> something I
>>>>>>>>>>>> don't fully understand. It seems we have each process insert into
>>>>>>>>>>>> the modex
>>>>>>>>>>>> the name of the PML module that it selected. Once the modex has
>>>>>>>>>>>> exchanged
>>>>>>>>>>>> that info, it then loops across all procs in the job to check
>>>>>>>>>>>> their
>>>>>>>>>>>> selection, and aborts if any proc picked a different PML module.
>>>>>>>>>>>> 
>>>>>>>>>>>> All well and good...assuming that procs actually -can- choose
>>>>>>>>>>>> different PML
>>>>>>>>>>>> modules and hence create an "abort" scenario. However, if I look
>>>>>>>>>>>> inside the
>>>>>>>>>>>> PML's at their selection logic, I find that a proc can ONLY pick a
>>>>>>>>>>>> module
>>>>>>>>>>>> other than ob1 if:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. the user specifies the module to use via -mca pml xyz or by
>>>>>>>>>>>> using a
>>>>>>>>>>>> module specific mca param to adjust its priority. In this case,
>>>>>>>>>>>> since the
>>>>>>>>>>>> mca param is propagated, ALL procs have no choice but to pick that
>>>>>>>>>>>> same
>>>>>>>>>>>> module, so that can't cause us to abort (we will have already
>>>>>>>>>>>> returned an
>>>>>>>>>>>> error and aborted if the specified module can't run).
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. the pml/cm module detects that an MTL module was selected, and
>>>>>>>>>>>> that it is
>>>>>>>>>>>> other than "psm". In this case, the CM module will be selected
>>>>>>>>>>>> because its
>>>>>>>>>>>> default priority is higher than that of OB1.
>>>>>>>>>>>> 
>>>>>>>>>>>> In looking deeper into the MTL selection logic, it appears to me
>>>>>>>>>>>> that you
>>>>>>>>>>>> either have the required capability or you don't. I can see that
>>>>>>>>>>>> in
>>>>>>>>>>>> some
>>>>>>>>>>>> environments (e.g., rsh across unmanaged collections of machines),
>>>>>>>>>>>> it might
>>>>>>>>>>>> be possible for someone to launch across a set of machines where
>>>>>>>>>>>> some do and
>>>>>>>>>>>> some don't have the required support. However, in all other cases,
>>>>>>>>>>>> this will
>>>>>>>>>>>> be homogeneous across the system.
>>>>>>>>>>>> 
>>>>>>>>>>>> Given this analysis (and someone more familiar with the PML should
>>>>>>>>>>>> feel free
>>>>>>>>>>>> to confirm or correct it), it seems to me that this could be
>>>>>>>>>>>> streamlined via
>>>>>>>>>>>> one or more means:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. at the most, we could have rank=0 add the PML module name to
>>>>>>>>>>>> the
>>>>>>>>>>>> modex,
>>>>>>>>>>>> and other procs simply check it against their own and return an
>>>>>>>>>>>> error if
>>>>>>>>>>>> they differ. This accomplishes the identical functionality to what
>>>>>>>>>>>> we have
>>>>>>>>>>>> today, but with much less info in the modex.
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. we could eliminate this info from the modex altogether by
>>>>>>>>>>>> requiring the
>>>>>>>>>>>> user to specify the PML module if they want something other than
>>>>>>>>>>>> the
>>>>>>>>>>>> default
>>>>>>>>>>>> OB1. In this case, there can be no confusion over what each proc
>>>>>>>>>>>> is
>>>>>>>>>>>> to use.
>>>>>>>>>>>> The CM module will attempt to init the MTL - if it cannot do so,
>>>>>>>>>>>> then the
>>>>>>>>>>>> job will return the correct error and tell the user that CM/MTL
>>>>>>>>>>>> support is
>>>>>>>>>>>> unavailable.
>>>>>>>>>>>> 
>>>>>>>>>>>> 3. we could again eliminate the info by not inserting it into the
>>>>>>>>>>>> modex if
>>>>>>>>>>>> (a) the default PML module is selected, or (b) the user specified
>>>>>>>>>>>> the PML
>>>>>>>>>>>> module to be used. In the first case, each proc can simply check
>>>>>>>>>>>> to
>>>>>>>>>>>> see if
>>>>>>>>>>>> they picked the default - if not, then we can insert the info to
>>>>>>>>>>>> indicate
>>>>>>>>>>>> the difference. Thus, in the "standard" case, no info will be
>>>>>>>>>>>> inserted.
>>>>>>>>>>>> 
>>>>>>>>>>>> In the second case, we will already get an error if the specified
>>>>>>>>>>>> PML module
>>>>>>>>>>>> could not be used. Hence, the modex check provides no additional
>>>>>>>>>>>> info or
>>>>>>>>>>>> value.
>>>>>>>>>>>> 
>>>>>>>>>>>> I understand the motivation to support automation. However, in
>>>>>>>>>>>> this
>>>>>>>>>>>> case,
>>>>>>>>>>>> the automation actually doesn't seem to buy us very much, and it
>>>>>>>>>>>> isn't
>>>>>>>>>>>> coming "free". So perhaps some change in how this is done would be
>>>>>>>>>>>> in order?
>>>>>>>>>>>> 
>>>>>>>>>>>> Ralph
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> Cisco Systems
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] PML selection logic

Reply via email to