Just to complete this thread... Brian raised a very good point, so we identified it on the weekly telecon as a subject that really should be discussed at next week's technical meeting. I think we can find a reasonable answer, but there are several ways it can be done. So rather than doing our usual piecemeal approach to the solution, it makes sense to begin talking about a more holistic design for accommodating both needs.
Thanks Brian for pointing out the bigger picture. Ralph On 6/24/08 8:22 AM, "Brian W. Barrett" <brbar...@open-mpi.org> wrote: > yeah, that could be a problem, but it's such a minority case and we've got > to draw the line somewhere. > > Of course, it seems like this is a never ending battle between two > opposing forces... The desire to do the "right thing" all the time at > small and medium scale and the desire to scale out to the "big thing". > It seems like in the quest to kill off the modex, we've run into these > pretty often. > > The modex doesn't hurt us at small scale (indeed, we're probably ok with > the routed communication pattern up to 512 nodes or so if we don't do > anything stupid, maybe further). Is it time to admit defeat in this > argument and have a configure option that turns off the modex (at the cost > of some of these correctness checks) for the large machines, but keeps > things simple for the common case? I'm sure there are other things where > this will come up, so perhaps a --enable-large-scale? Maybe it's a dumb > idea, but it seems like we've made a lot of compromises lately around > this, where no one ends up really happy with the solution :/. > > Brian > > > On Tue, 24 Jun 2008, George Bosilca wrote: > >> Brian hinted a possible bug in one of his replies. How does this work in the >> case of dynamic processes? We can envision several scenarios, but lets take a >> simple: 2 jobs that get connected with connect/accept. One might publish the >> PML name (simply because the -mca argument was on) and one might not? >> >> george. >> >> On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote: >> >>> Also sounds good to me. >>> >>> Note that the most difficult part of the forward-looking plan is that we >>> usually can't tell the difference between "something failed to initialize" >>> and "you don't have support for feature X". >>> >>> I like the general philosophy of: running out of the box always works just >>> fine, but if you/the sysadmin is smart, you can get performance >>> improvements. >>> >>> >>> On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote: >>> >>>> I concur >>>> - galen >>>> >>>> On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote: >>>> >>>>> That sounds like a reasonable plan to me. >>>>> >>>>> Brian >>>>> >>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote: >>>>> >>>>>> Okay, so let's explore an alternative that preserves the support you are >>>>>> seeking for the "ignorant user", but doesn't penalize everyone else. >>>>>> What we >>>>>> could do is simply set things up so that: >>>>>> >>>>>> 1. if -mca plm xyz is provided, then no modex data is added >>>>>> >>>>>> 2. if it is not provided, then only rank=0 inserts the data. All other >>>>>> procs >>>>>> simply check their own selection against the one given by rank=0 >>>>>> >>>>>> Now, if a knowledgeable user or sys admin specifies what to use for >>>>>> their >>>>>> system, we won't penalize their startup time. A user who doesn't know >>>>>> what >>>>>> to do gets to run, albeit less scalably on startup. >>>>>> >>>>>> Looking forward from there, we can look to a day where failing to >>>>>> initialize >>>>>> something that exists on the system could be detected in some other >>>>>> fashion, >>>>>> letting the local proc abort since it would know that other procs that >>>>>> detected similar capabilities may well have selected that PML. For now, >>>>>> though, this would solve the problem. >>>>>> >>>>>> Make sense? >>>>>> Ralph >>>>>> >>>>>> >>>>>> >>>>>> On 6/23/08 1:31 PM, "Brian W. Barrett" <brbar...@open-mpi.org> wrote: >>>>>> >>>>>>> The problem is that we default to OB1, but that's not the right choice >>>>>>> for >>>>>>> some platforms (like Pathscale / PSM), where there's a huge performance >>>>>>> hit for using OB1. So we run into a situation where user installs Open >>>>>>> MPI, starts running, gets horrible performance, bad mouths Open MPI, >>>>>>> and >>>>>>> now we're in that game again. Yeah, the sys admin should know what to >>>>>>> do, >>>>>>> but it doesn't always work that way. >>>>>>> >>>>>>> Brian >>>>>>> >>>>>>> >>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote: >>>>>>> >>>>>>>> My fault - I should be more precise in my language. ;-/ >>>>>>>> >>>>>>>> #1 is not adequate, IMHO, as it forces us to -always- do a modex. It >>>>>>>> seems >>>>>>>> to me that a simpler solution to what you describe is for the user to >>>>>>>> specify -mca pml ob1, or -mca pml cm. If the latter, then you could >>>>>>>> deal >>>>>>>> with the failed-to-initialize problem cleanly by having the proc >>>>>>>> directly >>>>>>>> abort. >>>>>>>> >>>>>>>> Again, sometimes I think we attempt to automate too many things. This >>>>>>>> seems >>>>>>>> like a pretty clear case where you know what you want - the sys admin, >>>>>>>> if >>>>>>>> nobody else, can certainly set that mca param in the default param >>>>>>>> file! >>>>>>>> >>>>>>>> Otherwise, it seems to me that you are relying on the modex to detect >>>>>>>> that >>>>>>>> your proc failed to init the correct subsystem. I hate to force a >>>>>>>> modex just >>>>>>>> for that - if so, then perhaps this could again be a settable option >>>>>>>> to >>>>>>>> avoid requiring non-scalable behavior for those of us who want >>>>>>>> scalability? >>>>>>>> >>>>>>>> >>>>>>>> On 6/23/08 1:21 PM, "Brian W. Barrett" <brbar...@open-mpi.org> wrote: >>>>>>>> >>>>>>>>> The selection code was added because frequently high speed >>>>>>>>> interconnects >>>>>>>>> fail to initialize properly due to random stuff happening (yes, >>>>>>>>> that's a >>>>>>>>> horrible statement, but true). We ran into a situation with some >>>>>>>>> really >>>>>>>>> flaky machines where most of the processes would chose CM, but a >>>>>>>>> couple >>>>>>>>> would fail to initialize the MTL and therefore chose OB1. This lead >>>>>>>>> to a >>>>>>>>> hang situation, which is the worst of the worst. >>>>>>>>> >>>>>>>>> I think #1 is adequate, although it doesn't handle spawn particularly >>>>>>>>> well. And spawn is generally used in environments where such network >>>>>>>>> mismatches are most likely to occur. >>>>>>>>> >>>>>>>>> Brian >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote: >>>>>>>>> >>>>>>>>>> Since my goal is to eliminate the modex completely for managed >>>>>>>>>> installations, could you give me a brief understanding of this >>>>>>>>>> eventual PML >>>>>>>>>> selection logic? It would help to hear an example of how and why >>>>>>>>>> different >>>>>>>>>> procs could get different answers - and why we would want to allow >>>>>>>>>> them to >>>>>>>>>> do so. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Ralph >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 6/23/08 11:59 AM, "Aurélien Bouteiller" <boute...@eecs.utk.edu> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> The first approach sounds fair enough to me. We should avoid 2 and >>>>>>>>>>> 3 >>>>>>>>>>> as the pml selection mechanism used to be >>>>>>>>>>> more complex before we reduced it to accommodate a major design bug >>>>>>>>>>> in >>>>>>>>>>> the BTL selection process. When using the complete PML selection, >>>>>>>>>>> BTL >>>>>>>>>>> would be initialized several times, leading to a variety of bugs. >>>>>>>>>>> Eventually the PML selection should return to its old self, when >>>>>>>>>>> the >>>>>>>>>>> BTL bug gets fixed. >>>>>>>>>>> >>>>>>>>>>> Aurelien >>>>>>>>>>> >>>>>>>>>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit : >>>>>>>>>>> >>>>>>>>>>>> Yo all >>>>>>>>>>>> >>>>>>>>>>>> I've been doing further research into the modex and came across >>>>>>>>>>>> something I >>>>>>>>>>>> don't fully understand. It seems we have each process insert into >>>>>>>>>>>> the modex >>>>>>>>>>>> the name of the PML module that it selected. Once the modex has >>>>>>>>>>>> exchanged >>>>>>>>>>>> that info, it then loops across all procs in the job to check >>>>>>>>>>>> their >>>>>>>>>>>> selection, and aborts if any proc picked a different PML module. >>>>>>>>>>>> >>>>>>>>>>>> All well and good...assuming that procs actually -can- choose >>>>>>>>>>>> different PML >>>>>>>>>>>> modules and hence create an "abort" scenario. However, if I look >>>>>>>>>>>> inside the >>>>>>>>>>>> PML's at their selection logic, I find that a proc can ONLY pick a >>>>>>>>>>>> module >>>>>>>>>>>> other than ob1 if: >>>>>>>>>>>> >>>>>>>>>>>> 1. the user specifies the module to use via -mca pml xyz or by >>>>>>>>>>>> using a >>>>>>>>>>>> module specific mca param to adjust its priority. In this case, >>>>>>>>>>>> since the >>>>>>>>>>>> mca param is propagated, ALL procs have no choice but to pick that >>>>>>>>>>>> same >>>>>>>>>>>> module, so that can't cause us to abort (we will have already >>>>>>>>>>>> returned an >>>>>>>>>>>> error and aborted if the specified module can't run). >>>>>>>>>>>> >>>>>>>>>>>> 2. the pml/cm module detects that an MTL module was selected, and >>>>>>>>>>>> that it is >>>>>>>>>>>> other than "psm". In this case, the CM module will be selected >>>>>>>>>>>> because its >>>>>>>>>>>> default priority is higher than that of OB1. >>>>>>>>>>>> >>>>>>>>>>>> In looking deeper into the MTL selection logic, it appears to me >>>>>>>>>>>> that you >>>>>>>>>>>> either have the required capability or you don't. I can see that >>>>>>>>>>>> in >>>>>>>>>>>> some >>>>>>>>>>>> environments (e.g., rsh across unmanaged collections of machines), >>>>>>>>>>>> it might >>>>>>>>>>>> be possible for someone to launch across a set of machines where >>>>>>>>>>>> some do and >>>>>>>>>>>> some don't have the required support. However, in all other cases, >>>>>>>>>>>> this will >>>>>>>>>>>> be homogeneous across the system. >>>>>>>>>>>> >>>>>>>>>>>> Given this analysis (and someone more familiar with the PML should >>>>>>>>>>>> feel free >>>>>>>>>>>> to confirm or correct it), it seems to me that this could be >>>>>>>>>>>> streamlined via >>>>>>>>>>>> one or more means: >>>>>>>>>>>> >>>>>>>>>>>> 1. at the most, we could have rank=0 add the PML module name to >>>>>>>>>>>> the >>>>>>>>>>>> modex, >>>>>>>>>>>> and other procs simply check it against their own and return an >>>>>>>>>>>> error if >>>>>>>>>>>> they differ. This accomplishes the identical functionality to what >>>>>>>>>>>> we have >>>>>>>>>>>> today, but with much less info in the modex. >>>>>>>>>>>> >>>>>>>>>>>> 2. we could eliminate this info from the modex altogether by >>>>>>>>>>>> requiring the >>>>>>>>>>>> user to specify the PML module if they want something other than >>>>>>>>>>>> the >>>>>>>>>>>> default >>>>>>>>>>>> OB1. In this case, there can be no confusion over what each proc >>>>>>>>>>>> is >>>>>>>>>>>> to use. >>>>>>>>>>>> The CM module will attempt to init the MTL - if it cannot do so, >>>>>>>>>>>> then the >>>>>>>>>>>> job will return the correct error and tell the user that CM/MTL >>>>>>>>>>>> support is >>>>>>>>>>>> unavailable. >>>>>>>>>>>> >>>>>>>>>>>> 3. we could again eliminate the info by not inserting it into the >>>>>>>>>>>> modex if >>>>>>>>>>>> (a) the default PML module is selected, or (b) the user specified >>>>>>>>>>>> the PML >>>>>>>>>>>> module to be used. In the first case, each proc can simply check >>>>>>>>>>>> to >>>>>>>>>>>> see if >>>>>>>>>>>> they picked the default - if not, then we can insert the info to >>>>>>>>>>>> indicate >>>>>>>>>>>> the difference. Thus, in the "standard" case, no info will be >>>>>>>>>>>> inserted. >>>>>>>>>>>> >>>>>>>>>>>> In the second case, we will already get an error if the specified >>>>>>>>>>>> PML module >>>>>>>>>>>> could not be used. Hence, the modex check provides no additional >>>>>>>>>>>> info or >>>>>>>>>>>> value. >>>>>>>>>>>> >>>>>>>>>>>> I understand the motivation to support automation. However, in >>>>>>>>>>>> this >>>>>>>>>>>> case, >>>>>>>>>>>> the automation actually doesn't seem to buy us very much, and it >>>>>>>>>>>> isn't >>>>>>>>>>>> coming "free". So perhaps some change in how this is done would be >>>>>>>>>>>> in order? >>>>>>>>>>>> >>>>>>>>>>>> Ralph >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel