We can also make few different paramfiles for typical setups ( large cluster / minimum LT / max BW e.t.c ) the desired paramfile can be chosen by configure flag and be placed in * $prefix/etc/openmpi-mca-params.conf*
On Sat, Jun 28, 2008 at 3:55 PM, Jeff Squyres <jsquy...@cisco.com> wrote: > Agreed. I have a few ideas in this direction as well (random thoughts that > might as well be transcribed somewhere): > > - some kind of configure --enable-large-system (whatever) option is a Good > Thing > > - it would be good if the configure option simply set [MCA parameter?] > defaults wherever possible (vs. #if-selecting code). I think one of the > biggest lessons learned from Open MPI is that everyone's setup is different > -- having the ability to mix and match various run-time options, while not > widely used, is absolutely critical in some scenarios. So it might be good > if --enable-large-system sets a bunch of default parameters that some > sysadmins may still want/need to override. > > - decision to run the modex: I haven't seen all of Ralph's work in this > area, but I wonder if it's similar to the MPI handle parameter checks: it > could be a multi-value MCA parameter, such as: "never", "always", > "when-ompi-determines-its-necessary", etc., where the last value can use > multiple criteria to know if it's necessary to do a modex (e.g., job size, > when spawn occurs, whether the "pml" [or other critical] MCA param[s] were > specified, ...etc.). > > > > On Jun 26, 2008, at 9:26 AM, Ralph H Castain wrote: > > Just to complete this thread... >> >> Brian raised a very good point, so we identified it on the weekly telecon >> as >> a subject that really should be discussed at next week's technical >> meeting. >> I think we can find a reasonable answer, but there are several ways it can >> be done. So rather than doing our usual piecemeal approach to the >> solution, >> it makes sense to begin talking about a more holistic design for >> accommodating both needs. >> >> Thanks Brian for pointing out the bigger picture. >> Ralph >> >> >> >> On 6/24/08 8:22 AM, "Brian W. Barrett" <brbar...@open-mpi.org> wrote: >> >> yeah, that could be a problem, but it's such a minority case and we've got >>> to draw the line somewhere. >>> >>> Of course, it seems like this is a never ending battle between two >>> opposing forces... The desire to do the "right thing" all the time at >>> small and medium scale and the desire to scale out to the "big thing". >>> It seems like in the quest to kill off the modex, we've run into these >>> pretty often. >>> >>> The modex doesn't hurt us at small scale (indeed, we're probably ok with >>> the routed communication pattern up to 512 nodes or so if we don't do >>> anything stupid, maybe further). Is it time to admit defeat in this >>> argument and have a configure option that turns off the modex (at the >>> cost >>> of some of these correctness checks) for the large machines, but keeps >>> things simple for the common case? I'm sure there are other things where >>> this will come up, so perhaps a --enable-large-scale? Maybe it's a dumb >>> idea, but it seems like we've made a lot of compromises lately around >>> this, where no one ends up really happy with the solution :/. >>> >>> Brian >>> >>> >>> On Tue, 24 Jun 2008, George Bosilca wrote: >>> >>> Brian hinted a possible bug in one of his replies. How does this work in >>>> the >>>> case of dynamic processes? We can envision several scenarios, but lets >>>> take a >>>> simple: 2 jobs that get connected with connect/accept. One might publish >>>> the >>>> PML name (simply because the -mca argument was on) and one might not? >>>> >>>> george. >>>> >>>> On Jun 24, 2008, at 8:28 AM, Jeff Squyres wrote: >>>> >>>> Also sounds good to me. >>>>> >>>>> Note that the most difficult part of the forward-looking plan is that >>>>> we >>>>> usually can't tell the difference between "something failed to >>>>> initialize" >>>>> and "you don't have support for feature X". >>>>> >>>>> I like the general philosophy of: running out of the box always works >>>>> just >>>>> fine, but if you/the sysadmin is smart, you can get performance >>>>> improvements. >>>>> >>>>> >>>>> On Jun 23, 2008, at 4:18 PM, Shipman, Galen M. wrote: >>>>> >>>>> I concur >>>>>> - galen >>>>>> >>>>>> On Jun 23, 2008, at 3:44 PM, Brian W. Barrett wrote: >>>>>> >>>>>> That sounds like a reasonable plan to me. >>>>>>> >>>>>>> Brian >>>>>>> >>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote: >>>>>>> >>>>>>> Okay, so let's explore an alternative that preserves the support you >>>>>>>> are >>>>>>>> seeking for the "ignorant user", but doesn't penalize everyone else. >>>>>>>> What we >>>>>>>> could do is simply set things up so that: >>>>>>>> >>>>>>>> 1. if -mca plm xyz is provided, then no modex data is added >>>>>>>> >>>>>>>> 2. if it is not provided, then only rank=0 inserts the data. All >>>>>>>> other >>>>>>>> procs >>>>>>>> simply check their own selection against the one given by rank=0 >>>>>>>> >>>>>>>> Now, if a knowledgeable user or sys admin specifies what to use for >>>>>>>> their >>>>>>>> system, we won't penalize their startup time. A user who doesn't >>>>>>>> know >>>>>>>> what >>>>>>>> to do gets to run, albeit less scalably on startup. >>>>>>>> >>>>>>>> Looking forward from there, we can look to a day where failing to >>>>>>>> initialize >>>>>>>> something that exists on the system could be detected in some other >>>>>>>> fashion, >>>>>>>> letting the local proc abort since it would know that other procs >>>>>>>> that >>>>>>>> detected similar capabilities may well have selected that PML. For >>>>>>>> now, >>>>>>>> though, this would solve the problem. >>>>>>>> >>>>>>>> Make sense? >>>>>>>> Ralph >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 6/23/08 1:31 PM, "Brian W. Barrett" <brbar...@open-mpi.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>> The problem is that we default to OB1, but that's not the right >>>>>>>>> choice >>>>>>>>> for >>>>>>>>> some platforms (like Pathscale / PSM), where there's a huge >>>>>>>>> performance >>>>>>>>> hit for using OB1. So we run into a situation where user installs >>>>>>>>> Open >>>>>>>>> MPI, starts running, gets horrible performance, bad mouths Open >>>>>>>>> MPI, >>>>>>>>> and >>>>>>>>> now we're in that game again. Yeah, the sys admin should know what >>>>>>>>> to >>>>>>>>> do, >>>>>>>>> but it doesn't always work that way. >>>>>>>>> >>>>>>>>> Brian >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote: >>>>>>>>> >>>>>>>>> My fault - I should be more precise in my language. ;-/ >>>>>>>>>> >>>>>>>>>> #1 is not adequate, IMHO, as it forces us to -always- do a modex. >>>>>>>>>> It >>>>>>>>>> seems >>>>>>>>>> to me that a simpler solution to what you describe is for the user >>>>>>>>>> to >>>>>>>>>> specify -mca pml ob1, or -mca pml cm. If the latter, then you >>>>>>>>>> could >>>>>>>>>> deal >>>>>>>>>> with the failed-to-initialize problem cleanly by having the proc >>>>>>>>>> directly >>>>>>>>>> abort. >>>>>>>>>> >>>>>>>>>> Again, sometimes I think we attempt to automate too many things. >>>>>>>>>> This >>>>>>>>>> seems >>>>>>>>>> like a pretty clear case where you know what you want - the sys >>>>>>>>>> admin, >>>>>>>>>> if >>>>>>>>>> nobody else, can certainly set that mca param in the default param >>>>>>>>>> file! >>>>>>>>>> >>>>>>>>>> Otherwise, it seems to me that you are relying on the modex to >>>>>>>>>> detect >>>>>>>>>> that >>>>>>>>>> your proc failed to init the correct subsystem. I hate to force a >>>>>>>>>> modex just >>>>>>>>>> for that - if so, then perhaps this could again be a settable >>>>>>>>>> option >>>>>>>>>> to >>>>>>>>>> avoid requiring non-scalable behavior for those of us who want >>>>>>>>>> scalability? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 6/23/08 1:21 PM, "Brian W. Barrett" <brbar...@open-mpi.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> The selection code was added because frequently high speed >>>>>>>>>>> interconnects >>>>>>>>>>> fail to initialize properly due to random stuff happening (yes, >>>>>>>>>>> that's a >>>>>>>>>>> horrible statement, but true). We ran into a situation with some >>>>>>>>>>> really >>>>>>>>>>> flaky machines where most of the processes would chose CM, but a >>>>>>>>>>> couple >>>>>>>>>>> would fail to initialize the MTL and therefore chose OB1. This >>>>>>>>>>> lead >>>>>>>>>>> to a >>>>>>>>>>> hang situation, which is the worst of the worst. >>>>>>>>>>> >>>>>>>>>>> I think #1 is adequate, although it doesn't handle spawn >>>>>>>>>>> particularly >>>>>>>>>>> well. And spawn is generally used in environments where such >>>>>>>>>>> network >>>>>>>>>>> mismatches are most likely to occur. >>>>>>>>>>> >>>>>>>>>>> Brian >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, 23 Jun 2008, Ralph H Castain wrote: >>>>>>>>>>> >>>>>>>>>>> Since my goal is to eliminate the modex completely for managed >>>>>>>>>>>> installations, could you give me a brief understanding of this >>>>>>>>>>>> eventual PML >>>>>>>>>>>> selection logic? It would help to hear an example of how and why >>>>>>>>>>>> different >>>>>>>>>>>> procs could get different answers - and why we would want to >>>>>>>>>>>> allow >>>>>>>>>>>> them to >>>>>>>>>>>> do so. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Ralph >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 6/23/08 11:59 AM, "Aurélien Bouteiller" < >>>>>>>>>>>> boute...@eecs.utk.edu> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> The first approach sounds fair enough to me. We should avoid 2 >>>>>>>>>>>>> and >>>>>>>>>>>>> 3 >>>>>>>>>>>>> as the pml selection mechanism used to be >>>>>>>>>>>>> more complex before we reduced it to accommodate a major design >>>>>>>>>>>>> bug >>>>>>>>>>>>> in >>>>>>>>>>>>> the BTL selection process. When using the complete PML >>>>>>>>>>>>> selection, >>>>>>>>>>>>> BTL >>>>>>>>>>>>> would be initialized several times, leading to a variety of >>>>>>>>>>>>> bugs. >>>>>>>>>>>>> Eventually the PML selection should return to its old self, >>>>>>>>>>>>> when >>>>>>>>>>>>> the >>>>>>>>>>>>> BTL bug gets fixed. >>>>>>>>>>>>> >>>>>>>>>>>>> Aurelien >>>>>>>>>>>>> >>>>>>>>>>>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit : >>>>>>>>>>>>> >>>>>>>>>>>>> Yo all >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've been doing further research into the modex and came >>>>>>>>>>>>>> across >>>>>>>>>>>>>> something I >>>>>>>>>>>>>> don't fully understand. It seems we have each process insert >>>>>>>>>>>>>> into >>>>>>>>>>>>>> the modex >>>>>>>>>>>>>> the name of the PML module that it selected. Once the modex >>>>>>>>>>>>>> has >>>>>>>>>>>>>> exchanged >>>>>>>>>>>>>> that info, it then loops across all procs in the job to check >>>>>>>>>>>>>> their >>>>>>>>>>>>>> selection, and aborts if any proc picked a different PML >>>>>>>>>>>>>> module. >>>>>>>>>>>>>> >>>>>>>>>>>>>> All well and good...assuming that procs actually -can- choose >>>>>>>>>>>>>> different PML >>>>>>>>>>>>>> modules and hence create an "abort" scenario. However, if I >>>>>>>>>>>>>> look >>>>>>>>>>>>>> inside the >>>>>>>>>>>>>> PML's at their selection logic, I find that a proc can ONLY >>>>>>>>>>>>>> pick a >>>>>>>>>>>>>> module >>>>>>>>>>>>>> other than ob1 if: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. the user specifies the module to use via -mca pml xyz or by >>>>>>>>>>>>>> using a >>>>>>>>>>>>>> module specific mca param to adjust its priority. In this >>>>>>>>>>>>>> case, >>>>>>>>>>>>>> since the >>>>>>>>>>>>>> mca param is propagated, ALL procs have no choice but to pick >>>>>>>>>>>>>> that >>>>>>>>>>>>>> same >>>>>>>>>>>>>> module, so that can't cause us to abort (we will have already >>>>>>>>>>>>>> returned an >>>>>>>>>>>>>> error and aborted if the specified module can't run). >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2. the pml/cm module detects that an MTL module was selected, >>>>>>>>>>>>>> and >>>>>>>>>>>>>> that it is >>>>>>>>>>>>>> other than "psm". In this case, the CM module will be selected >>>>>>>>>>>>>> because its >>>>>>>>>>>>>> default priority is higher than that of OB1. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In looking deeper into the MTL selection logic, it appears to >>>>>>>>>>>>>> me >>>>>>>>>>>>>> that you >>>>>>>>>>>>>> either have the required capability or you don't. I can see >>>>>>>>>>>>>> that >>>>>>>>>>>>>> in >>>>>>>>>>>>>> some >>>>>>>>>>>>>> environments (e.g., rsh across unmanaged collections of >>>>>>>>>>>>>> machines), >>>>>>>>>>>>>> it might >>>>>>>>>>>>>> be possible for someone to launch across a set of machines >>>>>>>>>>>>>> where >>>>>>>>>>>>>> some do and >>>>>>>>>>>>>> some don't have the required support. However, in all other >>>>>>>>>>>>>> cases, >>>>>>>>>>>>>> this will >>>>>>>>>>>>>> be homogeneous across the system. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Given this analysis (and someone more familiar with the PML >>>>>>>>>>>>>> should >>>>>>>>>>>>>> feel free >>>>>>>>>>>>>> to confirm or correct it), it seems to me that this could be >>>>>>>>>>>>>> streamlined via >>>>>>>>>>>>>> one or more means: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. at the most, we could have rank=0 add the PML module name >>>>>>>>>>>>>> to >>>>>>>>>>>>>> the >>>>>>>>>>>>>> modex, >>>>>>>>>>>>>> and other procs simply check it against their own and return >>>>>>>>>>>>>> an >>>>>>>>>>>>>> error if >>>>>>>>>>>>>> they differ. This accomplishes the identical functionality to >>>>>>>>>>>>>> what >>>>>>>>>>>>>> we have >>>>>>>>>>>>>> today, but with much less info in the modex. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2. we could eliminate this info from the modex altogether by >>>>>>>>>>>>>> requiring the >>>>>>>>>>>>>> user to specify the PML module if they want something other >>>>>>>>>>>>>> than >>>>>>>>>>>>>> the >>>>>>>>>>>>>> default >>>>>>>>>>>>>> OB1. In this case, there can be no confusion over what each >>>>>>>>>>>>>> proc >>>>>>>>>>>>>> is >>>>>>>>>>>>>> to use. >>>>>>>>>>>>>> The CM module will attempt to init the MTL - if it cannot do >>>>>>>>>>>>>> so, >>>>>>>>>>>>>> then the >>>>>>>>>>>>>> job will return the correct error and tell the user that >>>>>>>>>>>>>> CM/MTL >>>>>>>>>>>>>> support is >>>>>>>>>>>>>> unavailable. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 3. we could again eliminate the info by not inserting it into >>>>>>>>>>>>>> the >>>>>>>>>>>>>> modex if >>>>>>>>>>>>>> (a) the default PML module is selected, or (b) the user >>>>>>>>>>>>>> specified >>>>>>>>>>>>>> the PML >>>>>>>>>>>>>> module to be used. In the first case, each proc can simply >>>>>>>>>>>>>> check >>>>>>>>>>>>>> to >>>>>>>>>>>>>> see if >>>>>>>>>>>>>> they picked the default - if not, then we can insert the info >>>>>>>>>>>>>> to >>>>>>>>>>>>>> indicate >>>>>>>>>>>>>> the difference. Thus, in the "standard" case, no info will be >>>>>>>>>>>>>> inserted. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the second case, we will already get an error if the >>>>>>>>>>>>>> specified >>>>>>>>>>>>>> PML module >>>>>>>>>>>>>> could not be used. Hence, the modex check provides no >>>>>>>>>>>>>> additional >>>>>>>>>>>>>> info or >>>>>>>>>>>>>> value. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I understand the motivation to support automation. However, in >>>>>>>>>>>>>> this >>>>>>>>>>>>>> case, >>>>>>>>>>>>>> the automation actually doesn't seem to buy us very much, and >>>>>>>>>>>>>> it >>>>>>>>>>>>>> isn't >>>>>>>>>>>>>> coming "free". So perhaps some change in how this is done >>>>>>>>>>>>>> would be >>>>>>>>>>>>>> in order? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> Cisco Systems >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>> >>>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > Cisco Systems > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >