Forgot to mention this tip for debugging paffinity: There is a test module in the paffinity framework. The module has mca params that let you define the number of sockets/node (default: 4) and the #cores/socket (also default: 4). So by setting -mca paffinity test and adjusting those two parameters, you can test a fairly wide range of configurations without being constrained by available hardware.
Get the param names with: ompi_info --param paffinity test Any contributions to that module that extend its range are welcome. Ralph On Apr 16, 2010, at 7:59 PM, Ralph Castain wrote: > Well, I guess I got sucked back into paffinity again...sigh. > > I have committed a solution to this issue in r22984 and r22985. I have tested > it against a range of scenarios, but hardly an exhaustive test. So please do > stress it. > > The following comments are by no means intended as criticism, but rather as > me taking advantage of an opportunity to educate the general community > regarding this topic. Since my available time to maintain this capability has > diminished, the more people who understand all the nuances of it, the more > likely we are to efficiently execute changes such as this one. > > I couldn't use the provided patch for several reasons: > > * you cannot print a message out of an odls module after the fork occurs > unless you also report an error - i.e., you cannot print a message out as a > warning and then continue processing. If you do so, everything will appear > correct when you are operating in an environment where no processes are local > to mpirun - e.g., when running under slurm as it is normally configured. > However, when processes are local to mpirun, then using orte_show_help after > the fork causes the print statement to occur in separate process instances. > This prevents mpirun from realizing that multiple copies of the message are > being printed, and thus it cannot aggregate them. > > As a result, the provided patch generated one warning for every local > process, plus one aggregated warning for all the remote processes. This isn't > what we wanted users to see. > > The correct solution was to write an integer indicating the warning to be > issued back to the parent process, and then let that process output the > actual warning. This allows mpirun to aggregate the result. > > Nadia wasn't the only one to make this mistake. I found that I had also made > it in an earlier revision when reporting the "could not bind" message. So it > is easy to make, but one we need to avoid. > > * it didn't address the full range of binding scenarios - it only addressed > bind-to-socket. While I know that solved Nadia's immediate problem, it helps > if we try to address the broader issue when making such changes. Otherwise, > we wind up with a piecemeal approach to the problem. So I added support for > all the binding methods in the odls_default module. > > * it missed the use-case where processes are launched outside of mpirun with > paffinity_alone or slot-list set - e.g., when direct-launching processes in > slurm. In this case, MPI_Init actually attempts to set the process affinity - > the odls is never called. > > Here is why it is important to remember that use-case. While implementing the > warning message there, I discovered that the code in ompi_mpi_init.c would > actually deal with Nadia's scenario incorrectly. It would identify the > process as unbound because it had been "bound" to all available processors. > Since paffinity_alone is set, it would then have automatically bound the > process to a single core based on that process' node rank. > > So even though the odls had "bound" us to the socket, mpi_init would turn > around and bind us to a core - which is not at all what Nadia wanted to have > happen. > > The solution here was to pass a parameter to the spawned process indicating > that mpirun had "bound" it, even if the "binding" was a no-op. This value is > then checked in mpi_init - if set, mpi_init makes no attempt to re-bind the > process. If not set, then mpi_init is free to do whatever it deems > appropriate. > > So looking at all the use-cases can expose some unintended interactions. > Unfortunately, I suspect that many people are unaware of this second method > of setting affinity, and so wouldn't realize that their intended actions were > not getting the desired result. > > Again, no criticism intended here. Hopefully, the above explanation will help > future changes! > Ralph > > > On Apr 13, 2010, at 5:34 AM, Nadia Derbey wrote: > >> On Tue, 2010-04-13 at 01:27 -0600, Ralph Castain wrote: >>> On Apr 13, 2010, at 1:02 AM, Nadia Derbey wrote: >>> >>>> On Mon, 2010-04-12 at 10:07 -0600, Ralph Castain wrote: >>>>> By definition, if you bind to all available cpus in the OS, you are >>>>> bound to nothing (i.e., "unbound") as your process runs on any >>>>> available cpu. >>>>> >>>>> >>>>> PLPA doesn't care, and I personally don't care. I was just explaining >>>>> why it generates an error in the odls. >>>>> >>>>> >>>>> A user app would detect its binding by (a) getting the affiinity mask >>>>> from the OS, and then (b) seeing if the bits are set to '1' for all >>>>> available processors. If it is, then you are not bound - there is no >>>>> mechanism available for checking "are the bits set only for the >>>>> processors I asked to be bound to". The OS doesn't track what you >>>>> asked for, it only tracks where you are bound - and a mask with all >>>>> '1's is defined as "unbound". >>>>> >>>>> >>>>> So the reason for my question was simple: a user asked us to "bind" >>>>> their process. If their process checks to see if it is bound, it will >>>>> return "no". The user would therefore be led to believe that OMPI had >>>>> failed to execute their request, when in fact we did execute it - but >>>>> the result was (as Nadia says) a "no-op". >>>>> >>>>> >>>>> After talking with Jeff, I think he has the right answer. It is a >>>>> method we have used elsewhere, so it isn't unexpected behavior. >>>>> Basically, he proposed that we use an mca param to control this >>>>> behavior: >>>>> >>>>> >>>>> * default: generate an error message as the "bind" results in a no-op, >>>>> and this is our current behavior >>>>> >>>>> >>>>> * warn: generate a warning that the binding wound up being a "no-op", >>>>> but continue working >>>>> >>>>> >>>>> * quiet: just ignore it and keep going >>>> >>>> Excellent, I completely agree (though I would have put the 2nd star as >>>> the default behavior, but never mind, I don't want to restart the >>>> discussion ;-) ) >>> >>> I actually went back/forth on that as well - I personally think it might be >>> better to just have warn and quiet, with warn being the default. The >>> warning could be generated with orte_show_help so the messages would be >>> consolidated across nodes. Given that the enhanced paffinity behavior is >>> fairly new, and that no-one has previously raised this issue, I don't think >>> the prior behavior is relevant. >>> >>> Would that make sense? If so, we could extend that to the other binding >>> options for consistency. >> >> Sure! >> >> Patch proposal attached. >> >> Regards, >> Nadia >>> >>>> >>>> Also this is a good opportunity to fix the other issue I talked about in >>>> the first message in this thread: the tag >>>> "odls-default:could-not-bind-to-socket" does not exist in >>>> orte/mca/odls/default/help-odls-default.txt >>> >>> I'll take that one - my fault for missing it. I'll cross-check the other >>> messages as well. Thanks for catching it! >>> >>> As for your other change: let me think on it. I -think- I understand your >>> logic, but honestly haven't had time to really walk through it properly. >>> Got an ORCM deadline to meet, but hope to break free towards the end of >>> this week. >>> >>> >>>> >>>> Regards, >>>> Nadia >>>>> >>>>> >>>>> Fairly trivial to implement, and Bull could set the default mca param >>>>> file to "quiet" to get what they want. I'm not sure if that's what the >>>>> community wants or not - like I said, it makes no diff to me so long >>>>> as the code logic is understandable. >>>>> >>>>> >>>>> >>>>> On Apr 12, 2010, at 8:27 AM, Terry Dontje wrote: >>>>> >>>>>> Ralph, I guess I am curious why is it that if there is only one >>>>>> socket we cannot bind to it? Does plpa actually error on this or is >>>>>> this a condition we decided was an error at odls? >>>>>> >>>>>> I am somewhat torn on whether this makes sense. On the one hand it >>>>>> is definitely useless as to the result if you allow it. However if >>>>>> you don't allow it and you have a script or running tests on >>>>>> multiple systems it would be nice to have this run because you are >>>>>> not really running into a resource starvation issue. >>>>>> >>>>>> At a minimum I think the error condition/message needs to be spelled >>>>>> out (defined). As to whether we allow binding when only one >>>>>> socket exist I could go either way slightly leaning towards allowing >>>>>> such a specification to work. >>>>>> >>>>>> --td >>>>>> >>>>>> >>>>>> Ralph Castain wrote: >>>>>>> Guess I'll jump in here as I finally had a few minutes to look at the >>>>>>> code and think about your original note. In fact, I believe your >>>>>>> original statement is the source of contention. >>>>>>> >>>>>>> If someone tells us -bind-to-socket, but there is only one socket, then >>>>>>> we really cannot bind them to anything. Any check by their code would >>>>>>> reveal that they had not, in fact, been bound - raising questions as to >>>>>>> whether or not OMPI is performing the request. Our operating standard >>>>>>> has been to error out if the user specifies something we cannot do to >>>>>>> avoid that kind of confusion. This is what generated the code in the >>>>>>> system today. >>>>>>> >>>>>>> Now I can see an argument that -bind-to-socket with one socket maybe >>>>>>> shouldn't generate an error, but that decision then has to get >>>>>>> reflected in other code areas as well. >>>>>>> >>>>>>> As for the test you cite - it actually performs a valuable function >>>>>>> and was added to catch specific scenarios. In particular, if you follow >>>>>>> the code flow up just a little, you will see that it is possible to >>>>>>> complete the loop without ever actually setting a bit in the mask. This >>>>>>> happens when none of the cpus in that socket have been assigned to us >>>>>>> via an external bind. People actually use that as a means of >>>>>>> suballocating nodes, so the test needs to be there. Again, if the user >>>>>>> said "bind to socket", but none of that socket's cores are assigned for >>>>>>> our use, that is an error. >>>>>>> >>>>>>> I haven't looked at your specific fix, but I agree with Terry's >>>>>>> question. It seems to me that whether or not we were externally bound >>>>>>> is irrelevant. Even if the overall result is what you want, I think a >>>>>>> more logically understandable test would help others reading the code. >>>>>>> >>>>>>> But first we need to resolve the question: should this scenario return >>>>>>> an error or not? >>>>>>> >>>>>>> >>>>>>> On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote: >>>>>>> >>>>>>> >>>>>>>> On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote: >>>>>>>> >>>>>>>>> Ralph Castain wrote: >>>>>>>>> >>>>>>>>>> Okay, just wanted to ensure everyone was working from the same base >>>>>>>>>> code. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Terry, Brad: you might want to look this proposed change over. >>>>>>>>>> Something doesn't quite look right to me, but I haven't really >>>>>>>>>> walked through the code to check it. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> At first blush I don't really get the usage of orte_odls_globals.bound >>>>>>>>> in you patch. It would seem to me that the insertion of that >>>>>>>>> conditional would prevent the check it surrounds being done when the >>>>>>>>> process has not been bounded prior to startup which is a common case. >>>>>>>>> >>>>>>>> Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path >>>>>>>> (odls_default_fork_local_proc() in odls_default_module.c): >>>>>>>> >>>>>>>> <set target_socket depending on the desired mapping> >>>>>>>> <set my paffinity mask to 0> (line 715) >>>>>>>> <for each core in the socket> { >>>>>>>> <get the associated phys_core> >>>>>>>> <get the associated phys_cpu> >>>>>>>> <if we are bound (orte_odls_globals.bound)> { >>>>>>>> <if phys_cpu does not belong to the cpus I'm bound to> >>>>>>>> continue >>>>>>>> } >>>>>>>> <set phys-cpu bit in my affinity mask> >>>>>>>> } >>>>>>>> <check if something is set in my affinity mask> >>>>>>>> ... >>>>>>>> >>>>>>>> >>>>>>>> What I'm saying is that the only way to have nothing set in the >>>>>>>> affinity >>>>>>>> mask (which would justify the last test) is to have never called the >>>>>>>> <set phys_cpu in my affinity mask> instruction. This means: >>>>>>>> . the test on orte_odls_globals.bound is true >>>>>>>> . call <continue> for all the cores in the socket. >>>>>>>> >>>>>>>> In the other path, what we are doing is checking if we have set one or >>>>>>>> more bits in a mask after having actually set them: don't you think >>>>>>>> it's >>>>>>>> useless? >>>>>>>> >>>>>>>> That's why I'm suggesting to call the last check only if >>>>>>>> orte_odls_globals.bound is true. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Nadia >>>>>>>> >>>>>>>>> --td >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Nadia Derbey wrote: >>>>>>>>>>> >>>>>>>>>>>> On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Just to check: is this with the latest trunk? Brad and Terry have >>>>>>>>>>>>> been making changes to this section of code, including modifying >>>>>>>>>>>>> the PROCESS_IS_BOUND test... >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> Well, it was on the v1.5. But I just checked: looks like >>>>>>>>>>>> 1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in >>>>>>>>>>>> odls_default_fork_local_proc() >>>>>>>>>>>> 2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way >>>>>>>>>>>> >>>>>>>>>>>> But, I'll give it a try with the latest trunk. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Nadia >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> The changes, I've done do not touch >>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND at all. Also, I am only touching >>>>>>>>>>> code related to the "bind-to-core" option so I really doubt if my >>>>>>>>>>> changes are causing issues here. >>>>>>>>>>> >>>>>>>>>>> --td >>>>>>>>>>> >>>>>>>>>>>>> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am facing a problem with a test that runs fine on some nodes, >>>>>>>>>>>>>> and >>>>>>>>>>>>>> fails on others. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have a heterogenous cluster, with 3 types of nodes: >>>>>>>>>>>>>> 1) Single socket , 4 cores >>>>>>>>>>>>>> 2) 2 sockets, 4cores per socket >>>>>>>>>>>>>> 3) 2 sockets, 6 cores/socket >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am using: >>>>>>>>>>>>>> . salloc to allocate the nodes, >>>>>>>>>>>>>> . mpirun binding/mapping options "-bind-to-socket -bysocket" >>>>>>>>>>>>>> >>>>>>>>>>>>>> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900 >>>>>>>>>>>>>> >>>>>>>>>>>>>> This command fails if the allocated node is of type #1 (single >>>>>>>>>>>>>> socket/4 >>>>>>>>>>>>>> cpus). >>>>>>>>>>>>>> BTW, in that case orte_show_help is referencing a tag >>>>>>>>>>>>>> ("could-not-bind-to-socket") that does not exist in >>>>>>>>>>>>>> help-odls-default.txt. >>>>>>>>>>>>>> >>>>>>>>>>>>>> While it succeeds when run on nodes of type #2 or 3. >>>>>>>>>>>>>> I think a "bind to socket" should not return an error on a >>>>>>>>>>>>>> single socket >>>>>>>>>>>>>> machine, but rather be a noop. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The problem comes from the test >>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); >>>>>>>>>>>>>> called in odls_default_fork_local_proc() after the binding to the >>>>>>>>>>>>>> processors socket has been done: >>>>>>>>>>>>>> ======== >>>>>>>>>>>>>> <snip> >>>>>>>>>>>>>> OPAL_PAFFINITY_CPU_ZERO(mask); >>>>>>>>>>>>>> for (n=0; n < orte_default_num_cores_per_socket; n++) { >>>>>>>>>>>>>> <snip> >>>>>>>>>>>>>> OPAL_PAFFINITY_CPU_SET(phys_cpu, mask); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> /* if we did not bind it anywhere, then that is an error */ >>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); >>>>>>>>>>>>>> if (!bound) { >>>>>>>>>>>>>> orte_show_help("help-odls-default.txt", >>>>>>>>>>>>>> "odls-default:could-not-bind-to-socket", true); >>>>>>>>>>>>>> ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> ======== >>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits >>>>>>>>>>>>>> set in >>>>>>>>>>>>>> the mask *AND* the number of bits set is lesser than the number >>>>>>>>>>>>>> of cpus >>>>>>>>>>>>>> on the machine. Thus on a single socket, 4 cores machine the >>>>>>>>>>>>>> test will >>>>>>>>>>>>>> fail. While on other the kinds of machines it will succeed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Again, I think the problem could be solved by changing the >>>>>>>>>>>>>> alogrithm, >>>>>>>>>>>>>> and assuming that ORTE_BIND_TO_SOCKET, on a single socket >>>>>>>>>>>>>> machine = >>>>>>>>>>>>>> noop. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Another solution could be to call the test >>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if >>>>>>>>>>>>>> we are >>>>>>>>>>>>>> bound (orte_odls_globals.bound). Actually that is the only case >>>>>>>>>>>>>> where I >>>>>>>>>>>>>> see a justification to this test (see attached patch). >>>>>>>>>>>>>> >>>>>>>>>>>>>> And may be both solutions could be mixed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Nadia >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Nadia Derbey <nadia.der...@bull.net> >>>>>>>>>>>>>> <001_fix_process_binding_test.patch>_______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> <Mail Attachment.gif> >>>>>>>>>>> Terry D. Dontje | Principal Software Engineer >>>>>>>>>>> Developer Tools Engineering | +1.650.633.7054 >>>>>>>>>>> Oracle - Performance Technologies >>>>>>>>>>> 95 Network Drive, Burlington, MA 01803 >>>>>>>>>>> Email terry.don...@oracle.com >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>> ____________________________________________________________________ >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Oracle >>>>>>>>> Terry D. Dontje | Principal Software Engineer >>>>>>>>> Developer Tools Engineering | +1.650.633.7054 >>>>>>>>> Oracle - Performance Technologies >>>>>>>>> 95 Network Drive, Burlington, MA 01803 >>>>>>>>> Email terry.don...@oracle.com >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>> -- >>>>>>>> Nadia Derbey <nadia.der...@bull.net> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> <Mail Attachment.gif> >>>>>> Terry D. Dontje | Principal Software Engineer >>>>>> Developer Tools Engineering | +1.650.633.7054 >>>>>> Oracle - Performance Technologies >>>>>> 95 Network Drive, Burlington, MA 01803 >>>>>> Email terry.don...@oracle.com >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> -- >>>> Nadia Derbey <nadia.der...@bull.net> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> -- >> Nadia Derbey <nadia.der...@bull.net> >> <003_bind_to_socket_on_single_socket.patch>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >