Forgot to mention this tip for debugging paffinity:
There is a test module in the paffinity framework. The module has mca params
that let you define the number of sockets/node (default: 4) and the
#cores/socket (also default: 4). So by setting -mca paffinity test and
adjusting those two parameters, you can test a fairly wide range of
configurations without being constrained by available hardware.
Get the param names with: ompi_info --param paffinity test
Any contributions to that module that extend its range are welcome.
Ralph
On Apr 16, 2010, at 7:59 PM, Ralph Castain wrote:
> Well, I guess I got sucked back into paffinity again...sigh.
>
> I have committed a solution to this issue in r22984 and r22985. I have tested
> it against a range of scenarios, but hardly an exhaustive test. So please do
> stress it.
>
> The following comments are by no means intended as criticism, but rather as
> me taking advantage of an opportunity to educate the general community
> regarding this topic. Since my available time to maintain this capability has
> diminished, the more people who understand all the nuances of it, the more
> likely we are to efficiently execute changes such as this one.
>
> I couldn't use the provided patch for several reasons:
>
> * you cannot print a message out of an odls module after the fork occurs
> unless you also report an error - i.e., you cannot print a message out as a
> warning and then continue processing. If you do so, everything will appear
> correct when you are operating in an environment where no processes are local
> to mpirun - e.g., when running under slurm as it is normally configured.
> However, when processes are local to mpirun, then using orte_show_help after
> the fork causes the print statement to occur in separate process instances.
> This prevents mpirun from realizing that multiple copies of the message are
> being printed, and thus it cannot aggregate them.
>
> As a result, the provided patch generated one warning for every local
> process, plus one aggregated warning for all the remote processes. This isn't
> what we wanted users to see.
>
> The correct solution was to write an integer indicating the warning to be
> issued back to the parent process, and then let that process output the
> actual warning. This allows mpirun to aggregate the result.
>
> Nadia wasn't the only one to make this mistake. I found that I had also made
> it in an earlier revision when reporting the "could not bind" message. So it
> is easy to make, but one we need to avoid.
>
> * it didn't address the full range of binding scenarios - it only addressed
> bind-to-socket. While I know that solved Nadia's immediate problem, it helps
> if we try to address the broader issue when making such changes. Otherwise,
> we wind up with a piecemeal approach to the problem. So I added support for
> all the binding methods in the odls_default module.
>
> * it missed the use-case where processes are launched outside of mpirun with
> paffinity_alone or slot-list set - e.g., when direct-launching processes in
> slurm. In this case, MPI_Init actually attempts to set the process affinity -
> the odls is never called.
>
> Here is why it is important to remember that use-case. While implementing the
> warning message there, I discovered that the code in ompi_mpi_init.c would
> actually deal with Nadia's scenario incorrectly. It would identify the
> process as unbound because it had been "bound" to all available processors.
> Since paffinity_alone is set, it would then have automatically bound the
> process to a single core based on that process' node rank.
>
> So even though the odls had "bound" us to the socket, mpi_init would turn
> around and bind us to a core - which is not at all what Nadia wanted to have
> happen.
>
> The solution here was to pass a parameter to the spawned process indicating
> that mpirun had "bound" it, even if the "binding" was a no-op. This value is
> then checked in mpi_init - if set, mpi_init makes no attempt to re-bind the
> process. If not set, then mpi_init is free to do whatever it deems
> appropriate.
>
> So looking at all the use-cases can expose some unintended interactions.
> Unfortunately, I suspect that many people are unaware of this second method
> of setting affinity, and so wouldn't realize that their intended actions were
> not getting the desired result.
>
> Again, no criticism intended here. Hopefully, the above explanation will help
> future changes!
> Ralph
>
>
> On Apr 13, 2010, at 5:34 AM, Nadia Derbey wrote:
>
>> On Tue, 2010-04-13 at 01:27 -0600, Ralph Castain wrote:
>>> On Apr 13, 2010, at 1:02 AM, Nadia Derbey wrote:
>>>
>>>> On Mon, 2010-04-12 at 10:07 -0600, Ralph Castain wrote:
>>>>> By definition, if you bind to all available cpus in the OS, you are
>>>>> bound to nothing (i.e., "unbound") as your process runs on any
>>>>> available cpu.
>>>>>
>>>>>
>>>>> PLPA doesn't care, and I personally don't care. I was just explaining
>>>>> why it generates an error in the odls.
>>>>>
>>>>>
>>>>> A user app would detect its binding by (a) getting the affiinity mask
>>>>> from the OS, and then (b) seeing if the bits are set to '1' for all
>>>>> available processors. If it is, then you are not bound - there is no
>>>>> mechanism available for checking "are the bits set only for the
>>>>> processors I asked to be bound to". The OS doesn't track what you
>>>>> asked for, it only tracks where you are bound - and a mask with all
>>>>> '1's is defined as "unbound".
>>>>>
>>>>>
>>>>> So the reason for my question was simple: a user asked us to "bind"
>>>>> their process. If their process checks to see if it is bound, it will
>>>>> return "no". The user would therefore be led to believe that OMPI had
>>>>> failed to execute their request, when in fact we did execute it - but
>>>>> the result was (as Nadia says) a "no-op".
>>>>>
>>>>>
>>>>> After talking with Jeff, I think he has the right answer. It is a
>>>>> method we have used elsewhere, so it isn't unexpected behavior.
>>>>> Basically, he proposed that we use an mca param to control this
>>>>> behavior:
>>>>>
>>>>>
>>>>> * default: generate an error message as the "bind" results in a no-op,
>>>>> and this is our current behavior
>>>>>
>>>>>
>>>>> * warn: generate a warning that the binding wound up being a "no-op",
>>>>> but continue working
>>>>>
>>>>>
>>>>> * quiet: just ignore it and keep going
>>>>
>>>> Excellent, I completely agree (though I would have put the 2nd star as
>>>> the default behavior, but never mind, I don't want to restart the
>>>> discussion ;-) )
>>>
>>> I actually went back/forth on that as well - I personally think it might be
>>> better to just have warn and quiet, with warn being the default. The
>>> warning could be generated with orte_show_help so the messages would be
>>> consolidated across nodes. Given that the enhanced paffinity behavior is
>>> fairly new, and that no-one has previously raised this issue, I don't think
>>> the prior behavior is relevant.
>>>
>>> Would that make sense? If so, we could extend that to the other binding
>>> options for consistency.
>>
>> Sure!
>>
>> Patch proposal attached.
>>
>> Regards,
>> Nadia
>>>
>>>>
>>>> Also this is a good opportunity to fix the other issue I talked about in
>>>> the first message in this thread: the tag
>>>> "odls-default:could-not-bind-to-socket" does not exist in
>>>> orte/mca/odls/default/help-odls-default.txt
>>>
>>> I'll take that one - my fault for missing it. I'll cross-check the other
>>> messages as well. Thanks for catching it!
>>>
>>> As for your other change: let me think on it. I -think- I understand your
>>> logic, but honestly haven't had time to really walk through it properly.
>>> Got an ORCM deadline to meet, but hope to break free towards the end of
>>> this week.
>>>
>>>
>>>>
>>>> Regards,
>>>> Nadia
>>>>>
>>>>>
>>>>> Fairly trivial to implement, and Bull could set the default mca param
>>>>> file to "quiet" to get what they want. I'm not sure if that's what the
>>>>> community wants or not - like I said, it makes no diff to me so long
>>>>> as the code logic is understandable.
>>>>>
>>>>>
>>>>>
>>>>> On Apr 12, 2010, at 8:27 AM, Terry Dontje wrote:
>>>>>
>>>>>> Ralph, I guess I am curious why is it that if there is only one
>>>>>> socket we cannot bind to it? Does plpa actually error on this or is
>>>>>> this a condition we decided was an error at odls?
>>>>>>
>>>>>> I am somewhat torn on whether this makes sense. On the one hand it
>>>>>> is definitely useless as to the result if you allow it. However if
>>>>>> you don't allow it and you have a script or running tests on
>>>>>> multiple systems it would be nice to have this run because you are
>>>>>> not really running into a resource starvation issue.
>>>>>>
>>>>>> At a minimum I think the error condition/message needs to be spelled
>>>>>> out (defined). As to whether we allow binding when only one
>>>>>> socket exist I could go either way slightly leaning towards allowing
>>>>>> such a specification to work.
>>>>>>
>>>>>> --td
>>>>>>
>>>>>>
>>>>>> Ralph Castain wrote:
>>>>>>> Guess I'll jump in here as I finally had a few minutes to look at the
>>>>>>> code and think about your original note. In fact, I believe your
>>>>>>> original statement is the source of contention.
>>>>>>>
>>>>>>> If someone tells us -bind-to-socket, but there is only one socket, then
>>>>>>> we really cannot bind them to anything. Any check by their code would
>>>>>>> reveal that they had not, in fact, been bound - raising questions as to
>>>>>>> whether or not OMPI is performing the request. Our operating standard
>>>>>>> has been to error out if the user specifies something we cannot do to
>>>>>>> avoid that kind of confusion. This is what generated the code in the
>>>>>>> system today.
>>>>>>>
>>>>>>> Now I can see an argument that -bind-to-socket with one socket maybe
>>>>>>> shouldn't generate an error, but that decision then has to get
>>>>>>> reflected in other code areas as well.
>>>>>>>
>>>>>>> As for the test you cite - it actually performs a valuable function
>>>>>>> and was added to catch specific scenarios. In particular, if you follow
>>>>>>> the code flow up just a little, you will see that it is possible to
>>>>>>> complete the loop without ever actually setting a bit in the mask. This
>>>>>>> happens when none of the cpus in that socket have been assigned to us
>>>>>>> via an external bind. People actually use that as a means of
>>>>>>> suballocating nodes, so the test needs to be there. Again, if the user
>>>>>>> said "bind to socket", but none of that socket's cores are assigned for
>>>>>>> our use, that is an error.
>>>>>>>
>>>>>>> I haven't looked at your specific fix, but I agree with Terry's
>>>>>>> question. It seems to me that whether or not we were externally bound
>>>>>>> is irrelevant. Even if the overall result is what you want, I think a
>>>>>>> more logically understandable test would help others reading the code.
>>>>>>>
>>>>>>> But first we need to resolve the question: should this scenario return
>>>>>>> an error or not?
>>>>>>>
>>>>>>>
>>>>>>> On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:
>>>>>>>>
>>>>>>>>> Ralph Castain wrote:
>>>>>>>>>
>>>>>>>>>> Okay, just wanted to ensure everyone was working from the same base
>>>>>>>>>> code.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Terry, Brad: you might want to look this proposed change over.
>>>>>>>>>> Something doesn't quite look right to me, but I haven't really
>>>>>>>>>> walked through the code to check it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> At first blush I don't really get the usage of orte_odls_globals.bound
>>>>>>>>> in you patch. It would seem to me that the insertion of that
>>>>>>>>> conditional would prevent the check it surrounds being done when the
>>>>>>>>> process has not been bounded prior to startup which is a common case.
>>>>>>>>>
>>>>>>>> Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path
>>>>>>>> (odls_default_fork_local_proc() in odls_default_module.c):
>>>>>>>>
>>>>>>>> <set target_socket depending on the desired mapping>
>>>>>>>> <set my paffinity mask to 0> (line 715)
>>>>>>>> <for each core in the socket> {
>>>>>>>> <get the associated phys_core>
>>>>>>>> <get the associated phys_cpu>
>>>>>>>> <if we are bound (orte_odls_globals.bound)> {
>>>>>>>> <if phys_cpu does not belong to the cpus I'm bound to>
>>>>>>>> continue
>>>>>>>> }
>>>>>>>> <set phys-cpu bit in my affinity mask>
>>>>>>>> }
>>>>>>>> <check if something is set in my affinity mask>
>>>>>>>> ...
>>>>>>>>
>>>>>>>>
>>>>>>>> What I'm saying is that the only way to have nothing set in the
>>>>>>>> affinity
>>>>>>>> mask (which would justify the last test) is to have never called the
>>>>>>>> <set phys_cpu in my affinity mask> instruction. This means:
>>>>>>>> . the test on orte_odls_globals.bound is true
>>>>>>>> . call <continue> for all the cores in the socket.
>>>>>>>>
>>>>>>>> In the other path, what we are doing is checking if we have set one or
>>>>>>>> more bits in a mask after having actually set them: don't you think
>>>>>>>> it's
>>>>>>>> useless?
>>>>>>>>
>>>>>>>> That's why I'm suggesting to call the last check only if
>>>>>>>> orte_odls_globals.bound is true.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Nadia
>>>>>>>>
>>>>>>>>> --td
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Nadia Derbey wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Just to check: is this with the latest trunk? Brad and Terry have
>>>>>>>>>>>>> been making changes to this section of code, including modifying
>>>>>>>>>>>>> the PROCESS_IS_BOUND test...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> Well, it was on the v1.5. But I just checked: looks like
>>>>>>>>>>>> 1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
>>>>>>>>>>>> odls_default_fork_local_proc()
>>>>>>>>>>>> 2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way
>>>>>>>>>>>>
>>>>>>>>>>>> But, I'll give it a try with the latest trunk.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Nadia
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> The changes, I've done do not touch
>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND at all. Also, I am only touching
>>>>>>>>>>> code related to the "bind-to-core" option so I really doubt if my
>>>>>>>>>>> changes are causing issues here.
>>>>>>>>>>>
>>>>>>>>>>> --td
>>>>>>>>>>>
>>>>>>>>>>>>> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am facing a problem with a test that runs fine on some nodes,
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> fails on others.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a heterogenous cluster, with 3 types of nodes:
>>>>>>>>>>>>>> 1) Single socket , 4 cores
>>>>>>>>>>>>>> 2) 2 sockets, 4cores per socket
>>>>>>>>>>>>>> 3) 2 sockets, 6 cores/socket
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am using:
>>>>>>>>>>>>>> . salloc to allocate the nodes,
>>>>>>>>>>>>>> . mpirun binding/mapping options "-bind-to-socket -bysocket"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This command fails if the allocated node is of type #1 (single
>>>>>>>>>>>>>> socket/4
>>>>>>>>>>>>>> cpus).
>>>>>>>>>>>>>> BTW, in that case orte_show_help is referencing a tag
>>>>>>>>>>>>>> ("could-not-bind-to-socket") that does not exist in
>>>>>>>>>>>>>> help-odls-default.txt.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> While it succeeds when run on nodes of type #2 or 3.
>>>>>>>>>>>>>> I think a "bind to socket" should not return an error on a
>>>>>>>>>>>>>> single socket
>>>>>>>>>>>>>> machine, but rather be a noop.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The problem comes from the test
>>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
>>>>>>>>>>>>>> called in odls_default_fork_local_proc() after the binding to the
>>>>>>>>>>>>>> processors socket has been done:
>>>>>>>>>>>>>> ========
>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>> OPAL_PAFFINITY_CPU_ZERO(mask);
>>>>>>>>>>>>>> for (n=0; n < orte_default_num_cores_per_socket; n++) {
>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>> OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> /* if we did not bind it anywhere, then that is an error */
>>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
>>>>>>>>>>>>>> if (!bound) {
>>>>>>>>>>>>>> orte_show_help("help-odls-default.txt",
>>>>>>>>>>>>>> "odls-default:could-not-bind-to-socket", true);
>>>>>>>>>>>>>> ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> ========
>>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits
>>>>>>>>>>>>>> set in
>>>>>>>>>>>>>> the mask *AND* the number of bits set is lesser than the number
>>>>>>>>>>>>>> of cpus
>>>>>>>>>>>>>> on the machine. Thus on a single socket, 4 cores machine the
>>>>>>>>>>>>>> test will
>>>>>>>>>>>>>> fail. While on other the kinds of machines it will succeed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Again, I think the problem could be solved by changing the
>>>>>>>>>>>>>> alogrithm,
>>>>>>>>>>>>>> and assuming that ORTE_BIND_TO_SOCKET, on a single socket
>>>>>>>>>>>>>> machine =
>>>>>>>>>>>>>> noop.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Another solution could be to call the test
>>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if
>>>>>>>>>>>>>> we are
>>>>>>>>>>>>>> bound (orte_odls_globals.bound). Actually that is the only case
>>>>>>>>>>>>>> where I
>>>>>>>>>>>>>> see a justification to this test (see attached patch).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And may be both solutions could be mixed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Nadia
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nadia Derbey <[email protected]>
>>>>>>>>>>>>>> <001_fix_process_binding_test.patch>_______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> <Mail Attachment.gif>
>>>>>>>>>>> Terry D. Dontje | Principal Software Engineer
>>>>>>>>>>> Developer Tools Engineering | +1.650.633.7054
>>>>>>>>>>> Oracle - Performance Technologies
>>>>>>>>>>> 95 Network Drive, Burlington, MA 01803
>>>>>>>>>>> Email [email protected]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>> ____________________________________________________________________
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Oracle
>>>>>>>>> Terry D. Dontje | Principal Software Engineer
>>>>>>>>> Developer Tools Engineering | +1.650.633.7054
>>>>>>>>> Oracle - Performance Technologies
>>>>>>>>> 95 Network Drive, Burlington, MA 01803
>>>>>>>>> Email [email protected]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Nadia Derbey <[email protected]>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> [email protected]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> [email protected]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> <Mail Attachment.gif>
>>>>>> Terry D. Dontje | Principal Software Engineer
>>>>>> Developer Tools Engineering | +1.650.633.7054
>>>>>> Oracle - Performance Technologies
>>>>>> 95 Network Drive, Burlington, MA 01803
>>>>>> Email [email protected]
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> --
>>>> Nadia Derbey <[email protected]>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>> --
>> Nadia Derbey <[email protected]>
>> <003_bind_to_socket_on_single_socket.patch>_______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>