Re: [OMPI devel] problem when binding to socket on a single socket node

Ralph Castain Fri, 16 Apr 2010 21:59:31 -0400

Well, I guess I got sucked back into paffinity again...sigh.

I have committed a solution to this issue in r22984 and r22985. I have tested 
it against a range of scenarios, but hardly an exhaustive test. So please do 
stress it.


The following comments are by no means intended as criticism, but rather as me 
taking advantage of an opportunity to educate the general community regarding 
this topic. Since my available time to maintain this capability has diminished, 
the more people who understand all the nuances of it, the more likely we are to 
efficiently execute changes such as this one.

I couldn't use the provided patch for several reasons:

* you cannot print a message out of an odls module after the fork occurs unless 
you also report an error - i.e., you cannot print a message out as a warning 
and then continue processing. If you do so, everything will appear correct when 
you are operating in an environment where no processes are local to mpirun - 
e.g., when running under slurm as it is normally configured. However, when 
processes are local to mpirun, then using orte_show_help after the fork causes 
the print statement to occur in separate process instances. This prevents 
mpirun from realizing that multiple copies of the message are being printed, 
and thus it cannot aggregate them.

As a result, the provided patch generated one warning for every local process, 
plus one aggregated warning for all the remote processes. This isn't what we 
wanted users to see.

The correct solution was to write an integer indicating the warning to be 
issued back to the parent process, and then let that process output the actual 
warning. This allows mpirun to aggregate the result.

Nadia wasn't the only one to make this mistake. I found that I had also made it 
in an earlier revision when reporting the "could not bind" message. So it is 
easy to make, but one we need to avoid.

* it didn't address the full range of binding scenarios - it only addressed 
bind-to-socket. While I know that solved Nadia's immediate problem, it helps if 
we try to address the broader issue when making such changes. Otherwise, we 
wind up with a piecemeal approach to the problem. So I added support for all 
the binding methods in the odls_default module.

* it missed the use-case where processes are launched outside of mpirun with 
paffinity_alone or slot-list set - e.g., when direct-launching processes in 
slurm. In this case, MPI_Init actually attempts to set the process affinity - 
the odls is never called.

Here is why it is important to remember that use-case. While implementing the 
warning message there, I discovered that the code in ompi_mpi_init.c would 
actually deal with Nadia's scenario incorrectly. It would identify the process 
as unbound because it had been "bound" to all available processors. Since 
paffinity_alone is set, it would then have automatically bound the process to a 
single core based on that process' node rank.

So even though the odls had "bound" us to the socket, mpi_init would turn 
around and bind us to a core - which is not at all what Nadia wanted to have 
happen.

The solution here was to pass a parameter to the spawned process indicating 
that mpirun had "bound" it, even if the "binding" was a no-op. This value is 
then checked in mpi_init - if set, mpi_init makes no attempt to re-bind the 
process. If not set, then mpi_init is free to do whatever it deems appropriate.

So looking at all the use-cases can expose some unintended interactions. 
Unfortunately, I suspect that many people are unaware of this second  method of 
setting affinity, and so wouldn't realize that their intended actions were not 
getting the desired result.

Again, no criticism intended here. Hopefully, the above explanation will help 
future changes!
Ralph


On Apr 13, 2010, at 5:34 AM, Nadia Derbey wrote:

> On Tue, 2010-04-13 at 01:27 -0600, Ralph Castain wrote:
>> On Apr 13, 2010, at 1:02 AM, Nadia Derbey wrote:
>> 
>>> On Mon, 2010-04-12 at 10:07 -0600, Ralph Castain wrote:
>>>> By definition, if you bind to all available cpus in the OS, you are
>>>> bound to nothing (i.e., "unbound") as your process runs on any
>>>> available cpu.
>>>> 
>>>> 
>>>> PLPA doesn't care, and I personally don't care. I was just explaining
>>>> why it generates an error in the odls.
>>>> 
>>>> 
>>>> A user app would detect its binding by (a) getting the affiinity mask
>>>> from the OS, and then (b) seeing if the bits are set to '1' for all
>>>> available processors. If it is, then you are not bound - there is no
>>>> mechanism available for checking "are the bits set only for the
>>>> processors I asked to be bound to". The OS doesn't track what you
>>>> asked for, it only tracks where you are bound - and a mask with all
>>>> '1's is defined as "unbound".
>>>> 
>>>> 
>>>> So the reason for my question was simple: a user asked us to "bind"
>>>> their process. If their process checks to see if it is bound, it will
>>>> return "no". The user would therefore be led to believe that OMPI had
>>>> failed to execute their request, when in fact we did execute it - but
>>>> the result was (as Nadia says) a "no-op".
>>>> 
>>>> 
>>>> After talking with Jeff, I think he has the right answer. It is a
>>>> method we have used elsewhere, so it isn't unexpected behavior.
>>>> Basically, he proposed that we use an mca param to control this
>>>> behavior:
>>>> 
>>>> 
>>>> * default: generate an error message as the "bind" results in a no-op,
>>>> and this is our current behavior
>>>> 
>>>> 
>>>> * warn: generate a warning that the binding wound up being a "no-op",
>>>> but continue working
>>>> 
>>>> 
>>>> * quiet: just ignore it and keep going
>>> 
>>> Excellent, I completely agree (though I would have put the 2nd star as
>>> the default behavior, but never mind, I don't want to restart the
>>> discussion ;-) )
>> 
>> I actually went back/forth on that as well - I personally think it might be 
>> better to just have warn and quiet, with warn being the default. The warning 
>> could be generated with orte_show_help so the messages would be consolidated 
>> across nodes. Given that the enhanced paffinity behavior is fairly new, and 
>> that no-one has previously raised this issue, I don't think the prior 
>> behavior is relevant.
>> 
>> Would that make sense? If so, we could extend that to the other binding 
>> options for consistency.
> 
> Sure!
> 
> Patch proposal attached.
> 
> Regards,
> Nadia
>> 
>>> 
>>> Also this is a good opportunity to fix the other issue I talked about in
>>> the first message in this thread: the tag
>>> "odls-default:could-not-bind-to-socket" does not exist in
>>> orte/mca/odls/default/help-odls-default.txt
>> 
>> I'll take that one - my fault for missing it. I'll cross-check the other 
>> messages as well. Thanks for catching it!
>> 
>> As for your other change: let me think on it. I -think- I understand your 
>> logic, but honestly haven't had time to really walk through it properly. Got 
>> an ORCM deadline to meet, but hope to break free towards the end of this 
>> week.
>> 
>> 
>>> 
>>> Regards,
>>> Nadia
>>>> 
>>>> 
>>>> Fairly trivial to implement, and Bull could set the default mca param
>>>> file to "quiet" to get what they want. I'm not sure if that's what the
>>>> community wants or not - like I said, it makes no diff to me so long
>>>> as the code logic is understandable.
>>>> 
>>>> 
>>>> 
>>>> On Apr 12, 2010, at 8:27 AM, Terry Dontje wrote:
>>>> 
>>>>> Ralph, I guess I am curious why is it that if there is only one
>>>>> socket we cannot bind to it?  Does plpa actually error on this or is
>>>>> this a condition we decided was an error at odls?
>>>>> 
>>>>> I am somewhat torn on whether this makes sense.  On the one hand it
>>>>> is definitely useless as to the result if you allow it.  However if
>>>>> you don't allow it and you have a script or running tests on
>>>>> multiple systems it would be nice to have this run because you are
>>>>> not really running into a resource starvation issue.
>>>>> 
>>>>> At a minimum I think the error condition/message needs to be spelled
>>>>> out (defined).    As to whether we allow binding when only one
>>>>> socket exist I could go either way slightly leaning towards allowing
>>>>> such a specification to work.
>>>>> 
>>>>> --td
>>>>> 
>>>>> 
>>>>> Ralph Castain wrote: 
>>>>>> Guess I'll jump in here as I finally had a few minutes to look at the 
>>>>>> code and think about your original note. In fact, I believe your 
>>>>>> original statement is the source of contention.
>>>>>> 
>>>>>> If someone tells us -bind-to-socket, but there is only one socket, then 
>>>>>> we really cannot bind them to anything. Any check by their code would 
>>>>>> reveal that they had not, in fact, been bound - raising questions as to 
>>>>>> whether or not OMPI is performing the request. Our operating standard 
>>>>>> has been to error out if the user specifies something we cannot do to 
>>>>>> avoid that kind of confusion. This is what generated the code in the 
>>>>>> system today.
>>>>>> 
>>>>>> Now I can see an argument that -bind-to-socket with one socket maybe 
>>>>>> shouldn't generate an error, but that decision then has to get reflected 
>>>>>> in other code areas as well.
>>>>>> 
>>>>>> As for the test you cite -  it actually performs a valuable function and 
>>>>>> was added to catch specific scenarios. In particular, if you follow the 
>>>>>> code flow up just a little, you will see that it is possible to complete 
>>>>>> the loop without ever actually setting a bit in the mask. This happens 
>>>>>> when none of the cpus in that socket have been assigned to us via an 
>>>>>> external bind. People actually use that as a means of suballocating 
>>>>>> nodes, so the test needs to be there. Again, if the user said "bind to 
>>>>>> socket", but none of that socket's cores are assigned for our use, that 
>>>>>> is an error.
>>>>>> 
>>>>>> I haven't looked at your specific fix, but I agree with Terry's 
>>>>>> question. It seems to me that whether or not we were externally bound is 
>>>>>> irrelevant. Even if the overall result is what you want, I think a more 
>>>>>> logically understandable test would help others reading the code.
>>>>>> 
>>>>>> But first we need to resolve the question: should this scenario return 
>>>>>> an error or not?
>>>>>> 
>>>>>> 
>>>>>> On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:
>>>>>>> 
>>>>>>>> Ralph Castain wrote: 
>>>>>>>> 
>>>>>>>>> Okay, just wanted to ensure everyone was working from the same base
>>>>>>>>> code. 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Terry, Brad: you might want to look this proposed change over.
>>>>>>>>> Something doesn't quite look right to me, but I haven't really
>>>>>>>>> walked through the code to check it.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> At first blush I don't really get the usage of orte_odls_globals.bound
>>>>>>>> in you patch.  It would seem to me that the insertion of that
>>>>>>>> conditional would prevent the check it surrounds being done when the
>>>>>>>> process has not been bounded prior to startup which is a common case.
>>>>>>>> 
>>>>>>> Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path
>>>>>>> (odls_default_fork_local_proc() in odls_default_module.c):
>>>>>>> 
>>>>>>> <set target_socket depending on the desired mapping>
>>>>>>> <set my paffinity mask to 0>       (line 715)
>>>>>>> <for each core in the socket> {
>>>>>>>  <get the associated phys_core>
>>>>>>>  <get the associated phys_cpu>
>>>>>>>  <if we are bound (orte_odls_globals.bound)> {
>>>>>>>      <if phys_cpu does not belong to the cpus I'm bound to>
>>>>>>>          continue
>>>>>>>  }
>>>>>>>  <set phys-cpu bit in my affinity mask>
>>>>>>> }
>>>>>>> <check if something is set in my affinity mask>
>>>>>>> ...
>>>>>>> 
>>>>>>> 
>>>>>>> What I'm saying is that the only way to have nothing set in the affinity
>>>>>>> mask (which would justify the last test) is to have never called the
>>>>>>> <set phys_cpu in my affinity mask> instruction. This means:
>>>>>>> . the test on orte_odls_globals.bound is true
>>>>>>> . call <continue> for all the cores in the socket.
>>>>>>> 
>>>>>>> In the other path, what we are doing is checking if we have set one or
>>>>>>> more bits in a mask after having actually set them: don't you think it's
>>>>>>> useless?
>>>>>>> 
>>>>>>> That's why I'm suggesting to call the last check only if
>>>>>>> orte_odls_globals.bound is true.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Nadia
>>>>>>> 
>>>>>>>> --td
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Nadia Derbey wrote: 
>>>>>>>>>> 
>>>>>>>>>>> On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Just to check: is this with the latest trunk? Brad and Terry have 
>>>>>>>>>>>> been making changes to this section of code, including modifying 
>>>>>>>>>>>> the PROCESS_IS_BOUND test...
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> Well, it was on the v1.5. But I just checked: looks like
>>>>>>>>>>> 1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
>>>>>>>>>>>   odls_default_fork_local_proc()
>>>>>>>>>>> 2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way
>>>>>>>>>>> 
>>>>>>>>>>> But, I'll give it a try with the latest trunk.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Nadia
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> The changes, I've done do not touch
>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND at all.  Also, I am only touching
>>>>>>>>>> code related to the "bind-to-core" option so I really doubt if my
>>>>>>>>>> changes are causing issues here.
>>>>>>>>>> 
>>>>>>>>>> --td
>>>>>>>>>> 
>>>>>>>>>>>> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I am facing a problem with a test that runs fine on some nodes, 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> fails on others.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have a heterogenous cluster, with 3 types of nodes:
>>>>>>>>>>>>> 1) Single socket , 4 cores
>>>>>>>>>>>>> 2) 2 sockets, 4cores per socket
>>>>>>>>>>>>> 3) 2 sockets, 6 cores/socket
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I am using:
>>>>>>>>>>>>> . salloc to allocate the nodes,
>>>>>>>>>>>>> . mpirun binding/mapping options "-bind-to-socket -bysocket"
>>>>>>>>>>>>> 
>>>>>>>>>>>>> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This command fails if the allocated node is of type #1 (single 
>>>>>>>>>>>>> socket/4
>>>>>>>>>>>>> cpus).
>>>>>>>>>>>>> BTW, in that case orte_show_help is referencing a tag
>>>>>>>>>>>>> ("could-not-bind-to-socket") that does not exist in
>>>>>>>>>>>>> help-odls-default.txt.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> While it succeeds when run on nodes of type #2 or 3.
>>>>>>>>>>>>> I think a "bind to socket" should not return an error on a single 
>>>>>>>>>>>>> socket
>>>>>>>>>>>>> machine, but rather be a noop.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The problem comes from the test
>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
>>>>>>>>>>>>> called in odls_default_fork_local_proc() after the binding to the
>>>>>>>>>>>>> processors socket has been done:
>>>>>>>>>>>>> ========
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>> OPAL_PAFFINITY_CPU_ZERO(mask);
>>>>>>>>>>>>> for (n=0; n < orte_default_num_cores_per_socket; n++) {
>>>>>>>>>>>>>     <snip>
>>>>>>>>>>>>>     OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
>>>>>>>>>>>>> }
>>>>>>>>>>>>> /* if we did not bind it anywhere, then that is an error */
>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
>>>>>>>>>>>>> if (!bound) {
>>>>>>>>>>>>>     orte_show_help("help-odls-default.txt",
>>>>>>>>>>>>>                    "odls-default:could-not-bind-to-socket", true);
>>>>>>>>>>>>>     ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
>>>>>>>>>>>>> }
>>>>>>>>>>>>> ========
>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits 
>>>>>>>>>>>>> set in
>>>>>>>>>>>>> the mask *AND* the number of bits set is lesser than the number 
>>>>>>>>>>>>> of cpus
>>>>>>>>>>>>> on the machine. Thus on a single socket, 4 cores machine the test 
>>>>>>>>>>>>> will
>>>>>>>>>>>>> fail. While on other the kinds of machines it will succeed.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Again, I think the problem could be solved by changing the 
>>>>>>>>>>>>> alogrithm,
>>>>>>>>>>>>> and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine 
>>>>>>>>>>>>> =
>>>>>>>>>>>>> noop.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Another solution could be to call the test
>>>>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if 
>>>>>>>>>>>>> we are
>>>>>>>>>>>>> bound (orte_odls_globals.bound). Actually that is the only case 
>>>>>>>>>>>>> where I
>>>>>>>>>>>>> see a justification to this test (see attached patch).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> And may be both solutions could be mixed.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Nadia
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Nadia Derbey <[email protected]>
>>>>>>>>>>>>> <001_fix_process_binding_test.patch>_______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> -- 
>>>>>>>>>> <Mail Attachment.gif>
>>>>>>>>>> Terry D. Dontje | Principal Software Engineer
>>>>>>>>>> Developer Tools Engineering | +1.650.633.7054
>>>>>>>>>> Oracle - Performance Technologies
>>>>>>>>>> 95 Network Drive, Burlington, MA 01803
>>>>>>>>>> Email [email protected]
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>> ____________________________________________________________________
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Oracle
>>>>>>>> Terry D. Dontje | Principal Software Engineer
>>>>>>>> Developer Tools Engineering | +1.650.633.7054
>>>>>>>> Oracle - Performance Technologies
>>>>>>>> 95 Network Drive, Burlington, MA 01803
>>>>>>>> Email [email protected]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> [email protected]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>> -- 
>>>>>>> Nadia Derbey <[email protected]>
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> [email protected]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> <Mail Attachment.gif>
>>>>> Terry D. Dontje | Principal Software Engineer
>>>>> Developer Tools Engineering | +1.650.633.7054
>>>>> Oracle - Performance Technologies
>>>>> 95 Network Drive, Burlington, MA 01803
>>>>> Email [email protected]
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> -- 
>>> Nadia Derbey <[email protected]>
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> -- 
> Nadia Derbey <[email protected]>
> <003_bind_to_socket_on_single_socket.patch>_______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] problem when binding to socket on a single socket node

Reply via email to