On Mon, 2010-04-12 at 07:50 -0600, Ralph Castain wrote:
> Guess I'll jump in here as I finally had a few minutes to look at the code
> and think about your original note. In fact, I believe your original
> statement is the source of contention.
>
> If someone tells us -bind-to-socket, but there is only one socket, then we
> really cannot bind them to anything. Any check by their code would reveal
> that they had not, in fact, been bound - raising questions as to whether or
> not OMPI is performing the request. Our operating standard has been to error
> out if the user specifies something we cannot do to avoid that kind of
> confusion. This is what generated the code in the system today.
>
> Now I can see an argument that -bind-to-socket with one socket maybe
> shouldn't generate an error, but that decision then has to get reflected in
> other code areas as well.
Actually, that was my original point: -bind-to-socket on a single socket
should IMHO lead to a noop, but not an error.
>
> As for the test you cite - it actually performs a valuable function and was
> added to catch specific scenarios. In particular, if you follow the code flow
> up just a little, you will see that it is possible to complete the loop
> without ever actually setting a bit in the mask. This happens when none of
> the cpus in that socket have been assigned to us via an external bind.
This is exactly what I said, but may be I didn't express right:
. the test on orte_odls_globals.bound is true
---> means we have an external bind
. call <continue> for all the cores in the socket.
---> none of the cpus on the socket belongs to our binding
Regards,
> People actually use that as a means of suballocating nodes, so the test
> needs to be there. Again, if the user said "bind to socket", but none of that
> socket's cores are assigned for our use, that is an error.
>
> I haven't looked at your specific fix, but I agree with Terry's question. It
> seems to me that whether or not we were externally bound is irrelevant. Even
> if the overall result is what you want, I think a more logically
> understandable test would help others reading the code.
>
> But first we need to resolve the question: should this scenario return an
> error or not?
>
>
> On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote:
>
> > On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote:
> >> Ralph Castain wrote:
> >>> Okay, just wanted to ensure everyone was working from the same base
> >>> code.
> >>>
> >>>
> >>> Terry, Brad: you might want to look this proposed change over.
> >>> Something doesn't quite look right to me, but I haven't really
> >>> walked through the code to check it.
> >>>
> >>>
> >> At first blush I don't really get the usage of orte_odls_globals.bound
> >> in you patch. It would seem to me that the insertion of that
> >> conditional would prevent the check it surrounds being done when the
> >> process has not been bounded prior to startup which is a common case.
> >
> > Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path
> > (odls_default_fork_local_proc() in odls_default_module.c):
> >
> > <set target_socket depending on the desired mapping>
> > <set my paffinity mask to 0> (line 715)
> > <for each core in the socket> {
> > <get the associated phys_core>
> > <get the associated phys_cpu>
> > <if we are bound (orte_odls_globals.bound)> {
> > <if phys_cpu does not belong to the cpus I'm bound to>
> > continue
> > }
> > <set phys-cpu bit in my affinity mask>
> > }
> > <check if something is set in my affinity mask>
> > ...
> >
> >
> > What I'm saying is that the only way to have nothing set in the affinity
> > mask (which would justify the last test) is to have never called the
> > <set phys_cpu in my affinity mask> instruction. This means:
> > . the test on orte_odls_globals.bound is true
> > . call <continue> for all the cores in the socket.
> >
> > In the other path, what we are doing is checking if we have set one or
> > more bits in a mask after having actually set them: don't you think it's
> > useless?
> >
> > That's why I'm suggesting to call the last check only if
> > orte_odls_globals.bound is true.
> >
> > Regards,
> > Nadia
> >>
> >> --td
> >>
> >>
> >>>
> >>> On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote:
> >>>
> >>>> Nadia Derbey wrote:
> >>>>> On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote:
> >>>>>
> >>>>>> Just to check: is this with the latest trunk? Brad and Terry have been
> >>>>>> making changes to this section of code, including modifying the
> >>>>>> PROCESS_IS_BOUND test...
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> Well, it was on the v1.5. But I just checked: looks like
> >>>>> 1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in
> >>>>> odls_default_fork_local_proc()
> >>>>> 2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way
> >>>>>
> >>>>> But, I'll give it a try with the latest trunk.
> >>>>>
> >>>>> Regards,
> >>>>> Nadia
> >>>>>
> >>>>>
> >>>> The changes, I've done do not touch
> >>>> OPAL_PAFFINITY_PROCESS_IS_BOUND at all. Also, I am only touching
> >>>> code related to the "bind-to-core" option so I really doubt if my
> >>>> changes are causing issues here.
> >>>>
> >>>> --td
> >>>>>> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I am facing a problem with a test that runs fine on some nodes, and
> >>>>>>> fails on others.
> >>>>>>>
> >>>>>>> I have a heterogenous cluster, with 3 types of nodes:
> >>>>>>> 1) Single socket , 4 cores
> >>>>>>> 2) 2 sockets, 4cores per socket
> >>>>>>> 3) 2 sockets, 6 cores/socket
> >>>>>>>
> >>>>>>> I am using:
> >>>>>>> . salloc to allocate the nodes,
> >>>>>>> . mpirun binding/mapping options "-bind-to-socket -bysocket"
> >>>>>>>
> >>>>>>> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900
> >>>>>>>
> >>>>>>> This command fails if the allocated node is of type #1 (single
> >>>>>>> socket/4
> >>>>>>> cpus).
> >>>>>>> BTW, in that case orte_show_help is referencing a tag
> >>>>>>> ("could-not-bind-to-socket") that does not exist in
> >>>>>>> help-odls-default.txt.
> >>>>>>>
> >>>>>>> While it succeeds when run on nodes of type #2 or 3.
> >>>>>>> I think a "bind to socket" should not return an error on a single
> >>>>>>> socket
> >>>>>>> machine, but rather be a noop.
> >>>>>>>
> >>>>>>> The problem comes from the test
> >>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> >>>>>>> called in odls_default_fork_local_proc() after the binding to the
> >>>>>>> processors socket has been done:
> >>>>>>> ========
> >>>>>>> <snip>
> >>>>>>> OPAL_PAFFINITY_CPU_ZERO(mask);
> >>>>>>> for (n=0; n < orte_default_num_cores_per_socket; n++) {
> >>>>>>> <snip>
> >>>>>>> OPAL_PAFFINITY_CPU_SET(phys_cpu, mask);
> >>>>>>> }
> >>>>>>> /* if we did not bind it anywhere, then that is an error */
> >>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound);
> >>>>>>> if (!bound) {
> >>>>>>> orte_show_help("help-odls-default.txt",
> >>>>>>> "odls-default:could-not-bind-to-socket", true);
> >>>>>>> ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL);
> >>>>>>> }
> >>>>>>> ========
> >>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set
> >>>>>>> in
> >>>>>>> the mask *AND* the number of bits set is lesser than the number of
> >>>>>>> cpus
> >>>>>>> on the machine. Thus on a single socket, 4 cores machine the test will
> >>>>>>> fail. While on other the kinds of machines it will succeed.
> >>>>>>>
> >>>>>>> Again, I think the problem could be solved by changing the alogrithm,
> >>>>>>> and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine =
> >>>>>>> noop.
> >>>>>>>
> >>>>>>> Another solution could be to call the test
> >>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we
> >>>>>>> are
> >>>>>>> bound (orte_odls_globals.bound). Actually that is the only case where
> >>>>>>> I
> >>>>>>> see a justification to this test (see attached patch).
> >>>>>>>
> >>>>>>> And may be both solutions could be mixed.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Nadia
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Nadia Derbey <[email protected]>
> >>>>>>> <001_fix_process_binding_test.patch>_______________________________________________
> >>>>>>> devel mailing list
> >>>>>>> [email protected]
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>>
> >>>>>> _______________________________________________
> >>>>>> devel mailing list
> >>>>>> [email protected]
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> --
> >>>> <Mail Attachment.gif>
> >>>> Terry D. Dontje | Principal Software Engineer
> >>>> Developer Tools Engineering | +1.650.633.7054
> >>>> Oracle - Performance Technologies
> >>>> 95 Network Drive, Burlington, MA 01803
> >>>> Email [email protected]
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> [email protected]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>>
> >>>
> >>> ____________________________________________________________________
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> [email protected]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>
> >> --
> >> Oracle
> >> Terry D. Dontje | Principal Software Engineer
> >> Developer Tools Engineering | +1.650.633.7054
> >> Oracle - Performance Technologies
> >> 95 Network Drive, Burlington, MA 01803
> >> Email [email protected]
> >>
> >>
> >> _______________________________________________
> >> devel mailing list
> >> [email protected]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > --
> > Nadia Derbey <[email protected]>
> >
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
--
Nadia Derbey <[email protected]>