On Mon, 2010-04-12 at 07:50 -0600, Ralph Castain wrote: > Guess I'll jump in here as I finally had a few minutes to look at the code > and think about your original note. In fact, I believe your original > statement is the source of contention. > > If someone tells us -bind-to-socket, but there is only one socket, then we > really cannot bind them to anything. Any check by their code would reveal > that they had not, in fact, been bound - raising questions as to whether or > not OMPI is performing the request. Our operating standard has been to error > out if the user specifies something we cannot do to avoid that kind of > confusion. This is what generated the code in the system today. > > Now I can see an argument that -bind-to-socket with one socket maybe > shouldn't generate an error, but that decision then has to get reflected in > other code areas as well.
Actually, that was my original point: -bind-to-socket on a single socket should IMHO lead to a noop, but not an error. > > As for the test you cite - it actually performs a valuable function and was > added to catch specific scenarios. In particular, if you follow the code flow > up just a little, you will see that it is possible to complete the loop > without ever actually setting a bit in the mask. This happens when none of > the cpus in that socket have been assigned to us via an external bind. This is exactly what I said, but may be I didn't express right: . the test on orte_odls_globals.bound is true ---> means we have an external bind . call <continue> for all the cores in the socket. ---> none of the cpus on the socket belongs to our binding Regards, > People actually use that as a means of suballocating nodes, so the test > needs to be there. Again, if the user said "bind to socket", but none of that > socket's cores are assigned for our use, that is an error. > > I haven't looked at your specific fix, but I agree with Terry's question. It > seems to me that whether or not we were externally bound is irrelevant. Even > if the overall result is what you want, I think a more logically > understandable test would help others reading the code. > > But first we need to resolve the question: should this scenario return an > error or not? > > > On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote: > > > On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote: > >> Ralph Castain wrote: > >>> Okay, just wanted to ensure everyone was working from the same base > >>> code. > >>> > >>> > >>> Terry, Brad: you might want to look this proposed change over. > >>> Something doesn't quite look right to me, but I haven't really > >>> walked through the code to check it. > >>> > >>> > >> At first blush I don't really get the usage of orte_odls_globals.bound > >> in you patch. It would seem to me that the insertion of that > >> conditional would prevent the check it surrounds being done when the > >> process has not been bounded prior to startup which is a common case. > > > > Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path > > (odls_default_fork_local_proc() in odls_default_module.c): > > > > <set target_socket depending on the desired mapping> > > <set my paffinity mask to 0> (line 715) > > <for each core in the socket> { > > <get the associated phys_core> > > <get the associated phys_cpu> > > <if we are bound (orte_odls_globals.bound)> { > > <if phys_cpu does not belong to the cpus I'm bound to> > > continue > > } > > <set phys-cpu bit in my affinity mask> > > } > > <check if something is set in my affinity mask> > > ... > > > > > > What I'm saying is that the only way to have nothing set in the affinity > > mask (which would justify the last test) is to have never called the > > <set phys_cpu in my affinity mask> instruction. This means: > > . the test on orte_odls_globals.bound is true > > . call <continue> for all the cores in the socket. > > > > In the other path, what we are doing is checking if we have set one or > > more bits in a mask after having actually set them: don't you think it's > > useless? > > > > That's why I'm suggesting to call the last check only if > > orte_odls_globals.bound is true. > > > > Regards, > > Nadia > >> > >> --td > >> > >> > >>> > >>> On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote: > >>> > >>>> Nadia Derbey wrote: > >>>>> On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote: > >>>>> > >>>>>> Just to check: is this with the latest trunk? Brad and Terry have been > >>>>>> making changes to this section of code, including modifying the > >>>>>> PROCESS_IS_BOUND test... > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> Well, it was on the v1.5. But I just checked: looks like > >>>>> 1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in > >>>>> odls_default_fork_local_proc() > >>>>> 2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way > >>>>> > >>>>> But, I'll give it a try with the latest trunk. > >>>>> > >>>>> Regards, > >>>>> Nadia > >>>>> > >>>>> > >>>> The changes, I've done do not touch > >>>> OPAL_PAFFINITY_PROCESS_IS_BOUND at all. Also, I am only touching > >>>> code related to the "bind-to-core" option so I really doubt if my > >>>> changes are causing issues here. > >>>> > >>>> --td > >>>>>> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote: > >>>>>> > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> I am facing a problem with a test that runs fine on some nodes, and > >>>>>>> fails on others. > >>>>>>> > >>>>>>> I have a heterogenous cluster, with 3 types of nodes: > >>>>>>> 1) Single socket , 4 cores > >>>>>>> 2) 2 sockets, 4cores per socket > >>>>>>> 3) 2 sockets, 6 cores/socket > >>>>>>> > >>>>>>> I am using: > >>>>>>> . salloc to allocate the nodes, > >>>>>>> . mpirun binding/mapping options "-bind-to-socket -bysocket" > >>>>>>> > >>>>>>> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900 > >>>>>>> > >>>>>>> This command fails if the allocated node is of type #1 (single > >>>>>>> socket/4 > >>>>>>> cpus). > >>>>>>> BTW, in that case orte_show_help is referencing a tag > >>>>>>> ("could-not-bind-to-socket") that does not exist in > >>>>>>> help-odls-default.txt. > >>>>>>> > >>>>>>> While it succeeds when run on nodes of type #2 or 3. > >>>>>>> I think a "bind to socket" should not return an error on a single > >>>>>>> socket > >>>>>>> machine, but rather be a noop. > >>>>>>> > >>>>>>> The problem comes from the test > >>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); > >>>>>>> called in odls_default_fork_local_proc() after the binding to the > >>>>>>> processors socket has been done: > >>>>>>> ======== > >>>>>>> <snip> > >>>>>>> OPAL_PAFFINITY_CPU_ZERO(mask); > >>>>>>> for (n=0; n < orte_default_num_cores_per_socket; n++) { > >>>>>>> <snip> > >>>>>>> OPAL_PAFFINITY_CPU_SET(phys_cpu, mask); > >>>>>>> } > >>>>>>> /* if we did not bind it anywhere, then that is an error */ > >>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); > >>>>>>> if (!bound) { > >>>>>>> orte_show_help("help-odls-default.txt", > >>>>>>> "odls-default:could-not-bind-to-socket", true); > >>>>>>> ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL); > >>>>>>> } > >>>>>>> ======== > >>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits set > >>>>>>> in > >>>>>>> the mask *AND* the number of bits set is lesser than the number of > >>>>>>> cpus > >>>>>>> on the machine. Thus on a single socket, 4 cores machine the test will > >>>>>>> fail. While on other the kinds of machines it will succeed. > >>>>>>> > >>>>>>> Again, I think the problem could be solved by changing the alogrithm, > >>>>>>> and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine = > >>>>>>> noop. > >>>>>>> > >>>>>>> Another solution could be to call the test > >>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if we > >>>>>>> are > >>>>>>> bound (orte_odls_globals.bound). Actually that is the only case where > >>>>>>> I > >>>>>>> see a justification to this test (see attached patch). > >>>>>>> > >>>>>>> And may be both solutions could be mixed. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Nadia > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Nadia Derbey <nadia.der...@bull.net> > >>>>>>> <001_fix_process_binding_test.patch>_______________________________________________ > >>>>>>> devel mailing list > >>>>>>> de...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> devel mailing list > >>>>>> de...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>> > >>>>>> > >>>> > >>>> > >>>> -- > >>>> <Mail Attachment.gif> > >>>> Terry D. Dontje | Principal Software Engineer > >>>> Developer Tools Engineering | +1.650.633.7054 > >>>> Oracle - Performance Technologies > >>>> 95 Network Drive, Burlington, MA 01803 > >>>> Email terry.don...@oracle.com > >>>> > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> > >>> > >>> ____________________________________________________________________ > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> > >> -- > >> Oracle > >> Terry D. Dontje | Principal Software Engineer > >> Developer Tools Engineering | +1.650.633.7054 > >> Oracle - Performance Technologies > >> 95 Network Drive, Burlington, MA 01803 > >> Email terry.don...@oracle.com > >> > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > > Nadia Derbey <nadia.der...@bull.net> > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Nadia Derbey <nadia.der...@bull.net>