On Tue, 2010-04-13 at 01:27 -0600, Ralph Castain wrote: > On Apr 13, 2010, at 1:02 AM, Nadia Derbey wrote: > > > On Mon, 2010-04-12 at 10:07 -0600, Ralph Castain wrote: > >> By definition, if you bind to all available cpus in the OS, you are > >> bound to nothing (i.e., "unbound") as your process runs on any > >> available cpu. > >> > >> > >> PLPA doesn't care, and I personally don't care. I was just explaining > >> why it generates an error in the odls. > >> > >> > >> A user app would detect its binding by (a) getting the affiinity mask > >> from the OS, and then (b) seeing if the bits are set to '1' for all > >> available processors. If it is, then you are not bound - there is no > >> mechanism available for checking "are the bits set only for the > >> processors I asked to be bound to". The OS doesn't track what you > >> asked for, it only tracks where you are bound - and a mask with all > >> '1's is defined as "unbound". > >> > >> > >> So the reason for my question was simple: a user asked us to "bind" > >> their process. If their process checks to see if it is bound, it will > >> return "no". The user would therefore be led to believe that OMPI had > >> failed to execute their request, when in fact we did execute it - but > >> the result was (as Nadia says) a "no-op". > >> > >> > >> After talking with Jeff, I think he has the right answer. It is a > >> method we have used elsewhere, so it isn't unexpected behavior. > >> Basically, he proposed that we use an mca param to control this > >> behavior: > >> > >> > >> * default: generate an error message as the "bind" results in a no-op, > >> and this is our current behavior > >> > >> > >> * warn: generate a warning that the binding wound up being a "no-op", > >> but continue working > >> > >> > >> * quiet: just ignore it and keep going > > > > Excellent, I completely agree (though I would have put the 2nd star as > > the default behavior, but never mind, I don't want to restart the > > discussion ;-) ) > > I actually went back/forth on that as well - I personally think it might be > better to just have warn and quiet, with warn being the default. The warning > could be generated with orte_show_help so the messages would be consolidated > across nodes. Given that the enhanced paffinity behavior is fairly new, and > that no-one has previously raised this issue, I don't think the prior > behavior is relevant. > > Would that make sense? If so, we could extend that to the other binding > options for consistency.
Sure! Patch proposal attached. Regards, Nadia > > > > > Also this is a good opportunity to fix the other issue I talked about in > > the first message in this thread: the tag > > "odls-default:could-not-bind-to-socket" does not exist in > > orte/mca/odls/default/help-odls-default.txt > > I'll take that one - my fault for missing it. I'll cross-check the other > messages as well. Thanks for catching it! > > As for your other change: let me think on it. I -think- I understand your > logic, but honestly haven't had time to really walk through it properly. Got > an ORCM deadline to meet, but hope to break free towards the end of this week. > > > > > > Regards, > > Nadia > >> > >> > >> Fairly trivial to implement, and Bull could set the default mca param > >> file to "quiet" to get what they want. I'm not sure if that's what the > >> community wants or not - like I said, it makes no diff to me so long > >> as the code logic is understandable. > >> > >> > >> > >> On Apr 12, 2010, at 8:27 AM, Terry Dontje wrote: > >> > >>> Ralph, I guess I am curious why is it that if there is only one > >>> socket we cannot bind to it? Does plpa actually error on this or is > >>> this a condition we decided was an error at odls? > >>> > >>> I am somewhat torn on whether this makes sense. On the one hand it > >>> is definitely useless as to the result if you allow it. However if > >>> you don't allow it and you have a script or running tests on > >>> multiple systems it would be nice to have this run because you are > >>> not really running into a resource starvation issue. > >>> > >>> At a minimum I think the error condition/message needs to be spelled > >>> out (defined). As to whether we allow binding when only one > >>> socket exist I could go either way slightly leaning towards allowing > >>> such a specification to work. > >>> > >>> --td > >>> > >>> > >>> Ralph Castain wrote: > >>>> Guess I'll jump in here as I finally had a few minutes to look at the > >>>> code and think about your original note. In fact, I believe your > >>>> original statement is the source of contention. > >>>> > >>>> If someone tells us -bind-to-socket, but there is only one socket, then > >>>> we really cannot bind them to anything. Any check by their code would > >>>> reveal that they had not, in fact, been bound - raising questions as to > >>>> whether or not OMPI is performing the request. Our operating standard > >>>> has been to error out if the user specifies something we cannot do to > >>>> avoid that kind of confusion. This is what generated the code in the > >>>> system today. > >>>> > >>>> Now I can see an argument that -bind-to-socket with one socket maybe > >>>> shouldn't generate an error, but that decision then has to get reflected > >>>> in other code areas as well. > >>>> > >>>> As for the test you cite - it actually performs a valuable function and > >>>> was added to catch specific scenarios. In particular, if you follow the > >>>> code flow up just a little, you will see that it is possible to complete > >>>> the loop without ever actually setting a bit in the mask. This happens > >>>> when none of the cpus in that socket have been assigned to us via an > >>>> external bind. People actually use that as a means of suballocating > >>>> nodes, so the test needs to be there. Again, if the user said "bind to > >>>> socket", but none of that socket's cores are assigned for our use, that > >>>> is an error. > >>>> > >>>> I haven't looked at your specific fix, but I agree with Terry's > >>>> question. It seems to me that whether or not we were externally bound is > >>>> irrelevant. Even if the overall result is what you want, I think a more > >>>> logically understandable test would help others reading the code. > >>>> > >>>> But first we need to resolve the question: should this scenario return > >>>> an error or not? > >>>> > >>>> > >>>> On Apr 12, 2010, at 1:43 AM, Nadia Derbey wrote: > >>>> > >>>> > >>>>> On Fri, 2010-04-09 at 14:23 -0400, Terry Dontje wrote: > >>>>> > >>>>>> Ralph Castain wrote: > >>>>>> > >>>>>>> Okay, just wanted to ensure everyone was working from the same base > >>>>>>> code. > >>>>>>> > >>>>>>> > >>>>>>> Terry, Brad: you might want to look this proposed change over. > >>>>>>> Something doesn't quite look right to me, but I haven't really > >>>>>>> walked through the code to check it. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> At first blush I don't really get the usage of orte_odls_globals.bound > >>>>>> in you patch. It would seem to me that the insertion of that > >>>>>> conditional would prevent the check it surrounds being done when the > >>>>>> process has not been bounded prior to startup which is a common case. > >>>>>> > >>>>> Well, if you have a look at the algo in the ORTE_BIND_TO_SOCKET path > >>>>> (odls_default_fork_local_proc() in odls_default_module.c): > >>>>> > >>>>> <set target_socket depending on the desired mapping> > >>>>> <set my paffinity mask to 0> (line 715) > >>>>> <for each core in the socket> { > >>>>> <get the associated phys_core> > >>>>> <get the associated phys_cpu> > >>>>> <if we are bound (orte_odls_globals.bound)> { > >>>>> <if phys_cpu does not belong to the cpus I'm bound to> > >>>>> continue > >>>>> } > >>>>> <set phys-cpu bit in my affinity mask> > >>>>> } > >>>>> <check if something is set in my affinity mask> > >>>>> ... > >>>>> > >>>>> > >>>>> What I'm saying is that the only way to have nothing set in the affinity > >>>>> mask (which would justify the last test) is to have never called the > >>>>> <set phys_cpu in my affinity mask> instruction. This means: > >>>>> . the test on orte_odls_globals.bound is true > >>>>> . call <continue> for all the cores in the socket. > >>>>> > >>>>> In the other path, what we are doing is checking if we have set one or > >>>>> more bits in a mask after having actually set them: don't you think it's > >>>>> useless? > >>>>> > >>>>> That's why I'm suggesting to call the last check only if > >>>>> orte_odls_globals.bound is true. > >>>>> > >>>>> Regards, > >>>>> Nadia > >>>>> > >>>>>> --td > >>>>>> > >>>>>> > >>>>>> > >>>>>>> On Apr 9, 2010, at 9:33 AM, Terry Dontje wrote: > >>>>>>> > >>>>>>> > >>>>>>>> Nadia Derbey wrote: > >>>>>>>> > >>>>>>>>> On Fri, 2010-04-09 at 08:41 -0600, Ralph Castain wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> Just to check: is this with the latest trunk? Brad and Terry have > >>>>>>>>>> been making changes to this section of code, including modifying > >>>>>>>>>> the PROCESS_IS_BOUND test... > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> Well, it was on the v1.5. But I just checked: looks like > >>>>>>>>> 1. the call to OPAL_PAFFINITY_PROCESS_IS_BOUND is still there in > >>>>>>>>> odls_default_fork_local_proc() > >>>>>>>>> 2. OPAL_PAFFINITY_PROCESS_IS_BOUND() is defined the same way > >>>>>>>>> > >>>>>>>>> But, I'll give it a try with the latest trunk. > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Nadia > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> The changes, I've done do not touch > >>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND at all. Also, I am only touching > >>>>>>>> code related to the "bind-to-core" option so I really doubt if my > >>>>>>>> changes are causing issues here. > >>>>>>>> > >>>>>>>> --td > >>>>>>>> > >>>>>>>>>> On Apr 9, 2010, at 3:39 AM, Nadia Derbey wrote: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>> > >>>>>>>>>>> I am facing a problem with a test that runs fine on some nodes, > >>>>>>>>>>> and > >>>>>>>>>>> fails on others. > >>>>>>>>>>> > >>>>>>>>>>> I have a heterogenous cluster, with 3 types of nodes: > >>>>>>>>>>> 1) Single socket , 4 cores > >>>>>>>>>>> 2) 2 sockets, 4cores per socket > >>>>>>>>>>> 3) 2 sockets, 6 cores/socket > >>>>>>>>>>> > >>>>>>>>>>> I am using: > >>>>>>>>>>> . salloc to allocate the nodes, > >>>>>>>>>>> . mpirun binding/mapping options "-bind-to-socket -bysocket" > >>>>>>>>>>> > >>>>>>>>>>> # salloc -N 1 mpirun -n 4 -bind-to-socket -bysocket sleep 900 > >>>>>>>>>>> > >>>>>>>>>>> This command fails if the allocated node is of type #1 (single > >>>>>>>>>>> socket/4 > >>>>>>>>>>> cpus). > >>>>>>>>>>> BTW, in that case orte_show_help is referencing a tag > >>>>>>>>>>> ("could-not-bind-to-socket") that does not exist in > >>>>>>>>>>> help-odls-default.txt. > >>>>>>>>>>> > >>>>>>>>>>> While it succeeds when run on nodes of type #2 or 3. > >>>>>>>>>>> I think a "bind to socket" should not return an error on a single > >>>>>>>>>>> socket > >>>>>>>>>>> machine, but rather be a noop. > >>>>>>>>>>> > >>>>>>>>>>> The problem comes from the test > >>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); > >>>>>>>>>>> called in odls_default_fork_local_proc() after the binding to the > >>>>>>>>>>> processors socket has been done: > >>>>>>>>>>> ======== > >>>>>>>>>>> <snip> > >>>>>>>>>>> OPAL_PAFFINITY_CPU_ZERO(mask); > >>>>>>>>>>> for (n=0; n < orte_default_num_cores_per_socket; n++) { > >>>>>>>>>>> <snip> > >>>>>>>>>>> OPAL_PAFFINITY_CPU_SET(phys_cpu, mask); > >>>>>>>>>>> } > >>>>>>>>>>> /* if we did not bind it anywhere, then that is an error */ > >>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); > >>>>>>>>>>> if (!bound) { > >>>>>>>>>>> orte_show_help("help-odls-default.txt", > >>>>>>>>>>> "odls-default:could-not-bind-to-socket", > >>>>>>>>>>> true); > >>>>>>>>>>> ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL); > >>>>>>>>>>> } > >>>>>>>>>>> ======== > >>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() will return true if there bits > >>>>>>>>>>> set in > >>>>>>>>>>> the mask *AND* the number of bits set is lesser than the number > >>>>>>>>>>> of cpus > >>>>>>>>>>> on the machine. Thus on a single socket, 4 cores machine the test > >>>>>>>>>>> will > >>>>>>>>>>> fail. While on other the kinds of machines it will succeed. > >>>>>>>>>>> > >>>>>>>>>>> Again, I think the problem could be solved by changing the > >>>>>>>>>>> alogrithm, > >>>>>>>>>>> and assuming that ORTE_BIND_TO_SOCKET, on a single socket machine > >>>>>>>>>>> = > >>>>>>>>>>> noop. > >>>>>>>>>>> > >>>>>>>>>>> Another solution could be to call the test > >>>>>>>>>>> OPAL_PAFFINITY_PROCESS_IS_BOUND() at the end of the loop only if > >>>>>>>>>>> we are > >>>>>>>>>>> bound (orte_odls_globals.bound). Actually that is the only case > >>>>>>>>>>> where I > >>>>>>>>>>> see a justification to this test (see attached patch). > >>>>>>>>>>> > >>>>>>>>>>> And may be both solutions could be mixed. > >>>>>>>>>>> > >>>>>>>>>>> Regards, > >>>>>>>>>>> Nadia > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Nadia Derbey <nadia.der...@bull.net> > >>>>>>>>>>> <001_fix_process_binding_test.patch>_______________________________________________ > >>>>>>>>>>> devel mailing list > >>>>>>>>>>> de...@open-mpi.org > >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> devel mailing list > >>>>>>>>>> de...@open-mpi.org > >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> -- > >>>>>>>> <Mail Attachment.gif> > >>>>>>>> Terry D. Dontje | Principal Software Engineer > >>>>>>>> Developer Tools Engineering | +1.650.633.7054 > >>>>>>>> Oracle - Performance Technologies > >>>>>>>> 95 Network Drive, Burlington, MA 01803 > >>>>>>>> Email terry.don...@oracle.com > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> devel mailing list > >>>>>>>> de...@open-mpi.org > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>>> > >>>>>>> ____________________________________________________________________ > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> devel mailing list > >>>>>>> de...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>> > >>>>>> -- > >>>>>> Oracle > >>>>>> Terry D. Dontje | Principal Software Engineer > >>>>>> Developer Tools Engineering | +1.650.633.7054 > >>>>>> Oracle - Performance Technologies > >>>>>> 95 Network Drive, Burlington, MA 01803 > >>>>>> Email terry.don...@oracle.com > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> devel mailing list > >>>>>> de...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>> > >>>>> -- > >>>>> Nadia Derbey <nadia.der...@bull.net> > >>>>> > >>>>> _______________________________________________ > >>>>> devel mailing list > >>>>> de...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> > >>> > >>> > >>> -- > >>> <Mail Attachment.gif> > >>> Terry D. Dontje | Principal Software Engineer > >>> Developer Tools Engineering | +1.650.633.7054 > >>> Oracle - Performance Technologies > >>> 95 Network Drive, Burlington, MA 01803 > >>> Email terry.don...@oracle.com > >>> > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > > Nadia Derbey <nadia.der...@bull.net> > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Nadia Derbey <nadia.der...@bull.net>
when bind-to-socket is asked for, do not unconditionally leave if we are running on a single socket node diff -r 0b851b2e7934 orte/mca/odls/default/help-odls-default.txt --- a/orte/mca/odls/default/help-odls-default.txt Thu Mar 18 16:10:25 2010 +0100 +++ b/orte/mca/odls/default/help-odls-default.txt Tue Apr 13 13:40:12 2010 +0200 @@ -130,3 +130,13 @@ binding action: Application name: %s Please revise the request and try again. +# +[odls-default:warn-not-bound-to-socket] +A request to bind the processes to a socket was made, but the local host +only contains a single socket. +This will result in the processes being unbound. +Continuing anyway. + + Local host: %s + Action requested: %s + Application name: %s diff -r 0b851b2e7934 orte/mca/odls/default/odls_default.h --- a/orte/mca/odls/default/odls_default.h Thu Mar 18 16:10:25 2010 +0100 +++ b/orte/mca/odls/default/odls_default.h Tue Apr 13 13:40:12 2010 +0200 @@ -36,6 +36,7 @@ BEGIN_C_DECLS int orte_odls_default_component_open(void); int orte_odls_default_component_close(void); int orte_odls_default_component_query(mca_base_module_t **module, int *priority); +int orte_odls_default_component_register(void); /* * ODLS Default module @@ -46,6 +47,8 @@ ORTE_MODULE_DECLSPEC extern orte_odls_ba /* dedicated debug output flag */ ORTE_MODULE_DECLSPEC extern bool orte_odls_default_report_bindings; +ORTE_DECLSPEC extern bool orte_odls_default_warn_if_not_bound; + END_C_DECLS #endif /* ORTE_ODLS_H */ diff -r 0b851b2e7934 orte/mca/odls/default/odls_default_component.c --- a/orte/mca/odls/default/odls_default_component.c Thu Mar 18 16:10:25 2010 +0100 +++ b/orte/mca/odls/default/odls_default_component.c Tue Apr 13 13:40:12 2010 +0200 @@ -31,12 +31,17 @@ #endif #include <ctype.h> +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" + #include "orte/mca/odls/odls.h" #include "orte/mca/odls/base/odls_private.h" #include "orte/mca/odls/default/odls_default.h" /* instantiate a module-global variable */ bool orte_odls_default_report_bindings; +bool orte_odls_default_warn_if_not_bound; /* * Instantiate the public struct with all of our public information @@ -57,7 +62,8 @@ orte_odls_base_component_t mca_odls_defa /* Component open and close functions */ orte_odls_default_component_open, orte_odls_default_component_close, - orte_odls_default_component_query + orte_odls_default_component_query, + orte_odls_default_component_register }, { /* The component is checkpoint ready */ @@ -72,6 +78,17 @@ int orte_odls_default_component_open(voi return ORTE_SUCCESS; } +int orte_odls_default_component_register(void) +{ + mca_base_param_reg_int(&mca_odls_default_component.version, + "warn_if_not_bound", + "If nonzero, issue a warning if the program asked " + "for a binding that results in a no-op (ex: " + "bind-to-socket on a single socket node)", + false, false, 1, + &orte_odls_default_warn_if_not_bound); + return ORTE_SUCCESS; +} int orte_odls_default_component_query(mca_base_module_t **module, int *priority) { diff -r 0b851b2e7934 orte/mca/odls/default/odls_default_module.c --- a/orte/mca/odls/default/odls_default_module.c Thu Mar 18 16:10:25 2010 +0100 +++ b/orte/mca/odls/default/odls_default_module.c Tue Apr 13 13:40:12 2010 +0200 @@ -750,9 +750,19 @@ static int odls_default_fork_local_proc( /* if we did not bind it anywhere, then that is an error */ OPAL_PAFFINITY_PROCESS_IS_BOUND(mask, &bound); if (!bound) { - orte_show_help("help-odls-default.txt", - "odls-default:could-not-bind-to-socket", true); - ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL); + if (orte_odls_globals.bound) { + orte_show_help("help-odls-default.txt", + "odls-default:could-not-bind-to-socket", true); + ORTE_ODLS_ERROR_OUT(ORTE_ERR_FATAL); + } else { + if (orte_odls_default_warn_if_not_bound) { + orte_show_help("help-odls-default.txt", + "odls-default:warn-not-bound-to-socket", + true, + orte_process_info.nodename, + "bind-to-core", context->app); + } + } } if (orte_report_bindings) { opal_output(0, "%s odls:default:fork binding child %s to socket %d cpus %04lx",