I think I am going to agree. Three comments: * which "binding fails" do you refer to? I assume all cases I listed. * I was initially against changing the default behavior of hwloc-bind, but it's not like changing the ABI. There are likely very few scripts using hwloc-bind out there. Breaking some of them is not too bad as long as we give a useful error message. * If we start failing because of invalid inputs in hwloc-bind, we may have to do the same in hwloc-calc. The parsing code is shared anyway.
Brice Le 13/09/2012 17:09, Jeff Squyres a écrit : > These are all good points. > > That being said, Brock Palen made another good point on the OMPI list > recently. It was in regards to OpenFabrics registered memory, but the issue > is quite analogous. > > OMPI used to issue a warning if there wasn't enough registered memory > available, but allow the job to run anyway (at lower performance). Brock was > firmly opposed to that (he's an HPC sysadmin): he didn't want jobs to run at > all if there wasn't enough registered memory. > > One of the rationale here is that users won't tend to notice a warning at the > top of a job's stdout/stderr -- if the job ran, that's good enough (until > much later when they realize that they're not getting the right performance, > or, worse, this job is impacting other jobs because its affinity is wrong). > But if the job doesn't run, that will get noticed immediately, and the > problem will be fixed by a human. > > Hence, it seems safer to fall back on the "if we can't give the user what > they asked for, fail and let a human figure it out" philosophy. Even if it > means changing the default. Keep in mind that if they run hwloc-bind, > they're specifically asking for binding. > > I think I'm now 80/20 in the "abort hwloc-bind if it fails to bind" camp now. > :-) > > After a little more thought, I'm also thinking that having a "it's ok if > binding fails" CLI flag is a bad idea. If the user really wants something to > run without binding, then you can just do that in the shell: > > ----- > hwloc-bind ...whatever... my_executable > if test "$?" != "0"; then > # run without binding > my_executable > fi > ----- > > My $0.02. :) > > > On Sep 13, 2012, at 4:09 AM, Brice Goglin wrote: > >> (resending because the formatting was bad) >> >> >> Le 13/09/2012 00:26, Jeff Squyres a écrit : >>> On Sep 12, 2012, at 10:30 AM, Samuel Thibault wrote: >>> >>>>> Sidenote: if hwloc-bind fails to bind, should we still launch the child >>>>> process? >>>> Well, it's up to you to decide :) >>> Anyone have an opinion? I'm 60/40 in favor of not letting it run, under >>> the rationale that the user asked for something that we can't deliver, so >>> we shouldn't continue. >>> >>> Any idea what numactl does if it can't bind? >> Let me add taskset to the list of tools to compare to, and distinguish >> several cases: >> >> 1) invalid command line >> * taskset (with invalid list "2,") errors out >> * numactl (with invalid list "2,") errors out >> * hwloc-bind (with invalid location followed by "-- executable") errors >> out (considers the invalid location as the executable name) >> >> 2) valid command-line containing *only* non-existing objects: >> * taskset errors out >> * numactl errors out >> * hwloc-bind succeeds, binds to nothing >> >> 3) valid command-line containing some existing objects and some >> non-existing: >> * taskset succeed (ignores unexisting objects, bind to others) >> * numactl errors out >> * hwloc-bind succeeds (ignores unexisting objects, bind to others) >> >> 4) valid command-line with only valid objects but missing OS support >> * doesn't apply to taskset and numactl afaik >> * hwloc-bind succeeds (ignores failure to bind) >> >> >> We have a --strict option, which translate into the STRICT binding flag >> which is documented as >> "Request strict binding from the OS. The function will fail if the >> binding can not be guaranteed / completely enforced." >> I usually see "non-strict" as 'if you can't do what I want, do something >> similar". I wouldn't be too bad to say that this applies to (3) (bind to >> smaller than requested). >> >> But (2) and (4) are different. Not binding at all or binding to nothing >> is far from "non-strict". But I wonder if adding a new command-line flag >> to exit on such errors would be confusing with respect to the existing >> --strict. >> >> We could also change the default to exit on error, and add --force to >> launch the process even on failure to bind. But changing defaults isn't >> always a good idea. >> >> Brice >> >