Re: [hwloc-users] Solaris and hwloc

Brice Goglin Thu, 13 Sep 2012 11:17:13 -0400

I think I am going to agree. Three comments:
* which "binding fails" do you refer to? I assume all cases I listed.
* I was initially against changing the default behavior of hwloc-bind,
but it's not like changing the ABI. There are likely very few scripts
using hwloc-bind out there. Breaking some of them is not too bad as long
as we give a useful error message.
* If we start failing because of invalid inputs in hwloc-bind, we may
have to do the same in hwloc-calc. The parsing code is shared anyway.


Brice



Le 13/09/2012 17:09, Jeff Squyres a écrit :
> These are all good points.
>
> That being said, Brock Palen made another good point on the OMPI list 
> recently.  It was in regards to OpenFabrics registered memory, but the issue 
> is quite analogous.
>
> OMPI used to issue a warning if there wasn't enough registered memory 
> available, but allow the job to run anyway (at lower performance).  Brock was 
> firmly opposed to that (he's an HPC sysadmin): he didn't want jobs to run at 
> all if there wasn't enough registered memory.  
>
> One of the rationale here is that users won't tend to notice a warning at the 
> top of a job's stdout/stderr -- if the job ran, that's good enough (until 
> much later when they realize that they're not getting the right performance, 
> or, worse, this job is impacting other jobs because its affinity is wrong).  
> But if the job doesn't run, that will get noticed immediately, and the 
> problem will be fixed by a human.
>
> Hence, it seems safer to fall back on the "if we can't give the user what 
> they asked for, fail and let a human figure it out" philosophy.  Even if it 
> means changing the default.  Keep in mind that if they run hwloc-bind, 
> they're specifically asking for binding.
>
> I think I'm now 80/20 in the "abort hwloc-bind if it fails to bind" camp now. 
>  :-)
>
> After a little more thought, I'm also thinking that having a "it's ok if 
> binding fails" CLI flag is a bad idea.  If the user really wants something to 
> run without binding, then you can just do that in the shell:
>
> -----
> hwloc-bind ...whatever... my_executable
> if test "$?" != "0"; then
>       # run without binding
>       my_executable
> fi
> -----
>
> My $0.02.  :)
>
>
> On Sep 13, 2012, at 4:09 AM, Brice Goglin wrote:
>
>> (resending because the formatting was bad)
>>
>>
>> Le 13/09/2012 00:26, Jeff Squyres a écrit :
>>> On Sep 12, 2012, at 10:30 AM, Samuel Thibault wrote:
>>>
>>>>> Sidenote: if hwloc-bind fails to bind, should we still launch the child 
>>>>> process?
>>>> Well, it's up to you to decide :)
>>> Anyone have an opinion?  I'm 60/40 in favor of not letting it run, under 
>>> the rationale that the user asked for something that we can't deliver, so 
>>> we shouldn't continue.
>>>
>>> Any idea what numactl does if it can't bind?
>> Let me add taskset to the list of tools to compare to, and distinguish
>> several cases:
>>
>> 1) invalid command line
>> * taskset (with invalid list "2,") errors out
>> * numactl (with invalid list "2,") errors out
>> * hwloc-bind (with invalid location followed by "-- executable") errors
>> out (considers the invalid location as the executable name)
>>
>> 2) valid command-line containing *only* non-existing objects:
>> * taskset errors out
>> * numactl errors out
>> * hwloc-bind succeeds, binds to nothing
>>
>> 3) valid command-line containing some existing objects and some
>> non-existing:
>> * taskset succeed (ignores unexisting objects, bind to others)
>> * numactl errors out
>> * hwloc-bind succeeds (ignores unexisting objects, bind to others)
>>
>> 4) valid command-line with only valid objects but missing OS support
>> * doesn't apply to taskset and numactl afaik
>> * hwloc-bind succeeds (ignores failure to bind)
>>
>>
>> We have a --strict option, which translate into the STRICT binding flag
>> which is documented as
>>  "Request strict binding from the OS.  The function will fail if the
>> binding can not be guaranteed / completely enforced."
>> I usually see "non-strict" as 'if you can't do what I want, do something
>> similar". I wouldn't be too bad to say that this applies to (3) (bind to
>> smaller than requested).
>>
>> But (2) and (4) are different. Not binding at all or binding to nothing
>> is far from "non-strict". But I wonder if adding a new command-line flag
>> to exit on such errors would be confusing with respect to the existing
>> --strict.
>>
>> We could also change the default to exit on error, and add --force to
>> launch the process even on failure to bind. But changing defaults isn't
>> always a good idea.
>>
>> Brice
>>
>

Re: [hwloc-users] Solaris and hwloc

Reply via email to