Hi!

On 2024-03-07T15:07:32+0100, Tobias Burnus <tbur...@baylibre.com> wrote:
> first, I have the feeling we talk about (more or less) the same code 
> region and use the same words – but we talk about rather different 
> things. Thus, you confuse me (and possibly Andrew) – and my reply 
> confuses you.

That, indeed, is my impression, too.  :-/

And actually the biggest confusion seems to be that both you would like
'GCN_SUPPRESS_HOST_FALLBACK' to mean something else than
'HSA_SUPPRESS_HOST_FALLBACK' originally meant.

Hopefully the
"GCN: The original meaning of 'GCN_SUPPRESS_HOST_FALLBACK' isn't applicable 
(non-shared memory system)"
does clarify that.


Just to close this out, let's try again for the other discussion items:

> Thomas Schwinge wrote:
>> On 2024-03-07T12:43:07+0100, Tobias Burnus<tbur...@baylibre.com>  wrote:
>>> Thomas Schwinge wrote:
>>>> First, I think most users do not set GCN_SUPPRESS_HOST_FALLBACK – and it
>>>> is also not really desirable.
>> External users probably don't, but certainly all our internal testing is
>> setting it,
>
> First, I doubt it

'git grep --cached GCN_SUPPRESS_HOST_FALLBACK' in our internal scripts is
your friend.

> secondly, if it were true, it was broken for the 
> last 5 years or so as we definitely did not notice fails due to not 
> working offload devices. – Neither for AMD GCN nor ...

You're saying that 'GCN_SUPPRESS_HOST_FALLBACK=1' doesn't report as fatal
certain errors during device probing?  That's not what the code as well
as my experience says.

>> and also implicitly all nvptx offloading testing: simply by
>> means of having ["no" missing here -- sorry!] such knob in the libgomp nvptx 
>> plugin.
>
> I did see it at some places set for AMD but I do not see any 
> nvptx-specific environment variable which permits to do the same.

Right, that was confusing: there was a "no" missing in that sentence --
sorry!

> However:
>>   That is, the
>> libgomp nvptx plugin has an implicit 'suppress_host_fallback = true' for
>> (the original meaning of) that flag
>
> I think that's one of the problems here – you talk about 
> suppress_host_fallback (implicit, original meaning), while I talk about 
> the GCN_SUPPRESS_HOST_FALLBACK environment variable.

The 'suppress_host_fallback' internal variable directly corresponds to
the 'GCN_SUPPRESS_HOST_FALLBACK' environment variable.

> Besides all the talk about suppress_host_fallback, 
> 'init_hsa_runtime_functions' is not fatal' of the subject line seems to 
> be something to be considered (beyond the patches you already suggested).

I'll next submit "GCN, nvptx: Errors during device probing are fatal".

>>> If I run on my Linux system the system compiler with nvptx + gcn suppost
>>> installed, I get (with a nvptx permission problem):
>>>
>>> $ GCN_SUPPRESS_HOST_FALLBACK=1 ./a.out
>>>
>>> libgomp: GCN host fallback has been suppressed
>>>
>>> And exit code = 1. The same result with '-foffload=disable' or with
>>> '-foffload=nvptx-none'.
>> I can't tell if that's what you expect to see there, or not?
>
> Well, obviously

In this discussion thread here, nothing was obvious to my anymore...  ;-|

> not that I get this error by default – and as your 
> wording indicated that the internal variable will be always true

That always-'true' suggestion was only for the *original* meaning of the
variable: the use in 'GOMP_OFFLOAD_can_run'.

> – and 
> not only when the env var GCN_SUPPRESS_HOST_FALLBACK is explicit set, I 
> worry that I would get the error any time.

That was exactly the point of my patch in this thread: to get rid of the
*additional*/*new* behavior that the libgomp GCN plugin derives from
'GCN_SUPPRESS_HOST_FALLBACK', different from what
'HSA_SUPPRESS_HOST_FALLBACK' originally meant.

However, I now understand that Andrew would like to keep that *new*
behavior.

>> (For avoidance of doubt: I'm expecting silent host-fallback execution in
>> case that libgomp GCN and/or nvptx plugins are available, but no
>> corresponding devices.  That's what my patch achieves.)
>
> I concur that the silent host fallback should happen by default (unless 
> env vars tell otherwise) - at least when either no code was generated 
> for the device (e.g. -foffload=disable) or when the vendor runtime 
> library is not available or no device (be it no hardware or no permission).
>
> That's the current behavior and if that remains, my main concern evaporates.

ACK, thanks.


Grüße
 Thomas


>>> If we want to remove it, we can make it always false - but I am strongly
>>> against making it always true.
>> I'm confused.  So you want the GCN and nvptx plugins to behave
>> differently in that regard?
> No – or at least: not unless GCN_SUPPRESS_HOST_FALLBACK is set.
>>> Use OMP_TARGET_OFFLOAD=mandatory (or that GCN env) if you want to
>>> prevent the host fallback, but don't break somewhat common systems.
>> That's an orthogonal concept?
>
> No – It's the same concept of the main use of the 
> GCN_SUPPRESS_HOST_FALLBACK environment variable: You get a run-time 
> error instead of a silent host fallback.
>
> But I have in the whole thread the feeling that – while talking about 
> the same code region and throwing in the same words – we actually talk 
> about completely different things.
>
> Tobias

Reply via email to