Re: Stabilizing flaky libgomp GCN target/offloading testing (was: libgomp GCN gfx1030/gfx1100 offloading status)

2024-02-21 Thread Richard Biener



> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge :
> 
> Hi!
> 
>> On 2024-02-01T15:49:02+0100, Richard Biener  wrote:
>>> On Thu, 1 Feb 2024, Thomas Schwinge wrote:
>>> On 2024-01-26T10:45:10+0100, Richard Biener  wrote:
 On Fri, 26 Jan 2024, Richard Biener wrote:
> On Wed, 24 Jan 2024, Andrew Stubbs wrote:
>> [...] is enough to get gfx1100 working for most purposes, on top of the
>> patch that Tobias committed a week or so ago; there are still some test
>> failures to investigate, and probably some tuning to do.
>> 
>> It might also get gfx1030 working too. @Richi, could you test it,
>> please?
> 
> I can report partial success here.  [...]
>>> 
> I'll followup with a test summary once the (serial) run of libgomp
> testing finished.
>>> 
>>> (Why serial, by the way?)
>> 
>> Just out of caution ... (I'm using the GPU for the desktop at the
>> same time and dmesg gets spammed with some not-so reassuring
>> "errors" during the offloading)
> 
> Yeah, indeed 'dmesg' is full of "notes"...
> 
> However, note that per my work on 
> "libgomp make check time is excessive", all execution testing in libgomp
> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  So,
> no problem/difference in that regard, to run parallel
> 'check-target-libgomp'.  (... with the caveat that execution tests for
> effective-targets are *not* governed by that, as I've found yesterday.
> I have a WIP hack for that, too.)
> 
> 
>>> [...] what I
>>> got with '-march=gfx1100' for AMD Radeon RX 7900 XTX.  [...]
> 
>>> [...] execution test FAILs.  Not all FAILs appear all the time [...]
> 
> What disturbs the testing a lot is, that the GPU may get into a bad
> state, upon which any use either fails with a
> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
> 'libhsa-runtime64.so.1'...
> 
> I've now tried to debug the latter case (hang).  When the GPU gets into
> this bad state (whatever exactly that is),
> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
> There it hangs until killed (for example, until DejaGnu's timeout
> mechanism kills the process -- just that the next GPU-using execution
> test then runs into the same thing again...).
> 
> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
> we're able to recover via:
> 
>$ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
>0
> 
> This is, obviously, a hack, probably needs a serial lock to not disturb
> other things, has hard-coded 'dri/0', and as I said in
> 
> "GCN RDNA2+ vs. GCC SLP vectorizer":
> 
> | I've no idea what
> | 'amdgpu_gpu_recover' would do if the GPU is also used for display.

It ends up terminating your X session… (there’s some automatic driver recovery 
that’s also sometimes triggered which sounds like the same thing).  I need to 
try using the integrated graphics for X11 to see if that avoids the issue.

Guess AMD needs to improve the driver/runtime (or we - it’s open source at 
least up to the firmware).

Richard 

> However, it's very useful in my testing.  :-|
> 
> The questions is, how to detect the "hang" state without first running
> into a timeout (and disambiguating such a timeout from a user code
> timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
> initialization, and before the actual GPU kernel launch cancel it with
> 'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
> error message that we can then react on, like for
> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
> no-go in libgomp -- instead, use a helper thread to similarly implement a
> watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
> other purposes.)  Any other clever ideas?  What's a suitable value for
> "a few seconds"?
> 
> 
> Grüße
> Thomas


Stabilizing flaky libgomp GCN target/offloading testing (was: libgomp GCN gfx1030/gfx1100 offloading status)

2024-02-21 Thread Thomas Schwinge
Hi!

On 2024-02-01T15:49:02+0100, Richard Biener  wrote:
> On Thu, 1 Feb 2024, Thomas Schwinge wrote:
>> On 2024-01-26T10:45:10+0100, Richard Biener  wrote:
>> > On Fri, 26 Jan 2024, Richard Biener wrote:
>> >> On Wed, 24 Jan 2024, Andrew Stubbs wrote:
>> >> > [...] is enough to get gfx1100 working for most purposes, on top of the
>> >> > patch that Tobias committed a week or so ago; there are still some test
>> >> > failures to investigate, and probably some tuning to do.
>> >> > 
>> >> > It might also get gfx1030 working too. @Richi, could you test it,
>> >> > please?
>> >> 
>> >> I can report partial success here.  [...]
>> 
>> >> I'll followup with a test summary once the (serial) run of libgomp
>> >> testing finished.
>> 
>> (Why serial, by the way?)
>
> Just out of caution ... (I'm using the GPU for the desktop at the
> same time and dmesg gets spammed with some not-so reassuring
> "errors" during the offloading)

Yeah, indeed 'dmesg' is full of "notes"...

However, note that per my work on 
"libgomp make check time is excessive", all execution testing in libgomp
is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  So,
no problem/difference in that regard, to run parallel
'check-target-libgomp'.  (... with the caveat that execution tests for
effective-targets are *not* governed by that, as I've found yesterday.
I have a WIP hack for that, too.)


>> [...] what I
>> got with '-march=gfx1100' for AMD Radeon RX 7900 XTX.  [...]

>> [...] execution test FAILs.  Not all FAILs appear all the time [...]

What disturbs the testing a lot is, that the GPU may get into a bad
state, upon which any use either fails with a
'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
'libhsa-runtime64.so.1'...

I've now tried to debug the latter case (hang).  When the GPU gets into
this bad state (whatever exactly that is),
'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
There it hangs until killed (for example, until DejaGnu's timeout
mechanism kills the process -- just that the next GPU-using execution
test then runs into the same thing again...).

In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
we're able to recover via:

$ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
0

This is, obviously, a hack, probably needs a serial lock to not disturb
other things, has hard-coded 'dri/0', and as I said in

"GCN RDNA2+ vs. GCC SLP vectorizer":

| I've no idea what
| 'amdgpu_gpu_recover' would do if the GPU is also used for display.

However, it's very useful in my testing.  :-|

The questions is, how to detect the "hang" state without first running
into a timeout (and disambiguating such a timeout from a user code
timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
initialization, and before the actual GPU kernel launch cancel it with
'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
error message that we can then react on, like for
'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
no-go in libgomp -- instead, use a helper thread to similarly implement a
watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
other purposes.)  Any other clever ideas?  What's a suitable value for
"a few seconds"?


Grüße
 Thomas