Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-16 Thread Joshua Ashton




On 1/16/24 15:05, Christian König wrote:

Am 16.01.24 um 14:48 schrieb Joshua Ashton:


[SNIP]

Going to adjust the implementation accordingly.


Awesome, please CC me know when you have something.


Sure, going to keep that in mind.



In the short-term I have changed if (r && r != -ENODATA) to if (r) 
and that seems to work nicely for me.


One problem with solely relying on the CS submission return value 
from userspace is cancelled syncobj waits.


For example, if we have an application with one thread that makes a 
submission, and then kicks off a vkWaitSemaphores to wait on a 
semaphore on another thread and that submission hangs, the syncobj 
relating to the vkWaitSemaphores should be signalled which is fine, 
but we need to return VK_ERROR_DEVICE_LOST if the context loss 
resulted in the signal for the VkSemaphore.


The way this was previously integrated was with the query thing, 
which as we have established does not provide correct information 
regarding soft recovery at the moment.


Unless you have an alternative for us to get some error out of the 
syncobj (eg. -ENODATA), then right now we still require the query.


I think fixing the -ENODATA reporting back for submit is a good 
step, but I believe we still need the query to report the same 
information as we would have gotten from that here.


Yeah, exactly that was one of the reasons why the guilty handling 
approach didn't solved the problem sufficiently.


If I remember correctly at least for the syncobj used on Android you can 
actually query the result of the execution after waiting for them to 
signal.


So not only the issuer of a submission can get the result, but also 
everybody waiting for the result. In other words Wayland, X, etc... can 
implement graceful handling should an application send them nonsense.


That would be great. Right now the vkQueueSubmit for a hanging 
application ends up going through as the Wayland commit has been made, 
and the fence gets signalled.


This means we can get an image that is complete garbage from a hung app 
-- no DCC retile occured for it, which is pretty funny as the scanout 
image looks garbage, but it displays whatever it last was in composite. xD


Getting the fence error on the compositor side would be great to 
disregard such images.


We could also do more useful device loss handling with that on the 
driver side.


Thanks!
- Joshie 🐸✨



I don't think we ever implemented something similar for drm_syncobj and 
we might need error forwarding in the dma_fence_chain container, but 
stuff like that is trivial to implement should the requirement for this 
ever come up.




Hmmm, actually the spec states that VK_SUCCESS is valid in this 
situation:


Commands that wait indefinitely for device execution (namely 
vkDeviceWaitIdle, vkQueueWaitIdle, vkWaitForFences with a maximum 
timeout, and vkGetQueryPoolResults with the VK_QUERY_RESULT_WAIT_BIT 
bit set in flags) must return in finite time even in the case of a 
lost device, and return either VK_SUCCESS or VK_ERROR_DEVICE_LOST.


...


Once a device is lost, command execution may fail, and certain 
commands that return a VkResult may return VK_ERROR_DEVICE_LOST.


I guess for now disregard last email regarding us potentially needing 
the query, it does seem that returning SUCCESS is completely valid.


Well how you interpret the information the kernel gives you in userspace 
is your thing. But we should at least make sure that the general 
approach inside the kernel has a design which can handle those 
requirements.


Christian.



- Joshie 🐸✨



- Joshie 🐸✨



Thanks

- Joshie 🐸✨



- Joshie 🐸✨



Christian.




Marek










Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-16 Thread Joshua Ashton




On 1/16/24 15:48, Michel Dänzer wrote:

On 2024-01-16 14:44, Joshua Ashton wrote:

On 1/16/24 13:41, Joshua Ashton wrote:

On 1/16/24 12:24, Joshua Ashton wrote:

On 1/16/24 07:47, Christian König wrote:

Am 16.01.24 um 01:05 schrieb Marek Olšák:

On Mon, Jan 15, 2024 at 3:06 PM Christian König
 wrote:

Am 15.01.24 um 20:30 schrieb Joshua Ashton:

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:

On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
mailto:friedrich.v...@gmx.de>> wrote:

  Re-sending as plaintext, sorry about that

  On 15.01.24 18:54, Michel Dänzer wrote:
   > On 2024-01-15 18:26, Friedrich Vock wrote:
   >> [snip]
   >> The fundamental problem here is that not telling
applications that
   >> something went wrong when you just canceled their work
midway is an
   >> out-of-spec hack.
   >> When there is a report of real-world apps breaking
because of
  that hack,
   >> reports of different apps working (even if it's
convenient that they
   >> work) doesn't justify keeping the broken code.
   > If the breaking apps hit multiple soft resets in a row,
I've laid
  out a pragmatic solution which covers both cases.
  Hitting soft reset every time is the lucky path. Once GPU
work is
  interrupted out of nowhere, all bets are off and it might as
well
  trigger a full system hang next time. No hang recovery should
be able to
  cause that under any circumstance.


I think the more insidious situation is no further hangs but
wrong results because we skipped some work. That we skipped work
may e.g. result in some texture not being uploaded or some GPGPU
work not being done and causing further errors downstream (say if
a game is doing AI/physics on the GPU not to say anything of
actual GPGPU work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even
knowing!

Well on the kernel side we do provide an API to query the result of
a submission. That includes canceling submissions with a soft
recovery.

What we just doesn't do is to prevent further submissions from this
application. E.g. enforcing that the application is punished for
bad behavior.

You do prevent future submissions for regular resets though: Those
increase karma which sets ctx->guilty, and if ctx->guilty then
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking
the submission fence error somehow? That doesn't seem very ergonomic
for a Vulkan driver compared to the simple solution which is to just
mark it as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of
reasons.

What we can do rather trivially is changing this code in
amdgpu_job_prepare_job():

   /* Ignore soft recovered fences here */
   r = drm_sched_entity_error(s_entity);
   if (r && r != -ENODATA)
   goto error;

This will bubble up errors from soft recoveries into the entity as
well and makes sure that further submissions are rejected.

That makes sense to do, but at least for GL_EXT_robustness, that will
not tell the app that it was guilty.

No, it clearly gets that signaled. We should probably replace the guilty
atomic with a calls to drm_sched_entity_error().

It's just that this isn't what Marek and I had in mind for this,
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
AMDGPU_CTX_OP_QUERY_STATE2.

Instead just look at the return value of the CS or query fence result IOCTL.

When you get an -ENODATA you have been guilty of causing a soft
recovery, when you get an -ETIME you are guilty of causing a timeout
which had to be hard recovered. When you get an -ECANCELED you are an
innocent victim of a hard recovery somebody else caused.

What we haven't defined yet is an error code for loosing VRAM, but that
should be trivial to do.

So far we have implemented the GPU reset and soft reset, but we
haven't done anything to have a robust system recovery. Under the
current system, things can easily keep hanging indefinitely because
nothing prevents that.

The reset status query should stay. Robust apps will use it to tell
when they should recreate their context and resources even if they
don't submit anything. Let's fully trust robust apps here. In the
future we might change our mind about that, but for now, let's just
focus on API conformance, and later we can change it as long as we
stay API conformant.

Non-robust apps must be terminated when they hang or are innocent but
affected. Their existence is a security and u

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-16 Thread Michel Dänzer
On 2024-01-16 14:44, Joshua Ashton wrote:
> On 1/16/24 13:41, Joshua Ashton wrote:
>> On 1/16/24 12:24, Joshua Ashton wrote:
>>> On 1/16/24 07:47, Christian König wrote:
 Am 16.01.24 um 01:05 schrieb Marek Olšák:
> On Mon, Jan 15, 2024 at 3:06 PM Christian König
>  wrote:
>> Am 15.01.24 um 20:30 schrieb Joshua Ashton:
>>> On 1/15/24 19:19, Christian König wrote:
 Am 15.01.24 um 20:13 schrieb Joshua Ashton:
> On 1/15/24 18:53, Christian König wrote:
>> Am 15.01.24 um 19:35 schrieb Joshua Ashton:
>>> On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
 On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
 mailto:friedrich.v...@gmx.de>> wrote:

  Re-sending as plaintext, sorry about that

  On 15.01.24 18:54, Michel Dänzer wrote:
   > On 2024-01-15 18:26, Friedrich Vock wrote:
   >> [snip]
   >> The fundamental problem here is that not telling
 applications that
   >> something went wrong when you just canceled their work
 midway is an
   >> out-of-spec hack.
   >> When there is a report of real-world apps breaking
 because of
  that hack,
   >> reports of different apps working (even if it's
 convenient that they
   >> work) doesn't justify keeping the broken code.
   > If the breaking apps hit multiple soft resets in a row,
 I've laid
  out a pragmatic solution which covers both cases.
  Hitting soft reset every time is the lucky path. Once GPU
 work is
  interrupted out of nowhere, all bets are off and it might as
 well
  trigger a full system hang next time. No hang recovery should
 be able to
  cause that under any circumstance.


 I think the more insidious situation is no further hangs but
 wrong results because we skipped some work. That we skipped work
 may e.g. result in some texture not being uploaded or some GPGPU
 work not being done and causing further errors downstream (say if
 a game is doing AI/physics on the GPU not to say anything of
 actual GPGPU work one might be doing like AI)
>>> Even worse if this is compute on eg. OpenCL for something
>>> science/math/whatever related, or training a model.
>>>
>>> You could randomly just get invalid/wrong results without even
>>> knowing!
>> Well on the kernel side we do provide an API to query the result of
>> a submission. That includes canceling submissions with a soft
>> recovery.
>>
>> What we just doesn't do is to prevent further submissions from this
>> application. E.g. enforcing that the application is punished for
>> bad behavior.
> You do prevent future submissions for regular resets though: Those
> increase karma which sets ctx->guilty, and if ctx->guilty then
> -ECANCELED is returned for a submission.
>
> ctx->guilty is never true for soft recovery though, as it doesn't
> increase karma, which is the problem this patch is trying to solve.
>
> By the submission result query API, I you assume you mean checking
> the submission fence error somehow? That doesn't seem very ergonomic
> for a Vulkan driver compared to the simple solution which is to just
> mark it as guilty with what already exists...
 Well as I said the guilty handling is broken for quite a number of
 reasons.

 What we can do rather trivially is changing this code in
 amdgpu_job_prepare_job():

   /* Ignore soft recovered fences here */
   r = drm_sched_entity_error(s_entity);
   if (r && r != -ENODATA)
   goto error;

 This will bubble up errors from soft recoveries into the entity as
 well and makes sure that further submissions are rejected.
>>> That makes sense to do, but at least for GL_EXT_robustness, that will
>>> not tell the app that it was guilty.
>> No, it clearly gets that signaled. We should probably replace the guilty
>> atomic with a calls to drm_sched_entity_error().
>>
>> It's just that this isn't what Marek and I had in mind for this,
>> basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
>> AMDGPU_CTX_OP_QUERY_STATE2.
>>
>> Instead just look at the return value of the CS or query fence result 
>> IOCTL.
>>
>> When you get an -ENODATA you have been guilty of causing a soft
>> recovery, when you get an -ETIME you are guilty of causing

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-16 Thread Christian König

Am 16.01.24 um 14:48 schrieb Joshua Ashton:


[SNIP]

Going to adjust the implementation accordingly.


Awesome, please CC me know when you have something.


Sure, going to keep that in mind.



In the short-term I have changed if (r && r != -ENODATA) to if (r) 
and that seems to work nicely for me.


One problem with solely relying on the CS submission return value 
from userspace is cancelled syncobj waits.


For example, if we have an application with one thread that makes a 
submission, and then kicks off a vkWaitSemaphores to wait on a 
semaphore on another thread and that submission hangs, the syncobj 
relating to the vkWaitSemaphores should be signalled which is fine, 
but we need to return VK_ERROR_DEVICE_LOST if the context loss 
resulted in the signal for the VkSemaphore.


The way this was previously integrated was with the query thing, 
which as we have established does not provide correct information 
regarding soft recovery at the moment.


Unless you have an alternative for us to get some error out of the 
syncobj (eg. -ENODATA), then right now we still require the query.


I think fixing the -ENODATA reporting back for submit is a good 
step, but I believe we still need the query to report the same 
information as we would have gotten from that here.


Yeah, exactly that was one of the reasons why the guilty handling 
approach didn't solved the problem sufficiently.


If I remember correctly at least for the syncobj used on Android you can 
actually query the result of the execution after waiting for them to signal.


So not only the issuer of a submission can get the result, but also 
everybody waiting for the result. In other words Wayland, X, etc... can 
implement graceful handling should an application send them nonsense.


I don't think we ever implemented something similar for drm_syncobj and 
we might need error forwarding in the dma_fence_chain container, but 
stuff like that is trivial to implement should the requirement for this 
ever come up.




Hmmm, actually the spec states that VK_SUCCESS is valid in this 
situation:


Commands that wait indefinitely for device execution (namely 
vkDeviceWaitIdle, vkQueueWaitIdle, vkWaitForFences with a maximum 
timeout, and vkGetQueryPoolResults with the VK_QUERY_RESULT_WAIT_BIT 
bit set in flags) must return in finite time even in the case of a 
lost device, and return either VK_SUCCESS or VK_ERROR_DEVICE_LOST.


...


Once a device is lost, command execution may fail, and certain 
commands that return a VkResult may return VK_ERROR_DEVICE_LOST.


I guess for now disregard last email regarding us potentially needing 
the query, it does seem that returning SUCCESS is completely valid.


Well how you interpret the information the kernel gives you in userspace 
is your thing. But we should at least make sure that the general 
approach inside the kernel has a design which can handle those requirements.


Christian.



- Joshie 🐸✨



- Joshie 🐸✨



Thanks

- Joshie 🐸✨



- Joshie 🐸✨



Christian.




Marek








Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-16 Thread Joshua Ashton




On 1/16/24 13:44, Joshua Ashton wrote:



On 1/16/24 13:41, Joshua Ashton wrote:



On 1/16/24 12:24, Joshua Ashton wrote:



On 1/16/24 07:47, Christian König wrote:

Am 16.01.24 um 01:05 schrieb Marek Olšák:

On Mon, Jan 15, 2024 at 3:06 PM Christian König
 wrote:

Am 15.01.24 um 20:30 schrieb Joshua Ashton:

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:

On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
mailto:friedrich.v...@gmx.de>> wrote:

 Re-sending as plaintext, sorry about that

 On 15.01.24 18:54, Michel Dänzer wrote:
  > On 2024-01-15 18:26, Friedrich Vock wrote:
  >> [snip]
  >> The fundamental problem here is that not telling
applications that
  >> something went wrong when you just canceled their work
midway is an
  >> out-of-spec hack.
  >> When there is a report of real-world apps breaking
because of
 that hack,
  >> reports of different apps working (even if it's
convenient that they
  >> work) doesn't justify keeping the broken code.
  > If the breaking apps hit multiple soft resets in a row,
I've laid
 out a pragmatic solution which covers both cases.
 Hitting soft reset every time is the lucky path. Once GPU
work is
 interrupted out of nowhere, all bets are off and it 
might as

well
 trigger a full system hang next time. No hang recovery 
should

be able to
 cause that under any circumstance.


I think the more insidious situation is no further hangs but
wrong results because we skipped some work. That we skipped 
work
may e.g. result in some texture not being uploaded or some 
GPGPU
work not being done and causing further errors downstream 
(say if

a game is doing AI/physics on the GPU not to say anything of
actual GPGPU work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even
knowing!
Well on the kernel side we do provide an API to query the 
result of

a submission. That includes canceling submissions with a soft
recovery.

What we just doesn't do is to prevent further submissions from 
this

application. E.g. enforcing that the application is punished for
bad behavior.

You do prevent future submissions for regular resets though: Those
increase karma which sets ctx->guilty, and if ctx->guilty then
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't
increase karma, which is the problem this patch is trying to 
solve.


By the submission result query API, I you assume you mean checking
the submission fence error somehow? That doesn't seem very 
ergonomic
for a Vulkan driver compared to the simple solution which is to 
just

mark it as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of
reasons.

What we can do rather trivially is changing this code in
amdgpu_job_prepare_job():

  /* Ignore soft recovered fences here */
  r = drm_sched_entity_error(s_entity);
  if (r && r != -ENODATA)
  goto error;

This will bubble up errors from soft recoveries into the entity as
well and makes sure that further submissions are rejected.
That makes sense to do, but at least for GL_EXT_robustness, that 
will

not tell the app that it was guilty.
No, it clearly gets that signaled. We should probably replace the 
guilty

atomic with a calls to drm_sched_entity_error().

It's just that this isn't what Marek and I had in mind for this,
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
AMDGPU_CTX_OP_QUERY_STATE2.

Instead just look at the return value of the CS or query fence 
result IOCTL.


When you get an -ENODATA you have been guilty of causing a soft
recovery, when you get an -ETIME you are guilty of causing a timeout
which had to be hard recovered. When you get an -ECANCELED you are an
innocent victim of a hard recovery somebody else caused.

What we haven't defined yet is an error code for loosing VRAM, but 
that

should be trivial to do.

So far we have implemented the GPU reset and soft reset, but we
haven't done anything to have a robust system recovery. Under the
current system, things can easily keep hanging indefinitely because
nothing prevents that.

The reset status query should stay. Robust apps will use it to tell
when they should recreate their context and resources even if they
don't submit anything. Let's fully trust robust apps here. In the
future we might change our mind about that, but for now, let's just
focus on API conformance, and later we can change it as long as we
stay API conformant.

Non-robust apps must be terminated when they hang or are innocent but
affected. Their existence is a security and usability problem and a
source of fru

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-16 Thread Joshua Ashton




On 1/16/24 13:41, Joshua Ashton wrote:



On 1/16/24 12:24, Joshua Ashton wrote:



On 1/16/24 07:47, Christian König wrote:

Am 16.01.24 um 01:05 schrieb Marek Olšák:

On Mon, Jan 15, 2024 at 3:06 PM Christian König
 wrote:

Am 15.01.24 um 20:30 schrieb Joshua Ashton:

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:

On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
mailto:friedrich.v...@gmx.de>> wrote:

 Re-sending as plaintext, sorry about that

 On 15.01.24 18:54, Michel Dänzer wrote:
  > On 2024-01-15 18:26, Friedrich Vock wrote:
  >> [snip]
  >> The fundamental problem here is that not telling
applications that
  >> something went wrong when you just canceled their work
midway is an
  >> out-of-spec hack.
  >> When there is a report of real-world apps breaking
because of
 that hack,
  >> reports of different apps working (even if it's
convenient that they
  >> work) doesn't justify keeping the broken code.
  > If the breaking apps hit multiple soft resets in a row,
I've laid
 out a pragmatic solution which covers both cases.
 Hitting soft reset every time is the lucky path. Once GPU
work is
 interrupted out of nowhere, all bets are off and it 
might as

well
 trigger a full system hang next time. No hang recovery 
should

be able to
 cause that under any circumstance.


I think the more insidious situation is no further hangs but
wrong results because we skipped some work. That we skipped work
may e.g. result in some texture not being uploaded or some GPGPU
work not being done and causing further errors downstream 
(say if

a game is doing AI/physics on the GPU not to say anything of
actual GPGPU work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even
knowing!
Well on the kernel side we do provide an API to query the 
result of

a submission. That includes canceling submissions with a soft
recovery.

What we just doesn't do is to prevent further submissions from 
this

application. E.g. enforcing that the application is punished for
bad behavior.

You do prevent future submissions for regular resets though: Those
increase karma which sets ctx->guilty, and if ctx->guilty then
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking
the submission fence error somehow? That doesn't seem very 
ergonomic
for a Vulkan driver compared to the simple solution which is to 
just

mark it as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of
reasons.

What we can do rather trivially is changing this code in
amdgpu_job_prepare_job():

  /* Ignore soft recovered fences here */
  r = drm_sched_entity_error(s_entity);
  if (r && r != -ENODATA)
  goto error;

This will bubble up errors from soft recoveries into the entity as
well and makes sure that further submissions are rejected.

That makes sense to do, but at least for GL_EXT_robustness, that will
not tell the app that it was guilty.
No, it clearly gets that signaled. We should probably replace the 
guilty

atomic with a calls to drm_sched_entity_error().

It's just that this isn't what Marek and I had in mind for this,
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
AMDGPU_CTX_OP_QUERY_STATE2.

Instead just look at the return value of the CS or query fence 
result IOCTL.


When you get an -ENODATA you have been guilty of causing a soft
recovery, when you get an -ETIME you are guilty of causing a timeout
which had to be hard recovered. When you get an -ECANCELED you are an
innocent victim of a hard recovery somebody else caused.

What we haven't defined yet is an error code for loosing VRAM, but 
that

should be trivial to do.

So far we have implemented the GPU reset and soft reset, but we
haven't done anything to have a robust system recovery. Under the
current system, things can easily keep hanging indefinitely because
nothing prevents that.

The reset status query should stay. Robust apps will use it to tell
when they should recreate their context and resources even if they
don't submit anything. Let's fully trust robust apps here. In the
future we might change our mind about that, but for now, let's just
focus on API conformance, and later we can change it as long as we
stay API conformant.

Non-robust apps must be terminated when they hang or are innocent but
affected. Their existence is a security and usability problem and a
source of frustrations for users. 100% guaranteed system rec

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-16 Thread Joshua Ashton




On 1/16/24 12:24, Joshua Ashton wrote:



On 1/16/24 07:47, Christian König wrote:

Am 16.01.24 um 01:05 schrieb Marek Olšák:

On Mon, Jan 15, 2024 at 3:06 PM Christian König
 wrote:

Am 15.01.24 um 20:30 schrieb Joshua Ashton:

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:

On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
mailto:friedrich.v...@gmx.de>> wrote:

 Re-sending as plaintext, sorry about that

 On 15.01.24 18:54, Michel Dänzer wrote:
  > On 2024-01-15 18:26, Friedrich Vock wrote:
  >> [snip]
  >> The fundamental problem here is that not telling
applications that
  >> something went wrong when you just canceled their work
midway is an
  >> out-of-spec hack.
  >> When there is a report of real-world apps breaking
because of
 that hack,
  >> reports of different apps working (even if it's
convenient that they
  >> work) doesn't justify keeping the broken code.
  > If the breaking apps hit multiple soft resets in a row,
I've laid
 out a pragmatic solution which covers both cases.
 Hitting soft reset every time is the lucky path. Once GPU
work is
 interrupted out of nowhere, all bets are off and it might as
well
 trigger a full system hang next time. No hang recovery 
should

be able to
 cause that under any circumstance.


I think the more insidious situation is no further hangs but
wrong results because we skipped some work. That we skipped work
may e.g. result in some texture not being uploaded or some GPGPU
work not being done and causing further errors downstream (say if
a game is doing AI/physics on the GPU not to say anything of
actual GPGPU work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even
knowing!

Well on the kernel side we do provide an API to query the result of
a submission. That includes canceling submissions with a soft
recovery.

What we just doesn't do is to prevent further submissions from this
application. E.g. enforcing that the application is punished for
bad behavior.

You do prevent future submissions for regular resets though: Those
increase karma which sets ctx->guilty, and if ctx->guilty then
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking
the submission fence error somehow? That doesn't seem very ergonomic
for a Vulkan driver compared to the simple solution which is to just
mark it as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of
reasons.

What we can do rather trivially is changing this code in
amdgpu_job_prepare_job():

  /* Ignore soft recovered fences here */
  r = drm_sched_entity_error(s_entity);
  if (r && r != -ENODATA)
  goto error;

This will bubble up errors from soft recoveries into the entity as
well and makes sure that further submissions are rejected.

That makes sense to do, but at least for GL_EXT_robustness, that will
not tell the app that it was guilty.
No, it clearly gets that signaled. We should probably replace the 
guilty

atomic with a calls to drm_sched_entity_error().

It's just that this isn't what Marek and I had in mind for this,
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
AMDGPU_CTX_OP_QUERY_STATE2.

Instead just look at the return value of the CS or query fence 
result IOCTL.


When you get an -ENODATA you have been guilty of causing a soft
recovery, when you get an -ETIME you are guilty of causing a timeout
which had to be hard recovered. When you get an -ECANCELED you are an
innocent victim of a hard recovery somebody else caused.

What we haven't defined yet is an error code for loosing VRAM, but that
should be trivial to do.

So far we have implemented the GPU reset and soft reset, but we
haven't done anything to have a robust system recovery. Under the
current system, things can easily keep hanging indefinitely because
nothing prevents that.

The reset status query should stay. Robust apps will use it to tell
when they should recreate their context and resources even if they
don't submit anything. Let's fully trust robust apps here. In the
future we might change our mind about that, but for now, let's just
focus on API conformance, and later we can change it as long as we
stay API conformant.

Non-robust apps must be terminated when they hang or are innocent but
affected. Their existence is a security and usability problem and a
source of frustrations for users. 100% guaranteed system recovery is
impossible if they continue to live.

IBs sho

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-16 Thread Joshua Ashton




On 1/16/24 07:47, Christian König wrote:

Am 16.01.24 um 01:05 schrieb Marek Olšák:

On Mon, Jan 15, 2024 at 3:06 PM Christian König
 wrote:

Am 15.01.24 um 20:30 schrieb Joshua Ashton:

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:

On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
mailto:friedrich.v...@gmx.de>> wrote:

 Re-sending as plaintext, sorry about that

 On 15.01.24 18:54, Michel Dänzer wrote:
  > On 2024-01-15 18:26, Friedrich Vock wrote:
  >> [snip]
  >> The fundamental problem here is that not telling
applications that
  >> something went wrong when you just canceled their work
midway is an
  >> out-of-spec hack.
  >> When there is a report of real-world apps breaking
because of
 that hack,
  >> reports of different apps working (even if it's
convenient that they
  >> work) doesn't justify keeping the broken code.
  > If the breaking apps hit multiple soft resets in a row,
I've laid
 out a pragmatic solution which covers both cases.
 Hitting soft reset every time is the lucky path. Once GPU
work is
 interrupted out of nowhere, all bets are off and it might as
well
 trigger a full system hang next time. No hang recovery should
be able to
 cause that under any circumstance.


I think the more insidious situation is no further hangs but
wrong results because we skipped some work. That we skipped work
may e.g. result in some texture not being uploaded or some GPGPU
work not being done and causing further errors downstream (say if
a game is doing AI/physics on the GPU not to say anything of
actual GPGPU work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even
knowing!

Well on the kernel side we do provide an API to query the result of
a submission. That includes canceling submissions with a soft
recovery.

What we just doesn't do is to prevent further submissions from this
application. E.g. enforcing that the application is punished for
bad behavior.

You do prevent future submissions for regular resets though: Those
increase karma which sets ctx->guilty, and if ctx->guilty then
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking
the submission fence error somehow? That doesn't seem very ergonomic
for a Vulkan driver compared to the simple solution which is to just
mark it as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of
reasons.

What we can do rather trivially is changing this code in
amdgpu_job_prepare_job():

  /* Ignore soft recovered fences here */
  r = drm_sched_entity_error(s_entity);
  if (r && r != -ENODATA)
  goto error;

This will bubble up errors from soft recoveries into the entity as
well and makes sure that further submissions are rejected.

That makes sense to do, but at least for GL_EXT_robustness, that will
not tell the app that it was guilty.

No, it clearly gets that signaled. We should probably replace the guilty
atomic with a calls to drm_sched_entity_error().

It's just that this isn't what Marek and I had in mind for this,
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
AMDGPU_CTX_OP_QUERY_STATE2.

Instead just look at the return value of the CS or query fence result 
IOCTL.


When you get an -ENODATA you have been guilty of causing a soft
recovery, when you get an -ETIME you are guilty of causing a timeout
which had to be hard recovered. When you get an -ECANCELED you are an
innocent victim of a hard recovery somebody else caused.

What we haven't defined yet is an error code for loosing VRAM, but that
should be trivial to do.

So far we have implemented the GPU reset and soft reset, but we
haven't done anything to have a robust system recovery. Under the
current system, things can easily keep hanging indefinitely because
nothing prevents that.

The reset status query should stay. Robust apps will use it to tell
when they should recreate their context and resources even if they
don't submit anything. Let's fully trust robust apps here. In the
future we might change our mind about that, but for now, let's just
focus on API conformance, and later we can change it as long as we
stay API conformant.

Non-robust apps must be terminated when they hang or are innocent but
affected. Their existence is a security and usability problem and a
source of frustrations for users. 100% guaranteed system recovery is
impossible if they continue to live.

IBs should be rejected for all guilty and affected i

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 16.01.24 um 01:05 schrieb Marek Olšák:

On Mon, Jan 15, 2024 at 3:06 PM Christian König
 wrote:

Am 15.01.24 um 20:30 schrieb Joshua Ashton:

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:

On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
mailto:friedrich.v...@gmx.de>> wrote:

 Re-sending as plaintext, sorry about that

 On 15.01.24 18:54, Michel Dänzer wrote:
  > On 2024-01-15 18:26, Friedrich Vock wrote:
  >> [snip]
  >> The fundamental problem here is that not telling
applications that
  >> something went wrong when you just canceled their work
midway is an
  >> out-of-spec hack.
  >> When there is a report of real-world apps breaking
because of
 that hack,
  >> reports of different apps working (even if it's
convenient that they
  >> work) doesn't justify keeping the broken code.
  > If the breaking apps hit multiple soft resets in a row,
I've laid
 out a pragmatic solution which covers both cases.
 Hitting soft reset every time is the lucky path. Once GPU
work is
 interrupted out of nowhere, all bets are off and it might as
well
 trigger a full system hang next time. No hang recovery should
be able to
 cause that under any circumstance.


I think the more insidious situation is no further hangs but
wrong results because we skipped some work. That we skipped work
may e.g. result in some texture not being uploaded or some GPGPU
work not being done and causing further errors downstream (say if
a game is doing AI/physics on the GPU not to say anything of
actual GPGPU work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even
knowing!

Well on the kernel side we do provide an API to query the result of
a submission. That includes canceling submissions with a soft
recovery.

What we just doesn't do is to prevent further submissions from this
application. E.g. enforcing that the application is punished for
bad behavior.

You do prevent future submissions for regular resets though: Those
increase karma which sets ctx->guilty, and if ctx->guilty then
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking
the submission fence error somehow? That doesn't seem very ergonomic
for a Vulkan driver compared to the simple solution which is to just
mark it as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of
reasons.

What we can do rather trivially is changing this code in
amdgpu_job_prepare_job():

  /* Ignore soft recovered fences here */
  r = drm_sched_entity_error(s_entity);
  if (r && r != -ENODATA)
  goto error;

This will bubble up errors from soft recoveries into the entity as
well and makes sure that further submissions are rejected.

That makes sense to do, but at least for GL_EXT_robustness, that will
not tell the app that it was guilty.

No, it clearly gets that signaled. We should probably replace the guilty
atomic with a calls to drm_sched_entity_error().

It's just that this isn't what Marek and I had in mind for this,
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
AMDGPU_CTX_OP_QUERY_STATE2.

Instead just look at the return value of the CS or query fence result IOCTL.

When you get an -ENODATA you have been guilty of causing a soft
recovery, when you get an -ETIME you are guilty of causing a timeout
which had to be hard recovered. When you get an -ECANCELED you are an
innocent victim of a hard recovery somebody else caused.

What we haven't defined yet is an error code for loosing VRAM, but that
should be trivial to do.

So far we have implemented the GPU reset and soft reset, but we
haven't done anything to have a robust system recovery. Under the
current system, things can easily keep hanging indefinitely because
nothing prevents that.

The reset status query should stay. Robust apps will use it to tell
when they should recreate their context and resources even if they
don't submit anything. Let's fully trust robust apps here. In the
future we might change our mind about that, but for now, let's just
focus on API conformance, and later we can change it as long as we
stay API conformant.

Non-robust apps must be terminated when they hang or are innocent but
affected. Their existence is a security and usability problem and a
source of frustrations for users. 100% guaranteed system recovery is
impossible if they continue to live.

IBs should be rejected for all guilty and affected innocent contexts
unconditionally, both robust a

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Marek Olšák
On Mon, Jan 15, 2024 at 3:06 PM Christian König
 wrote:
>
> Am 15.01.24 um 20:30 schrieb Joshua Ashton:
> > On 1/15/24 19:19, Christian König wrote:
> >> Am 15.01.24 um 20:13 schrieb Joshua Ashton:
> >>> On 1/15/24 18:53, Christian König wrote:
>  Am 15.01.24 um 19:35 schrieb Joshua Ashton:
> > On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
> >> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
> >> mailto:friedrich.v...@gmx.de>> wrote:
> >>
> >> Re-sending as plaintext, sorry about that
> >>
> >> On 15.01.24 18:54, Michel Dänzer wrote:
> >>  > On 2024-01-15 18:26, Friedrich Vock wrote:
> >>  >> [snip]
> >>  >> The fundamental problem here is that not telling
> >> applications that
> >>  >> something went wrong when you just canceled their work
> >> midway is an
> >>  >> out-of-spec hack.
> >>  >> When there is a report of real-world apps breaking
> >> because of
> >> that hack,
> >>  >> reports of different apps working (even if it's
> >> convenient that they
> >>  >> work) doesn't justify keeping the broken code.
> >>  > If the breaking apps hit multiple soft resets in a row,
> >> I've laid
> >> out a pragmatic solution which covers both cases.
> >> Hitting soft reset every time is the lucky path. Once GPU
> >> work is
> >> interrupted out of nowhere, all bets are off and it might as
> >> well
> >> trigger a full system hang next time. No hang recovery should
> >> be able to
> >> cause that under any circumstance.
> >>
> >>
> >> I think the more insidious situation is no further hangs but
> >> wrong results because we skipped some work. That we skipped work
> >> may e.g. result in some texture not being uploaded or some GPGPU
> >> work not being done and causing further errors downstream (say if
> >> a game is doing AI/physics on the GPU not to say anything of
> >> actual GPGPU work one might be doing like AI)
> >
> > Even worse if this is compute on eg. OpenCL for something
> > science/math/whatever related, or training a model.
> >
> > You could randomly just get invalid/wrong results without even
> > knowing!
> 
>  Well on the kernel side we do provide an API to query the result of
>  a submission. That includes canceling submissions with a soft
>  recovery.
> 
>  What we just doesn't do is to prevent further submissions from this
>  application. E.g. enforcing that the application is punished for
>  bad behavior.
> >>>
> >>> You do prevent future submissions for regular resets though: Those
> >>> increase karma which sets ctx->guilty, and if ctx->guilty then
> >>> -ECANCELED is returned for a submission.
> >>>
> >>> ctx->guilty is never true for soft recovery though, as it doesn't
> >>> increase karma, which is the problem this patch is trying to solve.
> >>>
> >>> By the submission result query API, I you assume you mean checking
> >>> the submission fence error somehow? That doesn't seem very ergonomic
> >>> for a Vulkan driver compared to the simple solution which is to just
> >>> mark it as guilty with what already exists...
> >>
> >> Well as I said the guilty handling is broken for quite a number of
> >> reasons.
> >>
> >> What we can do rather trivially is changing this code in
> >> amdgpu_job_prepare_job():
> >>
> >>  /* Ignore soft recovered fences here */
> >>  r = drm_sched_entity_error(s_entity);
> >>  if (r && r != -ENODATA)
> >>  goto error;
> >>
> >> This will bubble up errors from soft recoveries into the entity as
> >> well and makes sure that further submissions are rejected.
> >
> > That makes sense to do, but at least for GL_EXT_robustness, that will
> > not tell the app that it was guilty.
>
> No, it clearly gets that signaled. We should probably replace the guilty
> atomic with a calls to drm_sched_entity_error().
>
> It's just that this isn't what Marek and I had in mind for this,
> basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
> AMDGPU_CTX_OP_QUERY_STATE2.
>
> Instead just look at the return value of the CS or query fence result IOCTL.
>
> When you get an -ENODATA you have been guilty of causing a soft
> recovery, when you get an -ETIME you are guilty of causing a timeout
> which had to be hard recovered. When you get an -ECANCELED you are an
> innocent victim of a hard recovery somebody else caused.
>
> What we haven't defined yet is an error code for loosing VRAM, but that
> should be trivial to do.

So far we have implemented the GPU reset and soft reset, but we
haven't done anything to have a robust system recovery. Under the
current system, things can easily keep hanging indefinitely because
nothing prevents that.

The reset status query should stay. Robust apps will use it to tell
when they should recreate their context and resources even if they
d

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Marek Olšák
On Mon, Jan 15, 2024 at 11:41 AM Michel Dänzer  wrote:
>
> On 2024-01-15 17:19, Friedrich Vock wrote:
> > On 15.01.24 16:43, Joshua Ashton wrote:
> >> On 1/15/24 15:25, Michel Dänzer wrote:
> >>> On 2024-01-15 14:17, Christian König wrote:
>  Am 15.01.24 um 12:37 schrieb Joshua Ashton:
> > On 1/15/24 09:40, Christian König wrote:
> >> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
> >>
> >>> Without this feedback, the application may keep pushing through
> >>> the soft
> >>> recoveries, continually hanging the system with jobs that timeout.
> >>
> >> Well, that is intentional behavior. Marek is voting for making
> >> soft recovered errors fatal as well while Michel is voting for
> >> better ignoring them.
> >>
> >> I'm not really sure what to do. If you guys think that soft
> >> recovered hangs should be fatal as well then we can certainly do
> >> this.
> >>>
> >>> A possible compromise might be making soft resets fatal if they
> >>> happen repeatedly (within a certain period of time?).
> >>
> >> No, no and no. Aside from introducing issues by side effects not
> >> surfacing and all of the stuff I mentioned about descriptor buffers,
> >> bda, draw indirect and stuff just resulting in more faults and hangs...
> >>
> >> You are proposing we throw out every promise we made to an application
> >> on the API contract level because it "might work". That's just wrong!
> >>
> >> Let me put this in explicit terms: What you are proposing is in direct
> >> violation of the GL and Vulkan specification.
> >>
> >> You can't just chose to break these contracts because you think it
> >> 'might' be a better user experience.
> >
> > Is the original issue that motivated soft resets to be non-fatal even an
> > issue anymore?
> >
> > If I read that old thread correctly, the rationale for that was that
> > assigning guilt to a context was more broken than not doing it, because
> > the compositor/Xwayland process would also crash despite being unrelated
> > to the hang.
> > With Joshua's Mesa fixes, this is not the case anymore, so I don't think
> > keeping soft resets non-fatal provides any benefit to the user experience.
> > The potential detriments to user experience have been outlined multiple
> > times in this thread already.
> >
> > (I suppose if the compositor itself faults it might still bring down a
> > session, but I've literally never seen that, and it's not like a
> > compositor triggering segfaults on CPU stays alive either.)
>
> That's indeed what happened for me, multiple times. And each time the session 
> continued running fine for days after the soft reset.
>
> But apparently my experience isn't valid somehow, and I should have been 
> forced to log in again to please the GL gods...
>
>
> Conversely, I can't remember hitting a case where an app kept running into 
> soft resets. It's almost as if different people may have different 
> experiences! ;)
>
> Note that I'm not saying that case can't happen. Making soft resets fatal 
> only if they happen repeatedly could address both issues, rather than only 
> one or the other. Seems like a win-win.

This is exactly the comment that shouldn't have been sent, and you are
not the only one.

Nobody should ever care about subjective experiences. We can only do
this properly by looking at the whole system and its rules and try to
find a solution that works for everything on paper first. DrawIndirect
is one case where the current system fails. "Works for me because I
don't use DrawIndirect" is a horrible way to do this.

Marek


Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton




On 1/15/24 19:57, Christian König wrote:

Am 15.01.24 um 20:30 schrieb Joshua Ashton:

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
mailto:friedrich.v...@gmx.de>> wrote:


    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling 
applications that
 >> something went wrong when you just canceled their work 
midway is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking 
because of

    that hack,
 >> reports of different apps working (even if it's 
convenient that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, 
I've laid

    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU 
work is
    interrupted out of nowhere, all bets are off and it might as 
well
    trigger a full system hang next time. No hang recovery should 
be able to

    cause that under any circumstance.


I think the more insidious situation is no further hangs but 
wrong results because we skipped some work. That we skipped work 
may e.g. result in some texture not being uploaded or some GPGPU 
work not being done and causing further errors downstream (say if 
a game is doing AI/physics on the GPU not to say anything of 
actual GPGPU work one might be doing like AI)


Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.


You could randomly just get invalid/wrong results without even 
knowing!


Well on the kernel side we do provide an API to query the result of 
a submission. That includes canceling submissions with a soft 
recovery.


What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for 
bad behavior.


You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.


ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.


By the submission result query API, I you assume you mean checking 
the submission fence error somehow? That doesn't seem very ergonomic 
for a Vulkan driver compared to the simple solution which is to just 
mark it as guilty with what already exists...


Well as I said the guilty handling is broken for quite a number of 
reasons.


What we can do rather trivially is changing this code in 
amdgpu_job_prepare_job():


 /* Ignore soft recovered fences here */
 r = drm_sched_entity_error(s_entity);
 if (r && r != -ENODATA)
 goto error;

This will bubble up errors from soft recoveries into the entity as 
well and makes sure that further submissions are rejected.


That makes sense to do, but at least for GL_EXT_robustness, that will 
not tell the app that it was guilty.


No, it clearly gets that signaled. We should probably replace the guilty 
atomic with a calls to drm_sched_entity_error().


It's just that this isn't what Marek and I had in mind for this, 
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or 
AMDGPU_CTX_OP_QUERY_STATE2.


Instead just look at the return value of the CS or query fence result 
IOCTL.


When you get an -ENODATA you have been guilty of causing a soft 
recovery, when you get an -ETIME you are guilty of causing a timeout 
which had to be hard recovered. When you get an -ECANCELED you are an 
innocent victim of a hard recovery somebody else caused.


What we haven't defined yet is an error code for loosing VRAM, but that 
should be trivial to do.


Thanks for the info, I will test things out here and likely send a patch 
to change if (r && r != -ENODATA) -> if (r) if things work out.


- Joshie 🐸✨



Regards,
Christian.



We could always return UNKNOWN_CONTEXT_RESET_EXT in that case, I guess.

I am not super sure what is wrong/egrigious about the ctx->guilty 
handling, is there a previous email thread explaining that?


- Joshie 🐸✨



Regards,
Christian.



- Joshie 🐸✨





Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...


I am not saying you should be doing any of that with RADV + 
AMDGPU, but it's just food for thought... :-)


As I have been saying, you simply cannot just violate API 
contracts like this, it's flatout wrong.


Yeah, completely agree to that.

Regards,
Christian.



- Joshie 🐸✨



 >
 >
 >> If mutter needs to be robust against faults it

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 15.01.24 um 20:30 schrieb Joshua Ashton:

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
mailto:friedrich.v...@gmx.de>> wrote:


    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling 
applications that
 >> something went wrong when you just canceled their work 
midway is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking 
because of

    that hack,
 >> reports of different apps working (even if it's 
convenient that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, 
I've laid

    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU 
work is
    interrupted out of nowhere, all bets are off and it might as 
well
    trigger a full system hang next time. No hang recovery should 
be able to

    cause that under any circumstance.


I think the more insidious situation is no further hangs but 
wrong results because we skipped some work. That we skipped work 
may e.g. result in some texture not being uploaded or some GPGPU 
work not being done and causing further errors downstream (say if 
a game is doing AI/physics on the GPU not to say anything of 
actual GPGPU work one might be doing like AI)


Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.


You could randomly just get invalid/wrong results without even 
knowing!


Well on the kernel side we do provide an API to query the result of 
a submission. That includes canceling submissions with a soft 
recovery.


What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for 
bad behavior.


You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.


ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.


By the submission result query API, I you assume you mean checking 
the submission fence error somehow? That doesn't seem very ergonomic 
for a Vulkan driver compared to the simple solution which is to just 
mark it as guilty with what already exists...


Well as I said the guilty handling is broken for quite a number of 
reasons.


What we can do rather trivially is changing this code in 
amdgpu_job_prepare_job():


 /* Ignore soft recovered fences here */
 r = drm_sched_entity_error(s_entity);
 if (r && r != -ENODATA)
 goto error;

This will bubble up errors from soft recoveries into the entity as 
well and makes sure that further submissions are rejected.


That makes sense to do, but at least for GL_EXT_robustness, that will 
not tell the app that it was guilty.


No, it clearly gets that signaled. We should probably replace the guilty 
atomic with a calls to drm_sched_entity_error().


It's just that this isn't what Marek and I had in mind for this, 
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or 
AMDGPU_CTX_OP_QUERY_STATE2.


Instead just look at the return value of the CS or query fence result IOCTL.

When you get an -ENODATA you have been guilty of causing a soft 
recovery, when you get an -ETIME you are guilty of causing a timeout 
which had to be hard recovered. When you get an -ECANCELED you are an 
innocent victim of a hard recovery somebody else caused.


What we haven't defined yet is an error code for loosing VRAM, but that 
should be trivial to do.


Regards,
Christian.



We could always return UNKNOWN_CONTEXT_RESET_EXT in that case, I guess.

I am not super sure what is wrong/egrigious about the ctx->guilty 
handling, is there a previous email thread explaining that?


- Joshie 🐸✨



Regards,
Christian.



- Joshie 🐸✨





Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...


I am not saying you should be doing any of that with RADV + 
AMDGPU, but it's just food for thought... :-)


As I have been saying, you simply cannot just violate API 
contracts like this, it's flatout wrong.


Yeah, completely agree to that.

Regards,
Christian.



- Joshie 🐸✨



 >
 >
 >> If mutter needs to be robust against faults it caused 
itself, it

    should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
 >
 > Anyway, th

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton




On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
mailto:friedrich.v...@gmx.de>> wrote:


    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling 
applications that
 >> something went wrong when you just canceled their work 
midway is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking because of
    that hack,
 >> reports of different apps working (even if it's convenient 
that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, I've 
laid

    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU work is
    interrupted out of nowhere, all bets are off and it might as well
    trigger a full system hang next time. No hang recovery should 
be able to

    cause that under any circumstance.


I think the more insidious situation is no further hangs but wrong 
results because we skipped some work. That we skipped work may e.g. 
result in some texture not being uploaded or some GPGPU work not 
being done and causing further errors downstream (say if a game is 
doing AI/physics on the GPU not to say anything of actual GPGPU 
work one might be doing like AI)


Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.


You could randomly just get invalid/wrong results without even knowing!


Well on the kernel side we do provide an API to query the result of a 
submission. That includes canceling submissions with a soft recovery.


What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for bad 
behavior.


You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.


ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.


By the submission result query API, I you assume you mean checking the 
submission fence error somehow? That doesn't seem very ergonomic for a 
Vulkan driver compared to the simple solution which is to just mark it 
as guilty with what already exists...


Well as I said the guilty handling is broken for quite a number of reasons.

What we can do rather trivially is changing this code in 
amdgpu_job_prepare_job():


     /* Ignore soft recovered fences here */
     r = drm_sched_entity_error(s_entity);
     if (r && r != -ENODATA)
     goto error;

This will bubble up errors from soft recoveries into the entity as well 
and makes sure that further submissions are rejected.


That makes sense to do, but at least for GL_EXT_robustness, that will 
not tell the app that it was guilty.


We could always return UNKNOWN_CONTEXT_RESET_EXT in that case, I guess.

I am not super sure what is wrong/egrigious about the ctx->guilty 
handling, is there a previous email thread explaining that?


- Joshie 🐸✨



Regards,
Christian.



- Joshie 🐸✨





Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...


I am not saying you should be doing any of that with RADV + AMDGPU, 
but it's just food for thought... :-)


As I have been saying, you simply cannot just violate API contracts 
like this, it's flatout wrong.


Yeah, completely agree to that.

Regards,
Christian.



- Joshie 🐸✨



 >
 >
 >> If mutter needs to be robust against faults it caused 
itself, it

    should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
 >
 > Anyway, this will happen at some point, the reality is it 
hasn't

    yet though.
 >
 >





- Joshie 🐸✨




Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
mailto:friedrich.v...@gmx.de>> wrote:


    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling 
applications that
 >> something went wrong when you just canceled their work 
midway is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking because of
    that hack,
 >> reports of different apps working (even if it's convenient 
that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, I've 
laid

    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU work is
    interrupted out of nowhere, all bets are off and it might as well
    trigger a full system hang next time. No hang recovery should 
be able to

    cause that under any circumstance.


I think the more insidious situation is no further hangs but wrong 
results because we skipped some work. That we skipped work may e.g. 
result in some texture not being uploaded or some GPGPU work not 
being done and causing further errors downstream (say if a game is 
doing AI/physics on the GPU not to say anything of actual GPGPU 
work one might be doing like AI)


Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.


You could randomly just get invalid/wrong results without even knowing!


Well on the kernel side we do provide an API to query the result of a 
submission. That includes canceling submissions with a soft recovery.


What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for bad 
behavior.


You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.


ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.


By the submission result query API, I you assume you mean checking the 
submission fence error somehow? That doesn't seem very ergonomic for a 
Vulkan driver compared to the simple solution which is to just mark it 
as guilty with what already exists...


Well as I said the guilty handling is broken for quite a number of reasons.

What we can do rather trivially is changing this code in 
amdgpu_job_prepare_job():


    /* Ignore soft recovered fences here */
    r = drm_sched_entity_error(s_entity);
    if (r && r != -ENODATA)
    goto error;

This will bubble up errors from soft recoveries into the entity as well 
and makes sure that further submissions are rejected.


Regards,
Christian.



- Joshie 🐸✨





Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...


I am not saying you should be doing any of that with RADV + AMDGPU, 
but it's just food for thought... :-)


As I have been saying, you simply cannot just violate API contracts 
like this, it's flatout wrong.


Yeah, completely agree to that.

Regards,
Christian.



- Joshie 🐸✨



 >
 >
 >> If mutter needs to be robust against faults it caused 
itself, it

    should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
 >
 > Anyway, this will happen at some point, the reality is it 
hasn't

    yet though.
 >
 >





- Joshie 🐸✨




Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton




On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock > wrote:


    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling applications 
that
 >> something went wrong when you just canceled their work midway 
is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking because of
    that hack,
 >> reports of different apps working (even if it's convenient 
that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, I've laid
    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU work is
    interrupted out of nowhere, all bets are off and it might as well
    trigger a full system hang next time. No hang recovery should be 
able to

    cause that under any circumstance.


I think the more insidious situation is no further hangs but wrong 
results because we skipped some work. That we skipped work may e.g. 
result in some texture not being uploaded or some GPGPU work not 
being done and causing further errors downstream (say if a game is 
doing AI/physics on the GPU not to say anything of actual GPGPU work 
one might be doing like AI)


Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.


You could randomly just get invalid/wrong results without even knowing!


Well on the kernel side we do provide an API to query the result of a 
submission. That includes canceling submissions with a soft recovery.


What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for bad 
behavior.


You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.


ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.


By the submission result query API, I you assume you mean checking the 
submission fence error somehow? That doesn't seem very ergonomic for a 
Vulkan driver compared to the simple solution which is to just mark it 
as guilty with what already exists...


- Joshie 🐸✨





Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...


I am not saying you should be doing any of that with RADV + AMDGPU, 
but it's just food for thought... :-)


As I have been saying, you simply cannot just violate API contracts 
like this, it's flatout wrong.


Yeah, completely agree to that.

Regards,
Christian.



- Joshie 🐸✨



 >
 >
 >> If mutter needs to be robust against faults it caused itself, it
    should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
 >
 > Anyway, this will happen at some point, the reality is it hasn't
    yet though.
 >
 >





- Joshie 🐸✨


Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock > wrote:


    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling applications 
that
 >> something went wrong when you just canceled their work midway 
is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking because of
    that hack,
 >> reports of different apps working (even if it's convenient 
that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, I've laid
    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU work is
    interrupted out of nowhere, all bets are off and it might as well
    trigger a full system hang next time. No hang recovery should be 
able to

    cause that under any circumstance.


I think the more insidious situation is no further hangs but wrong 
results because we skipped some work. That we skipped work may e.g. 
result in some texture not being uploaded or some GPGPU work not 
being done and causing further errors downstream (say if a game is 
doing AI/physics on the GPU not to say anything of actual GPGPU work 
one might be doing like AI)


Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.


You could randomly just get invalid/wrong results without even knowing!


Well on the kernel side we do provide an API to query the result of a 
submission. That includes canceling submissions with a soft recovery.


What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for bad 
behavior.




Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...


I am not saying you should be doing any of that with RADV + AMDGPU, 
but it's just food for thought... :-)


As I have been saying, you simply cannot just violate API contracts 
like this, it's flatout wrong.


Yeah, completely agree to that.

Regards,
Christian.



- Joshie 🐸✨



 >
 >
 >> If mutter needs to be robust against faults it caused itself, it
    should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
 >
 > Anyway, this will happen at some point, the reality is it hasn't
    yet though.
 >
 >





Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton




On 1/15/24 18:30, Bas Nieuwenhuizen wrote:



On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock > wrote:


Re-sending as plaintext, sorry about that

On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling applications that
 >> something went wrong when you just canceled their work midway is an
 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking because of
that hack,
 >> reports of different apps working (even if it's convenient that they
 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, I've laid
out a pragmatic solution which covers both cases.
Hitting soft reset every time is the lucky path. Once GPU work is
interrupted out of nowhere, all bets are off and it might as well
trigger a full system hang next time. No hang recovery should be able to
cause that under any circumstance.


I think the more insidious situation is no further hangs but wrong 
results because we skipped some work. That we skipped work may e.g. 
result in some texture not being uploaded or some GPGPU work not being 
done and causing further errors downstream (say if a game is doing 
AI/physics on the GPU not to say anything of actual GPGPU work one might 
be doing like AI)


Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.


You could randomly just get invalid/wrong results without even knowing!

Now imagine this is VulkanSC displaying something in the car dashboard, 
or some medical device doing some compute work to show something on a 
graph...


I am not saying you should be doing any of that with RADV + AMDGPU, but 
it's just food for thought... :-)


As I have been saying, you simply cannot just violate API contracts like 
this, it's flatout wrong.


- Joshie 🐸✨



 >
 >
 >> If mutter needs to be robust against faults it caused itself, it
should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
itself, more likely Mesa or amdgpu.
 >
 > Anyway, this will happen at some point, the reality is it hasn't
yet though.
 >
 >



Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Bas Nieuwenhuizen
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
wrote:

> Re-sending as plaintext, sorry about that
>
> On 15.01.24 18:54, Michel Dänzer wrote:
> > On 2024-01-15 18:26, Friedrich Vock wrote:
> >> [snip]
> >> The fundamental problem here is that not telling applications that
> >> something went wrong when you just canceled their work midway is an
> >> out-of-spec hack.
> >> When there is a report of real-world apps breaking because of that hack,
> >> reports of different apps working (even if it's convenient that they
> >> work) doesn't justify keeping the broken code.
> > If the breaking apps hit multiple soft resets in a row, I've laid out a
> pragmatic solution which covers both cases.
> Hitting soft reset every time is the lucky path. Once GPU work is
> interrupted out of nowhere, all bets are off and it might as well
> trigger a full system hang next time. No hang recovery should be able to
> cause that under any circumstance.
>

I think the more insidious situation is no further hangs but wrong results
because we skipped some work. That we skipped work may e.g. result in some
texture not being uploaded or some GPGPU work not being done and causing
further errors downstream (say if a game is doing AI/physics on the GPU not
to say anything of actual GPGPU work one might be doing like AI)


> >
> >
> >> If mutter needs to be robust against faults it caused itself, it should
> be robust
> >> against GPU resets.
> > It's unlikely that the hangs I've seen were caused by mutter itself,
> more likely Mesa or amdgpu.
> >
> > Anyway, this will happen at some point, the reality is it hasn't yet
> though.
> >
> >
>


Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Friedrich Vock

Re-sending as plaintext, sorry about that

On 15.01.24 18:54, Michel Dänzer wrote:

On 2024-01-15 18:26, Friedrich Vock wrote:

[snip]
The fundamental problem here is that not telling applications that
something went wrong when you just canceled their work midway is an
out-of-spec hack.
When there is a report of real-world apps breaking because of that hack,
reports of different apps working (even if it's convenient that they
work) doesn't justify keeping the broken code.

If the breaking apps hit multiple soft resets in a row, I've laid out a 
pragmatic solution which covers both cases.

Hitting soft reset every time is the lucky path. Once GPU work is
interrupted out of nowhere, all bets are off and it might as well
trigger a full system hang next time. No hang recovery should be able to
cause that under any circumstance.




If mutter needs to be robust against faults it caused itself, it should be 
robust
against GPU resets.

It's unlikely that the hangs I've seen were caused by mutter itself, more 
likely Mesa or amdgpu.

Anyway, this will happen at some point, the reality is it hasn't yet though.




Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Friedrich Vock

On 15.01.24 18:54, Michel Dänzer wrote:

On 2024-01-15 18:26, Friedrich Vock wrote:
[snip]

The fundamental problem here is that not telling applications that
something went wrong when you just canceled their work midway is an
out-of-spec hack.
When there is a report of real-world apps breaking because of that hack,
reports of different apps working (even if it's convenient that they
work) doesn't justify keeping the broken code.

If the breaking apps hit multiple soft resets in a row, I've laid out a 
pragmatic solution which covers both cases.

Hitting soft reset every time is the lucky path. Once GPU work is
interrupted out of nowhere, all bets are off and it might as well
trigger a full system hang next time. No hang recovery should be able to
cause that under any circumstance.



If mutter needs to be robust against faults it caused itself, it should be 
robust
against GPU resets.

It's unlikely that the hangs I've seen were caused by mutter itself, more 
likely Mesa or amdgpu.

Anyway, this will happen at some point, the reality is it hasn't yet though.




Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton




On 1/15/24 17:09, Michel Dänzer wrote:

On 2024-01-15 17:46, Joshua Ashton wrote:

On 1/15/24 16:34, Michel Dänzer wrote:

On 2024-01-15 17:19, Friedrich Vock wrote:

On 15.01.24 16:43, Joshua Ashton wrote:

On 1/15/24 15:25, Michel Dänzer wrote:

On 2024-01-15 14:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:


Without this feedback, the application may keep pushing through
the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making
soft recovered errors fatal as well while Michel is voting for
better ignoring them.

I'm not really sure what to do. If you guys think that soft
recovered hangs should be fatal as well then we can certainly do
this.


A possible compromise might be making soft resets fatal if they
happen repeatedly (within a certain period of time?).


No, no and no. Aside from introducing issues by side effects not
surfacing and all of the stuff I mentioned about descriptor buffers,
bda, draw indirect and stuff just resulting in more faults and hangs...

You are proposing we throw out every promise we made to an application
on the API contract level because it "might work". That's just wrong!

Let me put this in explicit terms: What you are proposing is in direct
violation of the GL and Vulkan specification.

You can't just chose to break these contracts because you think it
'might' be a better user experience.


Is the original issue that motivated soft resets to be non-fatal even an
issue anymore?

If I read that old thread correctly, the rationale for that was that
assigning guilt to a context was more broken than not doing it, because
the compositor/Xwayland process would also crash despite being unrelated
to the hang.
With Joshua's Mesa fixes, this is not the case anymore, so I don't think
keeping soft resets non-fatal provides any benefit to the user experience.
The potential detriments to user experience have been outlined multiple
times in this thread already.

(I suppose if the compositor itself faults it might still bring down a
session, but I've literally never seen that, and it's not like a
compositor triggering segfaults on CPU stays alive either.)


That's indeed what happened for me, multiple times. And each time the session 
continued running fine for days after the soft reset.

But apparently my experience isn't valid somehow, and I should have been forced 
to log in again to please the GL gods...


Conversely, I can't remember hitting a case where an app kept running into soft 
resets. It's almost as if different people may have different experiences! ;)


Your anecdote of whatever application coincidentally managing to soldier 
through being hung is really not relevant.


But yours is, got it.


Yes, as what I am stating is backed by the specification for the APIs we 
are using.
You previously said things are not black and white, but it very 
explicitly is -- we have specifications for a reason!


Your app just-so happening to survive a command buffer being ignored is 
not a useful or valid addition to this discussion at all.


- Joshie 🐸✨





It looks like Mutter already has some handling for GL robustness anyway...


That's just for suspend/resume with the nvidia driver. It can't recover from 
GPU hangs yet.






Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Michel Dänzer
On 2024-01-15 18:26, Friedrich Vock wrote:
> On 15.01.24 18:09, Michel Dänzer wrote:
>> On 2024-01-15 17:46, Joshua Ashton wrote:
>>> On 1/15/24 16:34, Michel Dänzer wrote:
 On 2024-01-15 17:19, Friedrich Vock wrote:
> On 15.01.24 16:43, Joshua Ashton wrote:
>> On 1/15/24 15:25, Michel Dänzer wrote:
>>> On 2024-01-15 14:17, Christian König wrote:
 Am 15.01.24 um 12:37 schrieb Joshua Ashton:
> On 1/15/24 09:40, Christian König wrote:
>> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
>>
>>> Without this feedback, the application may keep pushing through
>>> the soft
>>> recoveries, continually hanging the system with jobs that timeout.
>> Well, that is intentional behavior. Marek is voting for making
>> soft recovered errors fatal as well while Michel is voting for
>> better ignoring them.
>>
>> I'm not really sure what to do. If you guys think that soft
>> recovered hangs should be fatal as well then we can certainly do
>> this.
>>> A possible compromise might be making soft resets fatal if they
>>> happen repeatedly (within a certain period of time?).
>> No, no and no. Aside from introducing issues by side effects not
>> surfacing and all of the stuff I mentioned about descriptor buffers,
>> bda, draw indirect and stuff just resulting in more faults and hangs...
>>
>> You are proposing we throw out every promise we made to an application
>> on the API contract level because it "might work". That's just wrong!
>>
>> Let me put this in explicit terms: What you are proposing is in direct
>> violation of the GL and Vulkan specification.
>>
>> You can't just chose to break these contracts because you think it
>> 'might' be a better user experience.
> Is the original issue that motivated soft resets to be non-fatal even an
> issue anymore?
>
> If I read that old thread correctly, the rationale for that was that
> assigning guilt to a context was more broken than not doing it, because
> the compositor/Xwayland process would also crash despite being unrelated
> to the hang.
> With Joshua's Mesa fixes, this is not the case anymore, so I don't think
> keeping soft resets non-fatal provides any benefit to the user experience.
> The potential detriments to user experience have been outlined multiple
> times in this thread already.
>
> (I suppose if the compositor itself faults it might still bring down a
> session, but I've literally never seen that, and it's not like a
> compositor triggering segfaults on CPU stays alive either.)
 That's indeed what happened for me, multiple times. And each time the 
 session continued running fine for days after the soft reset.

 But apparently my experience isn't valid somehow, and I should have been 
 forced to log in again to please the GL gods...


 Conversely, I can't remember hitting a case where an app kept running into 
 soft resets. It's almost as if different people may have different 
 experiences! ;)
>>> Your anecdote of whatever application coincidentally managing to soldier 
>>> through being hung is really not relevant.
>> But yours is, got it.
> The fundamental problem here is that not telling applications that
> something went wrong when you just canceled their work midway is an
> out-of-spec hack.
> When there is a report of real-world apps breaking because of that hack,
> reports of different apps working (even if it's convenient that they
> work) doesn't justify keeping the broken code.

If the breaking apps hit multiple soft resets in a row, I've laid out a 
pragmatic solution which covers both cases.


> If mutter needs to be robust against faults it caused itself, it should be 
> robust
> against GPU resets.

It's unlikely that the hangs I've seen were caused by mutter itself, more 
likely Mesa or amdgpu.

Anyway, this will happen at some point, the reality is it hasn't yet though.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Friedrich Vock

On 15.01.24 18:09, Michel Dänzer wrote:

On 2024-01-15 17:46, Joshua Ashton wrote:

On 1/15/24 16:34, Michel Dänzer wrote:

On 2024-01-15 17:19, Friedrich Vock wrote:

On 15.01.24 16:43, Joshua Ashton wrote:

On 1/15/24 15:25, Michel Dänzer wrote:

On 2024-01-15 14:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:


Without this feedback, the application may keep pushing through
the soft
recoveries, continually hanging the system with jobs that timeout.

Well, that is intentional behavior. Marek is voting for making
soft recovered errors fatal as well while Michel is voting for
better ignoring them.

I'm not really sure what to do. If you guys think that soft
recovered hangs should be fatal as well then we can certainly do
this.

A possible compromise might be making soft resets fatal if they
happen repeatedly (within a certain period of time?).

No, no and no. Aside from introducing issues by side effects not
surfacing and all of the stuff I mentioned about descriptor buffers,
bda, draw indirect and stuff just resulting in more faults and hangs...

You are proposing we throw out every promise we made to an application
on the API contract level because it "might work". That's just wrong!

Let me put this in explicit terms: What you are proposing is in direct
violation of the GL and Vulkan specification.

You can't just chose to break these contracts because you think it
'might' be a better user experience.

Is the original issue that motivated soft resets to be non-fatal even an
issue anymore?

If I read that old thread correctly, the rationale for that was that
assigning guilt to a context was more broken than not doing it, because
the compositor/Xwayland process would also crash despite being unrelated
to the hang.
With Joshua's Mesa fixes, this is not the case anymore, so I don't think
keeping soft resets non-fatal provides any benefit to the user experience.
The potential detriments to user experience have been outlined multiple
times in this thread already.

(I suppose if the compositor itself faults it might still bring down a
session, but I've literally never seen that, and it's not like a
compositor triggering segfaults on CPU stays alive either.)

That's indeed what happened for me, multiple times. And each time the session 
continued running fine for days after the soft reset.

But apparently my experience isn't valid somehow, and I should have been forced 
to log in again to please the GL gods...


Conversely, I can't remember hitting a case where an app kept running into soft 
resets. It's almost as if different people may have different experiences! ;)

Your anecdote of whatever application coincidentally managing to soldier 
through being hung is really not relevant.

But yours is, got it.

The fundamental problem here is that not telling applications that
something went wrong when you just canceled their work midway is an
out-of-spec hack.
When there is a report of real-world apps breaking because of that hack,
reports of different apps working (even if it's convenient that they
work) doesn't justify keeping the broken code.

I don't think it makes sense to argue against fixing broken code just
because it doesn't affect all apps (and is convenient for some of them).

This isn't anything personal. I don't think your experience is invalid
(I think I just misunderstood the original thread. Sorry if I came
across as condescending!), all I'm arguing is that this should be fixed
somewhere else than the kernel, because what the kernel does right now
makes it impossible to implement modern APIs fully correctly. If mutter
needs to be robust against faults it caused itself, it should be robust
against GPU resets.





It looks like Mutter already has some handling for GL robustness anyway...

That's just for suspend/resume with the nvidia driver. It can't recover from 
GPU hangs yet.




Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Michel Dänzer
On 2024-01-15 17:46, Joshua Ashton wrote:
> On 1/15/24 16:34, Michel Dänzer wrote:
>> On 2024-01-15 17:19, Friedrich Vock wrote:
>>> On 15.01.24 16:43, Joshua Ashton wrote:
 On 1/15/24 15:25, Michel Dänzer wrote:
> On 2024-01-15 14:17, Christian König wrote:
>> Am 15.01.24 um 12:37 schrieb Joshua Ashton:
>>> On 1/15/24 09:40, Christian König wrote:
 Am 13.01.24 um 15:02 schrieb Joshua Ashton:

> Without this feedback, the application may keep pushing through
> the soft
> recoveries, continually hanging the system with jobs that timeout.

 Well, that is intentional behavior. Marek is voting for making
 soft recovered errors fatal as well while Michel is voting for
 better ignoring them.

 I'm not really sure what to do. If you guys think that soft
 recovered hangs should be fatal as well then we can certainly do
 this.
>
> A possible compromise might be making soft resets fatal if they
> happen repeatedly (within a certain period of time?).

 No, no and no. Aside from introducing issues by side effects not
 surfacing and all of the stuff I mentioned about descriptor buffers,
 bda, draw indirect and stuff just resulting in more faults and hangs...

 You are proposing we throw out every promise we made to an application
 on the API contract level because it "might work". That's just wrong!

 Let me put this in explicit terms: What you are proposing is in direct
 violation of the GL and Vulkan specification.

 You can't just chose to break these contracts because you think it
 'might' be a better user experience.
>>>
>>> Is the original issue that motivated soft resets to be non-fatal even an
>>> issue anymore?
>>>
>>> If I read that old thread correctly, the rationale for that was that
>>> assigning guilt to a context was more broken than not doing it, because
>>> the compositor/Xwayland process would also crash despite being unrelated
>>> to the hang.
>>> With Joshua's Mesa fixes, this is not the case anymore, so I don't think
>>> keeping soft resets non-fatal provides any benefit to the user experience.
>>> The potential detriments to user experience have been outlined multiple
>>> times in this thread already.
>>>
>>> (I suppose if the compositor itself faults it might still bring down a
>>> session, but I've literally never seen that, and it's not like a
>>> compositor triggering segfaults on CPU stays alive either.)
>>
>> That's indeed what happened for me, multiple times. And each time the 
>> session continued running fine for days after the soft reset.
>>
>> But apparently my experience isn't valid somehow, and I should have been 
>> forced to log in again to please the GL gods...
>>
>>
>> Conversely, I can't remember hitting a case where an app kept running into 
>> soft resets. It's almost as if different people may have different 
>> experiences! ;)
> 
> Your anecdote of whatever application coincidentally managing to soldier 
> through being hung is really not relevant.

But yours is, got it.


> It looks like Mutter already has some handling for GL robustness anyway...

That's just for suspend/resume with the nvidia driver. It can't recover from 
GPU hangs yet.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton




On 1/15/24 16:34, Michel Dänzer wrote:

On 2024-01-15 17:19, Friedrich Vock wrote:

On 15.01.24 16:43, Joshua Ashton wrote:

On 1/15/24 15:25, Michel Dänzer wrote:

On 2024-01-15 14:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:


Without this feedback, the application may keep pushing through
the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making
soft recovered errors fatal as well while Michel is voting for
better ignoring them.

I'm not really sure what to do. If you guys think that soft
recovered hangs should be fatal as well then we can certainly do
this.


A possible compromise might be making soft resets fatal if they
happen repeatedly (within a certain period of time?).


No, no and no. Aside from introducing issues by side effects not
surfacing and all of the stuff I mentioned about descriptor buffers,
bda, draw indirect and stuff just resulting in more faults and hangs...

You are proposing we throw out every promise we made to an application
on the API contract level because it "might work". That's just wrong!

Let me put this in explicit terms: What you are proposing is in direct
violation of the GL and Vulkan specification.

You can't just chose to break these contracts because you think it
'might' be a better user experience.


Is the original issue that motivated soft resets to be non-fatal even an
issue anymore?

If I read that old thread correctly, the rationale for that was that
assigning guilt to a context was more broken than not doing it, because
the compositor/Xwayland process would also crash despite being unrelated
to the hang.
With Joshua's Mesa fixes, this is not the case anymore, so I don't think
keeping soft resets non-fatal provides any benefit to the user experience.
The potential detriments to user experience have been outlined multiple
times in this thread already.

(I suppose if the compositor itself faults it might still bring down a
session, but I've literally never seen that, and it's not like a
compositor triggering segfaults on CPU stays alive either.)


That's indeed what happened for me, multiple times. And each time the session 
continued running fine for days after the soft reset.

But apparently my experience isn't valid somehow, and I should have been forced 
to log in again to please the GL gods...


Conversely, I can't remember hitting a case where an app kept running into soft 
resets. It's almost as if different people may have different experiences! ;)


Your anecdote of whatever application coincidentally managing to soldier 
through being hung is really not relevant.


The right thing to do is to not break all of our api contracts with the 
application and not egregiously violate the VK and GL specs. :-)


No-one is saying your session should go down.
It looks like Mutter already has some handling for GL robustness anyway...
So this should already just work with these patches if it is not broken?

- Joshie 🐸✨



Note that I'm not saying that case can't happen. Making soft resets fatal only 
if they happen repeatedly could address both issues, rather than only one or 
the other. Seems like a win-win.




Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Michel Dänzer
On 2024-01-15 17:19, Friedrich Vock wrote:
> On 15.01.24 16:43, Joshua Ashton wrote:
>> On 1/15/24 15:25, Michel Dänzer wrote:
>>> On 2024-01-15 14:17, Christian König wrote:
 Am 15.01.24 um 12:37 schrieb Joshua Ashton:
> On 1/15/24 09:40, Christian König wrote:
>> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
>>
>>> Without this feedback, the application may keep pushing through
>>> the soft
>>> recoveries, continually hanging the system with jobs that timeout.
>>
>> Well, that is intentional behavior. Marek is voting for making
>> soft recovered errors fatal as well while Michel is voting for
>> better ignoring them.
>>
>> I'm not really sure what to do. If you guys think that soft
>> recovered hangs should be fatal as well then we can certainly do
>> this.
>>>
>>> A possible compromise might be making soft resets fatal if they
>>> happen repeatedly (within a certain period of time?).
>>
>> No, no and no. Aside from introducing issues by side effects not
>> surfacing and all of the stuff I mentioned about descriptor buffers,
>> bda, draw indirect and stuff just resulting in more faults and hangs...
>>
>> You are proposing we throw out every promise we made to an application
>> on the API contract level because it "might work". That's just wrong!
>>
>> Let me put this in explicit terms: What you are proposing is in direct
>> violation of the GL and Vulkan specification.
>>
>> You can't just chose to break these contracts because you think it
>> 'might' be a better user experience.
> 
> Is the original issue that motivated soft resets to be non-fatal even an
> issue anymore?
> 
> If I read that old thread correctly, the rationale for that was that
> assigning guilt to a context was more broken than not doing it, because
> the compositor/Xwayland process would also crash despite being unrelated
> to the hang.
> With Joshua's Mesa fixes, this is not the case anymore, so I don't think
> keeping soft resets non-fatal provides any benefit to the user experience.
> The potential detriments to user experience have been outlined multiple
> times in this thread already.
> 
> (I suppose if the compositor itself faults it might still bring down a
> session, but I've literally never seen that, and it's not like a
> compositor triggering segfaults on CPU stays alive either.)

That's indeed what happened for me, multiple times. And each time the session 
continued running fine for days after the soft reset.

But apparently my experience isn't valid somehow, and I should have been forced 
to log in again to please the GL gods...


Conversely, I can't remember hitting a case where an app kept running into soft 
resets. It's almost as if different people may have different experiences! ;)

Note that I'm not saying that case can't happen. Making soft resets fatal only 
if they happen repeatedly could address both issues, rather than only one or 
the other. Seems like a win-win.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Friedrich Vock



On 15.01.24 16:43, Joshua Ashton wrote:



On 1/15/24 15:25, Michel Dänzer wrote:

On 2024-01-15 14:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:


Without this feedback, the application may keep pushing through
the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making
soft recovered errors fatal as well while Michel is voting for
better ignoring them.

I'm not really sure what to do. If you guys think that soft
recovered hangs should be fatal as well then we can certainly do
this.


A possible compromise might be making soft resets fatal if they
happen repeatedly (within a certain period of time?).



No, no and no. Aside from introducing issues by side effects not
surfacing and all of the stuff I mentioned about descriptor buffers,
bda, draw indirect and stuff just resulting in more faults and hangs...

You are proposing we throw out every promise we made to an application
on the API contract level because it "might work". That's just wrong!

Let me put this in explicit terms: What you are proposing is in direct
violation of the GL and Vulkan specification.

You can't just chose to break these contracts because you think it
'might' be a better user experience.


Is the original issue that motivated soft resets to be non-fatal even an
issue anymore?

If I read that old thread correctly, the rationale for that was that
assigning guilt to a context was more broken than not doing it, because
the compositor/Xwayland process would also crash despite being unrelated
to the hang.
With Joshua's Mesa fixes, this is not the case anymore, so I don't think
keeping soft resets non-fatal provides any benefit to the user experience.
The potential detriments to user experience have been outlined multiple
times in this thread already.

(I suppose if the compositor itself faults it might still bring down a
session, but I've literally never seen that, and it's not like a
compositor triggering segfaults on CPU stays alive either.)






They have to be!

As Marek and I have pointed out, applications that hang or fault
will just hang or fault again, especially when they use things like
draw indirect, buffer device address, descriptor buffers, etc.


Ok, well then I now have two people (Marek and you) saying that soft
recovery should be fatal while Michel is saying that soft recovery
being non fatal improves stability for him :)


That's not quite what I wrote before.

I pointed out that my GNOME session has survived a soft reset without
issues[0] on multiple occasions, whereas Marek's proposal at the time
would have kicked me back to the login screen every time. > 0 vs
effectively 0 chance of survival.


The correct thing for GNOME/Mutter to do is to simply re-create it's
context, reimport it's DMABUFs, etc.

The fact that it survives and keeps soldiering on with whatever side
effects missed is purely coincidental and not valid API usage.

If you want such behaviour for hangs for Mutter, you should propose a
GL/VK extension for it, but I really doubt that will get anywhere.

- Joshie 🐸✨



[0] Except for Firefox unnecessarily falling back to software
rendering, which is a side note, not the main point.






Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton




On 1/15/24 15:25, Michel Dänzer wrote:

On 2024-01-15 14:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:


Without this feedback, the application may keep pushing through the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making soft recovered 
errors fatal as well while Michel is voting for better ignoring them.

I'm not really sure what to do. If you guys think that soft recovered hangs 
should be fatal as well then we can certainly do this.


A possible compromise might be making soft resets fatal if they happen 
repeatedly (within a certain period of time?).



No, no and no. Aside from introducing issues by side effects not 
surfacing and all of the stuff I mentioned about descriptor buffers, 
bda, draw indirect and stuff just resulting in more faults and hangs...


You are proposing we throw out every promise we made to an application 
on the API contract level because it "might work". That's just wrong!


Let me put this in explicit terms: What you are proposing is in direct 
violation of the GL and Vulkan specification.


You can't just chose to break these contracts because you think it 
'might' be a better user experience.





They have to be!

As Marek and I have pointed out, applications that hang or fault will just hang 
or fault again, especially when they use things like draw indirect, buffer 
device address, descriptor buffers, etc.


Ok, well then I now have two people (Marek and you) saying that soft recovery 
should be fatal while Michel is saying that soft recovery being non fatal 
improves stability for him :)


That's not quite what I wrote before.

I pointed out that my GNOME session has survived a soft reset without issues[0] on 
multiple occasions, whereas Marek's proposal at the time would have kicked me back 
to the login screen every time. > 0 vs effectively 0 chance of survival.


The correct thing for GNOME/Mutter to do is to simply re-create it's 
context, reimport it's DMABUFs, etc.


The fact that it survives and keeps soldiering on with whatever side 
effects missed is purely coincidental and not valid API usage.


If you want such behaviour for hangs for Mutter, you should propose a 
GL/VK extension for it, but I really doubt that will get anywhere.


- Joshie 🐸✨



[0] Except for Firefox unnecessarily falling back to software rendering, which 
is a side note, not the main point.






Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Michel Dänzer
On 2024-01-15 14:17, Christian König wrote:
> Am 15.01.24 um 12:37 schrieb Joshua Ashton:
>> On 1/15/24 09:40, Christian König wrote:
>>> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
>>>
 Without this feedback, the application may keep pushing through the soft
 recoveries, continually hanging the system with jobs that timeout.
>>>
>>> Well, that is intentional behavior. Marek is voting for making soft 
>>> recovered errors fatal as well while Michel is voting for better ignoring 
>>> them.
>>>
>>> I'm not really sure what to do. If you guys think that soft recovered hangs 
>>> should be fatal as well then we can certainly do this.

A possible compromise might be making soft resets fatal if they happen 
repeatedly (within a certain period of time?).


>> They have to be!
>>
>> As Marek and I have pointed out, applications that hang or fault will just 
>> hang or fault again, especially when they use things like draw indirect, 
>> buffer device address, descriptor buffers, etc.
> 
> Ok, well then I now have two people (Marek and you) saying that soft recovery 
> should be fatal while Michel is saying that soft recovery being non fatal 
> improves stability for him :)

That's not quite what I wrote before.

I pointed out that my GNOME session has survived a soft reset without issues[0] 
on multiple occasions, whereas Marek's proposal at the time would have kicked 
me back to the login screen every time. > 0 vs effectively 0 chance of survival.

[0] Except for Firefox unnecessarily falling back to software rendering, which 
is a side note, not the main point.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton




On 1/15/24 13:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.


Big NAK to that approach, the karma handling is completely deprecated.

When you want to signal execution errors please use the fence error 
code.


The fence error code does not result in ctx's being marked as guilty, 
only the karma path does.


See drm_sched_increase_karma.

Are you proposing that we instead mark contexts as guilty with the 
fence error ourselves here?


No, I'm proposing to completely abandon the concept of guilty contexts. 
Basically what we should do is to return an error from the CS IOCTL 
whenever a previous submission resulted in a fatal error as suggested by 
Marek.


Oh, I agree that is broken by design, but this is already implemented 
with the current guilt system!


The ioctls already will return -ECANCELLED if you are guilty of a hang:

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c#L64

The query merely exists to give more feedback as to the situation, which 
is fine.




That we query the context for guilty was just a design decision we 
copied over from our closed source drivers which turned out to 
absolutely not solving anything.


Marek can probably comment as well why the whole idea of querying the 
kernel if something fatal happens instead of just rejecting submissions 
is broken by design.




Without this feedback, the application may keep pushing through the 
soft

recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making soft 
recovered errors fatal as well while Michel is voting for better 
ignoring them.


I'm not really sure what to do. If you guys think that soft recovered 
hangs should be fatal as well then we can certainly do this.


They have to be!

As Marek and I have pointed out, applications that hang or fault will 
just hang or fault again, especially when they use things like draw 
indirect, buffer device address, descriptor buffers, etc.


Ok, well then I now have two people (Marek and you) saying that soft 
recovery should be fatal while Michel is saying that soft recovery being 
non fatal improves stability for him :)


Should we somehow make that configurable or depend it on if that's the 
display server or if it's an user application?


I could probably get every RADV developer, and all of the Proton 
graphics team to come in and preach the same thing also. :P


If a compositor/display server is guilty of a hang, it can just listen 
for DEVICE_LOST from vkQueueSubmit and re-create it's context, re-import 
the dmabufs etc (or restart itself).


FWIU, in the email chain, the thing Daenzer was saying was that Firefox 
was falling back to software rendering even when it was innocent, but 
that seems incredibly broken by design.


- Joshie 🐸✨



Regards,
Christian.



The majority of apps these days have a lot of side effects and 
persistence between frames and submissions.


- Joshie 🐸✨



Regards,
Christian.



There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an 
application

which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.

Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
amdgpu_ring *ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
  atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)









Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton




On 1/15/24 13:19, Christian König wrote:

Am 15.01.24 um 12:54 schrieb Joshua Ashton:

[SNIP]


The question here is really if we should handled soft recovered 
errors as fatal or not. Marek is in pro of that Michel is against it.


Figure out what you want in userspace and I'm happy to implement it :)



(That being said, without my patches, RADV treats *any* reset from 
the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost 
and it was not guilty, so any faulting/hanging application causes 
every Vulkan app to die e_e. This is fixed in 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )


That is actually intended behavior. When something disrupted the GPU 
execution and the application is affected it is mandatory to forward 
this error to the application.


No. If said context was entirely unaffected, then it should be 
completely transparent to the application.


Consider the following:

 - I have Counter-Strike 2 running
 - I have Gamescope running

I then go ahead and start HangApp that hangs the GPU.

Soft recovery happens and that clears out all the work for the 
specific VMID for HangApp's submissions and signals the submission fence.


In this instance, the Gamescope and Counter-Strike 2 ctxs are 
completely unaffected and don't need to report VK_ERROR_DEVICE_LOST as 
there was no impact to their work.


Ok, that is something I totally agree on. But why would the Gamescope 
and Counter-Strike 2 app report VK_ERROR_DEVICE_LOST for a soft recovery 
in the first place?


IIRC a soft recovery doesn't increment the reset counter in any way. So 
they should never be affected.


It does, and without assigning any guilt, amdgpu_ring_soft_recovery 
calls atomic_inc(&ring->adev->gpu_reset_counter).





Regards,
Christian.



Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem 
with HangApp, FWIU the way that the clear-out works being vmid 
specific means that they would be unaffected, right?


- Joshie 🐸✨



Regards,
Christian.



- Joshie 🐸✨




Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
amdgpu_ring *ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)










- Joshie 🐸✨


Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 15.01.24 um 12:54 schrieb Joshua Ashton:

[SNIP]


The question here is really if we should handled soft recovered 
errors as fatal or not. Marek is in pro of that Michel is against it.


Figure out what you want in userspace and I'm happy to implement it :)



(That being said, without my patches, RADV treats *any* reset from 
the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost 
and it was not guilty, so any faulting/hanging application causes 
every Vulkan app to die e_e. This is fixed in 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )


That is actually intended behavior. When something disrupted the GPU 
execution and the application is affected it is mandatory to forward 
this error to the application.


No. If said context was entirely unaffected, then it should be 
completely transparent to the application.


Consider the following:

 - I have Counter-Strike 2 running
 - I have Gamescope running

I then go ahead and start HangApp that hangs the GPU.

Soft recovery happens and that clears out all the work for the 
specific VMID for HangApp's submissions and signals the submission fence.


In this instance, the Gamescope and Counter-Strike 2 ctxs are 
completely unaffected and don't need to report VK_ERROR_DEVICE_LOST as 
there was no impact to their work.


Ok, that is something I totally agree on. But why would the Gamescope 
and Counter-Strike 2 app report VK_ERROR_DEVICE_LOST for a soft recovery 
in the first place?


IIRC a soft recovery doesn't increment the reset counter in any way. So 
they should never be affected.


Regards,
Christian.



Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem 
with HangApp, FWIU the way that the clear-out works being vmid 
specific means that they would be unaffected, right?


- Joshie 🐸✨



Regards,
Christian.



- Joshie 🐸✨




Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
amdgpu_ring *ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)










Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.


Big NAK to that approach, the karma handling is completely deprecated.

When you want to signal execution errors please use the fence error 
code.


The fence error code does not result in ctx's being marked as guilty, 
only the karma path does.


See drm_sched_increase_karma.

Are you proposing that we instead mark contexts as guilty with the 
fence error ourselves here?


No, I'm proposing to completely abandon the concept of guilty contexts. 
Basically what we should do is to return an error from the CS IOCTL 
whenever a previous submission resulted in a fatal error as suggested by 
Marek.


That we query the context for guilty was just a design decision we 
copied over from our closed source drivers which turned out to 
absolutely not solving anything.


Marek can probably comment as well why the whole idea of querying the 
kernel if something fatal happens instead of just rejecting submissions 
is broken by design.




Without this feedback, the application may keep pushing through the 
soft

recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making soft 
recovered errors fatal as well while Michel is voting for better 
ignoring them.


I'm not really sure what to do. If you guys think that soft recovered 
hangs should be fatal as well then we can certainly do this.


They have to be!

As Marek and I have pointed out, applications that hang or fault will 
just hang or fault again, especially when they use things like draw 
indirect, buffer device address, descriptor buffers, etc.


Ok, well then I now have two people (Marek and you) saying that soft 
recovery should be fatal while Michel is saying that soft recovery being 
non fatal improves stability for him :)


Should we somehow make that configurable or depend it on if that's the 
display server or if it's an user application?


Regards,
Christian.



The majority of apps these days have a lot of side effects and 
persistence between frames and submissions.


- Joshie 🐸✨



Regards,
Christian.



There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an 
application

which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.

Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
amdgpu_ring *ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
  atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)









Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton




On 1/15/24 09:47, Christian König wrote:

Am 13.01.24 um 23:55 schrieb Joshua Ashton:

+Marek

On 1/13/24 21:35, André Almeida wrote:

Hi Joshua,

Em 13/01/2024 11:02, Joshua Ashton escreveu:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.

Without this feedback, the application may keep pushing through the 
soft

recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an 
application

which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.



I sent a similar patch in the past, maybe you find the discussion 
interesting:


https://lore.kernel.org/lkml/20230424014324.218531-1-andrealm...@igalia.com/


Thanks, I had a peruse through that old thread.

Marek definitely had the right idea here, given he mentions:
"That supposedly depends on the compositor. There may be compositors for
very specific cases (e.g. Steam Deck)"

Given that is what I work on and also wrote this patch for that does 
basically the same thing as was proposed. :-)


For context though, I am less interested in Gamescope (the Steam Deck 
compositor) hanging (we don't have code that hangs, if we go down, 
it's likely Steam/CEF died with us anyway atm, can solve that battle 
some other day) and more about the applications run under it.


Marek is very right when he says applications that fault/hang will 
submit one IB after another that also fault/hang -- especially if they 
write to descriptors from the GPU (descriptor buffers), or use draw 
indirect or anything bindless or...
That's basically functionally equivalent to DOSing a system if it is 
not prevented.


And that's exactly what I see even in a simple test app doing a fault 
-> hang every frame.


Right now, given that soft recovery never marks a context as guilty, 
it means that every app I tested is never stopped from submitting 
garbage That holds true for basically any app that GPU hangs and makes 
soft recovery totally useless in my testing without this.


Yeah, the problem is that your patch wouldn't help with that. A testing 
app can still re-create the context for each submission and so crash the 
system over and over again.


It is still definitely possible for an application to do re-create its 
context and hang yet again -- however that is not the problem I am 
trying to solve here.


The problem I am trying to solve is that applications do not even get 
marked guilty when triggering soft recovery right now.


The patch does help with that on SteamOS, as the type of applications we 
deal with that hang, just abort on VK_ERROR_DEVICE_LOST.


If a UI toolkit that handles DEVICE_LOST keeps doing this, then yes it 
would not help, but this patch is also a necessary step towards fixing 
that someday. (Eg. some policy where processes are killed for hanging 
too many times, etc)




The question here is really if we should handled soft recovered errors 
as fatal or not. Marek is in pro of that Michel is against it.


Figure out what you want in userspace and I'm happy to implement it :)



(That being said, without my patches, RADV treats *any* reset from the 
query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it 
was not guilty, so any faulting/hanging application causes every 
Vulkan app to die e_e. This is fixed in 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )


That is actually intended behavior. When something disrupted the GPU 
execution and the application is affected it is mandatory to forward 
this error to the application.


No. If said context was entirely unaffected, then it should be 
completely transparent to the application.


Consider the following:

 - I have Counter-Strike 2 running
 - I have Gamescope running

I then go ahead and start HangApp that hangs the GPU.

Soft recovery happens and that clears out all the work for the specific 
VMID for HangApp's submissions and signals the submission fence.


In this instance, the Gamescope and Counter-Strike 2 ctxs are completely 
unaffected and don't need to report VK_ERROR_DEVICE_LOST as there was no 
impact to their work.


Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem with 
HangApp, FWIU the way that the clear-out works being vmid specific means 
that they would be unaffected, right?


- Joshie 🐸✨



Regards,
Christian.



- Joshie 🐸✨




Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton




On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.


Big NAK to that approach, the karma handling is completely deprecated.

When you want to signal execution errors please use the fence error code.


The fence error code does not result in ctx's being marked as guilty, 
only the karma path does.


See drm_sched_increase_karma.

Are you proposing that we instead mark contexts as guilty with the fence 
error ourselves here?





Without this feedback, the application may keep pushing through the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making soft 
recovered errors fatal as well while Michel is voting for better 
ignoring them.


I'm not really sure what to do. If you guys think that soft recovered 
hangs should be fatal as well then we can certainly do this.


They have to be!

As Marek and I have pointed out, applications that hang or fault will 
just hang or fault again, especially when they use things like draw 
indirect, buffer device address, descriptor buffers, etc.


The majority of apps these days have a lot of side effects and 
persistence between frames and submissions.


- Joshie 🐸✨



Regards,
Christian.



There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application
which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.

Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring 
*ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
  atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)







Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 13.01.24 um 23:55 schrieb Joshua Ashton:

+Marek

On 1/13/24 21:35, André Almeida wrote:

Hi Joshua,

Em 13/01/2024 11:02, Joshua Ashton escreveu:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.

Without this feedback, the application may keep pushing through the 
soft

recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an 
application

which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.



I sent a similar patch in the past, maybe you find the discussion 
interesting:


https://lore.kernel.org/lkml/20230424014324.218531-1-andrealm...@igalia.com/ 



Thanks, I had a peruse through that old thread.

Marek definitely had the right idea here, given he mentions:
"That supposedly depends on the compositor. There may be compositors for
very specific cases (e.g. Steam Deck)"

Given that is what I work on and also wrote this patch for that does 
basically the same thing as was proposed. :-)


For context though, I am less interested in Gamescope (the Steam Deck 
compositor) hanging (we don't have code that hangs, if we go down, 
it's likely Steam/CEF died with us anyway atm, can solve that battle 
some other day) and more about the applications run under it.


Marek is very right when he says applications that fault/hang will 
submit one IB after another that also fault/hang -- especially if they 
write to descriptors from the GPU (descriptor buffers), or use draw 
indirect or anything bindless or...
That's basically functionally equivalent to DOSing a system if it is 
not prevented.


And that's exactly what I see even in a simple test app doing a fault 
-> hang every frame.


Right now, given that soft recovery never marks a context as guilty, 
it means that every app I tested is never stopped from submitting 
garbage That holds true for basically any app that GPU hangs and makes 
soft recovery totally useless in my testing without this.


Yeah, the problem is that your patch wouldn't help with that. A testing 
app can still re-create the context for each submission and so crash the 
system over and over again.


The question here is really if we should handled soft recovered errors 
as fatal or not. Marek is in pro of that Michel is against it.


Figure out what you want in userspace and I'm happy to implement it :)



(That being said, without my patches, RADV treats *any* reset from the 
query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it 
was not guilty, so any faulting/hanging application causes every 
Vulkan app to die e_e. This is fixed in 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )


That is actually intended behavior. When something disrupted the GPU 
execution and the application is affected it is mandatory to forward 
this error to the application.


Regards,
Christian.



- Joshie 🐸✨




Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
amdgpu_ring *ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
  atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)






Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 13.01.24 um 15:02 schrieb Joshua Ashton:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.


Big NAK to that approach, the karma handling is completely deprecated.

When you want to signal execution errors please use the fence error code.


Without this feedback, the application may keep pushing through the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making soft 
recovered errors fatal as well while Michel is voting for better 
ignoring them.


I'm not really sure what to do. If you guys think that soft recovered 
hangs should be fatal as well then we can certainly do this.


Regards,
Christian.



There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application
which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.

Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, 
struct amdgpu_job *job)
dma_fence_set_error(fence, -ENODATA);
spin_unlock_irqrestore(fence->lock, flags);
  
+	if (job->vm)

+   drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter);
while (!dma_fence_is_signaled(fence) &&
   ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)




Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-13 Thread Joshua Ashton

+Marek

On 1/13/24 21:35, André Almeida wrote:

Hi Joshua,

Em 13/01/2024 11:02, Joshua Ashton escreveu:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.

Without this feedback, the application may keep pushing through the soft
recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application
which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.



I sent a similar patch in the past, maybe you find the discussion 
interesting:


https://lore.kernel.org/lkml/20230424014324.218531-1-andrealm...@igalia.com/


Thanks, I had a peruse through that old thread.

Marek definitely had the right idea here, given he mentions:
"That supposedly depends on the compositor. There may be compositors for
very specific cases (e.g. Steam Deck)"

Given that is what I work on and also wrote this patch for that does 
basically the same thing as was proposed. :-)


For context though, I am less interested in Gamescope (the Steam Deck 
compositor) hanging (we don't have code that hangs, if we go down, it's 
likely Steam/CEF died with us anyway atm, can solve that battle some 
other day) and more about the applications run under it.


Marek is very right when he says applications that fault/hang will 
submit one IB after another that also fault/hang -- especially if they 
write to descriptors from the GPU (descriptor buffers), or use draw 
indirect or anything bindless or...
That's basically functionally equivalent to DOSing a system if it is not 
prevented.


And that's exactly what I see even in a simple test app doing a fault -> 
hang every frame.


Right now, given that soft recovery never marks a context as guilty, it 
means that every app I tested is never stopped from submitting garbage 
That holds true for basically any app that GPU hangs and makes soft 
recovery totally useless in my testing without this.


(That being said, without my patches, RADV treats *any* reset from the 
query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it was 
not guilty, so any faulting/hanging application causes every Vulkan app 
to die e_e. This is fixed in 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )


- Joshie 🐸✨




Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring 
*ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
  atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)




Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-13 Thread André Almeida

Hi Joshua,

Em 13/01/2024 11:02, Joshua Ashton escreveu:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.

Without this feedback, the application may keep pushing through the soft
recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application
which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.



I sent a similar patch in the past, maybe you find the discussion 
interesting:


https://lore.kernel.org/lkml/20230424014324.218531-1-andrealm...@igalia.com/


Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, 
struct amdgpu_job *job)
dma_fence_set_error(fence, -ENODATA);
spin_unlock_irqrestore(fence->lock, flags);
  
+	if (job->vm)

+   drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter);
while (!dma_fence_is_signaled(fence) &&
   ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)


Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-13 Thread Friedrich Vock

On 13.01.24 15:02, Joshua Ashton wrote:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.

Without this feedback, the application may keep pushing through the soft
recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application
which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.

Signed-off-by: Joshua Ashton 

Tested-by: Friedrich Vock 


Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, 
struct amdgpu_job *job)
dma_fence_set_error(fence, -ENODATA);
spin_unlock_irqrestore(fence->lock, flags);

+   if (job->vm)
+   drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter);
while (!dma_fence_is_signaled(fence) &&
   ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)


[PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-13 Thread Joshua Ashton
We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.

Without this feedback, the application may keep pushing through the soft
recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application
which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.

Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, 
struct amdgpu_job *job)
dma_fence_set_error(fence, -ENODATA);
spin_unlock_irqrestore(fence->lock, flags);
 
+   if (job->vm)
+   drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter);
while (!dma_fence_is_signaled(fence) &&
   ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
-- 
2.43.0