On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
<[email protected]> wrote:
On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
<[email protected]> wrote:
From: Mario Limonciello <[email protected]>
The PM core should be notified that thaw was skipped for the device
so that if it's tried to be resumed (such as an aborted hibernate)
that it gets another chance to resume.
Cc: Muhammad Usama Anjum <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 61268aa82df4d..d40af069f24dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device *dev)
/* do not resume device if it's normal hibernation */
if (!pm_hibernate_is_recovering() &&
!pm_hibernation_mode_is_suspend())
- return 0;
+ return -EBUSY;
So that's why you need the special handling of -EBUSY in the previous patch.
Yup.
I think that you need to save some state in this driver and then use
it in subsequent callbacks instead of hacking the core to do what you
want.
The problem is the core decides "what" to call and more importantly
"when" to call it.
IE if the core thinks that something is thawed it will never call
resume, and that's why you end up in a bad place with Muhammad's
cancellation series and why I proposed this one to discuss.
We could obviously go back to dropping this case entirely:
if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
But then the display turns on at thaw(), you do an unnecessary resource
eviction, it takes a lot longer if you have a ton of VRAM etc.
The cancellation series is at odds with this code path AFAICS because
what if hibernation is canceled after the entire thaw transition?
Muhammad - did you test that specific timing of cancelling the hibernate?
Some cleanup would need to be done before thawing user space I suppose.
I agree; I think that series would need changes for it.
But if you put that series aside, I think this one still has some merit
on it's own. If another driver aborted the hibernate, I think the same
thing could happen if it happened to run before amdgpu's device thaw().
That series just exposed a very "easy" way to reproduce this issue.