amd: Return -EBUSY for amdgpu_pmops_thaw() on success

Rafael J. Wysocki Mon, 20 Oct 2025 12:56:13 -0700

On Mon, Oct 20, 2025 at 9:34 PM Mario Limonciello (AMD) (kernel.org)
<[email protected]> wrote:
>
>
>
> On 10/20/2025 2:18 PM, Rafael J. Wysocki wrote:
> > On Mon, Oct 20, 2025 at 9:14 PM Mario Limonciello (AMD) (kernel.org)
> > <[email protected]> wrote:
> >>
> >>
> >>
> >> On 10/20/2025 1:50 PM, Rafael J. Wysocki wrote:
> >>> On Mon, Oct 20, 2025 at 8:32 PM Mario Limonciello (AMD) (kernel.org)
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote:
> >>>>> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org)
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote:
> >>>>>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD)
> >>>>>>> <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> From: Mario Limonciello <[email protected]>
> >>>>>>>>
> >>>>>>>> The PM core should be notified that thaw was skipped for the device
> >>>>>>>> so that if it's tried to be resumed (such as an aborted hibernate)
> >>>>>>>> that it gets another chance to resume.
> >>>>>>>>
> >>>>>>>> Cc: Muhammad Usama Anjum <[email protected]>
> >>>>>>>> Signed-off-by: Mario Limonciello <[email protected]>
> >>>>>>>> ---
> >>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
> >>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>> index 61268aa82df4d..d40af069f24dd 100644
> >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device 
> >>>>>>>> *dev)
> >>>>>>>>
> >>>>>>>>             /* do not resume device if it's normal hibernation */
> >>>>>>>>             if (!pm_hibernate_is_recovering() && 
> >>>>>>>> !pm_hibernation_mode_is_suspend())
> >>>>>>>> -               return 0;
> >>>>>>>> +               return -EBUSY;
> >>>>>>>
> >>>>>>> So that's why you need the special handling of -EBUSY in the previous 
> >>>>>>> patch.
> >>>>>>
> >>>>>> Yup.
> >>>>>>
> >>>>>>>
> >>>>>>> I think that you need to save some state in this driver and then use
> >>>>>>> it in subsequent callbacks instead of hacking the core to do what you
> >>>>>>> want.
> >>>>>>>
> >>>>>>
> >>>>>> The problem is the core decides "what" to call and more importantly
> >>>>>> "when" to call it.
> >>>>>>
> >>>>>> IE if the core thinks that something is thawed it will never call
> >>>>>> resume, and that's why you end up in a bad place with Muhammad's
> >>>>>> cancellation series and why I proposed this one to discuss.
> >>>>>>
> >>>>>> We could obviously go back to dropping this case entirely:
> >>>>>>
> >>>>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend())
> >>>>>>
> >>>>>> But then the display turns on at thaw(), you do an unnecessary resource
> >>>>>> eviction, it takes a lot longer if you have a ton of VRAM etc.
> >>>>>
> >>>>> The cancellation series is at odds with this code path AFAICS because
> >>>>> what if hibernation is canceled after the entire thaw transition?
> >>>>
> >>>> Muhammad - did you test that specific timing of cancelling the hibernate?
> >>>>>
> >>>>> Some cleanup would need to be done before thawing user space I suppose.
> >>>>
> >>>> I agree; I think that series would need changes for it.
> >>>>
> >>>> But if you put that series aside, I think this one still has some merit
> >>>> on it's own.  If another driver aborted the hibernate, I think the same
> >>>> thing could happen if it happened to run before amdgpu's device thaw().
> >>>>
> >>>> That series just exposed a very "easy" way to reproduce this issue.
> >>>
> >>> Device thaw errors don't abort anything AFAICS.
> >>>
> >>
> >> You're right; it doesn't abort, it just is saved to the logs.
> >> The state is also not maintained.
> >>> What can happen though is that another device may abort the final
> >>> "power off" transition, which is one of the reasons why I think that
> >>> rolling it back is generally hard.
> >>
> >> That's exactly the reason for the first patch in this series.  The state
> >> of whether it succeeded isn't recorded.  So if thaw non-fatally fails
> >> and you've saved state to indicate this then any of the other calls that
> >> run can try again.
> >
> > So long as they are called.
> >
> > But as I said before, I would save the state in the driver thaw
> > callback and then clear it in the driver poweroff callback and look at
> > it in the driver restore callback.  If it is there at that point,
> > poweroff has not run and hibernation is rolling back, so you need to
> > do a "thaw".
>
> Are you suggesting that the device driver should directly manipulate
> dev->power.is_suspended?


No, it needs to have its own state for that.  power.is_suspended
should not be manipulated by drivers (or anything other than the core
for that matter).

> I'll do some testing but; I suppose that would work as well without
> needing to make core changes if you don't see a need for other devices
> to do this.

So long as they don't try to skip the "thaw" actions, I don't.

If there are more drivers wanting to do it, I guess it would be good
to have a common approach although at this point I'm not sure how much
in common there would be.

Re: [RFC 3/3] drm/amd: Return -EBUSY for amdgpu_pmops_thaw() on success

Reply via email to