On Mon, Oct 20, 2025 at 9:34 PM Mario Limonciello (AMD) (kernel.org) <[email protected]> wrote: > > > > On 10/20/2025 2:18 PM, Rafael J. Wysocki wrote: > > On Mon, Oct 20, 2025 at 9:14 PM Mario Limonciello (AMD) (kernel.org) > > <[email protected]> wrote: > >> > >> > >> > >> On 10/20/2025 1:50 PM, Rafael J. Wysocki wrote: > >>> On Mon, Oct 20, 2025 at 8:32 PM Mario Limonciello (AMD) (kernel.org) > >>> <[email protected]> wrote: > >>>> > >>>> > >>>> > >>>> On 10/20/2025 12:39 PM, Rafael J. Wysocki wrote: > >>>>> On Mon, Oct 20, 2025 at 7:28 PM Mario Limonciello (AMD) (kernel.org) > >>>>> <[email protected]> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 10/20/2025 12:21 PM, Rafael J. Wysocki wrote: > >>>>>>> On Mon, Oct 20, 2025 at 6:53 PM Mario Limonciello (AMD) > >>>>>>> <[email protected]> wrote: > >>>>>>>> > >>>>>>>> From: Mario Limonciello <[email protected]> > >>>>>>>> > >>>>>>>> The PM core should be notified that thaw was skipped for the device > >>>>>>>> so that if it's tried to be resumed (such as an aborted hibernate) > >>>>>>>> that it gets another chance to resume. > >>>>>>>> > >>>>>>>> Cc: Muhammad Usama Anjum <[email protected]> > >>>>>>>> Signed-off-by: Mario Limonciello <[email protected]> > >>>>>>>> --- > >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +- > >>>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) > >>>>>>>> > >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>>>>> index 61268aa82df4d..d40af069f24dd 100644 > >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >>>>>>>> @@ -2681,7 +2681,7 @@ static int amdgpu_pmops_thaw(struct device > >>>>>>>> *dev) > >>>>>>>> > >>>>>>>> /* do not resume device if it's normal hibernation */ > >>>>>>>> if (!pm_hibernate_is_recovering() && > >>>>>>>> !pm_hibernation_mode_is_suspend()) > >>>>>>>> - return 0; > >>>>>>>> + return -EBUSY; > >>>>>>> > >>>>>>> So that's why you need the special handling of -EBUSY in the previous > >>>>>>> patch. > >>>>>> > >>>>>> Yup. > >>>>>> > >>>>>>> > >>>>>>> I think that you need to save some state in this driver and then use > >>>>>>> it in subsequent callbacks instead of hacking the core to do what you > >>>>>>> want. > >>>>>>> > >>>>>> > >>>>>> The problem is the core decides "what" to call and more importantly > >>>>>> "when" to call it. > >>>>>> > >>>>>> IE if the core thinks that something is thawed it will never call > >>>>>> resume, and that's why you end up in a bad place with Muhammad's > >>>>>> cancellation series and why I proposed this one to discuss. > >>>>>> > >>>>>> We could obviously go back to dropping this case entirely: > >>>>>> > >>>>>> if (!pm_hibernate_is_recovering() && !pm_hibernation_mode_is_suspend()) > >>>>>> > >>>>>> But then the display turns on at thaw(), you do an unnecessary resource > >>>>>> eviction, it takes a lot longer if you have a ton of VRAM etc. > >>>>> > >>>>> The cancellation series is at odds with this code path AFAICS because > >>>>> what if hibernation is canceled after the entire thaw transition? > >>>> > >>>> Muhammad - did you test that specific timing of cancelling the hibernate? > >>>>> > >>>>> Some cleanup would need to be done before thawing user space I suppose. > >>>> > >>>> I agree; I think that series would need changes for it. > >>>> > >>>> But if you put that series aside, I think this one still has some merit > >>>> on it's own. If another driver aborted the hibernate, I think the same > >>>> thing could happen if it happened to run before amdgpu's device thaw(). > >>>> > >>>> That series just exposed a very "easy" way to reproduce this issue. > >>> > >>> Device thaw errors don't abort anything AFAICS. > >>> > >> > >> You're right; it doesn't abort, it just is saved to the logs. > >> The state is also not maintained. > >>> What can happen though is that another device may abort the final > >>> "power off" transition, which is one of the reasons why I think that > >>> rolling it back is generally hard. > >> > >> That's exactly the reason for the first patch in this series. The state > >> of whether it succeeded isn't recorded. So if thaw non-fatally fails > >> and you've saved state to indicate this then any of the other calls that > >> run can try again. > > > > So long as they are called. > > > > But as I said before, I would save the state in the driver thaw > > callback and then clear it in the driver poweroff callback and look at > > it in the driver restore callback. If it is there at that point, > > poweroff has not run and hibernation is rolling back, so you need to > > do a "thaw". > > Are you suggesting that the device driver should directly manipulate > dev->power.is_suspended?
No, it needs to have its own state for that. power.is_suspended should not be manipulated by drivers (or anything other than the core for that matter). > I'll do some testing but; I suppose that would work as well without > needing to make core changes if you don't see a need for other devices > to do this. So long as they don't try to skip the "thaw" actions, I don't. If there are more drivers wanting to do it, I guess it would be good to have a common approach although at this point I'm not sure how much in common there would be.
