On 2021-12-21 16:18, Alex Deucher wrote:
> On Tue, Dec 21, 2021 at 1:47 PM Deucher, Alexander
> <[email protected]> wrote:
>>
>> [Public]
>>
>>> -----Original Message-----
>>> From: Deucher, Alexander
>>> Sent: Tuesday, December 21, 2021 12:01 PM
>>> To: Linus Torvalds <[email protected]>; Imre Deak
>>> <[email protected]>; [email protected]
>>> Cc: Daniel Vetter <[email protected]>; Kai-Heng Feng
>>> <[email protected]>
>>> Subject: RE: Expecting to revert commit 55285e21f045 "fbdev/efifb: Release
>>> PCI device ..."
>>>
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Linus Torvalds <[email protected]>
>>>> Sent: Monday, December 20, 2021 5:05 PM
>>>> To: Imre Deak <[email protected]>
>>>> Cc: Daniel Vetter <[email protected]>; Deucher, Alexander
>>>> <[email protected]>; Kai-Heng Feng
>>>> <[email protected]>
>>>> Subject: Re: Expecting to revert commit 55285e21f045 "fbdev/efifb:
>>>> Release PCI device ..."
>>>>
>>>> On Mon, Dec 20, 2021 at 1:33 PM Imre Deak <[email protected]>
>>> wrote:
>>>>>
>>>>> amdgpu.runpm=0
>>>>
>>>> Hmmm.
>>>>
>>>> This does seem to "work", but not very well.
>>>>
>>>> With this, what seems to happen is odd: I lock the screen, wait, it
>>>> goes "No signal, shutting down", but then doesn't actually shut down
>>>> but stays black (with the backlight on). After _another_ five seconds
>>>> or so, the monitor goes "No signal, shutting down" _again_, and at that
>>> point it actually does it.
>>>>
>>>> So it solves my immediate problem - in that yes, the backlight finally
>>>> does turn off in the end - but it does seem to be still broken.
>>>>
>>>> I'm very surprised if no AMD drm developers can see this exact same thing.
>>>> This is a very simple setup. The only possibly slightly less common
>>>> thing is that I have two monitors, but while that is not necessarily
>>>> the _most_ common setup in an absolute sense, I'd expect it to be very
>>>> common among DRM developers..
>>>>
>>>> I guess I can just change the revert to just a
>>>>
>>>>     -int amdgpu_runtime_pm = -1;
>>>>     +int amdgpu_runtime_pm = 0;
>>>>
>>>> instead. The auto-detect is apparently broken. Maybe it should only
>>>> kick in for LVDS screens on actual laptops?
>>>>
>>>> Note: on my machine, I get that
>>>>
>>>>    amdgpu 0000:49:00.0: amdgpu: Using BACO for runtime pm
>>>>
>>>> so maybe the other possible runtime pm models (ARPX and BOCO) are ok,
>>>> and it's only that BACO case that is broken.
>>>>
>>>> I have no idea what any of those three things are - I'm just looking
>>>> at the uses of that amdgpu_runtime_pm variable.
>>>>
>>>> amdgpu people: if you don't want that amdgpu_runtime_pm turned off by
>>>> default, tell me something else to try.
>>>
>>> For a little background, runtime PM support was added about 10 year ago
>>> originally to support laptops with multiple GPUs (integrated and discrete).
>>> It's not specific to the display hardware.  When the GPU is idle, it can be
>>> powered down completely.  In the case of these laptops, it's D3 cold
>>> (managed by ACPI, we call this BOCO in AMD parlance - Bus Off, Chip Off)
>>> which powers off the dGPU completely (i.e., it disappears from the bus).  A
>>> few years ago we extended this to support desktop dGPUs as well which
>>> support their own version of runtime D3 (called BACO in AMD parlance - Bus
>>> Active, Chip Off).  The driver can put the chip into a low power state where
>>> everything except the bus interface is powered down (to avoid the device
>>> disappearing from the bus).  So this has worked for almost 2 years now on
>>> BACO capable parts and for a decade or more on BOCO systems.
>>> Unfortunately, changing the default runpm parameter setting would cause a
>>> flood of bug reports about runtime power management breaking and
>>> suddenly systems are using more power.
>>>
>>> Imre's commit (55285e21f045) fixes another commit (a6c0fd3d5a8b).
>>> Runtime pm was working on amdgpu prior to that commit.  Is it possible
>>> there is still some race between when amdgpu takes over from efifb?  Does
>>> it work properly when all pm_runtime calls in efifb are removed or if efifb 
>>> is
>>> not enabled?  Runtime pm for Polaris boards has been enabled by default
>>> since 4fdda2e66de0b which predates both of those patches.
>>
>> Thinking about this more, I wonder if there was some change in some 
>> userspace component which was hidden by the changes in 55285e21f045 and 
>> a6c0fd3d5a8b.  E.g., some desktop component started polling for display 
>> changes or GPU temperature or something like that and when a6c0fd3d5a8b was 
>> in place the GPU never entered runtime suspend.  Then when 55285e21f045 was 
>> applied, it unmasked the new behavior in the userpace component.
>>
>> What should happen is that when all of the displays blank, assuming the GPU 
>> is otherwise idle, the GPU will runtime suspend after  seconds.  When you 
>> move the mouse or hit the keyboard, that should trigger the GPU should 
>> runtime resume and then the displays will be re-enabled.
>>
>> In the behavior you are seeing, when the displays come back on after they 
>> blank are you seeing the device resume from runtime suspend?  On resume from 
>> suspend (runtime or system) we issue a hotplug notification to userspace in 
>> case any displays changed during suspend when the GPU was powered down (and 
>> hence could not detect a hotplug event).  Perhaps that event is triggering 
>> userspace to reprobe and re-enable the displays shortly after resume from 
>> runtime suspend due to some other event that caused the device to runtime 
>> resume.  Does something like this help by any chance?
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fattachment.cgi%3Fid%3D300103%26action%3Ddiff%26collapsed%3D%26headers%3D1%26format%3Draw&amp;data=04%7C01%7CHarry.Wentland%40amd.com%7Cd1d2f9528d8e42199af508d9c4c7793d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637757183279389936%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=KdIAXxYP0oQpNCFCe1slYwqgDhRdse2CrkGuCc2KrAE%3D&amp;reserved=0
> 
> I'm not seeing this on my system, and another user tried the patch and
> it fixed his system, so it indeed seems to be something along the
> lines of what I described above.  Something in userspace seems to be
> accessing the GPU causing it to runtime resume.  On resume the driver
> then sends a hotplug event to userspace (in case anything display
> related changed while the GPU was suspended).  The desktop manager
> then sees the hotplug event and reprobes and lights up the displays
> again.  @Wentland, Harry, I guess the display code may already handle
> this (seeing if anything has changed on resume since suspend), so we
> can probably drop the call from the generic amdgpu resume code?  Or if
> not, it should be pretty straightforward to fix that in
> dm_suspend()/dm_resume().
> 

We handle re-detection in dm_resume but only seem to call the
drm_kms_helper_hotplug_event for MST. We might want to call it
in dm_resume but that would just move it from amdgpu_device_resume
and will probably cause the same issue.

What I think we'd really want here is to make sure
drm_kms_helper_hotplug_event is only called when any display
config actually changes. Unfortunately, we're not being very
smart in our detection and just recreate everything (even
though dc_link_detect_helper seems to have code that wants to
be smart enough to detect if the sink is changed or not).
This means this change is non-trivial. At least I can't see
an easy fix that doesn't run the risk of breaking a bunch of
stuff.

Or maybe someone can try to see if (non-MST) detection during
RPM still works with your patch. If it does (for some reason)
then we should be good.

I can't seem to repro the problem on my Navi 1x card with KDE Plasma.
I wonder if it's a Polaris issue.

I also don't see my Navi 1x card going into RPM after
'xset dpms force off' with the Manjaro 5.15 kernel. Might need
to try the latest mainline or amd-stg.

Harry


> Alex

Reply via email to