Hi

Am 15.01.26 um 15:39 schrieb Christian König:
Sorry to being late, but I only now realized what you are doing here.

On 1/15/26 12:02, Thomas Zimmermann wrote:
Hi,

apologies for the delay. I wanted to reply and then forgot about it.

Am 10.01.26 um 05:52 schrieb Zack Rusin:
On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <[email protected]> wrote:
Hi

Am 29.12.25 um 22:58 schrieb Zack Rusin:
Almost a rite of passage for every DRM developer and most Linux users
is upgrading your DRM driver/updating boot flags/changing some config
and having DRM driver fail at probe resulting in a blank screen.

Currently there's no way to recover from DRM driver probe failure. PCI
DRM driver explicitly throw out the existing sysfb to get exclusive
access to PCI resources so if the probe fails the system is left without
a functioning display driver.

Add code to sysfb to recever system framebuffer when DRM driver's probe
fails. This means that a DRM driver that fails to load reloads the system
framebuffer driver.

This works best with simpledrm. Without it Xorg won't recover because
it still tries to load the vendor specific driver which ends up usually
not working at all. With simpledrm the system recovers really nicely
ending up with a working console and not a blank screen.

There's a caveat in that some hardware might require some special magic
register write to recover EFI display. I'd appreciate it a lot if
maintainers could introduce a temporary failure in their drivers
probe to validate that the sysfb recovers and they get a working console.
The easiest way to double check it is by adding:
    /* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
    dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
    ret = -EINVAL;
    goto out_error;
or such right after the devm_aperture_remove_conflicting_pci_devices .
Recovering the display like that is guess work and will at best work
with simple discrete devices where the framebuffer is always located in
a confined graphics aperture.

But the problem you're trying to solve is a real one.

What we'd want to do instead is to take the initial hardware state into
account when we do the initial mode-setting operation.

The first step is to move each driver's remove_conflicting_devices call
to the latest possible location in the probe function. We usually do it
first, because that's easy. But on most hardware, it could happen much
later.
Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
they request pci regions which is going to fail otherwise. Because
grabbining the pci resources is in general the very first thing that
those drivers need to do to setup anything, we
remove_conflicting_devices first or at least very early.
To my knowledge, requesting resources is more about correctness than a hard 
requirement to use an I/O or memory range. Has this changed?
Nope that is not correct.

At least for AMD GPUs remove_conflicting_devices() really early is necessary 
because otherwise some operations just result in a spontaneous system reboot.   
   

Here I was only talking about avoiding calls to request_resource() and similar interfaces.


For example resizing the PCIe BAR giving access to VRAM or disabling VGA 
emulation (which AFAIK is used for EFI as well) is only possible when the VGA 
or EFI framebuffer driver is kicked out first.

Yeah, that's what I expected.


And disabling VGA emulation is among the absolutely first steps you do to take 
over the scanout config.

Assuming the driver (or driver author) is careful, is it possible to only read state from AMD hardware at such an early time?

We usually do remove_conflicting_devices() as the first thing in most driver's probe function. As a first step, it would be helpful to postpone itto a later point.


So I absolutely clearly have to reject the amdgpu patch in this series, that 
will break tons of use cases.

Don't worry, we're still in the early ideation phase.

Best regards
Thomas


Regards,
Christian.

I also don't think it's possible or even desirable by some drivers to
reuse the initial state, good example here is vmwgfx where by default
some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
loads we allow scanning out from system memory, so you can set your vm
up with 8mb of vram but still use 4k resolutions when the driver
loads, this way the suspend size of the vm is very predictable (tiny
vram plus whatever ram was setup) while still allowing a lot of
flexibility.
If there's no initial state to switch from, the first modeset can fail while 
leaving the display unusable. There's no way around that. Going back to the old 
state is not an option unless the driver has been written to support this.

The case of vmwgfx is special, but does not effect the overall problem. For 
vmwgfx, it would be best to import that initial state and support a transparent 
modeset from vram to system memory (and back) at least during this initial 
state.


In general I think however this is planned it's two or three separate series:
1) infrastructure to reload the sysfb driver (what this series is)
2) making sure that drivers that do want to recover cleanly actually
clean out all the state on exit properly,
3) abstracting at least some of that cleanup in some driver independent way
That's really not going to work. For example, in the current series, you invoke 
devm_aperture_remove_conflicting_pci_devices_done() after drm_mode_reset(), 
drm_dev_register() and drm_client_setup(). Each of these calls can modify 
hardware state. In the case of _register() and _setup(), the DRM clients can 
perform a modeset, which destroys the initial hardware state. Patch 1 of this 
series removes the sysfb device/driver entirely. That should be a no-go as it 
significantly complicates recovery. For example, if the native drivers failed 
from an allocation failure, the sysfb device/driver is not likely to come back 
either. As the very first thing, the series should state which failures is is 
going to resolve, - failed hardware init, - invalid initial modesetting, - 
runtime errors (such ENOMEM, failed firmware loading), - others? And then 
specify how a recovery to sysfb could look in each supported scenario. In terms 
of implementation, make any transition between drivers
gradually. The native driver needs to acquire the hardware resource (framebuffer and 
I/O apertures) without unloading the sysfb driver. Luckily there's struct 
drm_device.unplug, which does that. [1] Flipping this field disables hardware access 
for DRM drivers. All sysfb drivers support this. To get the sysfb drivers ready, I 
suggest dedicated helpers for each drivers aperture. The aperture helpers can use 
these callback to flip the DRM driver off and on again. For example, efidrm could do 
this as a minimum: int efidrm_aperture_suspend() { dev->unplug = true; 
remove_resource(/*framebuffer aperture*/) return 0 } int efidrm_aperture_resume() { 
insert_resource(/*framebuffer aperture*/) dev->unplug = false; return 0 } struct 
aperture_funcs efidrm_aperture_funcs { .suspend = efidrm_aperture_suspend, .resume = 
efidrm_aperture_resume, } Pass this struct when efidrm acquires the framebuffer 
aperture, so that the aperture helpers can control the behavior of efidrm. With this, 
a multi-
step takeover from sysfb to native driver can be tried. It's still a massive 
effort that requires an audit of each driver's probing logic. There's no 
copy-paste pattern AFAICT. I suggest to pick one simple driver first and make a 
prototype. Let me also say that I DO like the general idea you're proposing. 
But if it was easy, we would likely have done it already. Best regards Thomas
z

--
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)


Reply via email to