Hi,
apologies for the delay. I wanted to reply and then forgot about it.
Am 10.01.26 um 05:52 schrieb Zack Rusin:
On Fri, Jan 9, 2026 at 5:34 AM Thomas Zimmermann <[email protected]> wrote:
Hi
Am 29.12.25 um 22:58 schrieb Zack Rusin:
Almost a rite of passage for every DRM developer and most Linux users
is upgrading your DRM driver/updating boot flags/changing some config
and having DRM driver fail at probe resulting in a blank screen.
Currently there's no way to recover from DRM driver probe failure. PCI
DRM driver explicitly throw out the existing sysfb to get exclusive
access to PCI resources so if the probe fails the system is left without
a functioning display driver.
Add code to sysfb to recever system framebuffer when DRM driver's probe
fails. This means that a DRM driver that fails to load reloads the system
framebuffer driver.
This works best with simpledrm. Without it Xorg won't recover because
it still tries to load the vendor specific driver which ends up usually
not working at all. With simpledrm the system recovers really nicely
ending up with a working console and not a blank screen.
There's a caveat in that some hardware might require some special magic
register write to recover EFI display. I'd appreciate it a lot if
maintainers could introduce a temporary failure in their drivers
probe to validate that the sysfb recovers and they get a working console.
The easiest way to double check it is by adding:
/* XXX: Temporary failure to test sysfb restore - REMOVE BEFORE COMMIT */
dev_info(&pdev->dev, "Testing sysfb restore: forcing probe failure\n");
ret = -EINVAL;
goto out_error;
or such right after the devm_aperture_remove_conflicting_pci_devices .
Recovering the display like that is guess work and will at best work
with simple discrete devices where the framebuffer is always located in
a confined graphics aperture.
But the problem you're trying to solve is a real one.
What we'd want to do instead is to take the initial hardware state into
account when we do the initial mode-setting operation.
The first step is to move each driver's remove_conflicting_devices call
to the latest possible location in the probe function. We usually do it
first, because that's easy. But on most hardware, it could happen much
later.
Well, some drivers (vbox, vmwgfx, bochs and currus-qemu) do it because
they request pci regions which is going to fail otherwise. Because
grabbining the pci resources is in general the very first thing that
those drivers need to do to setup anything, we
remove_conflicting_devices first or at least very early.
To my knowledge, requesting resources is more about correctness than a
hard requirement to use an I/O or memory range. Has this changed?
I also don't think it's possible or even desirable by some drivers to
reuse the initial state, good example here is vmwgfx where by default
some people will setup their vm's with e.g. 8mb ram, when the vmwgfx
loads we allow scanning out from system memory, so you can set your vm
up with 8mb of vram but still use 4k resolutions when the driver
loads, this way the suspend size of the vm is very predictable (tiny
vram plus whatever ram was setup) while still allowing a lot of
flexibility.
If there's no initial state to switch from, the first modeset can fail
while leaving the display unusable. There's no way around that. Going
back to the old state is not an option unless the driver has been
written to support this.
The case of vmwgfx is special, but does not effect the overall problem.
For vmwgfx, it would be best to import that initial state and support a
transparent modeset from vram to system memory (and back) at least
during this initial state.
In general I think however this is planned it's two or three separate series:
1) infrastructure to reload the sysfb driver (what this series is)
2) making sure that drivers that do want to recover cleanly actually
clean out all the state on exit properly,
3) abstracting at least some of that cleanup in some driver independent way
That's really not going to work. For example, in the current series, you
invoke devm_aperture_remove_conflicting_pci_devices_done() after
drm_mode_reset(), drm_dev_register() and drm_client_setup(). Each of
these calls can modify hardware state. In the case of _register() and
_setup(), the DRM clients can perform a modeset, which destroys the
initial hardware state. Patch 1 of this series removes the sysfb
device/driver entirely. That should be a no-go as it significantly
complicates recovery. For example, if the native drivers failed from an
allocation failure, the sysfb device/driver is not likely to come back
either. As the very first thing, the series should state which failures
is is going to resolve, - failed hardware init, - invalid initial
modesetting, - runtime errors (such ENOMEM, failed firmware loading), -
others? And then specify how a recovery to sysfb could look in each
supported scenario. In terms of implementation, make any transition
between drivers gradually. The native driver needs to acquire the
hardware resource (framebuffer and I/O apertures) without unloading the
sysfb driver. Luckily there's struct drm_device.unplug, which does that.
[1] Flipping this field disables hardware access for DRM drivers. All
sysfb drivers support this. To get the sysfb drivers ready, I suggest
dedicated helpers for each drivers aperture. The aperture helpers can
use these callback to flip the DRM driver off and on again. For example,
efidrm could do this as a minimum: int efidrm_aperture_suspend() {
dev->unplug = true; remove_resource(/*framebuffer aperture*/) return 0 }
int efidrm_aperture_resume() { insert_resource(/*framebuffer aperture*/)
dev->unplug = false; return 0 } struct aperture_funcs
efidrm_aperture_funcs { .suspend = efidrm_aperture_suspend, .resume =
efidrm_aperture_resume, } Pass this struct when efidrm acquires the
framebuffer aperture, so that the aperture helpers can control the
behavior of efidrm. With this, a multi-step takeover from sysfb to
native driver can be tried. It's still a massive effort that requires an
audit of each driver's probing logic. There's no copy-paste pattern
AFAICT. I suggest to pick one simple driver first and make a prototype.
Let me also say that I DO like the general idea you're proposing. But if
it was easy, we would likely have done it already. Best regards Thomas
z
--
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)