Hi @bugs,
Currently there is an issue on certain GPUs, where during boot they will
complain about a process interrupt error, I've included the error message
below.
Nonetheless DRM will fail to start, and the GPU is unusable. Once upon a
time,
it ceased to stop being an issue slightly before the 6.8 release, but
resumed
after a message printing diagnostic info was inserted into the kernel.
After a
period of time inserting print statements throughout the driver, I've
determined
that it will fail in the same fashion when any message is placed between the
calls to amdgpu_device_init and wsdisplay_cnattach in amdgpu_attachhook in
amdgpu_kms.c, with different behavior if it is placed very shortly before
the
point at which amdgpu_stdscreen is initialized.
Tracing to the bottom of the calls, it seems to have an issue where
somehow
amdgpu_stdscreen gets corrupted if there is any attempt at printing info in
the
kernel after a certain point. Specifically, it will fail if anything is
placed
after line 751 in file psp_v11_0.c ("WREG32_SOC15(MP0, 0,
mmMP0_SMN_C2PMSG_67);") and line 144 in file wsemul_vt100.c
("edp->emulcookie =
cookie"), a void pointer that at this point is a pointer to
amdgpu_stdscreen).
The latter of these two is in a function that is called multiple times,
however
it is only called once in the problem area. Any help would be greatly
appreciated, as I am unable to use OpenBSD on my desktop currently. Even
just a
pointer in how to go about fixing it myself would be useful, I program
daily in
C so I might be able to, although I'm still new to driver related matters.
I've
included the patch using @jcs's printf_flags he recommended that allows me
to
boot without the issue, though there are some other graphical issues that
cause
crashes and it's only a temporary hack.
Best regards,
Charlie Burnett
Get-around hack, on code checked out this morning:
Index: sys/dev/pci/drm/amd/amdgpu/psp_v11_0.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/drm/amd/amdgpu/psp_v11_0.c,v
retrieving revision 1.2
diff -u -p -r1.2 psp_v11_0.c
--- sys/dev/pci/drm/amd/amdgpu/psp_v11_0.c 7 Jul 2021 02:38:23 -0000
1.2
+++ sys/dev/pci/drm/amd/amdgpu/psp_v11_0.c 19 Aug 2021 15:24:20 -0000
@@ -741,6 +741,7 @@ static uint32_t psp_v11_0_ring_get_wptr(
static void psp_v11_0_ring_set_wptr(struct psp_context *psp, uint32_t
value)
{
+ extern int printf_flags;
struct amdgpu_device *adev = psp->adev;
if (amdgpu_sriov_vf(adev)) {
@@ -749,6 +750,7 @@ static void psp_v11_0_ring_set_wptr(stru
psp->km_ring.ring_wptr = value;
} else
WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_67, value);
+ printf_flags &= ~0x1;
}
static int psp_v11_0_load_usbc_pd_fw(struct psp_context *psp, dma_addr_t
dma_addr)
Index: sys/dev/wscons/wsemul_vt100.c
===================================================================
RCS file: /cvs/src/sys/dev/wscons/wsemul_vt100.c,v
retrieving revision 1.39
diff -u -p -r1.39 wsemul_vt100.c
--- sys/dev/wscons/wsemul_vt100.c 25 May 2020 09:55:49 -0000 1.39
+++ sys/dev/wscons/wsemul_vt100.c 19 Aug 2021 15:24:20 -0000
@@ -139,8 +139,12 @@ wsemul_vt100_init(struct wsemul_vt100_em
const struct wsscreen_descr *type, void *cookie, int ccol, int crow,
uint32_t defattr)
{
+ static int vt_num;
+ extern int printf_flags;
edp->emulops = type->textops;
edp->emulcookie = cookie;
+ if (vt_num++ == 1) /* the instance for me with amdgpu_stdscreen */
+ printf_flags |= 0x1;
edp->scrcapabilities = type->capabilities;
edp->nrows = type->nrows;
edp->ncols = type->ncols;
And here's the error message- this is running the latest snapshot on the
mirrors:
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR* [gfxhub0] no-retry page fault (src_id:0 ring:222 vmid:0 pasid:0,
for process pid 0 thread pid 0)
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR* in page starting at address 0x000000000086e000 from client 27
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR* VM_L2_PROTECTION_FAULT_STATUS:0x000009BC
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR* Faulty UTCL2 client ID: CPF (0x4)
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR* MORE_FAULTS: 0x0
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR* WALKER_ERROR: 0x6
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR* PERMISSION_FAULTS: 0xb
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR* MAPPING_ERROR: 0x1
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR* RW: 0x0
Aug 19 08:23:00 localhost /bsd: [drm] *ERROR* IB test failed on gfx (-60).
Aug 19 08:23:00 localhost /bsd: [drm] *ERROR* ib ring test failed (-60).
Let me know if there's anything I missed!