wsemul init amdgpu crash

Charlie Burnett Thu, 19 Aug 2021 14:17:26 -0700

Hi @bugs,

    Currently there is an issue on certain GPUs, where during boot they will
complain about a process interrupt error, I've included the error message
below.
Nonetheless DRM will fail to start, and the GPU is unusable. Once upon a
time,
it ceased to stop being an issue slightly before the 6.8 release, but
resumed
after a message printing diagnostic info was inserted into the kernel.
After a
period of time inserting print statements throughout the driver, I've
determined
that it will fail in the same fashion when any message is placed between the
calls to amdgpu_device_init and wsdisplay_cnattach in amdgpu_attachhook in
amdgpu_kms.c, with different behavior if it is placed very shortly before
the
point at which amdgpu_stdscreen is initialized.


    Tracing to the bottom of the calls, it seems to have an issue where
somehow
amdgpu_stdscreen gets corrupted if there is any attempt at printing info in
the
kernel after a certain point. Specifically, it will fail if anything is
placed
after line 751 in file psp_v11_0.c ("WREG32_SOC15(MP0, 0,
mmMP0_SMN_C2PMSG_67);") and line 144 in file wsemul_vt100.c
("edp->emulcookie =
cookie"), a void pointer that at this point is a pointer to
amdgpu_stdscreen).
The latter of these two is in a function that is called multiple times,
however
it is only called once in the problem area.  Any help would be greatly
appreciated, as I am unable to use OpenBSD on my desktop currently.  Even
just a
pointer in how to go about fixing it myself would be useful, I program
daily in
C so I might be able to, although I'm still new to driver related matters.
I've
included the patch using @jcs's printf_flags he recommended that allows me
to
boot without the issue, though there are some other graphical issues that
cause
crashes and it's only a temporary hack.

Best regards,
Charlie Burnett

Get-around hack, on code checked out this morning:

Index: sys/dev/pci/drm/amd/amdgpu/psp_v11_0.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/drm/amd/amdgpu/psp_v11_0.c,v
retrieving revision 1.2
diff -u -p -r1.2 psp_v11_0.c
--- sys/dev/pci/drm/amd/amdgpu/psp_v11_0.c      7 Jul 2021 02:38:23 -0000
    1.2
+++ sys/dev/pci/drm/amd/amdgpu/psp_v11_0.c      19 Aug 2021 15:24:20 -0000
@@ -741,6 +741,7 @@ static uint32_t psp_v11_0_ring_get_wptr(

 static void psp_v11_0_ring_set_wptr(struct psp_context *psp, uint32_t
value)
 {
+       extern int printf_flags;
        struct amdgpu_device *adev = psp->adev;

        if (amdgpu_sriov_vf(adev)) {
@@ -749,6 +750,7 @@ static void psp_v11_0_ring_set_wptr(stru
                psp->km_ring.ring_wptr = value;
        } else
                WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_67, value);
+       printf_flags &= ~0x1;
 }

 static int psp_v11_0_load_usbc_pd_fw(struct psp_context *psp, dma_addr_t
dma_addr)
Index: sys/dev/wscons/wsemul_vt100.c
===================================================================
RCS file: /cvs/src/sys/dev/wscons/wsemul_vt100.c,v
retrieving revision 1.39
diff -u -p -r1.39 wsemul_vt100.c
--- sys/dev/wscons/wsemul_vt100.c       25 May 2020 09:55:49 -0000      1.39
+++ sys/dev/wscons/wsemul_vt100.c       19 Aug 2021 15:24:20 -0000
@@ -139,8 +139,12 @@ wsemul_vt100_init(struct wsemul_vt100_em
     const struct wsscreen_descr *type, void *cookie, int ccol, int crow,
     uint32_t defattr)
 {
+       static int vt_num;
+       extern int printf_flags;
        edp->emulops = type->textops;
        edp->emulcookie = cookie;
+       if (vt_num++ == 1) /* the instance for me with amdgpu_stdscreen */
+               printf_flags |= 0x1;
        edp->scrcapabilities = type->capabilities;
        edp->nrows = type->nrows;
        edp->ncols = type->ncols;


And here's the error message- this is running the latest snapshot on the
mirrors:


Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR* [gfxhub0] no-retry page fault (src_id:0 ring:222 vmid:0 pasid:0,
for process  pid 0 thread  pid 0)
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR*   in page starting at address 0x000000000086e000 from client 27
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR* VM_L2_PROTECTION_FAULT_STATUS:0x000009BC
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR*   Faulty UTCL2 client ID: CPF (0x4)
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR*   MORE_FAULTS: 0x0
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR*   WALKER_ERROR: 0x6
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR*   PERMISSION_FAULTS: 0xb
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR*   MAPPING_ERROR: 0x1
Aug 19 08:23:00 localhost /bsd: drm:pid3866:gmc_v9_0_process_interrupt
*ERROR*   RW: 0x0
Aug 19 08:23:00 localhost /bsd: [drm] *ERROR* IB test failed on gfx (-60).
Aug 19 08:23:00 localhost /bsd: [drm] *ERROR* ib ring test failed (-60).

Let me know if there's anything I missed!

wsemul init amdgpu crash

Reply via email to