On 2022-11-10 18:00, Michel Dänzer wrote: > On 2022-11-08 09:01, Zhu, Jiadong wrote: >> >> I reproduced the glxgears 400fps scenario locally. The issue is caused by >> the patch5 "drm/amdgpu: Improve the software rings priority scheduler" which >> slows down the low priority scheduler thread if high priority ib is under >> executing. I'll drop this patch as we cannot identify gpu bound according to >> the unsignaled fence, etc. > > Okay, I'm testing with patches 1-4 only now. > > So far I haven't noticed any negative effects, no slowdowns or intermittent > freezes.
I'm afraid I may have run into another issue. I just hit a GPU hang, see the journalctl excerpt below. (I tried rebooting the machine via SSH after this, but it never seemed to complete, so I had to hard-power-off the machine by holding the power button for a few seconds) I can't be sure that the GPU hang is directly related to this series, but it seems plausible, and I hadn't hit a GPU hang in months if not over a year before. If this series results in potentially hitting a GPU hang every few days, it definitely doesn't provide enough benefit to justify that. Nov 14 17:21:22 thor kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=1166051, emitted seq=1166052 Nov 14 17:21:22 thor kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2828 thread gnome-shel:cs0 pid 2860 Nov 14 17:21:22 thor kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin! Nov 14 17:21:22 thor kernel: amdgpu 0000:05:00.0: amdgpu: free PSP TMR buffer Nov 14 17:21:22 thor kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset Nov 14 17:21:22 thor kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume Nov 14 17:21:22 thor kernel: [drm] PCIE GART of 1024M enabled. Nov 14 17:21:22 thor kernel: [drm] PTB located at 0x000000F400A00000 Nov 14 17:21:22 thor kernel: [drm] VRAM is lost due to GPU reset! Nov 14 17:21:22 thor kernel: [drm] PSP is resuming... Nov 14 17:21:22 thor kernel: [drm] reserve 0x400000 from 0xf431c00000 for PSP TMR Nov 14 17:21:23 thor kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available Nov 14 17:21:23 thor kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available Nov 14 17:21:23 thor gnome-shell[3639]: amdgpu: The CS has been rejected (-125), but the context isn't robust. Nov 14 17:21:23 thor gnome-shell[3639]: amdgpu: The process will be terminated. Nov 14 17:21:23 thor kernel: [drm] kiq ring mec 2 pipe 1 q 0 Nov 14 17:21:23 thor kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110) Nov 14 17:21:23 thor kernel: [drm:amdgpu_gfx_enable_kcq.cold [amdgpu]] *ERROR* KCQ enable failed Nov 14 17:21:23 thor kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110 Nov 14 17:21:23 thor kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(2) failed Nov 14 17:21:23 thor kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110 Nov 14 17:21:23 thor kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110 [...] Nov 14 17:21:33 thor kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=1166052, emitted seq=1166052 Nov 14 17:21:33 thor kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2828 thread gnome-shel:cs0 pid 2860 Nov 14 17:21:33 thor kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin! -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and Xwayland developer
