Bug#1002978: Another GPU hang
OK, now we know the i915 driver, in my environment is NOT fixed with just the microcode. I get GPU hangs whether the microcode is up-to-date or not. This boot had NO kernel command line options (besides Debian defaults): GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force" I don't remember if I added pcie_aspm=force or if Debian did. I've had that forever. I will try removing it next, but I probably need it for my Realtek Ethernet that uses the 8168 driver. Currently, I have: $ cat /sys/module/pcie_aspm/parameters/policy [default] performance powersave powersupersave Here is the output of /sys/class/drm/card0/error: GPU HANG: ecode 6:2:bb86, in systemd-logind [662] Kernel: 5.10.0-10-amd64 x86_64 Driver: 20200917 Time: 1642463910 s 885852 us Boottime: 7195 s 975866 us Uptime: 7189 s 236627 us Capture: 4296691264 jiffies; 543240 ms ago Active process (on ring bcs0): systemd-logind [662] Reset count: 0 Suspend count: 0 Platform: SANDYBRIDGE Subplatform: 0x0 PCI ID: 0x0102 PCI Revision: 0x09 PCI Subsystem: 1565:110d IOMMU enabled?: 0 RPM wakelock: yes PM suspended: no GT awake: yes EIR: 0x IER: 0x82bc8585 GTIER[0]: 0x00401001 PGTBL_ER: 0x FORCEWAKE: 0x DERRMR: 0x fence[0] = fde03b007f6001 fence[1] = 122600701127003 fence[2] = fence[3] = fence[4] = fence[5] = fence[6] = fence[7] = fence[8] = fence[9] = fence[10] = fence[11] = fence[12] = fence[13] = fence[14] = fence[15] = ERROR: 0x DONE_REG: 0x bcs0 command stream: CCID: 0x START: 0x5000 HEAD: 0x0fe034d0 [0x3458] TAIL: 0x36f8 [0x34d0, 0x34e0] CTL: 0x3001 MODE: 0x HWS: 0x7fffb000 ACTHD: 0x 7fff3ee8 IPEIR: 0x0008 IPEHR: 0x4477 ESR: 0x INSTDONE: 0xfff1 batch: [0x_7fff3000, 0x_7fff7000] BBADDR: 0x_7fff3ee3 BB_STATE: 0x INSTPS: 0x INSTPM: 0x FADDR: 0x 7fff3f80 RC PSMI: 0x0010 FAULT_REG: 0x GFX_MODE: 0x PP_DIR_BASE: 0x engine reset count: 0 Active context: systemd-logind[662] prio 0, guilty 0 active 0, runtime total 0ns, avg 0ns bcs0 --- HW Status = 0x 7fffb000 :cL%-H!!!Tc0;&FRS,4aK&eK-$Fk6JP^IKP"K!4)Z. bcs0 --- batch = 0x 7fff3000 :=m_`(+@[H9*1>F@9kb_ZL,tUu]nJ#WN6g$ROq$#UUsrA'--e$d_R^0)luEdKiJD?%4+2TScP/afS7MKb-!')6*QmkM"Blk<@4s1<;EjkJ#1.f"Jfpej!qVZa!XN-KA[pR`*rg0HQ5T%GpPsj+--CS"%=$jQ6kk;s1f^q.Lo'!^l]kiSj8M4Fpl!m3?ri7jfXZH[J]1gY4)\@FT%>S]p_nQEd7.q"nNW"9#LIO!8f2Y>V`Jhe43XiCr++i\Q2=(BsA+ZA/pcS]Z=Ym?gVZd/E`ZT&_2+3=(L'/[WV<0?.F3NSo2Oj9Cg([C?<5#tprrmU]Zng5G_:V_--7.X#hjfBhpDhh/bZ]+CY_C'XBK6%TZe2aK!L\UF%0cWmV_Fo48LFmI8%_9+8])9De%WAP:P4Etm-\s!sB7#I,r0L5G?uhuUQ1(uT"5Vg6^mt.DBZb?qq+mUQhiIBR$%i:B(*7-N4s<^:N!dt0cQ0s5s+k0m$jtCU4,i)BB41=kiWo?S!jlZkY92ou0D;o2B[Qju>l!LcIsr@W&,T%9=u7&TPZe*=HVitpV_`;n'2/baJ85fK^o\Lm<[=F;%./##ZtX0k1t5G*nI%E`?ri7fg%PrgQW.=8TNbZO07Bbl9n`NK0q=S1U%Y;lIt9ZX9tOb`e^&oNkdoaEl4%N(2--N-[ISbgeH9pu:cDKWggP-@,8@f"=jhqQW,C/LBgIZZp]`dKMO3QiYh(dsMIgI==2lY6)lu$H#3R`]N#9.[!KO-CR/!RC[1CM\'Y1J_6+Y3C$t)7T!";cr0l`'F2P-:3]"o[p[CE\'(:/^WTT/o&m<:#Zq.%`RH[RiaNd@g)P\C0=^al!DC.Tegj)\_2A5Q2>o#$MBn1WqElTkF:]'JXff4gP1@1Rqn/%ab**p1GI$cQBcCm.^G<:MT5?jpBSX-i4pr>C=C`c+;r@Gf%a0Ni$0"(k-nHI#(>;*/+ikG<4+CA?5HoRCs2gHkbUs9YN?IYr:r_1qS^AOT_#[We7Gck_IH?BItbqE?EGdslc-UZ?!b'QhsmeU^0mIp\1d1Pf=lSh.;`C,H'B8O3fC_35KF&uLtQbNS.3Hn;7"[MU?QUqa\BP3fDM^cJ,)RZ?h8iTN.:D`2\Z4Y+Pm#^s'd'&^ZP\m?JjigR=ECTngA4o7D+F;+qia#f9%M#6f$flnfY'qh0BW6OUkbdXlN.,'%ZUshF)d;6Z4/peY]C_rF*AR+s%n:F[h"t(NW&Df5hJE%Zjm,.anGi74SYLL99tc"!j?pWAC#l>.12;=cB!o;^S(9!eslWhor%+O,@u2Mhl#AEp\C./^\d!up5K%&MjC@?n2CK.hNi?'FjCEpV$$BlWg]$97gC]_!VMK700nM4@Bk02k50MGt:;(0Zj=hC@OU-KN/4*M[2a05D<0NV:u0Xc^Wkd>4@EYr\)ti8,hH$KiZPPn")qI/UMkf0Uk.g*XU*saY>akni+,Nq:?^us*f`5O7u4VE=B\->WQCt9WJY1d8Ra!\r\?]BcKI8`&^C"fPpqSLeNgOaPmRP+EBGr,VFE'o>s+U%S1u-'ORXS*J^dgLdI/#'QF/k&B,VTH/o9%;<3E_6^2QpUPMYbKW*l3NZb>]"iH.2*'i"n*Cg[KE=YpbN>t2?`"PWPZuHN%&?Qp;IUjBfiT,aY%aj?gOoqK$8!4[D"jt79*(+gf01QR:E:`!!',7 available engines: 7 slice total: 0, mask= subslice total: 0 EU total: 0 EU per subslice: 0 has slice power gating: no has subslice power gating: no has EU power gating: no Unavailable Num Pipes: 2 Pipe [0]: Power: on SRC: 077f0437 STAT: Plane [0]: CNTR: d8004400 STRIDE: 1e00 ADDR: SURF: 007f6000 TILEOFF: Cursor [0]: CNTR: 04004027 POS: 0277047e BASE: 00fdf000 Pipe [1]: Power: on SRC: STAT: Plane [1]: CNTR: 4000 STRIDE: ADDR: SURF: TILEOFF: Cursor [1]: CNTR: POS: BASE: CPU transcoder: A Power: on CONF: c000 HTOTAL: 0897077f HBLANK: 0897077f HSYNC: 080307d7 VTOTAL: 04640437 VBLANK: 04640437 VSYNC: 0440043b CPU transcoder: B Power: on CONF: H
Bug#1002978: Another GPU hang
I rebooted with these grub config options: GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=force intel_idle.max_cstate=1" In addition to an almost immediate GPU hang, my r8169 driver Ethernet card failed to load its firmware causing a full network outage. I rebooted to see if it was a fluke, but no: evidently, removing two boot options (i915.disable_power_well=1 i915.enable_dc=0) plus the new microcode is incompatible with the intel_idle.max_cstate=1 with my r8169 Ethernet card. I know that because I fixed it by removing intel_idle.max_cstate=1 in grub. My strategy is to explore if the microcode is enough to fix the problem. But because intel_idle.max_cstate=1 doesn't cause a tainted kernel warning, I wrongly thought it was an innocuous option. Hence the learning experience described above. So far this configuration is stable. The Ethernet is working and No GPU hangs though I have experienced a few small hiccups that might be GPU performance related. These were not too bad and nothing has appeared in the error logs. But I only have 45 minutes of uptime. I'll report back when I have more confidence. After that no network GPU hang here is the output of /sys/class/drm/card0/error: GPU HANG: ecode 6:0: Kernel: 5.10.0-10-amd64 x86_64 Driver: 20200917 Time: 1642455365 s 522631 us Boottime: 40 s 7750 us Uptime: 33 s 557210 us Capture: 4294902272 jiffies; 149072 ms ago Reset count: 0 Suspend count: 0 Platform: SANDYBRIDGE Subplatform: 0x0 PCI ID: 0x0102 PCI Revision: 0x09 PCI Subsystem: 1565:110d IOMMU enabled?: 0 RPM wakelock: yes PM suspended: no GT awake: yes EIR: 0x IER: 0x82bc8585 GTIER[0]: 0x00401001 PGTBL_ER: 0x FORCEWAKE: 0x DERRMR: 0x fence[0] = fde03b007f6001 fence[1] = 11f8007010f9003 fence[2] = fence[3] = fence[4] = fence[5] = fence[6] = fence[7] = fence[8] = fence[9] = fence[10] = fence[11] = fence[12] = fence[13] = fence[14] = fence[15] = ERROR: 0x DONE_REG: 0x available engines: 7 slice total: 0, mask= subslice total: 0 EU total: 0 EU per subslice: 0 has slice power gating: no has subslice power gating: no has EU power gating: no Unavailable Num Pipes: 2 Pipe [0]: Power: on SRC: 077f0437 STAT: Plane [0]: CNTR: d8004400 STRIDE: 1e00 ADDR: SURF: 007f6000 TILEOFF: Cursor [0]: CNTR: 04004027 POS: 02db0452 BASE: 00fdf000 Pipe [1]: Power: on SRC: STAT: Plane [1]: CNTR: 4000 STRIDE: ADDR: SURF: TILEOFF: Cursor [1]: CNTR: POS: BASE: CPU transcoder: A Power: on CONF: c000 HTOTAL: 0897077f HBLANK: 0897077f HSYNC: 080307d7 VTOTAL: 04640437 VBLANK: 04640437 VSYNC: 0440043b CPU transcoder: B Power: on CONF: HTOTAL: HBLANK: HSYNC: VTOTAL: VBLANK: VSYNC: gen: 6 gt: 1 iommu: disabled memory-regions: 5 page-sizes: 1000 platform: SANDYBRIDGE ppgtt-size: 31 ppgtt-type: 1 dma_mask_size: 40 is_mobile: no is_lp: no require_force_probe: no is_dgfx: no has_64bit_reloc: no gpu_reset_clobbers_display: no has_reset_engine: no has_fpga_dbg: no has_global_mocs: no has_gt_uc: no has_l3_dpf: no has_llc: yes has_logical_ring_contexts: no has_logical_ring_elsq: no has_logical_ring_preemption: no has_master_unit_irq: no has_pooled_eu: no has_rc6: yes has_rc6p: yes has_rps: yes has_runtime_pm: no has_snoop: no has_coherent_ggtt: yes unfenced_needs_alignment: no hws_needs_physical: no cursor_needs_physical: no has_csr: no has_ddi: no has_dp_mst: no has_dsb: no has_dsc: no has_fbc: yes has_gmch: no has_hdcp: no has_hotplug: yes has_hti: no has_ipc: no has_modular_fia: no has_overlay: no has_psr: no has_psr_hw_tracking: no overlay_needs_physical: no supports_tv: no rawclk rate: 125000 kHz CS timestamp frequency: 1250 Hz Has logical contexts? yes scheduler: 0 i915.vbt_firmware=(null) i915.modeset=-1 i915.lvds_channel_mode=0 i915.panel_use_ssc=-1 i915.vbt_sdvo_panel_type=-1 i915.enable_dc=-1 i915.enable_fbc=0 i915.enable_psr=-1 i915.psr_safest_params=no i915.enable_psr2_sel_fetch=no i915.disable_power_well=1 i915.enable_ips=1 i915.invert_brightness=0 i915.enable_guc=0 i915.guc_log_level=-1 i915.guc_firmware_path=(null) i915.huc_firmware_path=(null) i915.dmc_firmware_path=(null) i915.mmio_debug=0 i915.edp_vswing=0 i915.reset=3 i915.inject_probe_failure=0 i915.fastboot=-1 i915.enable_dpcd_backlight=-1 i915.force_probe= i915.fake_lmem_start=0 i915.enable_hangcheck=yes i915.load_detect_test=no i915.force_reset_modeset_test=no i915.error_capture=yes i915.disable_display=no i915.verbose_state_checks=yes i915.nuclear_pageflip=no i915.enable_dp_mst=yes i915.enable_gvt=no Here are the full logs from dmesg: [0.00] microcode: microcode updated early to revision 0x2f,
Bug#1002978: Another GPU hang
I rebooted with the GRUB options intel_idle.max_cstate=1 i915.disable_power_well=1 i915.enable_dc=0 which my research shows sometimes helps with these kinds of GPU hangs. For quite some time performance was acceptable, but less than 20 hours later I triggered a GPU hang by switching desktops rapidly in fvwm. I was executing my keyboard equivalents for fvwm's 'Desk 0 9' followed by 'Scroll +0 +100' followed by 'Scroll +0 -100' followed by 'Desk 0 11' in repeated cycling when the system hung. Before the hang, the dmesg output reports three lines with 'perf: interrupt took too long'. These were minor hiccups/freezes that failed to trigger a crash/CPU hang error in the logs. But I consider them related to the problem. I was again changing desktops and moving around my desktops using my keyboard shortcuts mentioned above and the i915 driver could not cope. Back in Jessie this kind of work was routine for me with no fear of triggering kernel stack dumps. The i915 driver appears to have developed major bugs at least with my chipset. I have attached dmesg output capturing boot messages and the crash/hang. The other file contains the output of 'cat /sys/class/drm/card0/error'. I am planning to reboot soon to try another test, since daily GPU hangs will slow my productivity unacceptibly. I need to keep debugging in the hopes that I can find a solution. -- CJ Fearnley | LinuxForce Inc. c...@linuxforce.net | Hosting and Linux Consulting https://www.LinuxForce.net | https://blog.LinuxForce.net [0.00] Linux version 5.10.0-10-amd64 (debian-ker...@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.84-1 (2021-12-08) [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-10-amd64 root=/dev/mapper/precession-root ro quiet pcie_aspm=force intel_idle.max_cstate=1 i915.disable_power_well=1 i915.enable_dc=0 [0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [0.00] x86/fpu: Enabled xstate features 0x3, context size is 576 bytes, using 'standard' format. [0.00] BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009d7ff] usable [0.00] BIOS-e820: [mem 0x0009d800-0x0009] reserved [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0x1fff] usable [0.00] BIOS-e820: [mem 0x2000-0x201f] reserved [0.00] BIOS-e820: [mem 0x2020-0x3fff] usable [0.00] BIOS-e820: [mem 0x4000-0x401f] reserved [0.00] BIOS-e820: [mem 0x4020-0xbad92fff] usable [0.00] BIOS-e820: [mem 0xbad93000-0xbadd9fff] ACPI NVS [0.00] BIOS-e820: [mem 0xbadda000-0xbade0fff] ACPI data [0.00] BIOS-e820: [mem 0xbade1000-0xbade1fff] ACPI NVS [0.00] BIOS-e820: [mem 0xbade2000-0xbae04fff] reserved [0.00] BIOS-e820: [mem 0xbae05000-0xbae05fff] usable [0.00] BIOS-e820: [mem 0xbae06000-0xbae17fff] reserved [0.00] BIOS-e820: [mem 0xbae18000-0xbae24fff] ACPI NVS [0.00] BIOS-e820: [mem 0xbae25000-0xbae48fff] reserved [0.00] BIOS-e820: [mem 0xbae49000-0xbae8bfff] ACPI NVS [0.00] BIOS-e820: [mem 0xbae8c000-0xbaff] usable [0.00] BIOS-e820: [mem 0xbb80-0xbf9f] reserved [0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved [0.00] BIOS-e820: [mem 0xff00-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x00023fdf] usable [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.7 present. [0.00] DMI: BIOSTAR Group H61MU3/H61MU3, BIOS 4.6.4 04/07/2011 [0.00] tsc: Fast TSC calibration using PIT [0.00] tsc: Detected 2394.344 MHz processor [0.000831] e820: update [mem 0x-0x0fff] usable ==> reserved [0.000834] e820: remove [mem 0x000a-0x000f] usable [0.000843] last_pfn = 0x23fe00 max_arch_pfn = 0x4 [0.000847] MTRR default type: uncachable [0.000848] MTRR fixed ranges enabled: [0.000850] 0-9 write-back [0.000851] A-B uncachable [0.000852] C-C write-protect [0.000853] D-E7FFF uncachable [0.000854] E8000-F write-protect [0.000854] MTRR variable ranges enabled: [0.000856] 0 base 0 mask E write-back [0.000857] 1 base 2 mask FC000 write-back [0.000859] 2 base 0BB80 mask FFF80 uncachable [0.000860] 3 base 0BC0