I am also affected by this, running Arch Linux on my Intel Nuc 8i3beh. I've
seen these same random mce broadcast error kernel panics (only capturable
via netconsole) ever since upgrading from the 5.15.x lts kernel series to
the 6.1.x series - latest I've tried is 6.1.45 and currently back to the
5.15.x branch for stability.

I update my Arch Linux installation on a rolling weekly basis so am right
upto date for all packages including intel-microcode. As others have
experienced, the problem seems more prominent (though not exclusively) when
the machine is Idle.

>>Maybe lowering "check_interval" or "monarch_timeout" in machinecheck will
cause the bug to strike more often, so a git bisect could be possible!? Or
raising those values may workaround the problem!?

I had similar thoughts and stumbled upon

/sys/kernel/debug/mce/fake_panic

Writing 1 to here will cause a fake panic such that the mce event will be
logged to dmesg but panic+reboot will not occur.

Interestingly we then get a couple more messages that possibly suggest that
the core lockup is somehow related to i915 as others suspect

[77775.848032] mce: CPUs not responding to MCE broadcast (may include false
positives): 1,3
[77775.848032] mce: CPUs not responding to MCE broadcast (may include false
positives): 1,3
[77775.848035] mce: [Hardware Error]: Fake kernel panic: Timeout: Not all
CPUs entered broadcast exception handler
[77775.848039] Disabling lock debugging due to kernel taint
[77775.885355] mce: [Hardware Error]: Machine check events logged
[77775.888283] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5
Bank 4: ba00000011000402
[77775.892145] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffc071678d>
{fwtable_read32+0x7d/0x220 [i915]}
[77775.897167] mce: [Hardware Error]: TSC d44e32bae41d
Might be interesting to see if the
RIP !INEXACT! 10:<ffffffffc071678d> {fwtable_read32+0x7d/0x220 [i915]}
 message occurs for others with fake_panic enabled.

Unfortunately, fake_panic does not appear to be a workaround from my
experience; since the cores reported in the mce event become locked up
thereafter; such that any task scheduled onto those cores becomes locked-up
- for example I ran the sensors command which hung and eventually.....

77798.629123] watchdog: BUG: soft lockup - CPU#2 stuck for 21s!
[sensors:1229265]
[77798.631037] Modules linked in: coretemp drivetemp netconsole
xt_conntrack ipt_REJECT nf_reject_ipv4 xt_connmark xt_mark iptable_mangle
xt_comment xt_addrtype iptable_raw wireguard curve25519_x86_64
libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic
libchacha ip6_udp_tunnel udp_tunnel rfcomm uinput xt_nat xt_tcpudp
iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
libcrc32c iptable_filter veth ts2020 snd_sof_pci_intel_cnl
snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation
soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof
snd_sof_utils soundwire_bus snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core
intel_rapl_msr intel_rapl_common snd_soc_sst_ipc intel_tcc_cooling
snd_soc_sst_dsp x86_pkg_temp_thermal intel_powerclamp
snd_soc_acpi_intel_match snd_soc_acpi kvm_intel snd_soc_core
snd_hda_codec_hdmi snd_compress kvm si2157 ac97_bus snd_hda_codec_realtek
snd_pcm_dmaengine snd_hda_codec_generic ledtrig_audio
[77798.631090]  irqbypass si2168 crct10dif_pclmul snd_hda_intel
crc32_pclmul polyval_clmulni snd_intel_dspcfg polyval_generic gf128mul
ghash_clmulni_intel snd_intel_sdw_acpi snd_hda_codec sha512_ssse3
dvb_usb_dvbsky dvb_usb_v2 btusb m88ds3103 snd_hda_core dvb_core btrtl btbcm
iTCO_wdt videobuf2_vmalloc snd_hwdep videobuf2_memops videobuf2_common
aesni_intel btintel crypto_simd snd_pcm intel_pmc_bxt btmtk cryptd
snd_timer rapl intel_cstate mei_pxp mei_hdcp iTCO_vendor_support ee1004 snd
videodev intel_uncore bluetooth mei_me e1000e intel_wmi_thunderbolt
i2c_i801 wmi_bmof soundcore pcspkr i2c_smbus mei mc i2c_mux ecdh_generic
intel_pch_thermal ir_rc6_decoder rc_rc6_mce ite_cir acpi_pad acpi_tad
mac_hid cfg80211 rfkill crypto_user loop fuse dm_mod bpf_preload ip_tables
x_tables ext4 crc32c_generic crc16 mbcache jbd2 mmc_block i915 drm_buddy
intel_gtt nvme rtsx_pci_sdmmc drm_display_helper mmc_core nvme_core
crc32c_intel cec xhci_pci rtsx_pci nvme_common video xhci_pci_renesas ttm
wmi
[77798.641974]  [last unloaded: i2c_dev]
[77798.656901] CPU: 2 PID: 1229265 Comm: sensors Tainted: G   M
   6.1.39-1-lts-custom-015e51c #1 0c9d39d05dfd27e4ed0b0da78692e6ddc0d0b631
[77798.659471] Hardware name: Intel(R) Client Systems NUC8i3BEH/NUC8BEB,
BIOS BECFL357.86A.0089.2021.0621.1343 06/21/2021
[77798.662012] RIP: 0010:smp_call_function_single+0xfe/0x140
[77798.664509] Code: 25 28 00 00 00 75 51 c9 c3 cc cc cc cc 48 89 e6 48 89
54 24 18 4c 89 44 24 10 e8 4d fe ff ff 8b 54 24 08 83 e2 01 74 0b f3 90
<8b> 54 24 08 83 e2 01 75 f5 eb b9 8b 05 89 b4 5d 02 85 c0 0f 85 65
[77798.667074] RSP: 0018:ffffad160582fcc0 EFLAGS: 00000202
[77798.669635] RAX: 0000000000000000 RBX: ffffad160582fd6c RCX:
ffff8be27b8dc238
[77798.672205] RDX: 0000000000000001 RSI: ffffad160582fcc0 RDI:
ffffad160582fcc0
[77798.674773] RBP: ffffad160582fd18 R08: ffffffff855f4fb0 R09:
ffff8be3366090c0
[77798.677349] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff8be2457889b0
[77798.679928] R13: ffffad160582fe30 R14: 0000000000000001 R15:
ffffad160582fec8
[77798.682464] FS:  00007f19d4d3e740(0000) GS:ffff8be5add00000(0000)
knlGS:0000000000000000
[77798.684997] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[77798.687489] CR2: 00007f7ef27cfc80 CR3: 000000032f4c2001 CR4:
00000000003706e0
[77798.689968] Call Trace:
[77798.692428]  <IRQ>
[77798.694886]  ? watchdog_timer_fn+0x1a8/0x200
[77798.697382]  ? lockup_detector_update_enable+0x50/0x50
[77798.699858]  ? __hrtimer_run_queues+0x10f/0x2b0
[77798.702340]  ? hrtimer_interrupt+0xf8/0x210
[77798.704812]  ? __sysvec_apic_timer_interrupt+0x5e/0x110
[77798.707300]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[77798.709803]  </IRQ>
[77798.712312]  <TASK>
[77798.714787]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[77798.717276]  ? pldmfw_flash_image+0xce0/0xce0
[77798.719755]  ? smp_call_function_single+0xfe/0x140
[77798.722241]  ? pldmfw_flash_image+0xce0/0xce0
[77798.724738]  rdmsr_on_cpu+0x5f/0x90
[77798.727216]  show_temp+0xc1/0xf0 [coretemp
2a9b54610668d110c724a01af9913aabfe08a40c]
[77798.729759]  dev_attr_show+0x19/0x40
[77798.732245]  sysfs_kf_seq_show+0xa8/0xf0
[77798.734672]  seq_read_iter+0x120/0x460
[77798.737042]  vfs_read+0x23d/0x310
[77798.739356]  ksys_read+0x6f/0xf0
[77798.741630]  do_syscall_64+0x5d/0x90
[77798.743906]  ? do_syscall_64+0x6c/0x90
[77798.746182]  ? do_syscall_64+0x6c/0x90
[77798.748428]  ? syscall_exit_to_user_mode+0x1b/0x40
[77798.750681]  ? do_syscall_64+0x6c/0x90
[77798.752923]  ? do_syscall_64+0x6c/0x90
[77798.755110]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[77798.757292] RIP: 0033:0x7f19d4f27b21
[77798.759440] Code: c5 fe ff ff 50 48 8d 3d 45 7d 0a 00 e8 e8 11 02 00 0f
1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d dd 99 0e 00 00 74 13 31 c0 0f 05
<48> 3d 00 f0 ff ff 77 57 c3 66 0f 1f 44 00 00 48 83 ec 28 48 89 54
[77798.761685] RSP: 002b:00007ffc3a71ea08 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[77798.763902] RAX: ffffffffffffffda RBX: 000055c53cb34340 RCX:
00007f19d4f27b21
[77798.766135] RDX: 0000000000001000 RSI: 000055c53cb4af30 RDI:
0000000000000003
[77798.768378] RBP: 00007f19d50065a0 R08: 0000000000000000 R09:
0000000000000001
[77798.770607] R10: 0000000000000003 R11: 0000000000000246 R12:
000055c53cb34340
[77798.772852] R13: 0000000000000a68 R14: 00007f19d5005ca0 R15:
0000000000000a68
[77798.775079]  </TASK>
[77798.777610] systemd-journald[236]: Compressed data object 996 -> 502
using ZSTD
[77798.780534] systemd-journald[236]: Compressed data object 988 -> 559
using ZSTD

So the machine then requires a reboot anyhow to return to normal operation.

Regards
James

Reply via email to