Apologies - had meant to report back on this thread as I found a workaround 
(for my setup at least).

I found the solution here:
https://wiki.archlinux.org/title/Intel_graphics#Crash/freeze_on_low_power_Intel_CPUs

After setting enable_dc=0 in my kernel boot parameters I have no longer 
experienced the issue on LTS 6.1.x kernel on my Intel NUC 8i3BEH running Arch 
linux.

I also tried upgrading to kernel 6.4.x where the problem also seems to be 
resolved.

Regards,

James.

On Fri, 18 Aug 2023 20:08:09 +0100 James Hutchinson 
<jahutchinso...@googlemail.com> wrote:
> I am also affected by this, running Arch Linux on my Intel Nuc 8i3beh. I've
> seen these same random mce broadcast error kernel panics (only capturable
> via netconsole) ever since upgrading from the 5.15.x lts kernel series to
> the 6.1.x series - latest I've tried is 6.1.45 and currently back to the
> 5.15.x branch for stability.
>
> I update my Arch Linux installation on a rolling weekly basis so am right
> upto date for all packages including intel-microcode. As others have
> experienced, the problem seems more prominent (though not exclusively) when
> the machine is Idle.
>
> >>Maybe lowering "check_interval" or "monarch_timeout" in machinecheck will
> cause the bug to strike more often, so a git bisect could be possible!? Or
> raising those values may workaround the problem!?
>
> I had similar thoughts and stumbled upon
>
> /sys/kernel/debug/mce/fake_panic
>
> Writing 1 to here will cause a fake panic such that the mce event will be
> logged to dmesg but panic+reboot will not occur.
>
> Interestingly we then get a couple more messages that possibly suggest that
> the core lockup is somehow related to i915 as others suspect
>
> [77775.848032] mce: CPUs not responding to MCE broadcast (may include false
> positives): 1,3
> [77775.848032] mce: CPUs not responding to MCE broadcast (may include false
> positives): 1,3
> [77775.848035] mce: [Hardware Error]: Fake kernel panic: Timeout: Not all
> CPUs entered broadcast exception handler
> [77775.848039] Disabling lock debugging due to kernel taint
> [77775.885355] mce: [Hardware Error]: Machine check events logged
> [77775.888283] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5
> Bank 4: ba00000011000402
> [77775.892145] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffc071678d>
> {fwtable_read32+0x7d/0x220 [i915]}
> [77775.897167] mce: [Hardware Error]: TSC d44e32bae41d
> Might be interesting to see if the
> RIP !INEXACT! 10:<ffffffffc071678d> {fwtable_read32+0x7d/0x220 [i915]}
>  message occurs for others with fake_panic enabled.
>
> Unfortunately, fake_panic does not appear to be a workaround from my
> experience; since the cores reported in the mce event become locked up
> thereafter; such that any task scheduled onto those cores becomes locked-up
> - for example I ran the sensors command which hung and eventually.....
>
> 77798.629123] watchdog: BUG: soft lockup - CPU#2 stuck for 21s!
> [sensors:1229265]
> [77798.631037] Modules linked in: coretemp drivetemp netconsole
> xt_conntrack ipt_REJECT nf_reject_ipv4 xt_connmark xt_mark iptable_mangle
> xt_comment xt_addrtype iptable_raw wireguard curve25519_x86_64
> libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic
> libchacha ip6_udp_tunnel udp_tunnel rfcomm uinput xt_nat xt_tcpudp
> iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> libcrc32c iptable_filter veth ts2020 snd_sof_pci_intel_cnl
> snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation
> soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof
> snd_sof_utils soundwire_bus snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core
> intel_rapl_msr intel_rapl_common snd_soc_sst_ipc intel_tcc_cooling

Reply via email to