Hi there,

Lets go through the different topics:
- Setup: It is a AMD 5600G on a ASRock B550M-ITX/ac, powered by a
BeQuiet SP7 300W

- Power: From the specifications it should fit. As it takes 5-20
minutes for the error to occur, I would take that as an indication,
that the power supply is ok. Otherwise would expect that to fail right
away? Is there a way to measure/test if there is any issue with it?
I also tested to limit PPT to 45W which also makes no difference.

- Memory: Yes was tested right after the build with no errors

- Thermal: I am observing the temperatures on the stresstest. If I am
correct in reading Smbusmaster0, Temps haven't been above 71°C, but
error also occurs earlier, way below 70.

- OS: I was running Debian stable in quite a minimal configuration
(fresh install as most services are dockerized) when first observed the
error. Now moved to Debian 12/Bookworm to see if it makes any
difference with higher kernel (it does not). Also exchanged r8169 for
the r8168. It changes the error messages, however system instability
stays.

I could disconnect the disks and see if it makes any difference.
However when reproducing this error, disks other than system where
unmounted. So would guess this would be a test to see if it is about
power?

-------- Ursprüngliche Nachricht --------
Von: David Christensen <dpchr...@holgerdanske.com>
An: debian-user@lists.debian.org
Betreff: Re: Weird behaviour on System under high load
Datum: Sat, 20 May 2023 18:00:48 -0700

On 5/20/23 14:46, Christian wrote:
> Hi there,
> 
> I am having trouble with a new build system. It works normal and
> stable
> until I put extreme stress on it, e.g. using all 12 cores with stress
> tool.
> 
> System will suddenly loose network connection and become
> unresponsive.
> Only a reset works. I am not sure what is going on, but it is
> reproducible: Put stress on the system and it fails. It seems, that
> something is getting out of step.
> 
> Stuff below I found in the logs. I tried quite a bit, even upgraded
> to
> bookworm, to see if the newer kernel works.
> 
> If anyone knows how to analyze this issue, it would be very helpful.
> 
> Kind regards
>    Christian
> 
> 
> 2023-05-20T20:12:17.054224+02:00 diskstation kernel: [ 1303.236428] -
> --
> ---------[ cut here ]------------
> 2023-05-20T20:12:17.054234+02:00 diskstation kernel: [ 1303.236430]
> NETDEV WATCHDOG: enp3s0 (r8169): transmit queue 0 timed out
> 2023-05-20T20:12:17.054235+02:00 diskstation kernel: [ 1303.236437]
> WARNING: CPU: 5 PID: 2411 at net/sched/sch_generic.c:525
> dev_watchdog+0x207/0x210
> 2023-05-20T20:12:17.054236+02:00 diskstation kernel: [ 1303.236442]
> Modules linked in: eq3_char_loop(OE) rpi_rf_mod_led(OE) ledtrig_timer
> ledtrig_default_on xt_MASQUERADE nf_conntrack_netlink xfrm_user
> xfrm_algo xt_addrtype br_netfilter bridge stp llc overlay ip6t_rt
> nft_chain_nat nf_nat xt_set xt_tcpmss xt_tcpudp xt_conntrack
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
> ip_set_hash_ip ip_set binfmt_misc nfnetlink nls_ascii nls_cp437 vfat
> fat amdgpu iwlmvm btusb intel_rapl_msr btrtl intel_rapl_common btbcm
> btintel edac_mce_amd btmtk mac80211 snd_hda_codec_realtek bluetooth
> snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi gpu_sched
> kvm_amd drm_buddy libarc4 snd_hda_intel drm_display_helper
> snd_intel_dspcfg snd_intel_sdw_acpi iwlwifi kvm cec snd_hda_codec
> jitterentropy_rng irqbypass rc_core snd_hda_core cfg80211 snd_hwdep
> drm_ttm_helper snd_pcm ttm drbg wmi_bmof rapl ccp snd_timer
> ansi_cprng
> drm_kms_helper sp5100_tco snd pcspkr ecdh_generic rng_core
> i2c_algo_bit
> watchdog soundcore k10temp rfkill hb_rf_usb_2(OE) ecc
> 2023-05-20T20:12:17.054240+02:00 diskstation kernel: [ 1303.236494]
> generic_raw_uart(OE) acpi_cpufreq button joydev evdev sg nct6775
> nct6775_core drm hwmon_vid fuse loop efi_pstore configfs efivarfs
> ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs
> blake2b_generic xor raid6_pq zstd_compress libcrc32c crc32c_generic
> dm_crypt dm_mod hid_generic usbhid hid sd_mod crc32_pclmul
> crc32c_intel
> ahci ghash_clmulni_intel sha512_ssse3 libahci xhci_pci sha512_generic
> xhci_hcd r8169 nvme realtek libata aesni_intel nvme_core t10_pi
> crypto_simd mdio_devres usbcore scsi_mod crc64_rocksoft_generic
> cryptd
> libphy crc64_rocksoft crc_t10dif i2c_piix4 crct10dif_generic
> crct10dif_pclmul crc64 crct10dif_common usb_common scsi_common video
> wmi gpio_amdpt gpio_generic
> 2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236534]
> CPU: 5 PID: 2411 Comm: stress Tainted: G           OE      6.1.0-9-
> amd64 #1  Debian 6.1.27-1
> 2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236536]
> Hardware name: To Be Filled By O.E.M. B550M-ITX/ac/B550M-ITX/ac, BIOS
> L2.62 01/31/2023
> 2023-05-20T20:12:17.054242+02:00 diskstation kernel: [ 1303.236537]
> RIP: 0010:dev_watchdog+0x207/0x210
> 2023-05-20T20:12:17.054242+02:00 diskstation kernel: [ 1303.236540]
> Code: 00 e9 40 ff ff ff 48 89 df c6 05 ff 5f 3d 01 01 e8 be 79 f9 ff
> 44
> 89 e9 48 89 de 48 c7 c7 c8 16 9b a8 48 89 c2 e8 09 d2 86 ff <0f> 0b
> e9
> 22 ff ff ff 66 90 0f 1f 44 00 00 55 53 48 89 fb 48 8b 6f
> 2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236541]
> RSP: 0000:ffffa831c345fdc8 EFLAGS: 00010286
> 2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236543]
> RAX: 0000000000000000 RBX: ffff91a3c1410000 RCX: 0000000000000000
> 2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236544]
> RDX: 0000000000000103 RSI: ffffffffa893fa66 RDI: 00000000ffffffff
> 2023-05-20T20:12:17.054244+02:00 diskstation kernel: [ 1303.236545]
> RBP: ffff91a3c1410488 R08: 0000000000000000 R09: ffffa831c345fc38
> 2023-05-20T20:12:17.054244+02:00 diskstation kernel: [ 1303.236546]
> R10: 0000000000000003 R11: ffff91aafe27afe8 R12: ffff91a3c14103dc
> 2023-05-20T20:12:17.054245+02:00 diskstation kernel: [ 1303.236547]
> R13: 0000000000000000 R14: ffffffffa7e2e7a0 R15: ffff91a3c1410488
> 2023-05-20T20:12:17.054245+02:00 diskstation kernel: [ 1303.236548]
> FS:
> 00007f169849d740(0000) GS:ffff91aade340000(0000)
> knlGS:0000000000000000
> 2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236550]
> CS:
> 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236551]
> CR2: 000055d05c3f4000 CR3: 0000000103cf2000 CR4: 0000000000750ee0
> 2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236552]
> PKRU: 55555554
> 2023-05-20T20:12:17.054247+02:00 diskstation kernel: [ 1303.236553]
> Call Trace:
> 2023-05-20T20:12:17.054247+02:00 diskstation kernel: [ 1303.236554]
> <TASK>
> 2023-05-20T20:12:17.054248+02:00 diskstation kernel: [ 1303.236557] 
> ?
> pfifo_fast_reset+0x140/0x140
> 2023-05-20T20:12:17.054248+02:00 diskstation kernel: [ 1303.236559]
> call_timer_fn+0x27/0x130
> 2023-05-20T20:12:17.054248+02:00 diskstation kernel: [ 1303.236562]
> __run_timers+0x21c/0x2a0
> 2023-05-20T20:12:17.054249+02:00 diskstation kernel: [ 1303.236565]
> run_timer_softirq+0x2b/0x50
> 2023-05-20T20:12:17.054249+02:00 diskstation kernel: [ 1303.236567]
> __do_softirq+0xf0/0x2fe
> 2023-05-20T20:12:17.054249+02:00 diskstation kernel: [ 1303.236570]
> __irq_exit_rcu+0xc7/0x130
> 2023-05-20T20:12:17.054250+02:00 diskstation kernel: [ 1303.236573]
> sysvec_apic_timer_interrupt+0x52/0xc0
> 2023-05-20T20:12:17.054250+02:00 diskstation kernel: [ 1303.236576]
> asm_sysvec_apic_timer_interrupt+0x16/0x20
> 2023-05-20T20:12:17.054251+02:00 diskstation kernel: [ 1303.236578]
> RIP: 0033:0x7f16984e085c
> 2023-05-20T20:12:17.054251+02:00 diskstation kernel: [ 1303.236579]
> Code: 48 89 44 24 08 31 c0 f0 0f b1 15 fb 3e 19 00 75 3d 48 8d 74 24
> 04
> 48 8d 3d f1 1f 19 00 e8 1c 04 00 00 31 c0 87 05 e0 3e 19 00 <83> f8
> 01
> 7f 2f 48 63 44 24 04 48 8b 54 24 08 64 48 2b 14 25 28 00
> 2023-05-20T20:12:17.054252+02:00 diskstation kernel: [ 1303.236581]
> RSP: 002b:00007fffb2c4cca0 EFLAGS: 00000246
> 2023-05-20T20:12:17.054252+02:00 diskstation kernel: [ 1303.236582]
> RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00007f169867221c
> 2023-05-20T20:12:17.054253+02:00 diskstation kernel: [ 1303.236583]
> RDX: 00007f1698672228 RSI: 00007fffb2c4cca4 RDI: 00007f1698672840
> 2023-05-20T20:12:17.054253+02:00 diskstation kernel: [ 1303.236584]
> RBP: 00000000000080e8 R08: 00007f1698672228 R09: 00007f1698672260
> 2023-05-20T20:12:17.054253+02:00 diskstation kernel: [ 1303.236585]
> R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> 2023-05-20T20:12:17.054254+02:00 diskstation kernel: [ 1303.236586]
> R13: 0000565167761004 R14: 0000565167761a78 R15: 000000000000000b
> 2023-05-20T20:12:17.054254+02:00 diskstation kernel: [ 1303.236588]
> </TASK>
> 2023-05-20T20:12:17.054255+02:00 diskstation kernel: [ 1303.236589] -
> --
> [ end trace 0000000000000000 ]---
> 2023-05-20T20:12:17.086199+02:00 diskstation kernel: [ 1303.270878]
> r8169 0000:03:00.0 enp3s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay:
> 100).
> 


Have you verified that your PSU has sufficient capacity for the load on
each and every rail?


Have you cleaned the system interior, filters, fans, heatsinks, ducts, 
etc., recently?


Have you tested the thermal solution(s) recently?


Have you tested the power supply recently?


Have you tested the memory recently?


Are you running Debian stable?


Are you running Debian stable packages only?  Were they all installed 
with the same package manager?


If all of the above are okay and the system is still locking up, I
would 
disable or remove all disks in the system, install a zeroed SSD,
install 
Debian stable choosing only "SSH server" and "standard system 
utilities", install only the stable packages required for your
workload, 
put the workload on it, and see what happens.


David


Reply via email to