Re: Weird behaviour on System under high load
On 5/28/23 03:09, Christian wrote: Ursprüngliche Nachricht Von: David Christensen An: debian-user@lists.debian.org Betreff: Re: Weird behaviour on System under high load Datum: Sat, 27 May 2023 16:30:05 -0700 On 5/27/23 15:28, Christian wrote: New day, new tests. Got a crash again, however with the message "AHCI controller unavailable". Figured that is the SATA drives not being plugged in the right order. Corrected that and a 3:30h stress test went so far without any issues besides this old bug https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=947685 Seems that I am just jumping from one error to the next... 3 hours and 30 minutes? Yikes! Please stop before you fry your computer. 10 seconds should be enough to see a problem; 1 minute is more than enough. Sadly not always. My crashes before would occur between a few minutes and 1 hour load. Now I hope everything is stable. Crashes are gone, only the network error seems to be unresolved (even though there is some workaround). Repeatable crashes from a reported issue indicate your hardware is okay. With the undervolting / overclocking on 12 core stress test, the system stays below 65°C (on Smbusmaster0) so should be no risk of damage. It is your computer and your decision. At this point, I would start adding the software stack, one piece at a time, testing between each piece. The challenge is devising or finding tests. Spot testing by hand can reveal bugs, but that gets tiresome. The best approach is an automated/ scripted test suite. If you are using Debian packages, you might want to look for test suites in the corresponding source packages. And/or, you can use building from source as a stress test. Compiling the Linux kernel should provide your processor, memory, and storage with a good workout. Thanks for the help! YW. :-) David
Re: Weird behaviour on System under high load
> Ursprüngliche Nachricht > Von: David Christensen > An: debian-user@lists.debian.org > Betreff: Re: Weird behaviour on System under high load > Datum: Sat, 27 May 2023 16:30:05 -0700 > > On 5/27/23 15:28, Christian wrote: > > > New day, new tests. Got a crash again, however with the message > "AHCI > > controller unavailable". > > Figured that is the SATA drives not being plugged in the right > order. > > Corrected that and a 3:30h stress test went so far without any > issues > > besides this old bug > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=947685 > > > > Seems that I am just jumping from one error to the next... > > > 3 hours and 30 minutes? Yikes! Please stop before you fry your > computer. 10 seconds should be enough to see a problem; 1 minute is > more than enough. > Sadly not always. My crashes before would occur between a few minutes and 1 hour load. Now I hope everything is stable. Crashes are gone, only the network error seems to be unresolved (even though there is some workaround). With the undervolting / overclocking on 12 core stress test, the system stays below 65°C (on Smbusmaster0) so should be no risk of damage. Thanks for the help! > > David > > >
Re: Weird behaviour on System under high load
On 5/27/23 15:28, Christian wrote: New day, new tests. Got a crash again, however with the message "AHCI controller unavailable". Figured that is the SATA drives not being plugged in the right order. Corrected that and a 3:30h stress test went so far without any issues besides this old bug https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=947685 Seems that I am just jumping from one error to the next... 3 hours and 30 minutes? Yikes! Please stop before you fry your computer. 10 seconds should be enough to see a problem; 1 minute is more than enough. David
Re: Weird behaviour on System under high load
> Ursprüngliche Nachricht > Von: David Christensen > An: debian-user@lists.debian.org > Betreff: Re: Weird behaviour on System under high load > Datum: Fri, 26 May 2023 18:22:17 -0700 > > On 5/26/23 16:08, Christian wrote: > > > Good and bad things: > > I started to test different setups (always with full 12 core stress > > test). Boot from USB liveCD (only stress and s-tui installed): > > > > - All disks disconnected, other than M2. Standard BIOS > > - All disks disconnected, other than M2. Proper Memory profile for > > timing > > - All disks disconnected, other than M2. Memory profile, > undervolted > > and overclocked with limited burst to 4ghz > > - All disks connected. Memory profile, undervolted and overclocked > > with > > limited burst to 4ghz > > > > All settings so far are stable. :-/ > > Will see tomorrow any differences in non-free firmware and kernel > > modules and test again. > > > > Very strange... > > > If everything is stable, including undervoltage and overclocking, I > would consider that good. I think your hardware is good. > > > When you say "USB liveCD", is that a USB optical drive with a live > CD, a > USB flash drive with a bootable OS on it, or something else? If it > is > something that can change, I suggest taking a image of the raw blocks > with dd(1) so that you can easily get back to this point as you > continue > testing. > New day, new tests. Got a crash again, however with the message "AHCI controller unavailable". Figured that is the SATA drives not being plugged in the right order. Corrected that and a 3:30h stress test went so far without any issues besides this old bug https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=947685 Seems that I am just jumping from one error to the next... > > AIUI Debian can include microcode patches (depending upon processor). > If you are using such, I suggest adding that to your test agenda > first. > > > Firmware and kernel modules seem like the right next steps. > > > David > >
Re: Weird behaviour on System under high load
On 5/26/23 16:08, Christian wrote: Good and bad things: I started to test different setups (always with full 12 core stress test). Boot from USB liveCD (only stress and s-tui installed): - All disks disconnected, other than M2. Standard BIOS - All disks disconnected, other than M2. Proper Memory profile for timing - All disks disconnected, other than M2. Memory profile, undervolted and overclocked with limited burst to 4ghz - All disks connected. Memory profile, undervolted and overclocked with limited burst to 4ghz All settings so far are stable. :-/ Will see tomorrow any differences in non-free firmware and kernel modules and test again. Very strange... If everything is stable, including undervoltage and overclocking, I would consider that good. I think your hardware is good. When you say "USB liveCD", is that a USB optical drive with a live CD, a USB flash drive with a bootable OS on it, or something else? If it is something that can change, I suggest taking a image of the raw blocks with dd(1) so that you can easily get back to this point as you continue testing. AIUI Debian can include microcode patches (depending upon processor). If you are using such, I suggest adding that to your test agenda first. Firmware and kernel modules seem like the right next steps. David
Re: Weird behaviour on System under high load
> Ursprüngliche Nachricht > Von: David Christensen > An: debian-user@lists.debian.org > Betreff: Re: Weird behaviour on System under high load > Datum: Sun, 21 May 2023 15:04:44 -0700 > > > > > > > What stresstest are you using? > > > ... the package and command "s-tui" and "stress" > > s-tui gives you an overview on power usage, fan control, temps, > core > > frequencies and core utilization on the console > > > > stress is just producing load on selected # of cpus, it can be > > integrated in s-tui. > > > Thanks -- I like tools and will play with it: > > https://packages.debian.org/bullseye/s-tui > > > > > Okay. Put my Perl script on your liveUSB. Also put some tool > for > > > monitoring CPU temperature, such as sensors(1). > > > > Will have time again in a few days and check. > > > Please let us know what you find. > Good and bad things: I started to test different setups (always with full 12 core stress test). Boot from USB liveCD (only stress and s-tui installed): - All disks disconnected, other than M2. Standard BIOS - All disks disconnected, other than M2. Proper Memory profile for timing - All disks disconnected, other than M2. Memory profile, undervolted and overclocked with limited burst to 4ghz - All disks connected. Memory profile, undervolted and overclocked with limited burst to 4ghz All settings so far are stable. :-/ Will see tomorrow any differences in non-free firmware and kernel modules and test again. Very strange...
Re: Weird behaviour on System under high load
On 5/21/23 14:46, Christian wrote: David Christensen Sun, 21 May 2023 14:22:22 -0700 On 5/21/23 06:31, Christian wrote: David Christensen Sun, 21 May 2023 03:11:43 -0700 David Christensen Sat, 20 May 2023 18:00:48 -0700 Heat sinks, heat pipes, water blocks, radiators, fans, ducts, etc.. It is quite simple - Noctua NH-L9a-AM4 for CPU - Chassis 12cm fan - PSU Integrated fans I like the Noctua. :-) What stresstest are you using? ... the package and command "s-tui" and "stress" s-tui gives you an overview on power usage, fan control, temps, core frequencies and core utilization on the console stress is just producing load on selected # of cpus, it can be integrated in s-tui. Thanks -- I like tools and will play with it: https://packages.debian.org/bullseye/s-tui Okay. Put my Perl script on your liveUSB. Also put some tool for monitoring CPU temperature, such as sensors(1). Will have time again in a few days and check. Please let us know what you find. David
Re: Weird behaviour on System under high load
> Ursprüngliche Nachricht > Von: David Christensen > An: debian-user@lists.debian.org > Betreff: Re: Weird behaviour on System under high load > Datum: Sun, 21 May 2023 14:22:22 -0700 > > On 5/21/23 06:31, Christian wrote: > > David Christensen Sun, 21 May 2023 03:11:43 -0700 > > >>> David Christensen Sat, 20 May 2023 18:00:48 -0700 > > > > Please use inline posting style and proper indentation. > > > > Phew... will be quite hard to read. But here you go. > > > It is not hard when you delete the portions that you are not > responding to. > > > > > > > Have you cleaned the system interior, filters, fans, > heatsinks, > > > > > ducts, > > > > > etc., recently? > > > As written in OP, the system is new. Only PSU is used. So it is > clean > > > Okay. > > > > What is a thermal solution? > > > Heat sinks, heat pipes, water blocks, radiators, fans, ducts, etc.. > It is quite simple - Noctua NH-L9a-AM4 for CPU - Chassis 12cm fan - PSU Integrated fans > > > > What stresstest are you using? > > > > > stress running in s-tui > > > Do you mean "in situ"? > > https://www.merriam-webster.com/dictionary/in%20situ > No, it is the package and command "s-tui" and "stress" s-tui gives you an overview on power usage, fan control, temps, core frequencies and core utilization on the console stress is just producing load on selected # of cpus, it can be integrated in s-tui. > I prefer a tool that I can control. That is why I wrote the > previously > attached Perl script. It is public domain; you and everyone are free > to > use, modify, distribute, etc., as you see fit. > > > > > > > Have you tested the power supply recently? > > > It was working before without issues, so not explicitly tested. > > > I am not building regularly, so would need to borrow such equipment > > somewhere > > > Understand that an ATX PSU has multiple stages that produce +12 VDC, > +5 > VDC, +5 VDC standby, +3.3 VDC, and -12 VDC ("rails"). It is common > for > one or more rails to fail and the others to continue working. > Computers > exhibit "weird behaviour" when this happens. > > > Just spend the US$20. > > > > > > > Have you tested the memory recently? > > > > Did you do multi-threaded/ stress tests? > > > > > Yes, stress is running multiple threads. Only on 2 threads it was > > stable so far. However it takes longer for the errors to come up > when > > using less threads. might be that I did not test long enough. > > > I use Memtest86+ 5.01 on a bootable USB stick. In the > "Configuration" > menu, I can choose "Core Selection". It appears the default is > "Parallel (all)". Other choices include "Round Robin" and > "Sequential". > Memtest 5.01 also displays the CPU temperature. Running it an > Intel > Core i7-2600S with matching factory heat sink and fan for 30+ > minutes, > the current CPU temperature is 50 C. This leads me to believe that > the > memory is loaded to 100%, but the CPU is less (perhaps 60%?). > > https://memtest.org/ > > > I recommend that you run Memtest86+ in parallel mode for at least one > pass. I have seen computers go for 20+ hours before encountering a > memory error. > > > > > Did you see the problems when running Debian stable OOTB, before > > > adding > > > anything? > > > I would need to do this with a liveUSB, to have it run OOTB > > > Okay. Put my Perl script on your liveUSB. Also put some tool for > monitoring CPU temperature, such as sensors(1). Will have time again in a few days and check. > > > David > >
Re: Weird behaviour on System under high load
On 5/21/23 06:31, Christian wrote: David Christensen Sun, 21 May 2023 03:11:43 -0700 >>> David Christensen Sat, 20 May 2023 18:00:48 -0700 Please use inline posting style and proper indentation. Phew... will be quite hard to read. But here you go. It is not hard when you delete the portions that you are not responding to. Have you cleaned the system interior, filters, fans, heatsinks, ducts, etc., recently? As written in OP, the system is new. Only PSU is used. So it is clean Okay. What is a thermal solution? Heat sinks, heat pipes, water blocks, radiators, fans, ducts, etc.. What stresstest are you using? stress running in s-tui Do you mean "in situ"? https://www.merriam-webster.com/dictionary/in%20situ I prefer a tool that I can control. That is why I wrote the previously attached Perl script. It is public domain; you and everyone are free to use, modify, distribute, etc., as you see fit. Have you tested the power supply recently? It was working before without issues, so not explicitly tested. I am not building regularly, so would need to borrow such equipment somewhere Understand that an ATX PSU has multiple stages that produce +12 VDC, +5 VDC, +5 VDC standby, +3.3 VDC, and -12 VDC ("rails"). It is common for one or more rails to fail and the others to continue working. Computers exhibit "weird behaviour" when this happens. Just spend the US$20. Have you tested the memory recently? Did you do multi-threaded/ stress tests? Yes, stress is running multiple threads. Only on 2 threads it was stable so far. However it takes longer for the errors to come up when using less threads. might be that I did not test long enough. I use Memtest86+ 5.01 on a bootable USB stick. In the "Configuration" menu, I can choose "Core Selection". It appears the default is "Parallel (all)". Other choices include "Round Robin" and "Sequential". Memtest 5.01 also displays the CPU temperature. Running it an Intel Core i7-2600S with matching factory heat sink and fan for 30+ minutes, the current CPU temperature is 50 C. This leads me to believe that the memory is loaded to 100%, but the CPU is less (perhaps 60%?). https://memtest.org/ I recommend that you run Memtest86+ in parallel mode for at least one pass. I have seen computers go for 20+ hours before encountering a memory error. Did you see the problems when running Debian stable OOTB, before adding anything? I would need to do this with a liveUSB, to have it run OOTB Okay. Put my Perl script on your liveUSB. Also put some tool for monitoring CPU temperature, such as sensors(1). David
Re: Weird behaviour on System under high load
On 5/21/23 06:26, songbird wrote: David Christensen wrote: ... Measuring actual power supply output and system usage would involve building or buying suitable test equipment. The cost would be non-trivial. ... it depends upon how accurate you want to be and how much power. for my system it was a simple matter of buying a reasonably sized battery backup unit which includes in it's display the amount of power being drawn in watts. on sale the backup unit cost about $150 USD. if i want to see what something draws i have a power cord set up to use for that and just plug it in and watch the display as it operates. if the device is a computer part i can plug it in to my motherboard or via usb or ... as long as it gets done with a grounding strip and i do the power turn off and turn back on as is appropriate for the device (and within ratings of my power supply). also use this setup to figure out how much power the various wall warts are eating. :( switches on all of them are worth the expense. songbird Yes, there are a variety of price/performance options for measuring current and voltages between the AC power outlet and an AC load (such as a computer). But, I was talking about measuring currents and voltages between a computer power supply output and the various components inside the computer. David
Re: Weird behaviour on System under high load
David Christensen wrote: ... > Measuring actual power supply output and system usage would involve > building or buying suitable test equipment. The cost would be non-trivial. ... it depends upon how accurate you want to be and how much power. for my system it was a simple matter of buying a reasonably sized battery backup unit which includes in it's display the amount of power being drawn in watts. on sale the backup unit cost about $150 USD. if i want to see what something draws i have a power cord set up to use for that and just plug it in and watch the display as it operates. if the device is a computer part i can plug it in to my motherboard or via usb or ... as long as it gets done with a grounding strip and i do the power turn off and turn back on as is appropriate for the device (and within ratings of my power supply). also use this setup to figure out how much power the various wall warts are eating. :( switches on all of them are worth the expense. songbird
Re: Weird behaviour on System under high load
> Ursprüngliche Nachricht > Von: David Christensen > An: debian-user@lists.debian.org > Betreff: Re: Weird behaviour on System under high load > Datum: Sun, 21 May 2023 03:11:43 -0700 > > On 5/21/23 01:14, Christian wrote: > > > > Ursprüngliche Nachricht > > > Von: David Christensen > > > An: debian-user@lists.debian.org > > > Betreff: Re: Weird behaviour on System under high load > > > Datum: Sat, 20 May 2023 18:00:48 -0700 > > > > > > On 5/20/23 14:46, Christian wrote: > > > > Hi there, > > > > > > > > I am having trouble with a new build system. It works normal > and > > > > stable > > > > until I put extreme stress on it, e.g. using all 12 cores with > > > > stress > > > > tool. > > > > > > > > System will suddenly loose network connection and become > > > > unresponsive. > > > > Only a reset works. I am not sure what is going on, but it is > > > > reproducible: Put stress on the system and it fails. It seems, > > > > that > > > > something is getting out of step. > > > > > > > > Stuff below I found in the logs. I tried quite a bit, even > > > > upgraded > > > > to > > > > bookworm, to see if the newer kernel works. > > > > > > > > If anyone knows how to analyze this issue, it would be very > > > > helpful. > > > Please use inline posting style and proper indentation. Phew... will be quite hard to read. But here you go. > > > > > Have you verified that your PSU has sufficient capacity for the > > > load on > > > each and every rail? > > > Hi there, > > > > Lets go through the different topics: > > - Setup: It is a AMD 5600G > > https://www.amd.com/en/products/apu/amd-ryzen-5-5600g > > 65 W > > > > on a ASRock B550M-ITX/ac, > > > https://www.asrock.com/mb/AMD/B550M-ITXac/index.asp > > > > powered by a BeQuiet SP7 300W > > > > - Power: From the specifications it should fit. As it takes 5-20 > > minutes for the error to occur, I would take that as an > indication, > > that the power supply is ok. Otherwise would expect that to fail > right > > away? Is there a way to measure/test if there is any issue with > it? > > I also tested to limit PPT to 45W which also makes no difference. > > > If all you have a motherboard, a 65W CPU, and an SSD, that looks like > a > good quality 300W PSU and I would think it should support long-term > full > loading of the CPU. But, there is no substitute for doing the > engineering. > > > I do PSU calculations using a spreadsheet. This requires finding > power > specifications (or making estimates) for everything in the system, > which > can be tough. > > > BeQuiet has a PSU calculator. I suggest using it: > > https://www.bequiet.com/en/psucalculator > > > Measuring actual power supply output and system usage would involve > building or buying suitable test equipment. The cost would be non- > trivial. > > > An easy A/B test would be to connect a known-good, high-quality PSU > with > a higher power rating (say, 500-1000W). I use: > > https://www.fractal-design.com/products/power-supplies/ion/ion-2-platinum-660w/black/ > Used the calculator, however might be, that the onboard graphics is not attributed properly for. Will see that I get a 500W PSU for testing. > > > > Have you cleaned the system interior, filters, fans, heatsinks, > > > ducts, > > > etc., recently? > > > ? As written in OP, the system is new. Only PSU is used. So it is clean > > > > > Have you tested the thermal solution(s) recently? > > > - Thermal: I am observing the temperatures on the stresstest. If I > am > > correct in reading Smbusmaster0, Temps haven't been above 71°C, > but > > error also occurs earlier, way below 70. > > > Okay. > > > What is your CPU thermal solution? > What is a thermal solution? > > What stresstest are you using? > stress running in s-tui > > > > Have you tested the power supply recently? > It was working before without issues, so not explicitly tested. > > I suffered a rash of bad PSU's recently. I was able to figure it out > because I bought an inexpensive PSU tester years ago. It has saved > my > sanity more than once. I suggest that you buy something like it: > > https://www.ebay.com/sch/i.html?_from=R40&_t
Re: Weird behaviour on System under high load
On 5/21/23 01:14, Christian wrote: Ursprüngliche Nachricht Von: David Christensen An: debian-user@lists.debian.org Betreff: Re: Weird behaviour on System under high load Datum: Sat, 20 May 2023 18:00:48 -0700 On 5/20/23 14:46, Christian wrote: Hi there, I am having trouble with a new build system. It works normal and stable until I put extreme stress on it, e.g. using all 12 cores with stress tool. System will suddenly loose network connection and become unresponsive. Only a reset works. I am not sure what is going on, but it is reproducible: Put stress on the system and it fails. It seems, that something is getting out of step. Stuff below I found in the logs. I tried quite a bit, even upgraded to bookworm, to see if the newer kernel works. If anyone knows how to analyze this issue, it would be very helpful. Please use inline posting style and proper indentation. Have you verified that your PSU has sufficient capacity for the load on each and every rail? > Hi there, > > Lets go through the different topics: > - Setup: It is a AMD 5600G https://www.amd.com/en/products/apu/amd-ryzen-5-5600g 65 W > on a ASRock B550M-ITX/ac, https://www.asrock.com/mb/AMD/B550M-ITXac/index.asp > powered by a BeQuiet SP7 300W > > - Power: From the specifications it should fit. As it takes 5-20 > minutes for the error to occur, I would take that as an indication, > that the power supply is ok. Otherwise would expect that to fail right > away? Is there a way to measure/test if there is any issue with it? > I also tested to limit PPT to 45W which also makes no difference. If all you have a motherboard, a 65W CPU, and an SSD, that looks like a good quality 300W PSU and I would think it should support long-term full loading of the CPU. But, there is no substitute for doing the engineering. I do PSU calculations using a spreadsheet. This requires finding power specifications (or making estimates) for everything in the system, which can be tough. BeQuiet has a PSU calculator. I suggest using it: https://www.bequiet.com/en/psucalculator Measuring actual power supply output and system usage would involve building or buying suitable test equipment. The cost would be non-trivial. An easy A/B test would be to connect a known-good, high-quality PSU with a higher power rating (say, 500-1000W). I use: https://www.fractal-design.com/products/power-supplies/ion/ion-2-platinum-660w/black/ Have you cleaned the system interior, filters, fans, heatsinks, ducts, etc., recently? ? Have you tested the thermal solution(s) recently? > - Thermal: I am observing the temperatures on the stresstest. If I am > correct in reading Smbusmaster0, Temps haven't been above 71°C, but > error also occurs earlier, way below 70. Okay. What is your CPU thermal solution? What stresstest are you using? Have you tested the power supply recently? I suffered a rash of bad PSU's recently. I was able to figure it out because I bought an inexpensive PSU tester years ago. It has saved my sanity more than once. I suggest that you buy something like it: https://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=antec+atx12+tester&_sacat=0 Have you tested the memory recently? > - Memory: Yes was tested right after the build with no errors Okay. Did you do multi-threaded/ stress tests? Are you running Debian stable? Are you running Debian stable packages only? Were they all installed with the same package manager? > - OS: I was running Debian stable in quite a minimal configuration > (fresh install as most services are dockerized) when first observed the > error. Now moved to Debian 12/Bookworm to see if it makes any > difference with higher kernel (it does not). Also exchanged r8169 for > the r8168. It changes the error messages, however system instability > stays. Did you see the problems when running Debian stable OOTB, before adding anything? Did you stress test the system before adding anything (other than the stress test)? If all of the above are okay and the system is still locking up, I would disable or remove all disks in the system, install a zeroed SSD, install Debian stable choosing only "SSH server" and "standard system utilities", install only the stable packages required for your workload, put the workload on it, and see what happens. > I could disconnect the disks and see if it makes any difference. > However when reproducing this error, disks other than system where > unmounted. So would guess this would be a test to see if it is about > power? Stripping the system down to minimum hardware and software is a good starting point. You will need a tool to load the system and some means to watch what happens. Assuming the base configuration passes all tests, then add something, test, and repeat until testing fails.
Re: Weird behaviour on System under high load
Hi there, Lets go through the different topics: - Setup: It is a AMD 5600G on a ASRock B550M-ITX/ac, powered by a BeQuiet SP7 300W - Power: From the specifications it should fit. As it takes 5-20 minutes for the error to occur, I would take that as an indication, that the power supply is ok. Otherwise would expect that to fail right away? Is there a way to measure/test if there is any issue with it? I also tested to limit PPT to 45W which also makes no difference. - Memory: Yes was tested right after the build with no errors - Thermal: I am observing the temperatures on the stresstest. If I am correct in reading Smbusmaster0, Temps haven't been above 71°C, but error also occurs earlier, way below 70. - OS: I was running Debian stable in quite a minimal configuration (fresh install as most services are dockerized) when first observed the error. Now moved to Debian 12/Bookworm to see if it makes any difference with higher kernel (it does not). Also exchanged r8169 for the r8168. It changes the error messages, however system instability stays. I could disconnect the disks and see if it makes any difference. However when reproducing this error, disks other than system where unmounted. So would guess this would be a test to see if it is about power? Ursprüngliche Nachricht Von: David Christensen An: debian-user@lists.debian.org Betreff: Re: Weird behaviour on System under high load Datum: Sat, 20 May 2023 18:00:48 -0700 On 5/20/23 14:46, Christian wrote: > Hi there, > > I am having trouble with a new build system. It works normal and > stable > until I put extreme stress on it, e.g. using all 12 cores with stress > tool. > > System will suddenly loose network connection and become > unresponsive. > Only a reset works. I am not sure what is going on, but it is > reproducible: Put stress on the system and it fails. It seems, that > something is getting out of step. > > Stuff below I found in the logs. I tried quite a bit, even upgraded > to > bookworm, to see if the newer kernel works. > > If anyone knows how to analyze this issue, it would be very helpful. > > Kind regards > Christian > > > 2023-05-20T20:12:17.054224+02:00 diskstation kernel: [ 1303.236428] - > -- > -[ cut here ] > 2023-05-20T20:12:17.054234+02:00 diskstation kernel: [ 1303.236430] > NETDEV WATCHDOG: enp3s0 (r8169): transmit queue 0 timed out > 2023-05-20T20:12:17.054235+02:00 diskstation kernel: [ 1303.236437] > WARNING: CPU: 5 PID: 2411 at net/sched/sch_generic.c:525 > dev_watchdog+0x207/0x210 > 2023-05-20T20:12:17.054236+02:00 diskstation kernel: [ 1303.236442] > Modules linked in: eq3_char_loop(OE) rpi_rf_mod_led(OE) ledtrig_timer > ledtrig_default_on xt_MASQUERADE nf_conntrack_netlink xfrm_user > xfrm_algo xt_addrtype br_netfilter bridge stp llc overlay ip6t_rt > nft_chain_nat nf_nat xt_set xt_tcpmss xt_tcpudp xt_conntrack > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables > ip_set_hash_ip ip_set binfmt_misc nfnetlink nls_ascii nls_cp437 vfat > fat amdgpu iwlmvm btusb intel_rapl_msr btrtl intel_rapl_common btbcm > btintel edac_mce_amd btmtk mac80211 snd_hda_codec_realtek bluetooth > snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi gpu_sched > kvm_amd drm_buddy libarc4 snd_hda_intel drm_display_helper > snd_intel_dspcfg snd_intel_sdw_acpi iwlwifi kvm cec snd_hda_codec > jitterentropy_rng irqbypass rc_core snd_hda_core cfg80211 snd_hwdep > drm_ttm_helper snd_pcm ttm drbg wmi_bmof rapl ccp snd_timer > ansi_cprng > drm_kms_helper sp5100_tco snd pcspkr ecdh_generic rng_core > i2c_algo_bit > watchdog soundcore k10temp rfkill hb_rf_usb_2(OE) ecc > 2023-05-20T20:12:17.054240+02:00 diskstation kernel: [ 1303.236494] > generic_raw_uart(OE) acpi_cpufreq button joydev evdev sg nct6775 > nct6775_core drm hwmon_vid fuse loop efi_pstore configfs efivarfs > ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs > blake2b_generic xor raid6_pq zstd_compress libcrc32c crc32c_generic > dm_crypt dm_mod hid_generic usbhid hid sd_mod crc32_pclmul > crc32c_intel > ahci ghash_clmulni_intel sha512_ssse3 libahci xhci_pci sha512_generic > xhci_hcd r8169 nvme realtek libata aesni_intel nvme_core t10_pi > crypto_simd mdio_devres usbcore scsi_mod crc64_rocksoft_generic > cryptd > libphy crc64_rocksoft crc_t10dif i2c_piix4 crct10dif_generic > crct10dif_pclmul crc64 crct10dif_common usb_common scsi_common video > wmi gpio_amdpt gpio_generic > 2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236534] > CPU: 5 PID: 2411 Comm: stress Tainted: G OE 6.1.0-9- > amd64 #1 Debian 6.1.27-1 > 2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236536] > Hardware name: To Be Filled By O.E.M. B550M-ITX/ac/B550M-ITX/ac, BIOS > L2.62 01/31/2023 > 2023-05-20T20:12:17.
Re: Weird behaviour on System under high load
On 5/20/23 14:46, Christian wrote: Hi there, I am having trouble with a new build system. It works normal and stable until I put extreme stress on it, e.g. using all 12 cores with stress tool. System will suddenly loose network connection and become unresponsive. Only a reset works. I am not sure what is going on, but it is reproducible: Put stress on the system and it fails. It seems, that something is getting out of step. Stuff below I found in the logs. I tried quite a bit, even upgraded to bookworm, to see if the newer kernel works. If anyone knows how to analyze this issue, it would be very helpful. Kind regards Christian 2023-05-20T20:12:17.054224+02:00 diskstation kernel: [ 1303.236428] --- -[ cut here ] 2023-05-20T20:12:17.054234+02:00 diskstation kernel: [ 1303.236430] NETDEV WATCHDOG: enp3s0 (r8169): transmit queue 0 timed out 2023-05-20T20:12:17.054235+02:00 diskstation kernel: [ 1303.236437] WARNING: CPU: 5 PID: 2411 at net/sched/sch_generic.c:525 dev_watchdog+0x207/0x210 2023-05-20T20:12:17.054236+02:00 diskstation kernel: [ 1303.236442] Modules linked in: eq3_char_loop(OE) rpi_rf_mod_led(OE) ledtrig_timer ledtrig_default_on xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter bridge stp llc overlay ip6t_rt nft_chain_nat nf_nat xt_set xt_tcpmss xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables ip_set_hash_ip ip_set binfmt_misc nfnetlink nls_ascii nls_cp437 vfat fat amdgpu iwlmvm btusb intel_rapl_msr btrtl intel_rapl_common btbcm btintel edac_mce_amd btmtk mac80211 snd_hda_codec_realtek bluetooth snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi gpu_sched kvm_amd drm_buddy libarc4 snd_hda_intel drm_display_helper snd_intel_dspcfg snd_intel_sdw_acpi iwlwifi kvm cec snd_hda_codec jitterentropy_rng irqbypass rc_core snd_hda_core cfg80211 snd_hwdep drm_ttm_helper snd_pcm ttm drbg wmi_bmof rapl ccp snd_timer ansi_cprng drm_kms_helper sp5100_tco snd pcspkr ecdh_generic rng_core i2c_algo_bit watchdog soundcore k10temp rfkill hb_rf_usb_2(OE) ecc 2023-05-20T20:12:17.054240+02:00 diskstation kernel: [ 1303.236494] generic_raw_uart(OE) acpi_cpufreq button joydev evdev sg nct6775 nct6775_core drm hwmon_vid fuse loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c crc32c_generic dm_crypt dm_mod hid_generic usbhid hid sd_mod crc32_pclmul crc32c_intel ahci ghash_clmulni_intel sha512_ssse3 libahci xhci_pci sha512_generic xhci_hcd r8169 nvme realtek libata aesni_intel nvme_core t10_pi crypto_simd mdio_devres usbcore scsi_mod crc64_rocksoft_generic cryptd libphy crc64_rocksoft crc_t10dif i2c_piix4 crct10dif_generic crct10dif_pclmul crc64 crct10dif_common usb_common scsi_common video wmi gpio_amdpt gpio_generic 2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236534] CPU: 5 PID: 2411 Comm: stress Tainted: G OE 6.1.0-9- amd64 #1 Debian 6.1.27-1 2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236536] Hardware name: To Be Filled By O.E.M. B550M-ITX/ac/B550M-ITX/ac, BIOS L2.62 01/31/2023 2023-05-20T20:12:17.054242+02:00 diskstation kernel: [ 1303.236537] RIP: 0010:dev_watchdog+0x207/0x210 2023-05-20T20:12:17.054242+02:00 diskstation kernel: [ 1303.236540] Code: 00 e9 40 ff ff ff 48 89 df c6 05 ff 5f 3d 01 01 e8 be 79 f9 ff 44 89 e9 48 89 de 48 c7 c7 c8 16 9b a8 48 89 c2 e8 09 d2 86 ff <0f> 0b e9 22 ff ff ff 66 90 0f 1f 44 00 00 55 53 48 89 fb 48 8b 6f 2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236541] RSP: :a831c345fdc8 EFLAGS: 00010286 2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236543] RAX: RBX: 91a3c141 RCX: 2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236544] RDX: 0103 RSI: a893fa66 RDI: 2023-05-20T20:12:17.054244+02:00 diskstation kernel: [ 1303.236545] RBP: 91a3c1410488 R08: R09: a831c345fc38 2023-05-20T20:12:17.054244+02:00 diskstation kernel: [ 1303.236546] R10: 0003 R11: 91aafe27afe8 R12: 91a3c14103dc 2023-05-20T20:12:17.054245+02:00 diskstation kernel: [ 1303.236547] R13: R14: a7e2e7a0 R15: 91a3c1410488 2023-05-20T20:12:17.054245+02:00 diskstation kernel: [ 1303.236548] FS: 7f169849d740() GS:91aade34() knlGS: 2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236550] CS: 0010 DS: ES: CR0: 80050033 2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236551] CR2: 55d05c3f4000 CR3: 000103cf2000 CR4: 00750ee0 2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236552] PKRU: 5554 2023-05-20T20:12:17.054247+02:00 diskstation kernel: [ 1303.236553] Call Trace: 2023-05-20T20:12:17.054247+02:00 diskstation kernel: [ 1303.236554] 2023-05-20T20:12:17.054248+02:00 diskstation kernel: [