Re: Weird behaviour on System under high load

2023-05-28 Thread David Christensen

On 5/28/23 03:09, Christian wrote:

 Ursprüngliche Nachricht 
Von: David Christensen 
An: debian-user@lists.debian.org
Betreff: Re: Weird behaviour on System under high load
Datum: Sat, 27 May 2023 16:30:05 -0700

On 5/27/23 15:28, Christian wrote:


New day, new tests. Got a crash again, however with the message

"AHCI

controller unavailable".
Figured that is the SATA drives not being plugged in the right

order.

Corrected that and a 3:30h stress test went so far without any

issues

besides this old bug
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=947685

Seems that I am just jumping from one error to the next...



3 hours and 30 minutes?  Yikes!  Please stop before you fry your
computer.  10 seconds should be enough to see a problem; 1 minute is
more than enough.


Sadly not always. My crashes before would occur between a few minutes
and 1 hour load. Now I hope everything is stable. Crashes are gone,
only the network error seems to be unresolved (even though there is
some workaround).



Repeatable crashes from a reported issue indicate your hardware is okay.



With the undervolting / overclocking on 12 core stress test, the system
stays below 65°C (on Smbusmaster0) so should be no risk of damage.



It is your computer and your decision.


At this point, I would start adding the software stack, one piece at a 
time, testing between each piece.  The challenge is devising or finding 
tests.  Spot testing by hand can reveal bugs, but that gets tiresome. 
The best approach is an automated/ scripted test suite.  If you are 
using Debian packages, you might want to look for test suites in the 
corresponding source packages.  And/or, you can use building from source 
as a stress test.  Compiling the Linux kernel should provide your 
processor, memory, and storage with a good workout.




Thanks for the help!


YW.  :-)


David



Re: Weird behaviour on System under high load

2023-05-28 Thread Christian
>  Ursprüngliche Nachricht 
> Von: David Christensen 
> An: debian-user@lists.debian.org
> Betreff: Re: Weird behaviour on System under high load
> Datum: Sat, 27 May 2023 16:30:05 -0700
> 
> On 5/27/23 15:28, Christian wrote:
> 
> > New day, new tests. Got a crash again, however with the message
> "AHCI
> > controller unavailable".
> > Figured that is the SATA drives not being plugged in the right
> order.
> > Corrected that and a 3:30h stress test went so far without any
> issues
> > besides this old bug
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=947685
> > 
> > Seems that I am just jumping from one error to the next...
> 
> 
> 3 hours and 30 minutes?  Yikes!  Please stop before you fry your 
> computer.  10 seconds should be enough to see a problem; 1 minute is 
> more than enough.
> 
Sadly not always. My crashes before would occur between a few minutes
and 1 hour load. Now I hope everything is stable. Crashes are gone,
only the network error seems to be unresolved (even though there is
some workaround).

With the undervolting / overclocking on 12 core stress test, the system
stays below 65°C (on Smbusmaster0) so should be no risk of damage.

Thanks for the help!
> 
> David
> 
> 
> 



Re: Weird behaviour on System under high load

2023-05-27 Thread David Christensen

On 5/27/23 15:28, Christian wrote:


New day, new tests. Got a crash again, however with the message "AHCI
controller unavailable".
Figured that is the SATA drives not being plugged in the right order.
Corrected that and a 3:30h stress test went so far without any issues
besides this old bug
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=947685

Seems that I am just jumping from one error to the next...



3 hours and 30 minutes?  Yikes!  Please stop before you fry your 
computer.  10 seconds should be enough to see a problem; 1 minute is 
more than enough.



David




Re: Weird behaviour on System under high load

2023-05-27 Thread Christian
>  Ursprüngliche Nachricht 
> Von: David Christensen 
> An: debian-user@lists.debian.org
> Betreff: Re: Weird behaviour on System under high load
> Datum: Fri, 26 May 2023 18:22:17 -0700
> 
> On 5/26/23 16:08, Christian wrote:
> 
> > Good and bad things:
> > I started to test different setups (always with full 12 core stress
> > test). Boot from USB liveCD (only stress and s-tui installed):
> > 
> > - All disks disconnected, other than M2. Standard BIOS
> > - All disks disconnected, other than M2. Proper Memory profile for
> > timing
> > - All disks disconnected, other than M2. Memory profile,
> undervolted
> > and overclocked with limited burst to 4ghz
> > - All disks connected. Memory profile, undervolted and overclocked
> > with
> > limited burst to 4ghz
> > 
> > All settings so far are stable. :-/
> > Will see tomorrow any differences in non-free firmware and kernel
> > modules and test again.
> > 
> > Very strange...
> 
> 
> If everything is stable, including undervoltage and overclocking, I 
> would consider that good.  I think your hardware is good.
> 
> 
> When you say "USB liveCD", is that a USB optical drive with a live
> CD, a 
> USB flash drive with a bootable OS on it, or something else?  If it
> is 
> something that can change, I suggest taking a image of the raw blocks
> with dd(1) so that you can easily get back to this point as you
> continue 
> testing.
> 

New day, new tests. Got a crash again, however with the message "AHCI
controller unavailable".
Figured that is the SATA drives not being plugged in the right order.
Corrected that and a 3:30h stress test went so far without any issues
besides this old bug
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=947685

Seems that I am just jumping from one error to the next...

> 
> AIUI Debian can include microcode patches (depending upon processor).
> If you are using such, I suggest adding that to your test agenda
> first.
> 
> 
> Firmware and kernel modules seem like the right next steps.
> 
> 
> David
> 
> 



Re: Weird behaviour on System under high load

2023-05-26 Thread David Christensen

On 5/26/23 16:08, Christian wrote:


Good and bad things:
I started to test different setups (always with full 12 core stress
test). Boot from USB liveCD (only stress and s-tui installed):

- All disks disconnected, other than M2. Standard BIOS
- All disks disconnected, other than M2. Proper Memory profile for
timing
- All disks disconnected, other than M2. Memory profile, undervolted
and overclocked with limited burst to 4ghz
- All disks connected. Memory profile, undervolted and overclocked with
limited burst to 4ghz

All settings so far are stable. :-/
Will see tomorrow any differences in non-free firmware and kernel
modules and test again.

Very strange...



If everything is stable, including undervoltage and overclocking, I 
would consider that good.  I think your hardware is good.



When you say "USB liveCD", is that a USB optical drive with a live CD, a 
USB flash drive with a bootable OS on it, or something else?  If it is 
something that can change, I suggest taking a image of the raw blocks 
with dd(1) so that you can easily get back to this point as you continue 
testing.



AIUI Debian can include microcode patches (depending upon processor). 
If you are using such, I suggest adding that to your test agenda first.



Firmware and kernel modules seem like the right next steps.


David



Re: Weird behaviour on System under high load

2023-05-26 Thread Christian
>  Ursprüngliche Nachricht 
> Von: David Christensen 
> An: debian-user@lists.debian.org
> Betreff: Re: Weird behaviour on System under high load
> Datum: Sun, 21 May 2023 15:04:44 -0700
> 
> 
> > > > > What stresstest are you using?
> 
> > ... the package and command "s-tui" and "stress"
> > s-tui gives you an overview on power usage, fan control, temps,
> core
> > frequencies and core utilization on the console
> > 
> > stress is just producing load on selected # of cpus, it can be
> > integrated in s-tui.
> 
> 
> Thanks -- I like tools and will play with it:
> 
> https://packages.debian.org/bullseye/s-tui
> 
> 
> > > Okay.  Put my Perl script on your liveUSB.  Also put some tool
> for
> > > monitoring CPU temperature, such as sensors(1).
> > 
> > Will have time again in a few days and check.
> 
> 
> Please let us know what you find.
> 
Good and bad things: 
I started to test different setups (always with full 12 core stress
test). Boot from USB liveCD (only stress and s-tui installed):

- All disks disconnected, other than M2. Standard BIOS
- All disks disconnected, other than M2. Proper Memory profile for
timing
- All disks disconnected, other than M2. Memory profile, undervolted
and overclocked with limited burst to 4ghz
- All disks connected. Memory profile, undervolted and overclocked with
limited burst to 4ghz

All settings so far are stable. :-/
Will see tomorrow any differences in non-free firmware and kernel
modules and test again.

Very strange...



Re: Weird behaviour on System under high load

2023-05-21 Thread David Christensen

On 5/21/23 14:46, Christian wrote:

David Christensen Sun, 21 May 2023 14:22:22 -0700

On 5/21/23 06:31, Christian wrote:

David Christensen Sun, 21 May 2023 03:11:43 -0700

David Christensen Sat, 20 May 2023 18:00:48 -0700



Heat sinks, heat pipes, water blocks, radiators, fans, ducts, etc..


It is quite simple
- Noctua NH-L9a-AM4 for CPU
- Chassis 12cm fan
- PSU Integrated fans



I like the Noctua.  :-)



What stresstest are you using?



... the package and command "s-tui" and "stress"
s-tui gives you an overview on power usage, fan control, temps, core
frequencies and core utilization on the console

stress is just producing load on selected # of cpus, it can be
integrated in s-tui.



Thanks -- I like tools and will play with it:

https://packages.debian.org/bullseye/s-tui



Okay.  Put my Perl script on your liveUSB.  Also put some tool for
monitoring CPU temperature, such as sensors(1).


Will have time again in a few days and check.



Please let us know what you find.


David



Re: Weird behaviour on System under high load

2023-05-21 Thread Christian
>  Ursprüngliche Nachricht 
> Von: David Christensen 
> An: debian-user@lists.debian.org
> Betreff: Re: Weird behaviour on System under high load
> Datum: Sun, 21 May 2023 14:22:22 -0700
> 
> On 5/21/23 06:31, Christian wrote:
> > David Christensen Sun, 21 May 2023 03:11:43 -0700
> 
>  >>> David Christensen Sat, 20 May 2023 18:00:48 -0700
> 
> > > Please use inline posting style and proper indentation.
> > 
> > Phew... will be quite hard to read. But here you go.
> 
> 
> It is not hard when you delete the portions that you are not
> responding to.
> 
> 
> > > > > Have you cleaned the system interior, filters, fans,
> heatsinks,
> > > > > ducts,
> > > > > etc., recently?
> 
> > As written in OP, the system is new. Only PSU is used. So it is
> clean
> 
> 
> Okay.
> 
> 
> > What is a thermal solution?
> 
> 
> Heat sinks, heat pipes, water blocks, radiators, fans, ducts, etc..
> 
It is quite simple 
- Noctua NH-L9a-AM4 for CPU
- Chassis 12cm fan
- PSU Integrated fans
> 
> > > What stresstest are you using?
> > > 
> > stress running in s-tui
> 
> 
> Do you mean "in situ"?
> 
> https://www.merriam-webster.com/dictionary/in%20situ
> 
No, it is the package and command "s-tui" and "stress"
s-tui gives you an overview on power usage, fan control, temps, core
frequencies and core utilization on the console

stress is just producing load on selected # of cpus, it can be
integrated in s-tui.

> I prefer a tool that I can control.  That is why I wrote the
> previously 
> attached Perl script.  It is public domain; you and everyone are free
> to 
> use, modify, distribute, etc., as you see fit.
> 
> 
> > > > > Have you tested the power supply recently?
> 
> > It was working before without issues, so not explicitly tested.
> 
> > I am not building regularly, so would need to borrow such equipment
> > somewhere
> 
> 
> Understand that an ATX PSU has multiple stages that produce +12 VDC,
> +5 
> VDC, +5 VDC standby, +3.3 VDC, and -12 VDC ("rails").  It is common
> for 
> one or more rails to fail and the others to continue working. 
> Computers 
> exhibit "weird behaviour" when this happens.
> 
> 
> Just spend the US$20.
> 
> 
> > > > > Have you tested the memory recently?
> 
> > > Did you do multi-threaded/ stress tests?
> > > 
> > Yes, stress is running multiple threads. Only on 2 threads it was
> > stable so far. However it takes longer for the errors to come up
> when
> > using less threads. might be that I did not test long enough.
> 
> 
> I use Memtest86+ 5.01 on a bootable USB stick.  In the
> "Configuration" 
> menu, I can choose "Core Selection".  It appears the default is 
> "Parallel (all)".  Other choices include "Round Robin" and
> "Sequential". 
>   Memtest 5.01 also displays the CPU temperature.  Running it an
> Intel 
> Core i7-2600S with matching factory heat sink and fan for 30+
> minutes, 
> the current CPU temperature is 50 C.  This leads me to believe that
> the 
> memory is loaded to 100%, but the CPU is less (perhaps 60%?).
> 
> https://memtest.org/
> 
> 
> I recommend that you run Memtest86+ in parallel mode for at least one
> pass.  I have seen computers go for 20+ hours before encountering a 
> memory error.
> 
> 
> > > Did you see the problems when running Debian stable OOTB, before
> > > adding
> > > anything?
> 
> > I would need to do this with a liveUSB, to have it run OOTB
> 
> 
> Okay.  Put my Perl script on your liveUSB.  Also put some tool for 
> monitoring CPU temperature, such as sensors(1).

Will have time again in a few days and check.

> 
> 
> David
> 
> 



Re: Weird behaviour on System under high load

2023-05-21 Thread David Christensen

On 5/21/23 06:31, Christian wrote:

David Christensen Sun, 21 May 2023 03:11:43 -0700


>>> David Christensen Sat, 20 May 2023 18:00:48 -0700


Please use inline posting style and proper indentation.


Phew... will be quite hard to read. But here you go.



It is not hard when you delete the portions that you are not responding to.



Have you cleaned the system interior, filters, fans, heatsinks,
ducts,
etc., recently?



As written in OP, the system is new. Only PSU is used. So it is clean



Okay.



What is a thermal solution?



Heat sinks, heat pipes, water blocks, radiators, fans, ducts, etc..



What stresstest are you using?


stress running in s-tui



Do you mean "in situ"?

https://www.merriam-webster.com/dictionary/in%20situ


I prefer a tool that I can control.  That is why I wrote the previously 
attached Perl script.  It is public domain; you and everyone are free to 
use, modify, distribute, etc., as you see fit.




Have you tested the power supply recently?



It was working before without issues, so not explicitly tested.



I am not building regularly, so would need to borrow such equipment
somewhere



Understand that an ATX PSU has multiple stages that produce +12 VDC, +5 
VDC, +5 VDC standby, +3.3 VDC, and -12 VDC ("rails").  It is common for 
one or more rails to fail and the others to continue working.  Computers 
exhibit "weird behaviour" when this happens.



Just spend the US$20.



Have you tested the memory recently?



Did you do multi-threaded/ stress tests?


Yes, stress is running multiple threads. Only on 2 threads it was
stable so far. However it takes longer for the errors to come up when
using less threads. might be that I did not test long enough.



I use Memtest86+ 5.01 on a bootable USB stick.  In the "Configuration" 
menu, I can choose "Core Selection".  It appears the default is 
"Parallel (all)".  Other choices include "Round Robin" and "Sequential". 
 Memtest 5.01 also displays the CPU temperature.  Running it an Intel 
Core i7-2600S with matching factory heat sink and fan for 30+ minutes, 
the current CPU temperature is 50 C.  This leads me to believe that the 
memory is loaded to 100%, but the CPU is less (perhaps 60%?).


https://memtest.org/


I recommend that you run Memtest86+ in parallel mode for at least one 
pass.  I have seen computers go for 20+ hours before encountering a 
memory error.




Did you see the problems when running Debian stable OOTB, before
adding
anything?



I would need to do this with a liveUSB, to have it run OOTB



Okay.  Put my Perl script on your liveUSB.  Also put some tool for 
monitoring CPU temperature, such as sensors(1).



David



Re: Weird behaviour on System under high load

2023-05-21 Thread David Christensen

On 5/21/23 06:26, songbird wrote:

David Christensen wrote:
...

Measuring actual power supply output and system usage would involve
building or buying suitable test equipment.  The cost would be non-trivial.


...

   it depends upon how accurate you want to be and
how much power.

   for my system it was a simple matter of buying a
reasonably sized battery backup unit which includes
in it's display the amount of power being drawn in
watts.

   on sale the backup unit cost about $150 USD.  if
i want to see what something draws i have a power
cord set up to use for that and just plug it in
and watch the display as it operates.  if the
device is a computer part i can plug it in to my
motherboard or via usb or ...  as long as it gets
done with a grounding strip and i do the power
turn off and turn back on as is appropriate for
the device (and within ratings of my power supply).

   also use this setup to figure out how much power
the various wall warts are eating.  :(  switches on
all of them are worth the expense.


   songbird



Yes, there are a variety of price/performance options for measuring 
current and voltages between the AC power outlet and an AC load (such as 
a computer).



But, I was talking about measuring currents and voltages between a 
computer power supply output and the various components inside the 
computer.



David



Re: Weird behaviour on System under high load

2023-05-21 Thread songbird
David Christensen wrote:
...
> Measuring actual power supply output and system usage would involve 
> building or buying suitable test equipment.  The cost would be non-trivial.

...

  it depends upon how accurate you want to be and
how much power.

  for my system it was a simple matter of buying a
reasonably sized battery backup unit which includes
in it's display the amount of power being drawn in
watts.

  on sale the backup unit cost about $150 USD.  if
i want to see what something draws i have a power 
cord set up to use for that and just plug it in
and watch the display as it operates.  if the 
device is a computer part i can plug it in to my
motherboard or via usb or ...  as long as it gets
done with a grounding strip and i do the power 
turn off and turn back on as is appropriate for
the device (and within ratings of my power supply).

  also use this setup to figure out how much power
the various wall warts are eating.  :(  switches on
all of them are worth the expense.


  songbird



Re: Weird behaviour on System under high load

2023-05-21 Thread Christian
>  Ursprüngliche Nachricht 
> Von: David Christensen 
> An: debian-user@lists.debian.org
> Betreff: Re: Weird behaviour on System under high load
> Datum: Sun, 21 May 2023 03:11:43 -0700
> 
> On 5/21/23 01:14, Christian wrote:
> 
> > >  Ursprüngliche Nachricht 
> > > Von: David Christensen 
> > > An: debian-user@lists.debian.org
> > > Betreff: Re: Weird behaviour on System under high load
> > > Datum: Sat, 20 May 2023 18:00:48 -0700
> > > 
> > > On 5/20/23 14:46, Christian wrote:
> > > > Hi there,
> > > > 
> > > > I am having trouble with a new build system. It works normal
> and
> > > > stable
> > > > until I put extreme stress on it, e.g. using all 12 cores with
> > > > stress
> > > > tool.
> > > > 
> > > > System will suddenly loose network connection and become
> > > > unresponsive.
> > > > Only a reset works. I am not sure what is going on, but it is
> > > > reproducible: Put stress on the system and it fails. It seems,
> > > > that
> > > > something is getting out of step.
> > > > 
> > > > Stuff below I found in the logs. I tried quite a bit, even
> > > > upgraded
> > > > to
> > > > bookworm, to see if the newer kernel works.
> > > > 
> > > > If anyone knows how to analyze this issue, it would be very
> > > > helpful.
> 
> 
> Please use inline posting style and proper indentation.

Phew... will be quite hard to read. But here you go.

> 
> 
> > > Have you verified that your PSU has sufficient capacity for the
> > > load on
> > > each and every rail?
> 
>  > Hi there,
>  >
>  > Lets go through the different topics:
>  > - Setup: It is a AMD 5600G
> 
> https://www.amd.com/en/products/apu/amd-ryzen-5-5600g
> 
> 65 W
> 
> 
>  > on a ASRock B550M-ITX/ac,
> 
> 
> https://www.asrock.com/mb/AMD/B550M-ITXac/index.asp
> 
> 
>  > powered by a BeQuiet SP7 300W
>  >
>  > - Power: From the specifications it should fit. As it takes 5-20
>  > minutes for the error to occur, I would take that as an
> indication,
>  > that the power supply is ok. Otherwise would expect that to fail
> right
>  > away? Is there a way to measure/test if there is any issue with
> it?
>  > I also tested to limit PPT to 45W which also makes no difference.
> 
> 
> If all you have a motherboard, a 65W CPU, and an SSD, that looks like
> a 
> good quality 300W PSU and I would think it should support long-term
> full 
> loading of the CPU.  But, there is no substitute for doing the
> engineering.
> 
> 
> I do PSU calculations using a spreadsheet.  This requires finding
> power 
> specifications (or making estimates) for everything in the system,
> which 
> can be tough.
> 
> 
> BeQuiet has a PSU calculator.  I suggest using it:
> 
> https://www.bequiet.com/en/psucalculator
> 
> 
> Measuring actual power supply output and system usage would involve 
> building or buying suitable test equipment.  The cost would be non-
> trivial.
> 
> 
> An easy A/B test would be to connect a known-good, high-quality PSU
> with 
> a higher power rating (say, 500-1000W).  I use:
> 
> https://www.fractal-design.com/products/power-supplies/ion/ion-2-platinum-660w/black/
> 
Used the calculator, however might be, that the onboard graphics is not
attributed properly for. Will see that I get a 500W PSU for testing.
> 
> > > Have you cleaned the system interior, filters, fans, heatsinks,
> > > ducts,
> > > etc., recently?
> 
> 
> ?
As written in OP, the system is new. Only PSU is used. So it is clean
> 
> 
> > > Have you tested the thermal solution(s) recently?
> 
>  > - Thermal: I am observing the temperatures on the stresstest. If I
> am
>  > correct in reading Smbusmaster0, Temps haven't been above 71°C,
> but
>  > error also occurs earlier, way below 70.
> 
> 
> Okay.
> 
> 
> What is your CPU thermal solution?
> 
What is a thermal solution?
> 
> What stresstest are you using?
> 
stress running in s-tui
> 
> > > Have you tested the power supply recently?
> 
It was working before without issues, so not explicitly tested.
> 
> I suffered a rash of bad PSU's recently.  I was able to figure it out
> because I bought an inexpensive PSU tester years ago.  It has saved
> my 
> sanity more than once.  I suggest that you buy something like it:
> 
> https://www.ebay.com/sch/i.html?_from=R40&_t

Re: Weird behaviour on System under high load

2023-05-21 Thread David Christensen

On 5/21/23 01:14, Christian wrote:


 Ursprüngliche Nachricht 
Von: David Christensen 
An: debian-user@lists.debian.org
Betreff: Re: Weird behaviour on System under high load
Datum: Sat, 20 May 2023 18:00:48 -0700

On 5/20/23 14:46, Christian wrote:

Hi there,

I am having trouble with a new build system. It works normal and
stable
until I put extreme stress on it, e.g. using all 12 cores with stress
tool.

System will suddenly loose network connection and become
unresponsive.
Only a reset works. I am not sure what is going on, but it is
reproducible: Put stress on the system and it fails. It seems, that
something is getting out of step.

Stuff below I found in the logs. I tried quite a bit, even upgraded
to
bookworm, to see if the newer kernel works.

If anyone knows how to analyze this issue, it would be very helpful.



Please use inline posting style and proper indentation.



Have you verified that your PSU has sufficient capacity for the load on
each and every rail?


> Hi there,
>
> Lets go through the different topics:
> - Setup: It is a AMD 5600G

https://www.amd.com/en/products/apu/amd-ryzen-5-5600g

65 W


> on a ASRock B550M-ITX/ac,


https://www.asrock.com/mb/AMD/B550M-ITXac/index.asp


> powered by a BeQuiet SP7 300W
>
> - Power: From the specifications it should fit. As it takes 5-20
> minutes for the error to occur, I would take that as an indication,
> that the power supply is ok. Otherwise would expect that to fail right
> away? Is there a way to measure/test if there is any issue with it?
> I also tested to limit PPT to 45W which also makes no difference.


If all you have a motherboard, a 65W CPU, and an SSD, that looks like a 
good quality 300W PSU and I would think it should support long-term full 
loading of the CPU.  But, there is no substitute for doing the engineering.



I do PSU calculations using a spreadsheet.  This requires finding power 
specifications (or making estimates) for everything in the system, which 
can be tough.



BeQuiet has a PSU calculator.  I suggest using it:

https://www.bequiet.com/en/psucalculator


Measuring actual power supply output and system usage would involve 
building or buying suitable test equipment.  The cost would be non-trivial.



An easy A/B test would be to connect a known-good, high-quality PSU with 
a higher power rating (say, 500-1000W).  I use:


https://www.fractal-design.com/products/power-supplies/ion/ion-2-platinum-660w/black/



Have you cleaned the system interior, filters, fans, heatsinks, ducts,
etc., recently?



?



Have you tested the thermal solution(s) recently?


> - Thermal: I am observing the temperatures on the stresstest. If I am
> correct in reading Smbusmaster0, Temps haven't been above 71°C, but
> error also occurs earlier, way below 70.


Okay.


What is your CPU thermal solution?


What stresstest are you using?



Have you tested the power supply recently?



I suffered a rash of bad PSU's recently.  I was able to figure it out 
because I bought an inexpensive PSU tester years ago.  It has saved my 
sanity more than once.  I suggest that you buy something like it:


https://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=antec+atx12+tester&_sacat=0



Have you tested the memory recently?


> - Memory: Yes was tested right after the build with no errors


Okay.


Did you do multi-threaded/ stress tests?



Are you running Debian stable?


Are you running Debian stable packages only?  Were they all installed
with the same package manager?


> - OS: I was running Debian stable in quite a minimal configuration
> (fresh install as most services are dockerized) when first observed the
> error. Now moved to Debian 12/Bookworm to see if it makes any
> difference with higher kernel (it does not). Also exchanged r8169 for
> the r8168. It changes the error messages, however system instability
> stays.


Did you see the problems when running Debian stable OOTB, before adding 
anything?



Did you stress test the system before adding anything (other than the 
stress test)?




If all of the above are okay and the system is still locking up, I
would
disable or remove all disks in the system, install a zeroed SSD,
install
Debian stable choosing only "SSH server" and "standard system
utilities", install only the stable packages required for your
workload,
put the workload on it, and see what happens.


> I could disconnect the disks and see if it makes any difference.
> However when reproducing this error, disks other than system where
> unmounted. So would guess this would be a test to see if it is about
> power?


Stripping the system down to minimum hardware and software is a good 
starting point.  You will need a tool to load the system and some means 
to watch what happens.  Assuming the base configuration passes all 
tests, then add something, test, and repeat until testing fails.




Re: Weird behaviour on System under high load

2023-05-21 Thread Christian
Hi there,

Lets go through the different topics:
- Setup: It is a AMD 5600G on a ASRock B550M-ITX/ac, powered by a
BeQuiet SP7 300W

- Power: From the specifications it should fit. As it takes 5-20
minutes for the error to occur, I would take that as an indication,
that the power supply is ok. Otherwise would expect that to fail right
away? Is there a way to measure/test if there is any issue with it?
I also tested to limit PPT to 45W which also makes no difference.

- Memory: Yes was tested right after the build with no errors

- Thermal: I am observing the temperatures on the stresstest. If I am
correct in reading Smbusmaster0, Temps haven't been above 71°C, but
error also occurs earlier, way below 70.

- OS: I was running Debian stable in quite a minimal configuration
(fresh install as most services are dockerized) when first observed the
error. Now moved to Debian 12/Bookworm to see if it makes any
difference with higher kernel (it does not). Also exchanged r8169 for
the r8168. It changes the error messages, however system instability
stays.

I could disconnect the disks and see if it makes any difference.
However when reproducing this error, disks other than system where
unmounted. So would guess this would be a test to see if it is about
power?

 Ursprüngliche Nachricht 
Von: David Christensen 
An: debian-user@lists.debian.org
Betreff: Re: Weird behaviour on System under high load
Datum: Sat, 20 May 2023 18:00:48 -0700

On 5/20/23 14:46, Christian wrote:
> Hi there,
> 
> I am having trouble with a new build system. It works normal and
> stable
> until I put extreme stress on it, e.g. using all 12 cores with stress
> tool.
> 
> System will suddenly loose network connection and become
> unresponsive.
> Only a reset works. I am not sure what is going on, but it is
> reproducible: Put stress on the system and it fails. It seems, that
> something is getting out of step.
> 
> Stuff below I found in the logs. I tried quite a bit, even upgraded
> to
> bookworm, to see if the newer kernel works.
> 
> If anyone knows how to analyze this issue, it would be very helpful.
> 
> Kind regards
>    Christian
> 
> 
> 2023-05-20T20:12:17.054224+02:00 diskstation kernel: [ 1303.236428] -
> --
> -[ cut here ]
> 2023-05-20T20:12:17.054234+02:00 diskstation kernel: [ 1303.236430]
> NETDEV WATCHDOG: enp3s0 (r8169): transmit queue 0 timed out
> 2023-05-20T20:12:17.054235+02:00 diskstation kernel: [ 1303.236437]
> WARNING: CPU: 5 PID: 2411 at net/sched/sch_generic.c:525
> dev_watchdog+0x207/0x210
> 2023-05-20T20:12:17.054236+02:00 diskstation kernel: [ 1303.236442]
> Modules linked in: eq3_char_loop(OE) rpi_rf_mod_led(OE) ledtrig_timer
> ledtrig_default_on xt_MASQUERADE nf_conntrack_netlink xfrm_user
> xfrm_algo xt_addrtype br_netfilter bridge stp llc overlay ip6t_rt
> nft_chain_nat nf_nat xt_set xt_tcpmss xt_tcpudp xt_conntrack
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
> ip_set_hash_ip ip_set binfmt_misc nfnetlink nls_ascii nls_cp437 vfat
> fat amdgpu iwlmvm btusb intel_rapl_msr btrtl intel_rapl_common btbcm
> btintel edac_mce_amd btmtk mac80211 snd_hda_codec_realtek bluetooth
> snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi gpu_sched
> kvm_amd drm_buddy libarc4 snd_hda_intel drm_display_helper
> snd_intel_dspcfg snd_intel_sdw_acpi iwlwifi kvm cec snd_hda_codec
> jitterentropy_rng irqbypass rc_core snd_hda_core cfg80211 snd_hwdep
> drm_ttm_helper snd_pcm ttm drbg wmi_bmof rapl ccp snd_timer
> ansi_cprng
> drm_kms_helper sp5100_tco snd pcspkr ecdh_generic rng_core
> i2c_algo_bit
> watchdog soundcore k10temp rfkill hb_rf_usb_2(OE) ecc
> 2023-05-20T20:12:17.054240+02:00 diskstation kernel: [ 1303.236494]
> generic_raw_uart(OE) acpi_cpufreq button joydev evdev sg nct6775
> nct6775_core drm hwmon_vid fuse loop efi_pstore configfs efivarfs
> ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs
> blake2b_generic xor raid6_pq zstd_compress libcrc32c crc32c_generic
> dm_crypt dm_mod hid_generic usbhid hid sd_mod crc32_pclmul
> crc32c_intel
> ahci ghash_clmulni_intel sha512_ssse3 libahci xhci_pci sha512_generic
> xhci_hcd r8169 nvme realtek libata aesni_intel nvme_core t10_pi
> crypto_simd mdio_devres usbcore scsi_mod crc64_rocksoft_generic
> cryptd
> libphy crc64_rocksoft crc_t10dif i2c_piix4 crct10dif_generic
> crct10dif_pclmul crc64 crct10dif_common usb_common scsi_common video
> wmi gpio_amdpt gpio_generic
> 2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236534]
> CPU: 5 PID: 2411 Comm: stress Tainted: G   OE  6.1.0-9-
> amd64 #1  Debian 6.1.27-1
> 2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236536]
> Hardware name: To Be Filled By O.E.M. B550M-ITX/ac/B550M-ITX/ac, BIOS
> L2.62 01/31/2023
> 2023-05-20T20:12:17.

Re: Weird behaviour on System under high load

2023-05-20 Thread David Christensen

On 5/20/23 14:46, Christian wrote:

Hi there,

I am having trouble with a new build system. It works normal and stable
until I put extreme stress on it, e.g. using all 12 cores with stress
tool.

System will suddenly loose network connection and become unresponsive.
Only a reset works. I am not sure what is going on, but it is
reproducible: Put stress on the system and it fails. It seems, that
something is getting out of step.

Stuff below I found in the logs. I tried quite a bit, even upgraded to
bookworm, to see if the newer kernel works.

If anyone knows how to analyze this issue, it would be very helpful.

Kind regards
   Christian


2023-05-20T20:12:17.054224+02:00 diskstation kernel: [ 1303.236428] ---
-[ cut here ]
2023-05-20T20:12:17.054234+02:00 diskstation kernel: [ 1303.236430]
NETDEV WATCHDOG: enp3s0 (r8169): transmit queue 0 timed out
2023-05-20T20:12:17.054235+02:00 diskstation kernel: [ 1303.236437]
WARNING: CPU: 5 PID: 2411 at net/sched/sch_generic.c:525
dev_watchdog+0x207/0x210
2023-05-20T20:12:17.054236+02:00 diskstation kernel: [ 1303.236442]
Modules linked in: eq3_char_loop(OE) rpi_rf_mod_led(OE) ledtrig_timer
ledtrig_default_on xt_MASQUERADE nf_conntrack_netlink xfrm_user
xfrm_algo xt_addrtype br_netfilter bridge stp llc overlay ip6t_rt
nft_chain_nat nf_nat xt_set xt_tcpmss xt_tcpudp xt_conntrack
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
ip_set_hash_ip ip_set binfmt_misc nfnetlink nls_ascii nls_cp437 vfat
fat amdgpu iwlmvm btusb intel_rapl_msr btrtl intel_rapl_common btbcm
btintel edac_mce_amd btmtk mac80211 snd_hda_codec_realtek bluetooth
snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi gpu_sched
kvm_amd drm_buddy libarc4 snd_hda_intel drm_display_helper
snd_intel_dspcfg snd_intel_sdw_acpi iwlwifi kvm cec snd_hda_codec
jitterentropy_rng irqbypass rc_core snd_hda_core cfg80211 snd_hwdep
drm_ttm_helper snd_pcm ttm drbg wmi_bmof rapl ccp snd_timer ansi_cprng
drm_kms_helper sp5100_tco snd pcspkr ecdh_generic rng_core i2c_algo_bit
watchdog soundcore k10temp rfkill hb_rf_usb_2(OE) ecc
2023-05-20T20:12:17.054240+02:00 diskstation kernel: [ 1303.236494]
generic_raw_uart(OE) acpi_cpufreq button joydev evdev sg nct6775
nct6775_core drm hwmon_vid fuse loop efi_pstore configfs efivarfs
ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs
blake2b_generic xor raid6_pq zstd_compress libcrc32c crc32c_generic
dm_crypt dm_mod hid_generic usbhid hid sd_mod crc32_pclmul crc32c_intel
ahci ghash_clmulni_intel sha512_ssse3 libahci xhci_pci sha512_generic
xhci_hcd r8169 nvme realtek libata aesni_intel nvme_core t10_pi
crypto_simd mdio_devres usbcore scsi_mod crc64_rocksoft_generic cryptd
libphy crc64_rocksoft crc_t10dif i2c_piix4 crct10dif_generic
crct10dif_pclmul crc64 crct10dif_common usb_common scsi_common video
wmi gpio_amdpt gpio_generic
2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236534]
CPU: 5 PID: 2411 Comm: stress Tainted: G   OE  6.1.0-9-
amd64 #1  Debian 6.1.27-1
2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236536]
Hardware name: To Be Filled By O.E.M. B550M-ITX/ac/B550M-ITX/ac, BIOS
L2.62 01/31/2023
2023-05-20T20:12:17.054242+02:00 diskstation kernel: [ 1303.236537]
RIP: 0010:dev_watchdog+0x207/0x210
2023-05-20T20:12:17.054242+02:00 diskstation kernel: [ 1303.236540]
Code: 00 e9 40 ff ff ff 48 89 df c6 05 ff 5f 3d 01 01 e8 be 79 f9 ff 44
89 e9 48 89 de 48 c7 c7 c8 16 9b a8 48 89 c2 e8 09 d2 86 ff <0f> 0b e9
22 ff ff ff 66 90 0f 1f 44 00 00 55 53 48 89 fb 48 8b 6f
2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236541]
RSP: :a831c345fdc8 EFLAGS: 00010286
2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236543]
RAX:  RBX: 91a3c141 RCX: 
2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236544]
RDX: 0103 RSI: a893fa66 RDI: 
2023-05-20T20:12:17.054244+02:00 diskstation kernel: [ 1303.236545]
RBP: 91a3c1410488 R08:  R09: a831c345fc38
2023-05-20T20:12:17.054244+02:00 diskstation kernel: [ 1303.236546]
R10: 0003 R11: 91aafe27afe8 R12: 91a3c14103dc
2023-05-20T20:12:17.054245+02:00 diskstation kernel: [ 1303.236547]
R13:  R14: a7e2e7a0 R15: 91a3c1410488
2023-05-20T20:12:17.054245+02:00 diskstation kernel: [ 1303.236548] FS:
7f169849d740() GS:91aade34() knlGS:
2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236550] CS:
0010 DS:  ES:  CR0: 80050033
2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236551]
CR2: 55d05c3f4000 CR3: 000103cf2000 CR4: 00750ee0
2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236552]
PKRU: 5554
2023-05-20T20:12:17.054247+02:00 diskstation kernel: [ 1303.236553]
Call Trace:
2023-05-20T20:12:17.054247+02:00 diskstation kernel: [ 1303.236554]

2023-05-20T20:12:17.054248+02:00 diskstation kernel: [