Re: heartbeat panic by heavy traffic

2023-09-15 Thread Manuel Bouyer
On Fri, Sep 15, 2023 at 02:00:31PM -, Michael van Elst wrote:
> bou...@antioche.eu.org (Manuel Bouyer) writes:
> 
> >But the clock softint shouldn't be locked out for 16s, ever.
> 
> Then the clock softint must have a higher priority than
> everything else including hard interrupts.
> 
> Obviously that's not how the system is designed, there
> are no limits on how long specific events may take and
> thus no guarantee for lower priority tasks to actually
> execute with a certain time. That would be some kind
> of real-time system.

But obviously such events are not expected to take a long time, or
they would have been deffered to lower priority, preemptible tasks.
Letting such events run for a long time wedges the system.

I still maintain that the bug here is the network soft interrupt running
for such a long time, without gigving a chance to other tasks

> 
> Such systems also rarely panic if they detect a violation
> of their rules.
> 
> In any case, locking out lower priority tasks by an
> overwhelmed network layer probably isn't the bug that
> we look for.

I disagree. And the heartbeat panic is here to help locate such bugs.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: heartbeat panic by heavy traffic

2023-09-15 Thread Michael van Elst
bou...@antioche.eu.org (Manuel Bouyer) writes:

>But the clock softint shouldn't be locked out for 16s, ever.

Then the clock softint must have a higher priority than
everything else including hard interrupts.

Obviously that's not how the system is designed, there
are no limits on how long specific events may take and
thus no guarantee for lower priority tasks to actually
execute with a certain time. That would be some kind
of real-time system.

Such systems also rarely panic if they detect a violation
of their rules.

In any case, locking out lower priority tasks by an
overwhelmed network layer probably isn't the bug that
we look for.



Re: heartbeat panic by heavy traffic

2023-09-15 Thread Manuel Bouyer
On Fri, Sep 15, 2023 at 09:19:04AM -, Michael van Elst wrote:
> mar...@duskware.de (Martin Husemann) writes:
> 
> >On Fri, Sep 15, 2023 at 12:17:58PM +0900, Masanobu SAITOH wrote:
> >> I think it would be good to change the default behavior from
> >> panic to something others because GENERIC kernel enables HEARTBEAT.
> >> by default. One of idea is to print warning message at sufficient 
> >> intervals.
> 
> >I disagree. It is very important that we fix the underlying problem
> >instead. Without hearbeat, this behaviour is still visible (but 
> >undiagnosable).
> 
> The crash here comes from how the network stack operates. Running at
> a higher priority, it locks out the lower priority clock softint
> and heartbeat detects that and crashes the system intentionally.

But the clock softint shouldn't be locked out for 16s, ever.
It means that userland processes are stuck too, as well as kernel threads.

This is a real bug, the network stack should be fixed to relax at
periodic intervals.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: heartbeat panic by heavy traffic

2023-09-15 Thread Michael van Elst
mar...@duskware.de (Martin Husemann) writes:

>On Fri, Sep 15, 2023 at 12:17:58PM +0900, Masanobu SAITOH wrote:
>> I think it would be good to change the default behavior from
>> panic to something others because GENERIC kernel enables HEARTBEAT.
>> by default. One of idea is to print warning message at sufficient intervals.

>I disagree. It is very important that we fix the underlying problem
>instead. Without hearbeat, this behaviour is still visible (but undiagnosable).

The crash here comes from how the network stack operates. Running at
a higher priority, it locks out the lower priority clock softint
and heartbeat detects that and crashes the system intentionally.

I don't consider that useful even in a test environment.



Re: heartbeat panic by heavy traffic

2023-09-15 Thread Martin Husemann
On Fri, Sep 15, 2023 at 12:17:58PM +0900, Masanobu SAITOH wrote:
> I think it would be good to change the default behavior from
> panic to something others because GENERIC kernel enables HEARTBEAT.
> by default. One of idea is to print warning message at sufficient intervals.

I disagree. It is very important that we fix the underlying problem
instead. Without hearbeat, this behaviour is still visible (but undiagnosable).

Martin