Re: Call for testing: New kernel heartbeat(9) checks

2023-07-07 Thread PHO

On 7/7/23 22:11, Taylor R Campbell wrote:

FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called
heartbeat(9) that will make the system crash rather than hang when
CPUs are stuck in certain ways that hardware watchdog timers can't
detect (or on systems without hardware watchdog timers) > [...]


This is a NetBSD/amd64 guest with 2 virtual CPUs, running on VMware:


1.  cpuctl offline 0
sleep 20
cpuctl online 0


No panics.


2.  cpuctl offline 1
sleep 20
cpuctl online 1


No panics.


3.  cpuctl offline 0
sysctl -w kern.heartbeat.max_period=5
sleep 10
sysctl -w kern.heartbeat.max_period=0
sleep 10
sysctl -w kern.heartbeat.max_period=15
sleep 20
cpuctl online 0


No panics.


4.  sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=2   # IPL_SOFTCLOCK
# verify system panics after 15sec


Changing spl_spinout hangs sysctl. The kernel panics after 15 seconds:

Jul  8 22:16:13 netbsd-current /netbsd: [ 231.3581695] 
crashme_sysctl_forwarder:208: invoking "spl_spinout" (infinite loop at 
raised spl)
Jul  8 22:16:13 netbsd-current /netbsd: [ 231.3581695] 
crashme_spl_spinout: raising ipl to 2
Jul  8 22:16:13 netbsd-current /netbsd: [ 231.3581695] 
crashme_spl_spinout: raised ipl to 2, s=0
Jul  8 22:16:13 netbsd-current /netbsd: [ 247.0084882] cpu0: found cpu1 
heart stopped beating after 16 seconds
Jul  8 22:16:13 netbsd-current /netbsd: [ 247.0084882] panic: cpu1[1743 
sysctl]: heart stopped beating



5.  sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=6   # IPL_SCHED
# verify system panics after 15sec


Like 4 but it panics with a different message:

Jul  8 22:23:24 netbsd-current /netbsd: [ 411.0078445] panic: cpu0: 
softints stuck for 16 seconds



6.  cpuctl offline 0
sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=2   # IPL_SOFTCLOCK
# verify system panics after 15sec


It panics after 15 seconds:

Jul  8 22:27:04 netbsd-current /netbsd: [ 200.0060379] panic: cpu1: 
softints stuck for 16 seconds



7.  cpuctl offline 0
sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=5   # IPL_VM
# verify system panics after 15sec


It panics after 15 seconds:

Jul  8 22:29:45 netbsd-current /netbsd: [ 142.0029650] panic: cpu1: 
softints stuck for 16 seconds


daily CVS update output

2023-07-07 Thread NetBSD source update


Updating src tree:
P src/crypto/external/bsd/openssh/dist/LICENCE
P src/crypto/external/bsd/openssh/dist/auth-passwd.c
P src/doc/HACKS
P src/external/gpl3/binutils/dist/bfd/elf64-alpha.c
U src/share/man/man9/heartbeat.9
P src/sys/arch/amd64/conf/ALL
P src/sys/arch/xen/xen/xen_clock.c
P src/sys/dev/ic/dm9000.c
P src/sys/dev/pci/virtio_pci.c
P src/sys/dev/virtio/virtio_mmio.c
P src/sys/kern/files.kern
P src/sys/kern/init_main.c
P src/sys/kern/kern_clock.c
P src/sys/kern/kern_cpu.c
P src/sys/kern/kern_crashme.c
U src/sys/kern/kern_heartbeat.c
P src/sys/kern/kern_lock.c
P src/sys/kern/subr_xcall.c
P src/sys/sys/cpu_data.h
U src/sys/sys/heartbeat.h
P src/sys/sys/param.h
P src/tests/usr.bin/xlint/lint1/c11_atomic.c
P src/tests/usr.bin/xlint/lint1/c11_generic_expression.c
P src/tests/usr.bin/xlint/lint1/c99_bool_strict_suppressed.c
P src/tests/usr.bin/xlint/lint1/d_alignof.c
P src/tests/usr.bin/xlint/lint1/d_c99_compound_literal_comma.c
P src/tests/usr.bin/xlint/lint1/d_c99_decls_after_stmt.c
P src/tests/usr.bin/xlint/lint1/d_c99_union_cast.c
P src/tests/usr.bin/xlint/lint1/d_cast_fun_array_param.c
P src/tests/usr.bin/xlint/lint1/d_compound_literals1.c
P src/tests/usr.bin/xlint/lint1/d_ellipsis_in_switch.c
P src/tests/usr.bin/xlint/lint1/d_fold_test.c
P src/tests/usr.bin/xlint/lint1/d_gcc_compound_statements2.c
P src/tests/usr.bin/xlint/lint1/d_gcc_func.c
P src/tests/usr.bin/xlint/lint1/d_gcc_variable_array_init.c
P src/tests/usr.bin/xlint/lint1/d_init_array_using_string.c
P src/tests/usr.bin/xlint/lint1/d_init_pop_member.c
P src/tests/usr.bin/xlint/lint1/d_long_double_int.c
P src/tests/usr.bin/xlint/lint1/d_pr_22119.c
P src/tests/usr.bin/xlint/lint1/d_return_type.c
P src/tests/usr.bin/xlint/lint1/decl.c
P src/tests/usr.bin/xlint/lint1/expr_cast.c
P src/tests/usr.bin/xlint/lint1/gcc_attribute_aligned.c
P src/tests/usr.bin/xlint/lint1/gcc_attribute_var.c
P src/tests/usr.bin/xlint/lint1/gcc_builtin_alloca.c
P src/tests/usr.bin/xlint/lint1/gcc_builtin_overflow.c
P src/tests/usr.bin/xlint/lint1/gcc_cast_union.c
P src/tests/usr.bin/xlint/lint1/gcc_stmt_asm.c
P src/tests/usr.bin/xlint/lint1/gcc_typeof_after_statement.c
P src/tests/usr.bin/xlint/lint1/init_braces.c
P src/tests/usr.bin/xlint/lint1/msg_003.c
P src/tests/usr.bin/xlint/lint1/msg_011.c
P src/tests/usr.bin/xlint/lint1/msg_012.c
P src/tests/usr.bin/xlint/lint1/msg_021.c
P src/tests/usr.bin/xlint/lint1/msg_023.c
P src/tests/usr.bin/xlint/lint1/msg_028.c
P src/tests/usr.bin/xlint/lint1/msg_030.c
P src/tests/usr.bin/xlint/lint1/msg_032.c
P src/tests/usr.bin/xlint/lint1/msg_043.c
P src/tests/usr.bin/xlint/lint1/msg_050.c
P src/tests/usr.bin/xlint/lint1/msg_052.c
P src/tests/usr.bin/xlint/lint1/msg_053.c
P src/tests/usr.bin/xlint/lint1/msg_057.c
P src/tests/usr.bin/xlint/lint1/msg_059.c
P src/tests/usr.bin/xlint/lint1/msg_062.c
P src/tests/usr.bin/xlint/lint1/msg_063.c
P src/tests/usr.bin/xlint/lint1/msg_072.c
P src/tests/usr.bin/xlint/lint1/msg_083.c
P src/tests/usr.bin/xlint/lint1/msg_084.c
P src/tests/usr.bin/xlint/lint1/msg_090.c
P src/tests/usr.bin/xlint/lint1/msg_092.c
P src/tests/usr.bin/xlint/lint1/msg_093.c
P src/tests/usr.bin/xlint/lint1/msg_094.c
P src/tests/usr.bin/xlint/lint1/msg_095.c
P src/tests/usr.bin/xlint/lint1/msg_096.c
P src/tests/usr.bin/xlint/lint1/msg_097.c
P src/tests/usr.bin/xlint/lint1/msg_099.c
P src/tests/usr.bin/xlint/lint1/msg_103.c
P src/tests/usr.bin/xlint/lint1/msg_104.c
P src/tests/usr.bin/xlint/lint1/msg_106.c
P src/tests/usr.bin/xlint/lint1/msg_107.c
P src/tests/usr.bin/xlint/lint1/msg_108.c
P src/tests/usr.bin/xlint/lint1/msg_109.c
P src/tests/usr.bin/xlint/lint1/msg_110.c
P src/tests/usr.bin/xlint/lint1/msg_113.c
P src/tests/usr.bin/xlint/lint1/msg_114.c
P src/tests/usr.bin/xlint/lint1/msg_116.c
P src/tests/usr.bin/xlint/lint1/msg_117.c
P src/tests/usr.bin/xlint/lint1/msg_118.c
P src/tests/usr.bin/xlint/lint1/msg_119.c
P src/tests/usr.bin/xlint/lint1/msg_120.c
P src/tests/usr.bin/xlint/lint1/msg_121.c
P src/tests/usr.bin/xlint/lint1/msg_122.c
P src/tests/usr.bin/xlint/lint1/msg_124.c
P src/tests/usr.bin/xlint/lint1/msg_125.c
P src/tests/usr.bin/xlint/lint1/msg_126.c
P src/tests/usr.bin/xlint/lint1/msg_128.c
P src/tests/usr.bin/xlint/lint1/msg_132_lp64.c
P src/tests/usr.bin/xlint/lint1/msg_133.c
P src/tests/usr.bin/xlint/lint1/msg_136.c
P src/tests/usr.bin/xlint/lint1/msg_138.c
P src/tests/usr.bin/xlint/lint1/msg_143.c
P src/tests/usr.bin/xlint/lint1/msg_144.c
P src/tests/usr.bin/xlint/lint1/msg_145.c
P src/tests/usr.bin/xlint/lint1/msg_146.c
P src/tests/usr.bin/xlint/lint1/msg_149.c
P src/tests/usr.bin/xlint/lint1/msg_159.c
P src/tests/usr.bin/xlint/lint1/msg_163.c
P src/tests/usr.bin/xlint/lint1/msg_164.c
P src/tests/usr.bin/xlint/lint1/msg_165.c
P src/tests/usr.bin/xlint/lint1/msg_166.c
P src/tests/usr.bin/xlint/lint1/msg_167.c
P src/tests/usr.bin/xlint/lint1/msg_169.c
P src/tests/usr.bin/xlint/lint1/msg_170.c
P src/tests/usr.bin/xlint/lint1/msg_171.c
P src/tests/usr.bin/xlint/lint1/msg_174.c
P 

Re: modesetting vs intel in 10.0

2023-07-07 Thread nia
On Fri, Jul 07, 2023 at 08:18:18PM +0100, David Brownlee wrote:
> On Fri, 7 Jul 2023 at 19:43, nia  wrote:
> >
> > After some testing on a Skylake machine, I've concluded
> > that xf86-video-modesetting is far superior to xf86-video-intel
> > on that generation of Intel hardware - the most obvious thing
> > is that modesetting has functional VSync and superior 3D performance
> > with less tearing.
> >
> > Only problem is that we default to intel, modesetting you need
> > to choose explicitly through xorg.conf.
> >
> > I also found similar problems in "radeon", but found that
> > modesetting would somehow pick a display mode that the monitor
> > didn't support. Maybe this is actually a drmkms bug - I'm not
> > sure.
> >
> > But maybe modesetting is mature enough (and intel bad enough)
> > to warrant being the default for Intel GPUs.
> 
> Could we start with some form of whitelist to pick modesetting over intel?
> 
> David

Maybe GPUs released after intel became abandonware in 2014 or so...


Automated report: NetBSD-current/i386 build success

2023-07-07 Thread NetBSD Test Fixture
The NetBSD-current/i386 build is working again.

The following commits were made between the last failed build and the
first successful build:

2023.07.07.18.02.52 riastradh src/sys/kern/kern_lock.c 1.186

Logs can be found at:


http://releng.NetBSD.org/b5reports/i386/commits-2023.07.html#2023.07.07.18.02.52


Re: modesetting vs intel in 10.0

2023-07-07 Thread David Brownlee
On Fri, 7 Jul 2023 at 19:43, nia  wrote:
>
> After some testing on a Skylake machine, I've concluded
> that xf86-video-modesetting is far superior to xf86-video-intel
> on that generation of Intel hardware - the most obvious thing
> is that modesetting has functional VSync and superior 3D performance
> with less tearing.
>
> Only problem is that we default to intel, modesetting you need
> to choose explicitly through xorg.conf.
>
> I also found similar problems in "radeon", but found that
> modesetting would somehow pick a display mode that the monitor
> didn't support. Maybe this is actually a drmkms bug - I'm not
> sure.
>
> But maybe modesetting is mature enough (and intel bad enough)
> to warrant being the default for Intel GPUs.

Could we start with some form of whitelist to pick modesetting over intel?

David


modesetting vs intel in 10.0

2023-07-07 Thread nia
After some testing on a Skylake machine, I've concluded
that xf86-video-modesetting is far superior to xf86-video-intel
on that generation of Intel hardware - the most obvious thing
is that modesetting has functional VSync and superior 3D performance
with less tearing.

Only problem is that we default to intel, modesetting you need
to choose explicitly through xorg.conf.

I also found similar problems in "radeon", but found that
modesetting would somehow pick a display mode that the monitor
didn't support. Maybe this is actually a drmkms bug - I'm not
sure.

But maybe modesetting is mature enough (and intel bad enough)
to warrant being the default for Intel GPUs.


Re: Call for testing: New kernel heartbeat(9) checks

2023-07-07 Thread Manuel Bouyer
On Fri, Jul 07, 2023 at 05:10:33PM +, Taylor R Campbell wrote:
> > Date: Fri, 7 Jul 2023 17:56:42 +0200
> > From: Manuel Bouyer 
> > 
> > On Fri, Jul 07, 2023 at 01:11:54PM +, Taylor R Campbell wrote:
> > > - The magic numbers for debug.crashme.spl_spinout are for evbarm.
> > >   On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.
> 
> Correction: IPL_SOFTCLOCK=2.
> 
> > > 1.cpuctl offline 0
> > >   sleep 20
> > >   cpuctl online 0
> > 
> > With this I get a panic on Xen:
> > [ 225.4605386] panic: kernel diagnostic assertion "kpreempt_disabled()" 
> > failed: file 
> > "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
> > [...]
> > [  53.5704682] panic: kernel diagnostic assertion "kpreempt_disabled()" 
> > failed: file 
> > "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
> 
> This was a mistake that arose because I was testing on aarch64 where
> kpreempt_disabled() is always true.  Update and try again, please!
> 
> sys/kern/kern_heartbeat.c 1.2
> sys/kern/subr_xcall.c 1.36

Yes, with these (and using 2 for IPL_SOFTCLOCK) every test pass now.
thanks ! This allowed me to fix a small bug in Xen's clock initialisation
already :)

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Automated report: NetBSD-current/i386 build failure

2023-07-07 Thread NetBSD Test Fixture
This is an automatically generated notice of a NetBSD-current/i386
build failure.

The failure occurred on babylon5.netbsd.org, a NetBSD/amd64 host,
using sources from CVS date 2023.07.07.17.05.13.

An extract from the build.sh output follows:

--- kern_reboot.d ---
mv -f kern_reboot.d.tmp kern_reboot.d
--- kern-INSTALL_XEN3PAE_DOMU ---
At top level:
/tmp/build/2023.07.07.17.05.13-i386/src/sys/kern/kern_lock.c:199:26: error: 
'kernel_lock_holder' defined but not used [-Werror=unused-variable]
/tmp/build/2023.07.07.17.05.13-i386/src/sys/kern/kern_lock.c:146:1: error: 
'kernel_lock_trace_ipi' defined but not used [-Werror=unused-function]
  146 | kernel_lock_trace_ipi(void *cookie)
  | ^
--- kern-MONOLITHIC ---
--- isapnpdebug.d ---
#create  MONOLITHIC/isapnpdebug.d

The following commits were made between the last successful build and
the first failed build:

2023.07.07.17.04.49 riastradh src/sys/kern/subr_xcall.c 1.36
2023.07.07.17.05.13 riastradh src/sys/kern/kern_heartbeat.c 1.2
2023.07.07.17.05.13 riastradh src/sys/kern/kern_lock.c 1.185

Logs can be found at:


http://releng.NetBSD.org/b5reports/i386/commits-2023.07.html#2023.07.07.17.05.13


Re: Call for testing: New kernel heartbeat(9) checks

2023-07-07 Thread Taylor R Campbell
> Date: Fri, 7 Jul 2023 17:56:42 +0200
> From: Manuel Bouyer 
> 
> On Fri, Jul 07, 2023 at 01:11:54PM +, Taylor R Campbell wrote:
> > - The magic numbers for debug.crashme.spl_spinout are for evbarm.
> >   On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.

Correction: IPL_SOFTCLOCK=2.

> > 1.  cpuctl offline 0
> > sleep 20
> > cpuctl online 0
> 
> With this I get a panic on Xen:
> [ 225.4605386] panic: kernel diagnostic assertion "kpreempt_disabled()" 
> failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", 
> line 158
> [...]
> [  53.5704682] panic: kernel diagnostic assertion "kpreempt_disabled()" 
> failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", 
> line 158

This was a mistake that arose because I was testing on aarch64 where
kpreempt_disabled() is always true.  Update and try again, please!

sys/kern/kern_heartbeat.c 1.2
sys/kern/subr_xcall.c 1.36

> > 4.  sysctl -w debug.crashme_enable=1
> > sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
> > # verify system panics after 15sec
> 
> my sysctl command did hang, but the system didn't panic

Right -- I made a mistake in my call for testing.  On x86,
IPL_SOFTCLOCK is 2, not 1, which is IPL_PREEMPT, a special ipl that
doesn't apply here.  So use this instead on x86:

sysctl -w debug.crashme.spl_spinout=2

(Not sure if it's different on Xen -- if it is, use whatever
IPL_SOFTCLOCK is there.)

> > 5.  sysctl -w debug.crashme_enable=1
> > sysctl -w debug.crashme.spl_spinout=6   # IPL_SCHED
> > # verify system panics after 15sec
> 
> This one did panic

Great!

> > 6.  cpuctl offline 0
> > sysctl -w debug.crashme_enable=1
> > sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
> > # verify system panics after 15sec
> 
> my sysctl command did hang, but the system didn't panic

Same as with (4), use 2 instead of 1 here (or whatever is the right
value on Xen).

> > 7.  cpuctl offline 0
> > sysctl -w debug.crashme_enable=1
> > sysctl -w debug.crashme.spl_spinout=5   # IPL_VM
> > # verify system panics after 15sec
> 
> and this one did panic

Great, thanks!


Re: Call for testing: New kernel heartbeat(9) checks

2023-07-07 Thread Manuel Bouyer
On Fri, Jul 07, 2023 at 01:11:54PM +, Taylor R Campbell wrote:
> FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called
> heartbeat(9) that will make the system crash rather than hang when
> CPUs are stuck in certain ways that hardware watchdog timers can't
> detect (or on systems without hardware watchdog timers).
> 
> It's optional for now, but it's small and I'd like to make it
> mandatory in the future.  If you'd like to try it out, add the
> following two lines to your kernel config:
> 
> options   HEARTBEAT
> options   HEARTBEAT_MAX_PERIOD_DEFAULT=15
> 
> You can disable it with `sysctl -w kern.heartbeat.max_period=0' at
> runtime, or use that knob to change the maximum period before the
> system will crash if not all (online) CPUs have made progress.
> 
> 
> Here are some manual tests that you can use to exercise it -- these
> are manual tests, not automatic tests, because some will deliberately
> crash the kernel to make sure the diagnostic works, and the others, if
> broken, will also crash the kernel.
> 
> Notes:
> - The magic numbers for debug.crashme.spl_spinout are for evbarm.
>   On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.
>   For other architectures, consult the source for the numbers to use.
> - If you're on a single-CPU system, skip the cpuctl offline/online
>   tests and just do (4) and (5).
> - If you're on a >2-CPU system, then for the cpuctl offline/online
>   tests, try offlining all CPUs but one at a time.
> 
> 1.cpuctl offline 0
>   sleep 20
>   cpuctl online 0

With this I get a panic on Xen:
[ 225.4605386] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: 
file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
[ 225.4605386] cpu0: Begin traceback...
[ 225.4605386] vpanic() at netbsd:vpanic+0x163
[ 225.4605386] kern_assert() at netbsd:kern_assert+0x4b
[ 225.4705333] heartbeat_resume() at netbsd:heartbeat_resume+0x82
[ 225.4705333] cpu_xc_online() at netbsd:cpu_xc_online+0x11
[ 225.4705333] xc_thread() at netbsd:xc_thread+0xc8
[ 225.4705333] cpu0: End traceback...
[ 225.4705333] fatal breakpoint trap in supervisor mode
[ 225.4705333] trap type 1 code 0 rip 0x8022e96d cs 0xe030 rflags 0x202 
cr2 0x9b8030d32000 ilevel 0 rsp 0x9b8030985dd0
[ 225.4705333] curlwp 0x9b80007c6900 pid 0.7 lowest kstack 
0x9b80309812c0
Stopped in pid 0.7 (system) at  netbsd:breakpoint+0x5:  leave
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x163
kern_assert() at netbsd:kern_assert+0x4b
heartbeat_resume() at netbsd:heartbeat_resume+0x82
cpu_xc_online() at netbsd:cpu_xc_online+0x11
xc_thread() at netbsd:xc_thread+0xc8

Is it expected ? Nothing looks Xen-specific here


> 
> 2.cpuctl offline 1
>   sleep 20
>   cpuctl online 1

same panic

> 
> 3.cpuctl offline 0
>   sysctl -w kern.heartbeat.max_period=5
>   sleep 10
>   sysctl -w kern.heartbeat.max_period=0
>   sleep 10
>   sysctl -w kern.heartbeat.max_period=15
>   sleep 20
>   cpuctl online 0

Here we have:
#sysctl -w kern.heartbeat.max_period=15
[  53.5704682] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: 
file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158
[  53.5704682] cpu0: Begin traceback...
[  53.5704682] vpanic() at netbsd:vpanic+0x163
[  53.5704682] kern_assert() at netbsd:kern_assert+0x4b
[  53.5704682] heartbeat_resume() at netbsd:heartbeat_resume+0x82
[  53.5704682] xc_thread() at netbsd:xc_thread+0xc8
[  53.5704682] cpu0: End traceback...


> 
> 4.sysctl -w debug.crashme_enable=1
>   sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
>   # verify system panics after 15sec

my sysctl command did hang, but the system didn't panic

> 
> 5.sysctl -w debug.crashme_enable=1
>   sysctl -w debug.crashme.spl_spinout=6   # IPL_SCHED
>   # verify system panics after 15sec

This one did panic
> 
> 6.cpuctl offline 0
>   sysctl -w debug.crashme_enable=1
>   sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
>   # verify system panics after 15sec

my sysctl command did hang, but the system didn't panic

> 
> 7.cpuctl offline 0
>   sysctl -w debug.crashme_enable=1
>   sysctl -w debug.crashme.spl_spinout=5   # IPL_VM
>   # verify system panics after 15sec

and this one did panic

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Call for testing: New kernel heartbeat(9) checks

2023-07-07 Thread Taylor R Campbell
FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called
heartbeat(9) that will make the system crash rather than hang when
CPUs are stuck in certain ways that hardware watchdog timers can't
detect (or on systems without hardware watchdog timers).

It's optional for now, but it's small and I'd like to make it
mandatory in the future.  If you'd like to try it out, add the
following two lines to your kernel config:

options HEARTBEAT
options HEARTBEAT_MAX_PERIOD_DEFAULT=15

You can disable it with `sysctl -w kern.heartbeat.max_period=0' at
runtime, or use that knob to change the maximum period before the
system will crash if not all (online) CPUs have made progress.


Here are some manual tests that you can use to exercise it -- these
are manual tests, not automatic tests, because some will deliberately
crash the kernel to make sure the diagnostic works, and the others, if
broken, will also crash the kernel.

Notes:
- The magic numbers for debug.crashme.spl_spinout are for evbarm.
  On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1.
  For other architectures, consult the source for the numbers to use.
- If you're on a single-CPU system, skip the cpuctl offline/online
  tests and just do (4) and (5).
- If you're on a >2-CPU system, then for the cpuctl offline/online
  tests, try offlining all CPUs but one at a time.

1.  cpuctl offline 0
sleep 20
cpuctl online 0

2.  cpuctl offline 1
sleep 20
cpuctl online 1

3.  cpuctl offline 0
sysctl -w kern.heartbeat.max_period=5
sleep 10
sysctl -w kern.heartbeat.max_period=0
sleep 10
sysctl -w kern.heartbeat.max_period=15
sleep 20
cpuctl online 0

4.  sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
# verify system panics after 15sec

5.  sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=6   # IPL_SCHED
# verify system panics after 15sec

6.  cpuctl offline 0
sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=1   # IPL_SOFTCLOCK
# verify system panics after 15sec

7.  cpuctl offline 0
sysctl -w debug.crashme_enable=1
sysctl -w debug.crashme.spl_spinout=5   # IPL_VM
# verify system panics after 15sec


Re: cpu temperature readings

2023-07-07 Thread Robert Elz
Sorry, no, I can't test, the system looks to have died, and certainly
needs repairs, it looks as if the cooler might be dead (not sure about
the cpu at the minute, it won't even boot to the stage where the BIOS
enables the display)

However, much of what your patch does (according to your description,
at the minute my method of e-mail access doesn't rise to looking at
attachments) is what I actually put in the kernel I was running.
I didn't change the lower limit on the range, for me that was clearly
not a problem, and I made the upper limit 130 instead of 120 (though
as it turned out, either would do).   My system sets Tjmax to 115.
That seems to be a constant, every read of the register, on all cores,
produces 115 (I added a diagnostic to tell me if it ever changed).

I can't comment on what should be done in the case of the value being
outside the expected range - I don't know enough about PC hardware to
know whether or not there are systems which might return garbage in
that register - if there are, then settling on a default to use sounds
like the right way, but if nothing is known to do that, thenjust believing
what the CPU says (as we do with any other register) - with or without
a disgnostic to the console would probably be better.

This time (Weds evening) when the system shut down, I had just finished
a build, and run (some) ATF tests to check some changes I was making to
sh quoting in some of the test scripts (there is some horribly bogus
nonsense around...  though as long as the data being used doesn't change
there would be no adverse effects in the tests I looked at, so the
change I was working on should be made, but won't actually change
any results - the ATF test runs I did verified that).

During the build the core temps were fluctuating about Tjmax (115)
which I didn't consider all that abnormal (the previous build I did,
before the Tjmax adjustment, did much the same thing).  The difference
this time was that things never cooled down after the build finished.

Further, before I could get to commit the changes, the "critical temp"
bit started being set (all cores) and powerd shut down the system.
I had a diagnostic to print the register that has the bit in it,
and also (if it managed to read properly) the temp that had been
read (in micro-kelvins, as the value has been converted by this time).
I took a photo of the data on the screen while (some of) that data
was visible, if it is likely to be useful to anyone.
(That's on my phone, so no problem accessing it).

Note that until my system gets repaired (and I won't even start looking
for a local reputable repair place until Monday at least) I am going
to be fairly sluggish accessing e-mail (I won't be looking very
frequently, and might easily miss messages when I do look, as I
get to see incoming messages without any spam filtering yet when I
access it this way).

kre



Re: cpu temperature readings

2023-07-07 Thread Masanobu SAITOH
Hi, all.

Could you test the following diff?

http://www.netbsd.org/~msaitoh/coretemp-20230707-0.dif

In the draft of the commit message:
--
coretemp(4): Change limits of Tjmax.

 - Change the lower limit from 70 to 60. At least, some BIOSes can change
   the value down to 62.
 - Change the upper limit from 110 to 120. At least, some BIOSes can change
   the value up to 115.
 - Print error message when rdmsr(TEMPERATURE_TARGET) failed.
#if 1
 - Print error message when Tjmax exceeded the limit.
#else
 - When Tjmax exceeded the limit, print warning message and use the value
   as it is.
#endif
--

In "#if 1" part, The default value (100) is used for Tjmax if it exceeded
the limit. It's the same as before except the range of the limit.
In "#else" part, the read value is used as it is even if it exceeded the
limit.

Which one do you think is better?

-- 
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)


Re: Issues with X in NetBSD-current

2023-07-07 Thread RVP

On Fri, 7 Jul 2023, Brett Lymn wrote:


http://ftp.netbsd.org/pub/NetBSD/misc/blymn/dmesg.capture-errors

Has anyone any hints I can try?



Looks like PR kern/57440.

I get those same endless "heartbeat" messages if I

a) use any resolution _other than_ 1024x768 in the bootloader.
b) use any font _other than_ the built-in ones (8x16, 16x32) when
   compiling the kernel.

If I leave things at the default, then the modesetting driver works
reasonably OK (it has a tendency to hang if a lot of text is being spewed
on the screen--like with a `find /'--but, that's a problem even on
Linux...).

-RVP


Re: Remove extra unlock in dm9000 driver

2023-07-07 Thread Martin Husemann
On Thu, Jul 06, 2023 at 10:10:37PM -0400, Lwazi Dube wrote:
>  ec->ec_flags |= ETHER_F_ALLMULTI;
> -ETHER_UNLOCK(ec);

I fixed it slightly different, thanks for catching it!

Martin


Issues with X in NetBSD-current

2023-07-07 Thread Brett Lymn


Folks,

I updated to -current a couple of weeks ago and I am now having issues with
the builtin X.

I have a fujitsu laptop with an i7-4500U cpu, prior to the update I was
using the i915 acceleration and it was working fine (the version of -current
I was using was postively geriatric though).

Now after updating when I try to start X I just get a blank screen, I cannot
switch consoles to get back to a text console.  The machine is still
running because the machine will shutdown cleanly if I tap the power button.
I normally can't ssh into the laptop only because I use the laptop on the
train and so lack another client.

One thing I have done is capture the dmesg once the screen is hung by
adding a dmesg redirected to a file to the power button powerd script.
I have had a look at the Xorg.log but cannot see anything amiss there.
The full dmesg can be found at:

http://ftp.netbsd.org/pub/NetBSD/misc/blymn/dmesg.capture-errors

Has anyone any hints I can try?

-- 
Brett Lymn
--
Sent from my NetBSD device.

"We are were wolves",
"You mean werewolves?",
"No we were wolves, now we are something else entirely",
"Oh"