On 01/12/25(Mon) 20:04, Alexander Bluhm wrote:
> On Mon, Dec 01, 2025 at 03:23:18PM +0100, Martin Pieuchot wrote:
> > Thanks a lot for this report.  It helps me a lot to understand the
> > existing limitation of OpenBSD's pdaemon.
> 
> It passes regress on i386 with the machine that paniced before.
> 
> I tried make release with this diff.
> 
> After some time I lost the ssh connection.  Note that the SSH timeout
> is configured rather short in my setup.  This happened quite often
> before.  It is a short time hang, the machine reacts normaly after
> a while.

I observe the same when the machine is swapping. 

> ===> gnu/usr.bin/clang/include/llvm/X86
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/obj/../../../llvm-tblgen/llvm-tblgen
>  -gen-subtarget  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/include
>  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86
>   -o X86GenSubtargetInfo.inc 
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86/X86.td
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/obj/../../../llvm-tblgen/llvm-tblgen
>  -gen-register-info  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/include
>  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86
>   -o X86GenRegisterInfo.inc 
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86/X86.td
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/obj/../../../llvm-tblgen/llvm-tblgen
>  -gen-register-bank  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/include
>  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86
>   -o X86GenRegisterBank.inc 
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86/X86.td
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/obj/../../../llvm-tblgen/llvm-tblgen
>  -gen-x86-mnemonic-tables -asmwriternum=1  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/include
>  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86
>   -o X86GenMnemonicTables.inc 
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86/X86.td
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/obj/../../../llvm-tblgen/llvm-tblgen
>  -gen-instr-info  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/include
>  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86
>   -o X86GenInstrInfo.inc 
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86/X86.td
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/obj/../../../llvm-tblgen/llvm-tblgen
>  -gen-global-isel  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/include
>  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86
>   -o X86GenGlobalISel.inc 
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86/X86.td
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/obj/../../../llvm-tblgen/llvm-tblgen
>  -gen-fast-isel  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/include
>  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86
>   -o X86GenFastISel.inc 
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86/X86.td
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/obj/../../../llvm-tblgen/llvm-tblgen
>  -gen-exegesis  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/include
>  
> -I/usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86
>   -o X86GenExegesis.inc 
> /usr/src/gnu/usr.bin/clang/include/llvm/X86/../../../../../llvm/llvm/lib/Target/X86/X86.td
> Timeout, server ot2 not responding.
> 
> So I tried to rebuild the directory src/gnu/usr.bin/clang/include/llvm/X86
> and then a top(1) output got stuck for a while and later continued.
> 
> load averages:  1.74,  0.75,  1.13               ot2.obsd-lab.genua.de 
> 17:59:55
> 53 processes: 52 idle, 1 on processor                        up 0 days 
> 01:06:29
> CPU0:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin,  0.0% intr,  100% idle
> CPU1:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin,  0.0% intr,  100% idle
> CPU2:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin,  0.0% intr,  100% idle
> CPU3:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin,  0.0% intr,  100% idle
> CPU4:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin,  0.0% intr,  100% idle
> CPU5:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin,  0.0% intr,  100% idle
> CPU6:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin,  0.0% intr,  100% idle
> CPU7:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin,  0.0% intr,  100% idle
> Memory: Real: 2235M/3223M act/tot Free: 32K Cache: 161M Swap: 399M/3556M
> 
>   PID USERNAME PRI NICE  SIZE   RES STATE     WAIT      TIME    CPU COMMAND
>  2595 root     -18    0  432M  123M sleep/6   flt_nor   0:16 28.32% 
> llvm-tblgen
> 56241 root     -18    0  431M  112M sleep/6   flt_nor   0:15 26.90% 
> llvm-tblgen
> 40502 root     -18    0  431M  105M sleep/6   flt_nor   0:15 26.61% 
> llvm-tblgen
> 61936 root     -18    0  459M   54M sleep/6   flt_nor   0:15 25.78% 
> llvm-tblgen
> 54450 root     -18    0  431M  111M sleep/7   flt_nor   0:15 25.63% 
> llvm-tblgen
> 33131 root     -18    0  431M   70M sleep/6   flt_nor   0:15 25.59% 
> llvm-tblgen
> 32823 root     -18    0  602M  208M sleep/6   flt_nor   0:14 22.75% 
> llvm-tblgen
> 14235 root       2    0 1628K 1212K sleep/1   kqread    0:03  5.62% 
> sshd-sessio
> 91133 root       2    0 2280K 2072K sleep/3   kqread    0:02  3.27% tmux
> 
> At least no crash, but it is hard to tell if situation improved.
> I have seen such hangs before, this is not a regression.

Yes, I'm well aware of the problem and the diff I sent you is a step
towards fixing this.

The problem lies in the current design of the page daemon which takes
multiple iterations before freeing pages.  Once the diff I sent you is
in, I'll send the next one to fix this.
 
> On the 12 CPU machine where I saw the crash before I am nearly hitting
> end of swap.  Maybe that is another can of worms.

It is indeed a different can of worms.  That say I'm *deeply* interested
in the bugs you hit with this machine.  I'm grateful for the awesome bug
reports you're sending and they allow me to improve the swapper step by
step.

Would you please try this diff on such machine and let me know if you
trigger another panic?

> load averages:  0.17,  0.26,  1.04               ot4.obsd-lab.genua.de 
> 20:02:31
> 86 processes: 85 idle, 1 on processor                        up 0 days 
> 03:45:33
> 12  CPUs:  0.0% user,  0.0% nice,  0.8% sys,  0.0% spin,  0.0% intr, 99.2% 
> idle
> Memory: Real: 1653M/2858M act/tot Free: 133M Cache: 149M Swap: 3272M/3319M
> 
>   PID USERNAME PRI NICE  SIZE   RES STATE     WAIT      TIME    CPU COMMAND
> 93949 build     -5    0  458M  235M sleep/4   biowait   0:19  0.68% 
> llvm-tblgen
> 65515 build     -5    0  463M  105M sleep/3   biowait   0:21  0.24% 
> llvm-tblgen
> 36109 build     -5    0  767M  324M sleep/9   biowait   0:20  0.05% 
> llvm-tblgen
> 89976 build     -5    0  465M  111M sleep/2   biowait   0:20  0.05% 
> llvm-tblgen
> 77887 build     -5    0  458M  206M sleep/6   biowait   0:19  0.05% 
> llvm-tblgen
> 43170 build     -5    0  132M 8652K sleep/7   biowait   0:19  0.05% 
> llvm-tblgen
> 98370 build     -5    0  463M   91M sleep/2   biowait   0:19  0.05% 
> llvm-tblgen
> 13101 root       2    0 1684K  808K sleep/2   kqread    0:06  0.05% 
> sshd-sessio
> 54645 root      29    0 1360K 2184K onproc/1  -         0:01  0.05% top
> 47918 build     -5    0  427M  244M sleep/11  biowait   0:21  0.00% 
> llvm-tblgen
>  6953 build     -5    0  465M  100M sleep/3   biowait   0:19  0.00% 
> llvm-tblgen
> 54170 build     -5    0   95M 6096K sleep/5   biowait   0:19  0.00% 
> llvm-tblgen
> 78266 build     -5    0  154M 8956K sleep/2   biowait   0:18  0.00% 
> llvm-tblgen
> 24448 build     -5    0  127M 7016K sleep/2   biowait   0:18  0.00% 
> llvm-tblgen
> 77594 root       2    0 3532K 2348K sleep/1   kqread    0:11  0.00% tmux
> 31856 _snmpd     2    0 4652K  908K sleep/1   kqread    0:01  0.00% snmpd
> 75039 _syslogd   2    0 1352K  568K sleep/2   kqread    0:01  0.00% syslogd
> 11862 root       2    0  980K 1464K idle      kqread    0:00  0.00% sshd
> 
> When looking at the syzkaller mailing list or my testing, I have
> the impression that we have more stability problems in the kernel
> than usual.  But they occure randomly, it is hard to find the moment
> when they started or what caused them.

I hear you, that say I doubt syzkaller is bugs are related to swapping
and the page daemon.

> I cannot test a single diff in this area say if it fixes anything.
> I might find regressions.  The only way to move forward is to fix
> bugs, commit them, and look at the stability of all test machines.
> This includes syzkaller, my setup, anton's machines, and of course
> all the snapshot users who run current.

I agree.  Does that mean you're ok with the diff?


Reply via email to