[email protected] (Stuart Henderson), 2020.11.27 (Fri) 19:34 (CET): > On 2020/11/27 18:50, Mark Kettenis wrote: > > > Date: Fri, 27 Nov 2020 18:43:47 +0100 > > > From: Marcus MERIGHI <[email protected]> > > > > > > [email protected] (Stuart Henderson), 2020.11.27 (Fri) 17:54 (CET): > > > > On 2020/11/27 16:21, Marcus MERIGHI wrote: > > > > > It happened again; anything I should do when "syncing disks..." is > > > > > done? > > > > > > This time around it doesn't seem to finish "syncing disks..." and drop > > > into ddb>. So it can't be rebooted via "boot reboot". Is there a way to > > > reboot via the serial console? Sending a BREAK (~#) doesn't seem to do > > > anything... > > > > > > > Can you try dowgrading the bios to 4.11.0.4? > > > > https://pcengines.github.io/#mr-33 > > > > > > Will do, as soon as the machine is rebooted. Thanks for the pointer! > > > (You mention 4.11.0.4, but your link goes to 4.11.0.5?)
Stuart, after your "scratch that" statement I will skip this. > > Frankly I think this issue is a kernel bug, where somehow the sysctl > > code that reports on open files is racing against code that closes > > those files or otherwise messes with the associated data structures. > > I bet that if you stop the process that is doing those sysctl calls, > > things will run stable again. Mark, after you wrote that, I went for the machine's CARP sibling and chased anything that would do filesystem access. > fstat was running on Marcus' machine. There was a cron job every minute that used fstat to look for unbound's DNS-over-TLS sockets and report them to syslog. (There were reliability problems with DoT on this machine in the past) > > Given what you wrote about the configuration of the machine I'd say > > this is related to sockets and missing locking in/against the network > > stack. Unfortunately the traces you showed so far don't really give > > me any clues. Is there anything I can do once it happens again? Any ddb> commands I should run? Other things that used to run and were disabled on the sibling follow. After these measures disk access, as shown in systat vmstat, dropped to 1 to 4 xfers every 2 to 3 seconds. The machine is still up and running, though now it's weekend and the load is lower anyways (200-300 Interrupts/s instead of 2500-4000). Logging was set to memory buffer and another machine; but I had overlooked the first line in the default syslog.conf(5), thus there was still file system access by syslogd(8). Disabled now. Disabled newsyslog in root's crontab, too; nothing to rotate, anyways. All other default cron jobs (daily, weekly, monthly, spamd-setup) are still active. Another cront job (*/5) was "/sbin/atactl /dev/sd0c smartstatus". "syspatch -c" and "pkg_add -un | grep -v 'signed on'" were set to run once during the night. There was a cron job (every minute) shell script that looked for pppx(4)s with ifconfig(8) and added some groups to the interface for the user. This shouldn't have caused disk i/o, though. Then there was a job, every five minutes, that used ftp(1) to fetch a web page and dump it to /dev/null. (don't ask for the reasons!) Every four hours a DNS block list was fetched and written to disk. Every two days a script fetched geolocation IPs and dumped them to disk. Every ten minutes "date; npppctl -n session brief" was dumped to a file. After that I stopped non-essential services: - one ran "iostat 60 | logger -t iostat" - one ran "netstat -b -n -w 60 -I em0 | logger -t nete" - one ran "netstat -b -n -w 60 -I em1 | logger -t netl" - one ran "netstat -b -n -w 60 -I em2 | logger -t netw" - one ran some connectivity checks and logged them with logger(1) I could not find any reason for filesystem access in this script. - I disabled pflogd(8) - I disabled relayd(8), which didn't do anything anyways, though. Marcus
