[email protected] (Stuart Henderson), 2020.11.27 (Fri) 19:34 (CET):
> On 2020/11/27 18:50, Mark Kettenis wrote:
> > > Date: Fri, 27 Nov 2020 18:43:47 +0100
> > > From: Marcus MERIGHI <[email protected]>
> > > 
> > > [email protected] (Stuart Henderson), 2020.11.27 (Fri) 17:54 (CET):
> > > > On 2020/11/27 16:21, Marcus MERIGHI wrote:
> > > > > It happened again; anything I should do when "syncing disks..." is 
> > > > > done?
> > > 
> > > This time around it doesn't seem to finish "syncing disks..." and drop
> > > into ddb>. So it can't be rebooted via "boot reboot". Is there a way to
> > > reboot via the serial console? Sending a BREAK (~#) doesn't seem to do
> > > anything...
> > > 
> > > > Can you try dowgrading the bios to 4.11.0.4?
> > > > https://pcengines.github.io/#mr-33
> > > 
> > > Will do, as soon as the machine is rebooted. Thanks for the pointer!
> > > (You mention 4.11.0.4, but your link goes to 4.11.0.5?)

Stuart, after your "scratch that" statement I will skip this.

> > Frankly I think this issue is a kernel bug, where somehow the sysctl
> > code that reports on open files is racing against code that closes
> > those files or otherwise messes with the associated data structures.
> > I bet that if you stop the process that is doing those sysctl calls,
> > things will run stable again.

Mark, after you wrote that, I went for the machine's CARP sibling and
chased anything that would do filesystem access.
 
> fstat was running on Marcus' machine.

There was a cron job every minute that used fstat to look for unbound's
DNS-over-TLS sockets and report them to syslog. (There were reliability
problems with DoT on this machine in the past)

> > Given what you wrote about the configuration of the machine I'd say
> > this is related to sockets and missing locking in/against the network
> > stack.  Unfortunately the traces you showed so far don't really give
> > me any clues.

Is there anything I can do once it happens again? 
Any ddb> commands I should run?

Other things that used to run and were disabled on the sibling follow.
After these measures disk access, as shown in systat vmstat, dropped to
1 to 4 xfers every 2 to 3 seconds. The machine is still up and running,
though now it's weekend and the load is lower anyways (200-300
Interrupts/s instead of 2500-4000).

Logging was set to memory buffer and another machine; but I had
overlooked the first line in the default syslog.conf(5), thus there was
still file system access by syslogd(8). Disabled now.

Disabled newsyslog in root's crontab, too; nothing to rotate, anyways.

All other default cron jobs (daily, weekly, monthly, spamd-setup) are
still active.

Another cront job (*/5) was "/sbin/atactl /dev/sd0c smartstatus".

"syspatch -c" and "pkg_add -un | grep -v 'signed on'" were set to run
once during the night.

There was a cron job (every minute) shell script that looked for
pppx(4)s with ifconfig(8) and added some groups to the interface for the
user. This shouldn't have caused disk i/o, though.

Then there was a job, every five minutes, that used ftp(1) to fetch a
web page and dump it to /dev/null. (don't ask for the reasons!)

Every four hours a DNS block list was fetched and written to disk.

Every two days a script fetched geolocation IPs and dumped them to disk.

Every ten minutes "date; npppctl -n session brief" was dumped to a file.

After that I stopped non-essential services: 
- one ran "iostat 60 | logger -t iostat"
- one ran "netstat -b -n -w 60 -I em0 | logger -t nete"
- one ran "netstat -b -n -w 60 -I em1 | logger -t netl"
- one ran "netstat -b -n -w 60 -I em2 | logger -t netw"
- one ran some connectivity checks and logged them with logger(1)
  I could not find any reason for filesystem access in this script.
- I disabled pflogd(8)
- I disabled relayd(8), which didn't do anything anyways, though.

Marcus

Reply via email to