On Sat, 8 May 2021, Robert Elz wrote: > | I just ran a full forced 'fsck -yf' on it just prior to these events. > | That was prompted by CVS failing to clean up a directory. > > That seems like an unusual response, using fsck to fix things (I assume > on an ummounted filesystem, otherwise it is definitely wrong) isn't typically > needed - that is required after the system has crashed, possibly > leaving unsaved updates, which need to be repaired (made consistent > at least). But as long as the system is still running, nothing is > lost, and the filesystems should all be fine (if not there are far more > serious problems - booting after an unclean shutdown without having done > a fsck can get you into that kind of situation).
In this case, there is a directory, but when CVS tries to delete it, it reports "Could not delete <some directory>: no such file or directory" and aborts the update. Re-running the update fails the same way. Trying to do so manually produces the same result. The filesystem always reports being clean, but 'fsck -yf' always finds problems with the file or directory in question, ususally missing "." and/or ".." for directories, sometimes an impossibly large block number. I wait until the system is quiescent and/or clients have finished or reached a convenientt stopping point, reboot single user, manually bring up the RAID, check parity and then run 'fsck -yf' on everything, just to be sure, then reboot again. > | I get those > | from time to time after the near-catastrophic events that prompted > | kern/55115. I used to get them frequently. Now they are less common. > | The carnage might still have caught the build this time. > > First, that PR is apparently fixed now right? It is still waiting feedback > from you to confirm that. I'm waiting for my clients' tasks to finish so I can reboot the machine, test with a -current kernel containing the fix and if successful request pullup to netbsd-9. > If the disk controller is still not working properly, then almost anything > is possible. If it is, then provided everything looks clean to fsck, there > should be nothing which would trigger a kernel locking problem - those tend > to be more caused by internal race conditions (sometimes by little used error > paths forgetting to release a semaphore). It's not that the controller is malfunctioning, per se, but that when I rebooted the machine with a kernel after MSI was enabled for siisata(4), this controller couldn't cope with that and my then-autoconfigured RAID got hosed. I recovered using 'raidctl -C' to force configuration, rebuild parity and fix the filesystem, but there has been lingering damage from that event that I've been cleaning up ever since. As I said, these problems used to happen more frequently, but as more and more blocks get allocated, new allocations occasionally stray into areas that still have problems. Before I reboot I'll see about getting a backtrace on the stuck processes as suggested by Greg Woods. -- |/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X |\ / jdbaker[snail]consolidated[flyspeck]net OpenBSD FreeBSD | X No HTML/proprietary data in email. BSD just sits there and works! |/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645
