Re: building netbsd-9 2 'sync' processes stuck in 'tstile'

John D. Baker Sat, 08 May 2021 11:02:25 -0700

On Sat, 8 May 2021, Robert Elz wrote:

>   | I just ran a full forced 'fsck -yf' on it just prior to these events.
>   | That was prompted by CVS failing to clean up a directory.
> 
> That seems like an unusual response, using fsck to fix things (I assume
> on an ummounted filesystem, otherwise it is definitely wrong) isn't typically
> needed - that is required after the system has crashed,  possibly
> leaving unsaved updates, which need to be repaired (made consistent
> at least).   But as long as the system is still running, nothing is
> lost, and the filesystems should all be fine (if not there are far more
> serious problems - booting after an unclean shutdown without having done
> a fsck can get you into that kind of situation).


In this case, there is a directory, but when CVS tries to delete it, it
reports "Could not delete <some directory>: no such file or directory"
and aborts the update.  Re-running the update fails the same way.  Trying
to do so manually produces the same result.  The filesystem always
reports being clean, but 'fsck -yf' always finds problems with the file
or directory in question, ususally missing "." and/or ".." for directories,
sometimes an impossibly large block number.

I wait until the system is quiescent and/or clients have finished or
reached a convenientt stopping point, reboot single user, manually bring
up the RAID, check parity and then run 'fsck -yf' on everything, just
to be sure, then reboot again.

>   | I get those
>   | from time to time after the near-catastrophic events that prompted
>   | kern/55115.  I used to get them frequently.  Now they are less common.
>   | The carnage might still have caught the build this time.
> 
> First, that PR is apparently fixed now right?   It is still waiting feedback
> from you to confirm that.

I'm waiting for my clients' tasks to finish so I can reboot the machine,
test with a -current kernel containing the fix and if successful request
pullup to netbsd-9.

> If the disk controller is still not working properly, then almost anything
> is possible.  If it is, then provided everything looks clean to fsck, there
> should be nothing which would trigger a kernel locking problem - those tend
> to be more caused by internal race conditions (sometimes by little used error
> paths forgetting to release a semaphore).

It's not that the controller is malfunctioning, per se, but that when
I rebooted the machine with a kernel after MSI was enabled for siisata(4),
this controller couldn't cope with that and my then-autoconfigured RAID
got hosed.  I recovered using 'raidctl -C' to force configuration,
rebuild parity and fix the filesystem, but there has been lingering
damage from that event that I've been cleaning up ever since.  As I
said, these problems used to happen more frequently, but as more and
more blocks get allocated, new allocations occasionally stray into areas
that still have problems.

Before I reboot I'll see about getting a backtrace on the stuck processes
as suggested by Greg Woods.

-- 
|/"\ John D. Baker, KN5UKS               NetBSD     Darwin/MacOS X
|\ / jdbaker[snail]consolidated[flyspeck]net  OpenBSD            FreeBSD
| X  No HTML/proprietary data in email.   BSD just sits there and works!
|/ \ GPGkeyID:  D703 4A7E 479F 63F8 D3F4  BD99 9572 8F23 E4AD 1645

Re: building netbsd-9 2 'sync' processes stuck in 'tstile'

Reply via email to