On Fri, Jan 15, 1999 at 09:12:07PM -0800, Archie Cobbs wrote:
> I was thinking about the DIAGNOSTICS replacement macros and
> had a random thought...
> Suppose you're sitting in front of a ddb (or better yet gdb) prompt
> because your kernel has just crashed due to who knows what reason.
> What do you do to debug this? You start looking at variables,
> memory, etc for anything funny going on.
> For example, several times we've spent hours going through a crash
> dump to find, for example, that a process was on two queues, or
> some mbuf was mangled, etc.
> The thought is that it would be really easy to help automate this
> process, by doing the following:
>  1. Define a new kernel option INCLUDE_SANITY_CHECKS (or whatever)


Hey, I just happen to remember that somebody added this a couple of
days ago - hmm, could it have been me?  :-)

>  2. When this is defined, all the various FreeBSD kernel
>     submodules (VM, networking, device drivers, etc) would
>     include a function that exhaustively runs sanity checks --
>     ie, validations that all the assumptions in the code are true --
>     for that particular submodule. This means checking all queues,
>     flags, whatever.

Ie, invariants.

>  4. The function is linked into a linker set SANITY_SET(...) or whatever

I've not thought of that - that may be a good idea.

> Then by simply calling this function from the debugger you can
> much more quickly narrow down on the problem (and hopefully fix
> it before you get tired and go to sleep :-)
> Moreover, since the function is running post-mortem, it can do
> very detailed checks that would otherwise take way too long.
> E.g., check every mbuf, every queue entry, check the filesystem,
> etc. Basically a "fsck" for the kernel memory.

You do not only want to call this at post-mortem.  You often want to
selectively use this while the kernel is running.

Example: At one point (a year and half or so ago), I was debugging the
tty driver in bisdn.  For some reason, it was crashing in various ways
at various times, with no sane reason - just garbage data.  I spent
quite a bit of time looking at this, finding no reason for the faults
- they "just happened", taking on average perhaps 4 hours hours under
load to trigger.

As I was getting more and more frustrated with attempting to shotgun
debug this, I went back to my normal mode of development - I wrote
invariants for all data structures in the vicinity.  When I added an
invariant for the clist structures (and check of it all over the
place), I found that my "crash" (now an invariant incorrect panic)
time went down to two minutes - and that it was always the same way,
with the same stack backtrace, instead of crashing at various random

The reason for the bug turned out to be that both I and the
implementor of the driver had missed the change of spls from levels in
BSD4.4 to masks in FreeBSD.  After I had seen the invariant failure, I
could see that something was being interrupted between two spls - and
after 3 minutes of reading the FreeBSD manpage and three lines of
changes I had something that worked.

That driver had been non-functional for at least three releases of
bisdn (and the userland code to handle it was not even there, which I
expect was due to this).  I further expect that somebody had tried
pretty hard to debug it, as they had spent the time to actually write
it.  The fact that I (which at that point had little experience with
the FreeBSD kernel) was able able to debug that in a couple of hours
where others had used more time and failed before me show some of the
power of invariants for finding obscure bugs.

I would like to have invariants available for all significant data
structures, and I'm planning to write them up as I get time for it.

> Is this something that people would be motivated enough to make
> as "official" FreeBSD kernel good housekeeping policy?

I suspect a large number of us will use it, making it likely it will
sort of maintain itself.


To Unsubscribe: send mail to majord...@freebsd.org
with "unsubscribe freebsd-current" in the body of the message

Reply via email to