RE: make -j 128 world hang....

John Baldwin Tue, 23 Jan 2001 10:08:55 -0800

On 23-Jan-01 Poul-Henning Kamp wrote:
> 
> I can still stall my 2xPII/350 machine with a make -j 128 world,
> but it is slightly different now I think:  I can break into ddb.
> 
> This machine is a ahc/scsi machine, so I don't know if this is
> really SMP or Justins recent changes...
> 
> Poul-Henning

Ok, the problem children here are the 'allproc' waiters, which are waiting on
the allproc lock.  I have had this lockup occur once so far.  It appears to be
a deadlock in the lockmgr (surprise, suprise).  When I had this, the allproc
lockmgr lock has 1 pending exclusive lock and 4 pending exclusive locks, and 1
existing shared lock.  However, the actual lockmgr struct itself thought it had
1 existing shared lock and 5 pending shared locks.  *sigh*  I tried to dig
around and determined that my smp_hlt.patch had made it worse because cpu0 had
basically HLT'd and never been woken up, and cpu1 was spinning forever in
lockmgr in the atkbd0 ithread.  I got no farther than that, however:


> 55818 cdc1ba80 cdc6c000    0 55775 16280 004006  3  allproc c02e08a0 cc

Exclusive lock for wait4() or exit1().  I think exit1().

>    13 cbc73400 cc4c1000    0     0     0 00020c  3  allproc c02e08a0 swi6:
> clock
>    12 cbc73620 cc4bf000    0     0     0 000204  3  allproc c02e08a0 swi1:
> net

These are both shared lock waiters.  The fact that softclock() is blocked is
why the machine "locks up".  It is blocked in schedcpu(), and no timeouts are
being called.

I have a vmcore and kernel.debug and have futzed around in gdb with them for a
while but don't know why it is locked up.  IIRC, the atkbd thread was stuck
here:

/*
 * This is the waitloop optimization, and note for this to work
 * simple_lock and simple_unlock should be subroutines to avoid
 * optimization troubles.
 */
static int
apause(struct lock *lkp, int flags)
{
#ifdef SMP
        int i, lock_wait;
#endif

        if ((lkp->lk_flags & flags) == 0)
                return 0;
#ifdef SMP
        for (lock_wait = LOCK_WAIT_TIME; lock_wait > 0; lock_wait--) {
                mtx_exit(lkp->lk_interlock, MTX_DEF);
                for (i = LOCK_SAMPLE_WAIT; i > 0; i--)
                        if ((lkp->lk_flags & flags) == 0)
                                break;
                mtx_enter(lkp->lk_interlock, MTX_DEF);
                if ((lkp->lk_flags & flags) == 0)
                        return 0;
        }
#endif
        return 1;
}

If you want some fun, stick KTR and KTR_EXTEND in your kernel.  Then, before
you start your world, do:

sysctl -w debug.ktr.mask=0x1008

To log process switches (0x1000) and mutex ops (8).  When you break into ddb,
you can use 'tbuf' to display the first entry in the log buffer, and 'tnext' to
display the next entry.  Then tnext again, etc.  If you can get a core dump (it
worked for me on my dual 200 at least), then I have gdb macros that allow you
to dump the KTR logs in gdb easily.  As for a fix, Jason Evans has implemented
and tested and will hopefully soon commit some simpler and lighter weith
shared/exclusive locks that allproc and proctree will switch to using. 
However, lockmgr is used in lots of places, so it is still in our best interest
to get it fixed.  Also, for the preemptive kernel, (which is very close to
running stably on UP and SMP x86 and UP alpha last I heard, just some problems
with FPU state) all these #ifdef SMP's will have to go away and we will use
mutexes in UP as well.

-- 

John Baldwin <[EMAIL PROTECTED]> -- http://www.FreeBSD.org/~jhb/
PGP Key: http://www.Baldwin.cx/~john/pgpkey.asc
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message
RE: make -j 128 world hang....

Reply via email to