Re: Hard lockups using 5.3-RELEASE..

2005-02-19 Thread Robert Watson
On Sat, 19 Feb 2005, Peter Losher wrote:

> We have a Celestica dual-Opteron system w/ 4GB RAM running
> 5.3-RELEASE/i386 (32-bit), and a SMP-aware kernel, which is experiencing
> hard lockups.  Debugging results below. 

Hmm.  So just to summarize:

- The system appears to wedge
- Serial break can get into the debugger

Have you tried updating to the latest RELENG_5_3 patch level?  That
includes at least one significant SMP stability fix.  You can rebuild
along the RELENG_5_3 branch, or just use freebsd-update to pull it in.

> It looks like it's trying to lock Giant while it already has Giant.  In
> any case, we have rebuilt a uniprocessor kernel for now.  If this is
> already fixed in 5-STABLE, then let me know. ;) 

Generally speaking, recursing Giant is fine, as Giant is a recursible
mutex; however, an ithread shouldn't already hold Giant at that point.

This may be fixed in 5-STABLE, but it's hard to say.  I think the order of
operations here is:

- First, slide to RELENG_5_3 head (p5?) to make sure you have the IPI
  stability fix.  See if the problem goes away.

- Generate the following information: when the box is wedged, does it...

  (1) Respond to pings
  (2) Does the num lock light go on and off when the num lock key is hit
  (3) If it responds to pings, what happens when you build a new TCP
  connection to an open TCP port (a) once (b) twice (c) the 100'd
  (or so) time.

- Generate the following DDB output using your serial console:

  show pcpu
  show pcpu 0
  show pcpu 1
  ps
  show lockedvnods

  I may then ask you to generate stack traces of the processes that appear
  "interesting".  The definition of interesting is a little bit
  context-specifi so it's hard to say what it is just now.  If there are a
  lot of processes wedged in VM and VFS, then I'll ask you to trace each
  process that appears in the lockedvnods output. 

- Next, recompile with INVARIANTS and see if the problem triggers an
  assertion failure when it occurs.

- Next, recompile with WITNESS and see if WITNESS creates a warning or
  assertion failure when it occurs.

  Break to the debugger and generate the above DDB output, but also "show
  allocks" (5-STABLE only), or "show locks" for interesting processes if
  5-RELEASE-*.

Also, I don't think you mentioned what sort of workload is present on the
box.

Thanks!

Robert N M Watson

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Hard lockups using 5.3-RELEASE..

2005-02-19 Thread Peter Losher
We have a Celestica dual-Opteron system w/ 4GB RAM running
5.3-RELEASE/i386 (32-bit), and a SMP-aware kernel, which is experiencing
hard lockups.  Debugging results below.
-=-
[BREAK]
KDB: enter: Line break on console
[thread 100104]
Stopped at  kdb_enter+0x2b: nop
db> where
kdb_enter(c084e4c6) at kdb_enter+0x2b
siointr1(c507d800,c0946700,0,c084e28e,6ad) at siointr1+0xce
siointr(c507d800) at siointr+0x21
intr_execute_handlers(c4f5d490,e9826b80,4,e9826bd0,c07b2ae3) at
intr_execute_han
dlers+0x89
lapic_handle_intr(34) at lapic_handle_intr+0x2e
Xapic_isr1() at Xapic_isr1+0x33
--- interrupt, eip = 0xc0604456, esp = 0xe9826bc4, ebp = 0xe9826bd0 ---
_mtx_lock_sleep(c08f67c0,c5698640,0,c084a0b3,126) at _mtx_lock_sleep+0xc6
_mtx_lock_flags(c08f67c0,0,c084a0b3,126,c6a82738) at _mtx_lock_flags+0x48
vm_fault(c5bbd5dc,81ae000,2,8,c5698640) at vm_fault+0x1fe
trap_pfault(e9826d48,1,81ae000,81ae000,0) at trap_pfault+0xf2
trap(2f,2f,2f,2000,81ae000) at trap+0x1df
calltrap() at calltrap+0x5
--- trap 0xc, eip = 0x2809bd8d, esp = 0xbfbfb7b0, ebp = 0xbfbfb7e8 ---
db> panic
panic: from debugger
cpuid = 3
boot() called on cpu#3
Uptime: 2h50m29s
-=-
(then resetting the system causes a panic, and the system locks up for
good, and a power reset is required)
We were able to get a coredump, and the resulting kgdb output is below:
-=-
(kgdb) up
#45 0xc05f9bda in fork_exit (callout=0xc05fa5dc ,
arg=0xc4fe7a00, frame=0xe8daed48) at ../../../kern/kern_fork.c:811
811 callout(arg, frame);
(kgdb) l
806  * cpu_set_fork_handler intercepts this function call to
807  * have this call a non-return function to stay in
kernel mode.
808  * initproc has its own fork handler, but it does return.
809  */
810 KASSERT(callout != NULL, ("NULL callout in fork_exit"));
811 callout(arg, frame);
812
813 /*
814  * Check if a kernel thread misbehaved and returned from
its main
815  * function.
(kgdb) down
#44 0xc05fa6e8 in ithread_loop (arg=0xc4fe7a00)
at ../../../kern/kern_intr.c:547
547 ih->ih_handler(ih->ih_argument);
(kgdb) l
542 mtx_unlock(&ithd->it_lock);
543 goto restart;
544 }
545 if ((ih->ih_flags & IH_MPSAFE) == 0)
546 mtx_lock(&Giant);
547 ih->ih_handler(ih->ih_argument);
548 if ((ih->ih_flags & IH_MPSAFE) == 0)
549 mtx_unlock(&Giant);
550 }
551 if (ithd->it_enable != NULL) {
(kgdb) down
#43 0xc0615dfa in softclock (dummy=0x0) at ../../../kern/kern_timeout.c:247
247 mtx_lock(&Giant);
(kgdb) l
242 (c->c_flags &
~CALLOUT_PENDING);
243 }
244 curr_callout = c;
245 mtx_unlock_spin(&callout_lock);
246 if (!(c_flags & CALLOUT_MPSAFE)) {
247 mtx_lock(&Giant);
248 gcalls++;
249 CTR1(KTR_CALLOUT,
"callout %p", c_func);
250 } else {
251 mpcalls++;
-=-
It looks like it's trying to lock Giant while it already has Giant.  In
any case, we have rebuilt a uniprocessor kernel for now.  If this is
already fixed in 5-STABLE, then let me know. ;)
Best Wishes - Peter
--
[EMAIL PROTECTED] | ISC | OpenPGP 0xE8048D08 | "The bits must flow"


signature.asc
Description: OpenPGP digital signature