Jack:
Here's the NMI output I'm getting when I very frequently now hang dumping the
1st block of the dump:
-----------------------------------------------------------------------------
258 monica 22:50 ~> sync
259 monica 22:50 ~> panic
Kernel panic: sys_setpriority
Entering kdb (current=0xe000000004578000, pid 860) on processor 1 due to panic
[1]kdb>
dump: Dumping to device 0x802 [sd(8,2)] on CPU 1 ...
dump: Compression value is 0x0, Writing dump header
escaping to L2 system controller
monica-001-L2>nmi
001i19:
INFO: command not support on this brick type
re-entering system console mode (001c28 console), <CTRL_T> to escape to L2
INIT - NASID 1, cpu 2 (monarch)
MinState area at 0x8000000200009600
Control registers
IIP 0x00000000ffe62760 IPSR 0x0000100001002018 IFS: 0x8000000000000409
IIP: FIRMWARE
XIP 0x00000000ffe5fa10 XPSR 0x0000100001002018 XFS: 0x0000000000000000
B0 0x00000000ffe625f0 PRED 0x0000000000008495 RSC: 0x0000000000000000
ISR 0x0000020000000004 IIPA 0x00000000ffe62750 ITIR: 0x0000000000000030
IFA 0x80000000ffffffe0 NaT:0x0000000000000000
BSR:0x00000000001e0e00 LIBC 0x00000000ffe9c6a0 ELSC 0x0000000200005600
PAL:0x0000000000048010
General registers GR0 .. GR31 (bank 1)
GR0 0x0000000000000000 0x00000000ffed28e0 0x00000000001e0e00
0x0000000000000000
GR4 0x0000000000000012 0x0000000000000040 0x0000000000000040
0x00000000000000c0
GR8 0x0000000000000000 0x0000000000040000 0x0000000000000000
0x0000000000000000
GR12 0x0000000203fdfe20 0x0000000000000333 0x80000a0001608018
0x80000a0001790010
GR16 0x000000000000002e 0x0000000203fdfd31 0x0000000203fdfdf0
0x0000000000005038
GR20 0x0000000000000000 0x80000000ffd28020 0x0000000001002018
0x0000000000000000
GR24 0x0000001008020000 0x0000000000000000 0x0000000000000000
0x0000000000000000
GR28 0x0000000000000100 0x0000000000000000 0x0000000000000000
0x0000000000000000
General registers GR16 .. GR31 (bank 0)
GR16 0x0000000000000060 0x0000000000000000 0x00000000ffd3ac80
0x0000000000000004
GR20 0x0000000000000000 0x80000000ffffff80 0x0000140000002030
0x0000000000000003
GR24 0x18e002017ffff801 0x08012b0e2029837c 0x0000100300002038
0x0000000000195631
GR28 0x0000000000195871 0x0000000000195872 0x0000000000000010
0xfffffffffffe8aa1
Rotating Registers GR32 .. GR40
GR32 0x00000000ffe6c938 0x00000000ffe6a668 0x0000000000000010
0x80000a0001400150
GR36 0x00000000ffdb1a80 0x000000000000058e 0x00000000ffe236c0
0x0000000000000306
GR40 0x0000000203fdfdf8INFO: partition 0 system console changed: 001c28 CPU0
HARDWARE ERROR STATE: (Forced error dump)
INIT - NASID 0, cpu 0
END Hardware Error State (Forced error dump)
MinState area at 0x8000000000007600
Dump Spool for PI Errors - nasid 1, err stack A
Control registers
Entry 0: (0x130d34f01000002)
IIP 0xe0020000004b2bd0 IPSR 0x0000121008022018 IFS: 0x800000000000048c
IIP: schedule+0xf0
Cmd 0x02(Request:READ), RRB stat: --------E0 XIP 0xe0020000007552f0 XPSR
0x0000141008026018 XFS: 0x0000000000000812 XIP: rt_check_expire__thr+0x410
CRB #0, T5 req #0, supp 0
B0 0xe00200000040ab40 PRED 0x0000000000016069 RSC: 0x0000000000000003
Error 2 Directory Error, Cache line address 0x130d34f00
ISR 0x0000040000000000 IIPA 0xe0020000004b2bd0 ITIR: 0x0000000000000538
Dump Spool complete.
IFA 0xbfffff0000000028 NaT:0x0000000000000000
BSR:0x00000000001e0e00 LIBC 0x00000000ffe9c6a0 ELSC 0x0000000000005600
PAL:0x0000000000048010
General registers GR0 .. GR31 (bank 1)
GR0 0x0000000000000000 0xe002100000bbc000 0xe00000000277ff00
0xe000000002778000
GR4 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000
GR8 0x0000000000000001 0x0000000000000309 0x0000000000000000
0x0000000000000206
GR12 0xe00000000277fe50 0xe000000002778000 0x0000000000000001
0xe002100000b552d0
GR16 0xe002100000b552d0 0x0000000000000001 0x0000000000000000
0x0000000000000000
GR20 0xe00000004f36b480 0x0000000000000002 0xffffffffffffffff
0xe000000002778000
GR24 0x0000000000000000 0xe000000001218058 0xe000000001218040
0xe000000001218050
GR28 0xe000000001218168 0xe000000001218060 0x0000000000000001
0x0000000000000000
General registers GR16 .. GR31 (bank 0)
GR16 0xe000000002779438 0x0000000000000308 0x0000000000000000
0x0000000000000000
GR20 0xe002100000bbc000 0xe002000000754ee0 0xbfffff0000000028
0x0000000000000000
GR24 0x0000000000000000 0x0000000000000000 0x000000000000048c
0x0000000000000003
GR28 0xe0020000007552f0 0x0000141008026018 0x8000000000000812
0x00000000000161a9
Rotating Registers GR32 .. GR41
GR32 0xe002100000a94d00 0x0000001008022018 0xe0020000004c7e50
0x0000000000000388
GR36 0xe000000001218048 0xe00200000040ab40 0x000000000000050a
0x0000000000000000
GR40 0x0000000000000000 0xe002100000a94e00
Dump Spool for PI Errors - nasid 1, err stack A
Entry 0: (0x130d34f01000002)
Cmd 0x02(Request:READ), RRB stat: --------E0 CRB #0, T5 req #0, supp 0
Error 2 Directory Error, Cache line address 0x130d34f00
Dump Spool complete.
INFO: partition 0 system console changed: 001c28 CPU2
INIT - NASID 0, cpu 2
MinState area at 0x8000000000009600
Control registers
IIP 0xe0020000005521a0 IPSR 0x0000101008026018 IFS: 0x8000000000000287
IIP: kiobuf_wait_for_io+0xc0 [MIB] cmp4.eq p6,p7=0,r14
XIP 0xe0020000005521a0 XPSR 0x0000101008026018 XFS: 0x0000000000000287
B0 0xe0020000005521f0 PRED 0x0000000000061599 RSC: 0x0000000000000003
ISR 0x0000000000000000 IIPA 0xe002000000552190 ITIR: 0x0000000000000538
IFA 0xbfffff0000000038 NaT:0x0000000000000000
BSR:0x00000000001e0e00 LIBC 0x00000000ffe9c6a0 ELSC 0x0000000000005600
PAL:0x0000000000048010
General registers GR0 .. GR31 (bank 1)
GR0 0x0000000000000000 0xe002100000bbc000 0x0000000000000000
0xe000000004578000
GR4 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000
GR8 0x0000000000000000 0x0000000000000c1c 0x0000000000000000
0x0000000000061499
GR12 0xe00000000457fde0 0xe000000004578000 0x0000000000000004
0x0000000000000001
GR16 0xe002100000c0d400 0xe0000000017980aa 0xc000000000000000
0xe002100000b57000
GR20 0xe0000000017386d0 0xe0000000017386c9 0xe0000000016c8578
0xe0000000016c8548
GR24 0x0000000000000001 0xe0000000016c8584 0xe0000000016c8560
0xe0000000016c8510
GR28 0xe0000000016c8507 0x0000000000000001 0xe0000000016c8558
0xe0000000016c8506
General registers GR16 .. GR31 (bank 0)
GR16 0xe000000004579540 0x0000000000000308 0x0000000000000000
0x0000000000000000
GR20 0xe002100000bbc000 0xe0020000006e2da0 0xbfffff0000000038
0xe0020000006e2da0
GR24 0x0000000000061559 0x0000000000000000 0x0000000000000287
0x0000000000000003
GR28 0xe0020000005521a0 0x0000101008026018 0x8000000000000287
0x0000000000061599
Rotating Registers GR32 .. GR38
GR32 0xe000000003ed41b8 0xe000000003ed41a8 0xe000000004578000
0xe00200000051f650
GR36
0x0000000000000f22INIT - NASID 1, cpu 0
0xe002100000b7ca20MinState area at 0x8000000200007600
0xe002100000bbc000 Control registers
IIP 0x00000000ffe62750 IPSR 0x0000100001002018 IFS: 0x8000000000000409
Dump Spool for PI Errors - nasid 1, err stack A
XIP 0x00000000ffe5fa10 XPSR 0x0000100001002018 XFS: 0x0000000000000000
Entry 0: (0x130d34f01000002)
B0 0x00000000ffe625f0 PRED 0x0000000000018495 RSC: 0x0000000000000000
Cmd 0x02(Request:READ), RRB stat: --------E0 ISR 0x0000020000000004 IIPA
0x00000000ffe62740 ITIR: 0x0000000000000030
CRB #0, T5 req #0, supp 0
IFA 0x80000000ffffffe0 NaT:0x0000000000000000
Error 2 Directory Error, Cache line address 0x130d34f00
BSR:0x00000000001e0e00 LIBC 0x00000000ffe9c6a0 ELSC 0x0000000200005600
Dump Spool complete.
PAL:0x0000000000048010
General registers GR0 .. GR31 (bank 1)
GR0 0x0000000000000000 0x00000000ffed28e0 0x00000000001e0e00
0x0000000000000000
GR4 0x0000000000000012 0x0000000000000040 0x0000000000000040
0x00000000000000c0
GR8 0x0000000000000000 0x0000000000000001 0x0000000000000000
0x0000000000000000
GR12 0x0000000203fffe20 0x0000000000000333 0x80000a0001608018
0x80000a0001790010
GR16 0x0000000000000030 0x0000000203fffd32 0x0000000000000000
0x0000000000000010
GR20 0x0000000009fff800 0xffffffff00000000 0x00000000ffffffff
0xffff0000ffff0000
GR24 0x80000b01c0000408 0x000000fe00000000 0x0000000000ffffff
0xffffffff01000000
GR28 0x0000000000000100 0x0000000000000000 0x0000000000000000
0x0000000000000000
General registers GR16 .. GR31 (bank 0)
GR16 0x0000000000000060 0x0000000000000000 0x00000000ffd3ac80
0x0000000000000004
GR20 0x0000000000000000 0x80000000ffffff80 0x0000140000002030
0x0000000000000003
GR24 0x18e002017ffff801 0x08012b0e2029837c 0x0000100300002038
0x0000000000195631
GR28 0x0000000000195871 0x0000000000195872 0x0000000000000010
0xfffffffffffe8aa1
Rotating Registers GR32 .. GR40
GR32 0x00000000ffe6c938 0x00000000ffe6a660 0x0000000000000010
0x80000a0001400150
GR36 0x00000000ffdb1a80 0x000000000000058e 0x00000000ffe236c0
0x0000000000000306
GR40 0x0000000203fffdf8
Dump Spool for PI Errors - nasid 1, err stack A
Entry 0: (0x130d34f01000002)
Cmd 0x02(Request:READ), RRB stat: --------E0 CRB #0, T5 req #0, supp 0
Dump Spool complete.
C 001 001c31:
C 001 001c31: *** NTLB Interruption on node 1
C 001 001c31: *** EPC: 0x0 ([Symbol Table not available])
C 001 001c31: *** IIP: 0xffd9ce80, IPSR: 0x1000000
C 001 001c31: *** Press ENTER to continue.
-------------------------------------------------------------------------------------------------------
I think we are likely waiting in all of the add_wait_queue() inline code with
kiobuf_wait_for_io+0xc0
being so close to the kiobuf_wait_for_io() entry point:
--------------------------------------------------------------------------
/**
* kiobuf_wait_for_io - wait for completion of a kiobuf request
* @kiobuf: kiobuf request to wait for
*
* Adds a completion event for the kiobuf in question and wakes up
* when the I/O has completed.
*/
void
kiobuf_wait_for_io(struct kiobuf *kiobuf)
{
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
if (atomic_read(&kiobuf->io_count) == 0)
return;
add_wait_queue(&kiobuf->wait_queue, &wait);
<<---- Hanging Here
repeat:
--------------------------------------------------------------------
So I suppose it's more likely that I'm in the inline code for add_wait_queue():
-------------------------------------------------------------------------------------
extern inline void add_wait_queue(struct wait_queue ** p, struct wait_queue * wait)
{
unsigned long flags;
write_lock_irqsave(&waitqueue_lock, flags);
__add_wait_queue(p, wait);
write_unlock_irqrestore(&waitqueue_lock, flags);
}
--------------------------------------------------------------------------------------------------------------
#define write_lock_irqsave(lock, flags) do { local_irq_save(flags);
write_lock(lock); } while (0)
--------------------------------------------------------------------------------------------------------------
# define local_irq_save(x)
\
do {
\
unsigned long ip, psr;
\
\
__asm__ __volatile__ ("mov %0=psr;; rsm psr.i;;" : "=r" (psr) :: "memory");
\
if (psr & (1UL << 14)) {
\
__asm__ ("mov %0=ip" : "=r"(ip));
\
last_cli_ip = ip;
\
}
\
(x) = psr;
\
} while (0)
--------------------------------------------------------------------------------------------------------------
#define write_lock(rw) \
do { \
__asm__ __volatile__ ( \
"mov ar.ccv = r0\n" \
"dep r29 = -1, r0, 31, 1\n" \
";;\n" \
"1:\n" \
"ld4 r2 = [%0]\n" \
";;\n" \
"cmp4.eq p0,p7 = r0,r2\n" \
"(p7) br.cond.spnt.few 1b \n" \
"cmpxchg4.acq r2 = [%0], r29, ar.ccv\n" \
";;\n" \
"cmp4.eq p0,p7 = r0, r2\n" \
"(p7) br.cond.spnt.few 1b\n" \
";;\n" \
:: "r"(rw) : "ar.ccv", "p7", "r2", "r29", "memory"); \
} while(0)
--------------------------------------------------------------------------------------------------------------
On most systems I've done much debugging of spinlock and mutex hangs I've added or used
audit information in the locks saying who owned it (CPU, and PC's on the stack of
caller
to write_lock(). IA64 Linux doesn't seem to have any audtin information even for DEBUG
or BRINGUP kernels.
I think there is likely something very wrong with kdb's handling of IPI's to
processors.
Perhaps in older 2.4.16 kernels it wasn't worrying about checking the state of the
processors
it was stopping to make sure they weren't processing interrupt code. On the Sequent
our SPL's
to block interrutps were right in the spinlock code. I don't see it in our
write_lock() #define.
Is that being done by local_irq_save()? I dont' see it but I'm still rusty on ia64
asm....
When I go thru panic() to dump and involve kdb I hit this bug between 20% to 90% of
the time.
When I go directly to dump() we sucessfully dumped the ia64 SN1 system 250 times
before hanging.
On a on a PC mono-processor without kdb we never hang; it ran all night (2500 dumps).
-piet