Jack:

Here's the NMI output I'm getting when I very frequently now hang dumping the
1st block of the dump:
-----------------------------------------------------------------------------
 258 monica 22:50 ~> sync
 259 monica 22:50 ~> panic
Kernel panic: sys_setpriority
 
Entering kdb (current=0xe000000004578000, pid 860) on processor 1 due to panic
[1]kdb> 

dump: Dumping to device 0x802 [sd(8,2)] on CPU 1 ...
dump: Compression value is 0x0, Writing dump header 
escaping to L2 system controller
monica-001-L2>nmi
001i19:
INFO: command not support on this brick type

re-entering system console mode (001c28 console), <CTRL_T> to escape to L2

INIT - NASID 1, cpu 2 (monarch)
MinState area at 0x8000000200009600
  Control registers
    IIP 0x00000000ffe62760    IPSR 0x0000100001002018    IFS: 0x8000000000000409       
                 IIP: FIRMWARE
    XIP 0x00000000ffe5fa10    XPSR 0x0000100001002018    XFS: 0x0000000000000000
    B0  0x00000000ffe625f0    PRED 0x0000000000008495    RSC: 0x0000000000000000
    ISR 0x0000020000000004    IIPA 0x00000000ffe62750   ITIR: 0x0000000000000030
    IFA 0x80000000ffffffe0     NaT:0x0000000000000000
    BSR:0x00000000001e0e00    LIBC 0x00000000ffe9c6a0    ELSC 0x0000000200005600
    PAL:0x0000000000048010
  General registers GR0 .. GR31 (bank 1)
    GR0    0x0000000000000000   0x00000000ffed28e0   0x00000000001e0e00   
0x0000000000000000
    GR4    0x0000000000000012   0x0000000000000040   0x0000000000000040   
0x00000000000000c0
    GR8    0x0000000000000000   0x0000000000040000   0x0000000000000000   
0x0000000000000000
    GR12   0x0000000203fdfe20   0x0000000000000333   0x80000a0001608018   
0x80000a0001790010
    GR16   0x000000000000002e   0x0000000203fdfd31   0x0000000203fdfdf0   
0x0000000000005038
    GR20   0x0000000000000000   0x80000000ffd28020   0x0000000001002018   
0x0000000000000000
    GR24   0x0000001008020000   0x0000000000000000   0x0000000000000000   
0x0000000000000000
    GR28   0x0000000000000100   0x0000000000000000   0x0000000000000000   
0x0000000000000000
  General registers  GR16 .. GR31 (bank 0)
    GR16   0x0000000000000060   0x0000000000000000   0x00000000ffd3ac80   
0x0000000000000004
    GR20   0x0000000000000000   0x80000000ffffff80   0x0000140000002030   
0x0000000000000003
    GR24   0x18e002017ffff801   0x08012b0e2029837c   0x0000100300002038   
0x0000000000195631
    GR28   0x0000000000195871   0x0000000000195872   0x0000000000000010   
0xfffffffffffe8aa1
  Rotating Registers GR32 .. GR40
    GR32   0x00000000ffe6c938   0x00000000ffe6a668   0x0000000000000010   
0x80000a0001400150
    GR36   0x00000000ffdb1a80   0x000000000000058e   0x00000000ffe236c0   
0x0000000000000306
    GR40   0x0000000203fdfdf8INFO: partition 0 system console changed: 001c28 CPU0


HARDWARE ERROR STATE: (Forced error dump)
INIT - NASID 0, cpu 0 
END Hardware Error State (Forced error dump)
MinState area at 0x8000000000007600
Dump Spool for PI Errors - nasid 1, err stack A
  Control registers
 Entry 0:    (0x130d34f01000002)
    IIP 0xe0020000004b2bd0    IPSR 0x0000121008022018    IFS: 0x800000000000048c       
                                                 IIP: schedule+0xf0
  Cmd 0x02(Request:READ), RRB stat: --------E0    XIP 0xe0020000007552f0    XPSR 
0x0000141008026018    XFS: 0x0000000000000812          XIP: rt_check_expire__thr+0x410
 CRB #0, T5 req #0, supp 0
    B0  0xe00200000040ab40    PRED 0x0000000000016069    RSC: 0x0000000000000003
  Error 2 Directory Error, Cache line address 0x130d34f00
    ISR 0x0000040000000000    IIPA 0xe0020000004b2bd0   ITIR: 0x0000000000000538
Dump Spool complete.
    IFA 0xbfffff0000000028     NaT:0x0000000000000000
    BSR:0x00000000001e0e00    LIBC 0x00000000ffe9c6a0    ELSC 0x0000000000005600
    PAL:0x0000000000048010
  General registers GR0 .. GR31 (bank 1)
    GR0    0x0000000000000000   0xe002100000bbc000   0xe00000000277ff00   
0xe000000002778000
    GR4    0x0000000000000000   0x0000000000000000   0x0000000000000000   
0x0000000000000000
    GR8    0x0000000000000001   0x0000000000000309   0x0000000000000000   
0x0000000000000206
    GR12   0xe00000000277fe50   0xe000000002778000   0x0000000000000001   
0xe002100000b552d0
    GR16   0xe002100000b552d0   0x0000000000000001   0x0000000000000000   
0x0000000000000000
    GR20   0xe00000004f36b480   0x0000000000000002   0xffffffffffffffff   
0xe000000002778000
    GR24   0x0000000000000000   0xe000000001218058   0xe000000001218040   
0xe000000001218050
    GR28   0xe000000001218168   0xe000000001218060   0x0000000000000001   
0x0000000000000000
  General registers  GR16 .. GR31 (bank 0)
    GR16   0xe000000002779438   0x0000000000000308   0x0000000000000000   
0x0000000000000000
    GR20   0xe002100000bbc000   0xe002000000754ee0   0xbfffff0000000028   
0x0000000000000000
    GR24   0x0000000000000000   0x0000000000000000   0x000000000000048c   
0x0000000000000003
    GR28   0xe0020000007552f0   0x0000141008026018   0x8000000000000812   
0x00000000000161a9
  Rotating Registers GR32 .. GR41
    GR32   0xe002100000a94d00   0x0000001008022018   0xe0020000004c7e50   
0x0000000000000388
    GR36   0xe000000001218048   0xe00200000040ab40   0x000000000000050a   
0x0000000000000000
    GR40   0x0000000000000000   0xe002100000a94e00
Dump Spool for PI Errors - nasid 1, err stack A
 Entry 0:    (0x130d34f01000002)
  Cmd 0x02(Request:READ), RRB stat: --------E0 CRB #0, T5 req #0, supp 0
  Error 2 Directory Error, Cache line address 0x130d34f00
Dump Spool complete.
INFO: partition 0 system console changed: 001c28 CPU2

INIT - NASID 0, cpu 2 
MinState area at 0x8000000000009600
  Control registers
    IIP 0xe0020000005521a0    IPSR 0x0000101008026018    IFS: 0x8000000000000287       
                         IIP: kiobuf_wait_for_io+0xc0 [MIB]       cmp4.eq p6,p7=0,r14
    XIP 0xe0020000005521a0    XPSR 0x0000101008026018    XFS: 0x0000000000000287
    B0  0xe0020000005521f0    PRED 0x0000000000061599    RSC: 0x0000000000000003
    ISR 0x0000000000000000    IIPA 0xe002000000552190   ITIR: 0x0000000000000538
    IFA 0xbfffff0000000038     NaT:0x0000000000000000
    BSR:0x00000000001e0e00    LIBC 0x00000000ffe9c6a0    ELSC 0x0000000000005600
    PAL:0x0000000000048010
  General registers GR0 .. GR31 (bank 1)
    GR0    0x0000000000000000   0xe002100000bbc000   0x0000000000000000   
0xe000000004578000
    GR4    0x0000000000000000   0x0000000000000000   0x0000000000000000   
0x0000000000000000
    GR8    0x0000000000000000   0x0000000000000c1c   0x0000000000000000   
0x0000000000061499
    GR12   0xe00000000457fde0   0xe000000004578000   0x0000000000000004   
0x0000000000000001
    GR16   0xe002100000c0d400   0xe0000000017980aa   0xc000000000000000   
0xe002100000b57000
    GR20   0xe0000000017386d0   0xe0000000017386c9   0xe0000000016c8578   
0xe0000000016c8548
    GR24   0x0000000000000001   0xe0000000016c8584   0xe0000000016c8560   
0xe0000000016c8510
    GR28   0xe0000000016c8507   0x0000000000000001   0xe0000000016c8558   
0xe0000000016c8506
  General registers  GR16 .. GR31 (bank 0)
    GR16   0xe000000004579540   0x0000000000000308   0x0000000000000000   
0x0000000000000000
    GR20   0xe002100000bbc000   0xe0020000006e2da0   0xbfffff0000000038   
0xe0020000006e2da0
    GR24   0x0000000000061559   0x0000000000000000   0x0000000000000287   
0x0000000000000003
    GR28   0xe0020000005521a0   0x0000101008026018   0x8000000000000287   
0x0000000000061599
  Rotating Registers GR32 .. GR38
    GR32   0xe000000003ed41b8   0xe000000003ed41a8   0xe000000004578000   
0xe00200000051f650
    GR36
   0x0000000000000f22INIT - NASID 1, cpu 0 
   0xe002100000b7ca20MinState area at 0x8000000200007600
   0xe002100000bbc000  Control registers

    IIP 0x00000000ffe62750    IPSR 0x0000100001002018    IFS: 0x8000000000000409
Dump Spool for PI Errors - nasid 1, err stack A
    XIP 0x00000000ffe5fa10    XPSR 0x0000100001002018    XFS: 0x0000000000000000
 Entry 0:    (0x130d34f01000002)
    B0  0x00000000ffe625f0    PRED 0x0000000000018495    RSC: 0x0000000000000000
  Cmd 0x02(Request:READ), RRB stat: --------E0    ISR 0x0000020000000004    IIPA 
0x00000000ffe62740   ITIR: 0x0000000000000030
 CRB #0, T5 req #0, supp 0
    IFA 0x80000000ffffffe0     NaT:0x0000000000000000
  Error 2 Directory Error, Cache line address 0x130d34f00
    BSR:0x00000000001e0e00    LIBC 0x00000000ffe9c6a0    ELSC 0x0000000200005600
Dump Spool complete.
    PAL:0x0000000000048010
  General registers GR0 .. GR31 (bank 1)
    GR0    0x0000000000000000   0x00000000ffed28e0   0x00000000001e0e00   
0x0000000000000000
    GR4    0x0000000000000012   0x0000000000000040   0x0000000000000040   
0x00000000000000c0
    GR8    0x0000000000000000   0x0000000000000001   0x0000000000000000   
0x0000000000000000
    GR12   0x0000000203fffe20   0x0000000000000333   0x80000a0001608018   
0x80000a0001790010
    GR16   0x0000000000000030   0x0000000203fffd32   0x0000000000000000   
0x0000000000000010
    GR20   0x0000000009fff800   0xffffffff00000000   0x00000000ffffffff   
0xffff0000ffff0000
    GR24   0x80000b01c0000408   0x000000fe00000000   0x0000000000ffffff   
0xffffffff01000000
    GR28   0x0000000000000100   0x0000000000000000   0x0000000000000000   
0x0000000000000000
  General registers  GR16 .. GR31 (bank 0)
    GR16   0x0000000000000060   0x0000000000000000   0x00000000ffd3ac80   
0x0000000000000004
    GR20   0x0000000000000000   0x80000000ffffff80   0x0000140000002030   
0x0000000000000003
    GR24   0x18e002017ffff801   0x08012b0e2029837c   0x0000100300002038   
0x0000000000195631
    GR28   0x0000000000195871   0x0000000000195872   0x0000000000000010   
0xfffffffffffe8aa1
  Rotating Registers GR32 .. GR40
    GR32   0x00000000ffe6c938   0x00000000ffe6a660   0x0000000000000010   
0x80000a0001400150
    GR36   0x00000000ffdb1a80   0x000000000000058e   0x00000000ffe236c0   
0x0000000000000306
    GR40   0x0000000203fffdf8
Dump Spool for PI Errors - nasid 1, err stack A
 Entry 0:    (0x130d34f01000002)
  Cmd 0x02(Request:READ), RRB stat: --------E0 CRB #0, T5 req #0, supp 0
Dump Spool complete.
C 001 001c31: 
C 001 001c31: *** NTLB Interruption on node 1
C 001 001c31: *** EPC: 0x0 ([Symbol Table not available])
C 001 001c31: *** IIP: 0xffd9ce80, IPSR: 0x1000000
C 001 001c31: *** Press ENTER to continue.
-------------------------------------------------------------------------------------------------------

I think we are likely waiting in all of the add_wait_queue() inline code with 
kiobuf_wait_for_io+0xc0
being so close to the kiobuf_wait_for_io() entry point:
        --------------------------------------------------------------------------
        /**
         * kiobuf_wait_for_io - wait for completion of a kiobuf request
         * @kiobuf: kiobuf request to wait for
         *
         * Adds a completion event for the kiobuf in question and wakes up
         * when the I/O has completed.
         */
        void
        kiobuf_wait_for_io(struct kiobuf *kiobuf)
        {
                struct task_struct *tsk = current;
                DECLARE_WAITQUEUE(wait, tsk);
        
                if (atomic_read(&kiobuf->io_count) == 0)
                        return;
        
                add_wait_queue(&kiobuf->wait_queue, &wait);                            
 <<---- Hanging Here
         repeat:
        --------------------------------------------------------------------    

So I suppose it's more likely that I'm in the inline code for add_wait_queue():
-------------------------------------------------------------------------------------
extern inline void add_wait_queue(struct wait_queue ** p, struct wait_queue * wait)
{
        unsigned long flags;

        write_lock_irqsave(&waitqueue_lock, flags);
        __add_wait_queue(p, wait);
        write_unlock_irqrestore(&waitqueue_lock, flags);
}
--------------------------------------------------------------------------------------------------------------
#define write_lock_irqsave(lock, flags)         do { local_irq_save(flags);      
write_lock(lock); } while (0)
--------------------------------------------------------------------------------------------------------------
# define local_irq_save(x)                                                             
 \
do {                                                                                   
 \
        unsigned long ip, psr;                                                         
 \
                                                                                       
 \
        __asm__ __volatile__ ("mov %0=psr;; rsm psr.i;;" : "=r" (psr) :: "memory");    
 \
        if (psr & (1UL << 14)) {                                                       
 \
                __asm__ ("mov %0=ip" : "=r"(ip));                                      
 \
                last_cli_ip = ip;                                                      
 \
        }                                                                              
 \
        (x) = psr;                                                                     
 \
} while (0)
--------------------------------------------------------------------------------------------------------------
#define write_lock(rw)                                                          \
do {                                                                            \
        __asm__ __volatile__ (                                                  \
                "mov ar.ccv = r0\n"                                             \
                "dep r29 = -1, r0, 31, 1\n"                                     \
                ";;\n"                                                          \
                "1:\n"                                                          \
                "ld4 r2 = [%0]\n"                                               \
                ";;\n"                                                          \
                "cmp4.eq p0,p7 = r0,r2\n"                                       \
                "(p7) br.cond.spnt.few 1b \n"                                   \
                "cmpxchg4.acq r2 = [%0], r29, ar.ccv\n"                         \
                ";;\n"                                                          \
                "cmp4.eq p0,p7 = r0, r2\n"                                      \
                "(p7) br.cond.spnt.few 1b\n"                                    \
                ";;\n"                                                          \
                :: "r"(rw) : "ar.ccv", "p7", "r2", "r29", "memory");            \
} while(0)
--------------------------------------------------------------------------------------------------------------

On most systems I've done much debugging of spinlock and mutex hangs I've added or used
audit information in the locks saying who owned it (CPU, and PC's on the stack of 
caller
to write_lock(). IA64 Linux doesn't seem to have any audtin information even for DEBUG
or BRINGUP kernels.

I think there is likely something very wrong with kdb's handling of IPI's to 
processors.
Perhaps in older 2.4.16 kernels it wasn't worrying about checking the state of the 
processors
it was stopping to make sure they weren't processing interrupt code. On the Sequent 
our SPL's
to block interrutps were right in the spinlock code. I don't see it in our 
write_lock() #define.
Is that being done by local_irq_save()? I dont' see it but I'm still rusty on ia64 
asm....

When I go thru panic() to dump and involve kdb I hit this bug between 20% to 90% of 
the time.
When I go directly to dump() we sucessfully dumped the ia64 SN1 system 250 times 
before hanging.
On a on a PC mono-processor without kdb we never hang; it ran all night (2500 dumps).

-piet

Reply via email to