[mdb-discuss] [Fwd: About debug hang issues with deadman]

Oliver Yang Mon, 04 Dec 2006 13:00:23 +0800

Raymond LI - Sun Microsystems - Beijing China wrote:
> Typically this is deadloop in kernel/module code. This case is bge's 
> long loop in ISR, which prevent lbolt from being cleared by clock.
I'm not sure, but network interrupt's PIL is 6, clock interrupts is 10.
Clock routine shouldn't be impacted by a interrupt thread lower than it.
>
> Oliver Yang wrote:
>> Hi Guys,
>>
>> I have a question about using mdb debug hang issue, does anybody can 
>> give me the answers in below mail?
>>
>> -------- Original Message --------
>> Subject:     About debug hang issues with deadman
>> Date:     Mon, 27 Nov 2006 16:40:00 +0800
>> From:     Oliver Yang <Oliver.Yang at Sun.COM>
>> To:     opensolaris-code at opensolaris.org
>>
>>
>>
>> Hi All,
>>
>> Recently, I ran into a hang issue, and I enabled deadman timer by 
>> setting "set snooping =1" in /etc/system file. Finally, we got a 
>> crashdump file.
>>
>> Does anybody can give me a hint on how to debug hang issue with 
>> enabling deadman?
>>
>> Here are steps what I have tried on one of crashdump files:
>>
>>> ::msgbuf
>>
>> panic[cpu0]/thread=a13f7de0:
>> deadman: timed out after 50 seconds of clock inactivity
>>
>>
>> a13f4f24 genunix:deadman+159 (0)
>> a13f4f5c genunix:cyclic_expire+280 (a14fe000, 3, aa2183)
>> a13f4fb8 genunix:cyclic_fire+17d (fec21d30)
>> a13f4fd4 unix:cbe_fire+4a (0, 0)
>> a13f75b8 unix:_interrupt+2a1 (1b0, abb10000, ab36)
>> a13f7638 unix:bcopy+39 (abb11780)
>> a13f765c genunix:copymsg+2e (aa228340)
>> a13f7688 genunix:copymsgchain+23 (aa228340)
>> a13f76c0 dls:i_dls_link_rx_func+9d (a3041d78, 0, a13f77)
>> a13f7704 dls:i_dls_link_rx_common_promisc+3b (a3041d78, 0, a13f77)
>> a13f77a8 dls:i_dls_link_rx_common+11c (a3041d78, 0, aa2283)
>> a13f77c4 dls:i_dls_link_txloop+18 (a3041d78, aa228340)
>> a13f77f0 mac:mac_txloop+92 (a3042ee8, aa228340)
>> a13f780c dls:dls_tx+16 (a3026f80, abb08e80)
>> a13f7828 dld:dld_tx_single+1e (a2bfdea8, abb08e80)
>> a13f7840 dld:str_mdata_fastpath_put+60 (a2bfdea8, abb08e80)
>> a13f78b0 ip:tcp_send_data+85c (aafda500, aafdfc68,)
>> a13f791c ip:tcp_send+6f1 (aafdfc68, aafda500,)
>> a13f79bc ip:tcp_wput_data+663 (aafda500, 0, 0)
>> a13f7aa0 ip:tcp_rput_data+29b2 (aafda100, a70a13c0,)
>> a13f7af8 ip:squeue_drain+1c3 (a1df0d00, 4, 4bb30e)
>> a13f7b48 ip:squeue_enter_chain+5be (a1df0d00, abb16f40,)
>> a13f7bdc ip:ip_input+731 (a2581754, a2afc020,)
>> a13f7c78 dls:i_dls_link_rx_common+26e (a3041d78, a2afc020,)
>> a13f7c90 dls:i_dls_link_rx_promisc+19 (a3041d78, a2afc020,)
>> a13f7ccc mac:mac_rx+53 (a3042ee8, a2afc020,)
>> a13f7d60 bge:bge_receive+598 (a2c78000, a2f26800)
>> a13f7dac bge:bge_intr+30f (a2c78000, 0)
>>
>> syncing file systems...
>> done
>> dumping to /dev/dsk/c2d0s1, offset 1719074816, content: kernel
>>
>>
>> Then I used mdb checking the crash dump file, and we found most of 
>> CPUs are IDLE while system paniced:
>>
>>
>>> ::cpuinfo -v
>> ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
>> 0 fec254f8  1b    5   10 105   no    no t-5676 a13f7de0 sched
>>              |    |    |
>>   RUNNING <--+    |    +--> PIL THREAD
>>     READY         |          10 a13fade0
>>    EXISTS         |           6 a13f7de0
>>    ENABLE         |           - a11ccde0 (idle)
>>                   |
>>                   +-->  PRI THREAD   PROC
>>                         109 a13fade0 sched
>>                          60 a1b7ede0 sched
>>                          60 a1587de0 sched
>>                          60 a27a1de0 sched
>>                          59 a727f000 snoop
>>
>> ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
>> 1 a1d6e200  1f    1    0  -1   no    no t-0    a22a9de0 (idle)
>>              |    |
>>   RUNNING <--+    +-->  PRI THREAD   PROC
>>     READY                60 a24a3de0 sched
>>  QUIESCED
>>    EXISTS
>>    ENABLE
>>
>> ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
>> 2 a1d6d180  1f    0    0  -1   no    no t-0    a23f6de0 (idle)
>>              |
>>   RUNNING <--+
>>     READY
>>  QUIESCED
>>    EXISTS
>>    ENABLE
>>
>> ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
>> 3 a1d6c100  1f    0    0  -1   no    no t-0    a242fde0 (idle)
>>              |
>>   RUNNING <--+
>>     READY
>>  QUIESCED
>>    EXISTS
>>    ENABLE
>>
>> The CPU 1,2,3 status is QUIESCED, I think it must be disabled by 
>> deadman via a cross-call.
>>
>> On CPU0, a network interrupt thread a13f7de0 was interrupted by a 
>> clock(cyclic) interrupt:
>>
>>
>>> a13f7de0::findstack -v
>> stack pointer for thread a13f7de0: a13f75a8
>> a13f75b8 _interrupt+0xe7()
>> a13f7638 bcopy+0x39(abb11780)
>> a13f765c copymsg+0x2e(aa228340)
>> a13f7688 copymsgchain+0x23(aa228340)
>> a13f76c0 i_dls_link_rx_func+0x9d(a3041d78, 0, a13f773c, aa228340, 
>> 10000, 0)
>> a13f7704 i_dls_link_rx_common_promisc+0x3b(a3041d78, 0, a13f773c, 
>> aa228340, 0, e669fcdc)
>> a13f77a8 i_dls_link_rx_common+0x11c(a3041d78, 0, aa228340, e669fcdc)
>> a13f77c4 i_dls_link_txloop+0x18(a3041d78, aa228340)
>> a13f77f0 mac_txloop+0x92(a3042ee8, aa228340)
>> a13f780c dls_tx+0x16(a3026f80, abb08e80)
>> a13f7828 dld_tx_single+0x1e(a2bfdea8, abb08e80)
>> a13f7840 str_mdata_fastpath_put+0x60(a2bfdea8, abb08e80)
>> a13f78b0 tcp_send_data+0x85c(aafda500, aafdfc68, abb08e80)
>> a13f791c tcp_send+0x6f1(aafdfc68, aafda500, 5a8, 34, 20, 0)
>> a13f79bc tcp_wput_data+0x663(aafda500, 0, 0)
>> a13f7aa0 tcp_rput_data+0x29b2(aafda100, a70a13c0, a1df0d00)
>> a13f7af8 squeue_drain+0x1c3(a1df0d00, 4, 4bb30e5e, bd)
>> a13f7b48 squeue_enter_chain+0x5be(a1df0d00, abb16f40, a70a13c0, 3, 1)
>> a13f7bdc ip_input+0x731(a2581754, a2afc020, a1359540, a13f7c0c)
>> a13f7c78 i_dls_link_rx_common+0x26e(a3041d78, a2afc020, a1359540, 
>> e669fc0c)
>> a13f7c90 i_dls_link_rx_promisc+0x19(a3041d78, a2afc020, a1359540)
>> a13f7ccc mac_rx+0x53(a3042ee8, a2afc020, a1359540)
>> a13f7d60 bge_receive+0x598(a2c78000, a2f26800)
>> a13f7dac bge_intr+0x30f(a2c78000, 0)
>> a13f7ddc intr_thread+0x152()
>>
>> I check the stack of  clock(cyclic) interrupt, it seems that it is a 
>> soft interrupt, and it was blocked while trying to process the 
>> callout table in clock routine.
>>
>> I don't know why it was blocked, and I didn't see any other thread 
>> held the mutex:
>>
>>> a13fade0::findstack -v
>> stack pointer for thread a13fade0: a13fabb8
>> a13fabec swtch+0xc8()
>> a13fac24 turnstile_block+0x775(aab00900, 0, a1279000, fec04bc8, 0, 0)
>> a13fac80 mutex_vector_enter+0x34e(a1279000)
>> a13faca8 callout_schedule_1+0x13(a1279000)
>> a13facc8 callout_schedule+0x31()
>> a13fad14 clock+0x488(0)
>> a13fad80 cyclic_softint+0x29e(fec21d30, 1)
>> a13fad94 cbe_softclock+0x14(0, 0)
>> a13fadcc av_dispatch_softvect+0x66(a)
>> a13faddc dosoftint+0x109()
>>
>>> a13fade0::thread
>>   ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL     INTR DISPTIME 
>> BOUND PR
>> a13fade0 run         9    0    3   109     0  10      n/a        0    
>> -1  1
>>> a13fade0::thread -b
>>   ADDR    WCHAN       TS     PITS    SOBJ OPS
>> a13fade0        0 aab00900        0           0
>>> a1279000::mutex
>>   ADDR  TYPE     HELD MINSPL OLDSPL WAITERS
>> a1279000 adapt       no      -      -      no
>>
>> The status of a13fade0 thread is run, and I think while system panic, 
>> the deadman should be running on CPU0.
>> How can I find the stack of deadman?  I know it should be high level 
>> interrupt, but I can't get any infomation from cpu_t cpu_intr_stack:
>>
>>> fec254f8::print cpu_t cpu_intr_stack
>> cpu_intr_stack = 0xa13f5000
>>> 0xa13f5000,20/nap
>> 0xa13f5000:
>> mdb: failed to read data from target: no mapping for address
>> 0xa13f5000:
>> mdb: failed to read data from target: no mapping for address
>>
>>
>>
>> We can see, most of physical memory are free:
>>
>>> ::memstat
>> Page Summary                Pages                MB  %Tot
>> ------------     ----------------  ----------------  ----
>> Kernel                     127163               496    3%
>> Anon                        14338                56    0%
>> Exec and libs                1775                 6    0%
>> Page cache                    509                 1    0%
>> Free (cachelist)              642                 2    0%
>> Free (freelist)           4046691             15807   97%
>>
>> Total                     4191118             16371
>> Physical                  4191117             16371
>>
>>
>> We can found there are 5112 pending counts of clock, that's why 
>> deadman was triggered.
>> And we also could find lots of pending counts of 
>> apic_redistribute_compute, does it mean there is a APIC issue?
>>
>>> ::cycinfo -v
>> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
>> 0 a14fe000  online      5 aa2182c0      bd4a938200 deadman
>>
>>                                      3
>>                                      |
>>                   +------------------+------------------+
>>                   0                                     4
>>                   |                                     |
>>         +---------+--------+                  +---------+---------+
>>         1                  2
>>         |                  |
>>    +----+----+        +----+----+
>>
>>     ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
>> aa2182c0   0    1 high     0      bd4a938200   10000 cbe_hres_tick
>> aa2182e0   1    3  low 10787      bd4b2c1880   10000 
>> apic_redistribute_compute
>> aa218300   2    4 lock  5112      bd4b2c1880   10000 clock
>> aa218320   3    0 high     0      bd4a938200 1000000 deadman
>> aa218340   4    2  low    11      beebcf0800 10000000 ao_mca_poll_cyclic
>>
>>
>> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
>> 1 a2742000  online      2 a133d000      f226b17650 deadman
>>
>>                                      1
>>                                      |
>>                   +------------------+------------------+
>>                   0
>>                   |
>>         +---------+--------+
>>
>>     ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
>> a133d000   0    1  low     1      f224d4a000 10000000 ao_mca_poll_cyclic
>> a133d020   1    0 high     0      f226b17650 1000000 deadman
>>
>>
>> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
>> 2 a274d000  online      4 a26c2a80      f200000000 bge_chip_cyclic
>>
>>                                      2
>>                                      |
>>                   +------------------+------------------+
>>                   0                                     3
>>                   |                                     |
>>         +---------+--------+                  +---------+---------+
>>         1
>>         |
>>    +----+----+
>>
>>     ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
>> a26c2a80   0    1  low     1      f224d4a000 10000000 ao_mca_poll_cyclic
>> a26c2aa0   1    3 high     0      f1ecf382a0 1000000 deadman
>> a26c2ac0   2    0 lock    18      f200000000  536870 bge_chip_cyclic
>> a26c2ae0   3    2 lock     9      f200000000 1073741 nge_chip_cyclic
>>
>>
>> CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
>> 3 a2758000  online      2 a133d880      f22a6b22f0 deadman
>>
>>                                      1
>>                                      |
>>                   +------------------+------------------+
>>                   0
>>                   |
>>         +---------+--------+
>>
>>     ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
>> a133d880   0    1  low     1      f224d4a000 10000000 ao_mca_poll_cyclic
>> a133d8a0   1    0 high     0      f22a6b22f0 1000000 deadman
>>
>>
>>
>>
>>
>



-- 
Cheers,

----------------------------------------------------------------------
Oliver Yang | OPG Engineering Operation | Oliver.Yang at Sun.COM | x82229

[mdb-discuss] [Fwd: About debug hang issues with deadman]

Reply via email to