The problem is that the level6 interrupt handler (network interrupt) is 
blocked (base spl of CPU 0 is 6) which blocked level1 as well. That's 
the reason softcall state is still SOFT_PEND (2).

This problem is being worked as part of bug :-

6292092 callout should not be blocked by interrupts from executing 
realtime timeouts
6540436 kpreempt() needs a more reliable way to generate level1 intr

We have seen this problem happening very often with e1000g.

Thanks,
--
Saurabh Mishra,
Solaris Filesystems & Kernel | NAS OS
http://blogs.sun.com/saurabh_mishra

Eric Saxe wrote:
> FYI, question posed on opensolaris forums, I figured you probably have 
> an answer. :)
> 
> -Eric
> 
> ------------------------------------------------------------------------
> 
> Subject:
> [osol-code] Questions about overdue callout_execute
> From:
> Oliver Yang <Oliver.Yang at sun.com>
> Date:
> Mon, 27 Aug 2007 16:48:36 +0800
> To:
> mdb-discuss at opensolaris.org, opensolaris-code at opensolaris.org
> 
> To:
> mdb-discuss at opensolaris.org, opensolaris-code at opensolaris.org
> 
> 
> Hi All,
> 
> I have several questions about overdue callout_execute
> 
> 1. How does the system handle a overdue callout_execute?
> 
> 2. As far as I know, setrun entry in callout table will be handled as a 
> softcall, which is a software interrupt(softlevel1) with PIL 1, right?
>  
> 3. If we found there are so many overdue setrun entries in callout 
> table, what we should check by using mdb?
> 
> Here are background information with my other questions:
> 
> I had encountered a system hang issue in my test env and I forced a 
> crash dump file successfully.
> 
> In this crash dump file, I found a e1000g driver thread blocked on 
> cv_timedwait, but we supposed it should return after 1 tick, but it had 
> never returned.
> 
>  > ::callout ! grep fffffe80044e5c80
> setrun                   fffffe80044e5c80 3ffffffffffe1a80            
> 1036d (T-33730)  ---> it was overdue about 33730 ticks.
> 
> stack pointer for thread fffffe80044e5c80: fffffe80044e5880
> [ fffffe80044e5880 _resume_from_idle+0xf8() ]
>  fffffe80044e58c0 swtch+0x167()
>  fffffe80044e5930 cv_timedwait+0xcf(ffffffff82f76390, ffffffff82f76388, 
> 1036d)
>  fffffe80044e59c0 cv_timedwait_sig+0x2cc(ffffffff82f76390, 
> ffffffff82f76388, 1036d)
>  fffffe80044e5a70 e1000g_send+0x136(ffffffff82f76370, ffffffffac2fce40)
>  fffffe80044e5ab0 e1000g_m_tx+0x6f(ffffffff82f76000, ffffffffa21f8180)
>  fffffe80044e5ad0 dls_tx+0x1d(ffffffff82f2ec80, ffffffffa21f8180)
>  fffffe80044e5b20 dld_wsrv+0xcc(ffffffff894acb70)
>  fffffe80044e5b50 runservice+0x42(ffffffff894acb70)
>  fffffe80044e5b80 queue_service+0x42(ffffffff894acb70)
>  fffffe80044e5bc0 stream_service+0x73(ffffffff83905740)
>  fffffe80044e5c60 taskq_d_thread+0xbb(ffffffff833af820)
>  fffffe80044e5c70 thread_start+8()
> 
> After I check the callout table, I found the relevant callout entry for 
> this thread, and it was overdue for some reason:
> 
>  > ::callout ! grep fffffe80044e5c80
> setrun                   fffffe80044e5c80 3ffffffffffe1a80            
> 1036d (T-33730)       --------> This indicate it is overdue.
> 
> I also find about 2573 overdue entries in callout table, it was really a 
> big number:
> 
>  > ::callout
> FUNCTION                 ARGUMENT         ID               TIME           
> sigalarm2proc            ffffffff9569aae0 7fffffffffffc010            
> 144a1 (T-17038) -----> This indicated the it was overdue.
> sigalarm2proc            ffffffff91bb7510 7fffffffffffe010            
> 14484 (T-17067)
> sigalarm2proc            ffffffff9569c380 7fffffffffffc020            
> 144a1 (T-17038)
> sigalarm2proc            ffffffff95428d48 7fffffffffffc030            
> 144a1 (T-17038)
> sigalarm2proc            ffffffff91bb8db0 7fffffffffffe030            
> 14483 (T-17068)
> sigalarm2proc            ffffffff9542b238 7fffffffffffc040            
> 144a1 (T-17038)
> .....................[snipped]....................................................................
> 
>  > ::callout ! grep "T-" | wc -l
>     2573
> 
> Why we could find so many overdue entries in callout table?
> 
> Since the setrun entry will be processed by softcall, I tried to print 
> the global softcall list by softhead.
> And we did find there are two callout_execute was pending to handled by 
> softint:
> 
>  > *softhead::list softcall_t sc_next|::print softcall_t
> {
>     sc_func = callout_execute
>     sc_arg = 0xffffffff80219000
>     sc_next = softcalls+0x1290
> }
> {
>     sc_func = callout_execute
>     sc_arg = 0xffffffff80216000
>     sc_next = 0
> }
> 
>  > softcall_state/J
> softcall_state:
> softcall_state: 2              
> 
> #define SOFT_PEND               0x02    /* softcall list needs processing */
> 
> But when I checked the softint by following command, I found no pending 
> count in softlevel1 line, why?
> 
> I think the softlevel1 should call into softint(), then, the 
> callout_execute would be executed.
> 
> But it seemed it had never happened, who can give me some ideas about it?
> 
> 
>  > ::softint
> ADDR             PEND PIL ARG1             ARG2            ISR(s)
> ffffffff8277a5c0 0    1   ffffffff8048da80 0               errorq_intr
> fffffffffbc05ae8 0    1   0                0               
> softlevel1            ------------> no pending count.
> ffffffff8277a4c0 0    2   ffffffff8048dd00 0               errorq_intr
> fffffffffbc00070 0    2   0                0               cbe_low_level
> ffffffff83a706c0 0    4   ffffffff90004d18 0               ghd_doneq_process
> ffffffff8ed31880 0    4   ffffffff90004d18 0               
> ghd_timeout_softintr
> ffffffff83946e00 0    4   ffffffff8f046c40 0               power_soft_intr
> ffffffff83a70c00 0    4   ffffffff83b1b000 0               bge_chip_factotum
> ffffffff83a70cc0 0    4   ffffffff83b1b000 0               bge_reschedule
> ffffffff8277a2c0 0    4   0                0               asysoftintr
> ffffffff82f4d100 0    9   ffffffff82f76370 0               
> e1000g_tx_softint_worker
> ffffffff82f4df00 0    9   ffffffff82f86370 0               
> e1000g_tx_softint_worker
> ffffffff833b3e80 0    9   ffffffff801af7e8 0               hcdi_soft_intr
> ffffffff8277a000 0    9   ffffffff801afb68 0               hcdi_soft_intr
> fffffffffbc00030 0    10  0                0               cbe_softclock
> 
> 
> Actually, at that time, system still had 1 IDLE CPU, I think the system 
> shouldn't have so many overdue callout entries.
> 
>  > ::cpuinfo -v
>  ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH 
> THREAD           PROC
>   0 fffffffffbc27730  1f    1    6 169   no    no t-3    
> fffffe80000bfc80 sched
>                        |    |    | 
>             RUNNING <--+    |    +--> PIL THREAD
>               READY         |          10 fffffe80000bfc80
>            QUIESCED         |           6 fffffe80000b9c80
>              EXISTS         |
>              ENABLE         +-->  PRI THREAD           PROC
>                                    99 fffffe80000d1c80 sched
> 
>  ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH 
> THREAD           PROC
>   1 fffffffffbc2f260  1b    1    0  -1   no    no t-17   
> fffffe8000401c80 (idle)
>                        |    |
>             RUNNING <--+    +-->  PRI THREAD           PROC
>               READY                60 fffffe80044d9c80 sched
>              EXISTS
>              ENABLE
> 
> What else should I check?

Reply via email to