The problem is that the level6 interrupt handler (network interrupt) is blocked (base spl of CPU 0 is 6) which blocked level1 as well. That's the reason softcall state is still SOFT_PEND (2).
This problem is being worked as part of bug :- 6292092 callout should not be blocked by interrupts from executing realtime timeouts 6540436 kpreempt() needs a more reliable way to generate level1 intr We have seen this problem happening very often with e1000g. Thanks, -- Saurabh Mishra, Solaris Filesystems & Kernel | NAS OS http://blogs.sun.com/saurabh_mishra Eric Saxe wrote: > FYI, question posed on opensolaris forums, I figured you probably have > an answer. :) > > -Eric > > ------------------------------------------------------------------------ > > Subject: > [osol-code] Questions about overdue callout_execute > From: > Oliver Yang <Oliver.Yang at sun.com> > Date: > Mon, 27 Aug 2007 16:48:36 +0800 > To: > mdb-discuss at opensolaris.org, opensolaris-code at opensolaris.org > > To: > mdb-discuss at opensolaris.org, opensolaris-code at opensolaris.org > > > Hi All, > > I have several questions about overdue callout_execute > > 1. How does the system handle a overdue callout_execute? > > 2. As far as I know, setrun entry in callout table will be handled as a > softcall, which is a software interrupt(softlevel1) with PIL 1, right? > > 3. If we found there are so many overdue setrun entries in callout > table, what we should check by using mdb? > > Here are background information with my other questions: > > I had encountered a system hang issue in my test env and I forced a > crash dump file successfully. > > In this crash dump file, I found a e1000g driver thread blocked on > cv_timedwait, but we supposed it should return after 1 tick, but it had > never returned. > > > ::callout ! grep fffffe80044e5c80 > setrun fffffe80044e5c80 3ffffffffffe1a80 > 1036d (T-33730) ---> it was overdue about 33730 ticks. > > stack pointer for thread fffffe80044e5c80: fffffe80044e5880 > [ fffffe80044e5880 _resume_from_idle+0xf8() ] > fffffe80044e58c0 swtch+0x167() > fffffe80044e5930 cv_timedwait+0xcf(ffffffff82f76390, ffffffff82f76388, > 1036d) > fffffe80044e59c0 cv_timedwait_sig+0x2cc(ffffffff82f76390, > ffffffff82f76388, 1036d) > fffffe80044e5a70 e1000g_send+0x136(ffffffff82f76370, ffffffffac2fce40) > fffffe80044e5ab0 e1000g_m_tx+0x6f(ffffffff82f76000, ffffffffa21f8180) > fffffe80044e5ad0 dls_tx+0x1d(ffffffff82f2ec80, ffffffffa21f8180) > fffffe80044e5b20 dld_wsrv+0xcc(ffffffff894acb70) > fffffe80044e5b50 runservice+0x42(ffffffff894acb70) > fffffe80044e5b80 queue_service+0x42(ffffffff894acb70) > fffffe80044e5bc0 stream_service+0x73(ffffffff83905740) > fffffe80044e5c60 taskq_d_thread+0xbb(ffffffff833af820) > fffffe80044e5c70 thread_start+8() > > After I check the callout table, I found the relevant callout entry for > this thread, and it was overdue for some reason: > > > ::callout ! grep fffffe80044e5c80 > setrun fffffe80044e5c80 3ffffffffffe1a80 > 1036d (T-33730) --------> This indicate it is overdue. > > I also find about 2573 overdue entries in callout table, it was really a > big number: > > > ::callout > FUNCTION ARGUMENT ID TIME > sigalarm2proc ffffffff9569aae0 7fffffffffffc010 > 144a1 (T-17038) -----> This indicated the it was overdue. > sigalarm2proc ffffffff91bb7510 7fffffffffffe010 > 14484 (T-17067) > sigalarm2proc ffffffff9569c380 7fffffffffffc020 > 144a1 (T-17038) > sigalarm2proc ffffffff95428d48 7fffffffffffc030 > 144a1 (T-17038) > sigalarm2proc ffffffff91bb8db0 7fffffffffffe030 > 14483 (T-17068) > sigalarm2proc ffffffff9542b238 7fffffffffffc040 > 144a1 (T-17038) > .....................[snipped].................................................................... > > > ::callout ! grep "T-" | wc -l > 2573 > > Why we could find so many overdue entries in callout table? > > Since the setrun entry will be processed by softcall, I tried to print > the global softcall list by softhead. > And we did find there are two callout_execute was pending to handled by > softint: > > > *softhead::list softcall_t sc_next|::print softcall_t > { > sc_func = callout_execute > sc_arg = 0xffffffff80219000 > sc_next = softcalls+0x1290 > } > { > sc_func = callout_execute > sc_arg = 0xffffffff80216000 > sc_next = 0 > } > > > softcall_state/J > softcall_state: > softcall_state: 2 > > #define SOFT_PEND 0x02 /* softcall list needs processing */ > > But when I checked the softint by following command, I found no pending > count in softlevel1 line, why? > > I think the softlevel1 should call into softint(), then, the > callout_execute would be executed. > > But it seemed it had never happened, who can give me some ideas about it? > > > > ::softint > ADDR PEND PIL ARG1 ARG2 ISR(s) > ffffffff8277a5c0 0 1 ffffffff8048da80 0 errorq_intr > fffffffffbc05ae8 0 1 0 0 > softlevel1 ------------> no pending count. > ffffffff8277a4c0 0 2 ffffffff8048dd00 0 errorq_intr > fffffffffbc00070 0 2 0 0 cbe_low_level > ffffffff83a706c0 0 4 ffffffff90004d18 0 ghd_doneq_process > ffffffff8ed31880 0 4 ffffffff90004d18 0 > ghd_timeout_softintr > ffffffff83946e00 0 4 ffffffff8f046c40 0 power_soft_intr > ffffffff83a70c00 0 4 ffffffff83b1b000 0 bge_chip_factotum > ffffffff83a70cc0 0 4 ffffffff83b1b000 0 bge_reschedule > ffffffff8277a2c0 0 4 0 0 asysoftintr > ffffffff82f4d100 0 9 ffffffff82f76370 0 > e1000g_tx_softint_worker > ffffffff82f4df00 0 9 ffffffff82f86370 0 > e1000g_tx_softint_worker > ffffffff833b3e80 0 9 ffffffff801af7e8 0 hcdi_soft_intr > ffffffff8277a000 0 9 ffffffff801afb68 0 hcdi_soft_intr > fffffffffbc00030 0 10 0 0 cbe_softclock > > > Actually, at that time, system still had 1 IDLE CPU, I think the system > shouldn't have so many overdue callout entries. > > > ::cpuinfo -v > ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH > THREAD PROC > 0 fffffffffbc27730 1f 1 6 169 no no t-3 > fffffe80000bfc80 sched > | | | > RUNNING <--+ | +--> PIL THREAD > READY | 10 fffffe80000bfc80 > QUIESCED | 6 fffffe80000b9c80 > EXISTS | > ENABLE +--> PRI THREAD PROC > 99 fffffe80000d1c80 sched > > ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH > THREAD PROC > 1 fffffffffbc2f260 1b 1 0 -1 no no t-17 > fffffe8000401c80 (idle) > | | > RUNNING <--+ +--> PRI THREAD PROC > READY 60 fffffe80044d9c80 sched > EXISTS > ENABLE > > What else should I check?