Re: [osol-code] [mdb-discuss] [Fwd: Questions about overdue callout_execute]

Oliver Yang Mon, 27 Aug 2007 18:54:02 -0700

Saurabh Misra wrote:
> The problem is that the level6 interrupt handler (network interrupt) is 
> blocked (base spl of CPU 0 is 6) which blocked level1 as well. That's 
> the reason softcall state is still SOFT_PEND (2).
>
> This problem is being worked as part of bug :-
>
> 6292092 callout should not be blocked by interrupts from executing 
> realtime timeouts
> 6540436 kpreempt() needs a more reliable way to generate level1 intr
>
> We have seen this problem happening very often with e1000g.
>   
Hi Saurabh,


What I concerned is, although CPU0 BSPL is 6 as you said. But CPU1 is 
IDLE at time,
why those pending softcall can't be dispatched to CPU1?


> Eric Saxe wrote:
>   
>> FYI, question posed on opensolaris forums, I figured you probably have 
>> an answer. :)
>>
>> -Eric
>>
>> ------------------------------------------------------------------------
>>
>> Subject:
>> [osol-code] Questions about overdue callout_execute
>> From:
>> Oliver Yang <[EMAIL PROTECTED]>
>> Date:
>> Mon, 27 Aug 2007 16:48:36 +0800
>> To:
>> [EMAIL PROTECTED], [email protected]
>>
>> To:
>> [EMAIL PROTECTED], [email protected]
>>
>>
>> Hi All,
>>
>> I have several questions about overdue callout_execute
>>
>> 1. How does the system handle a overdue callout_execute?
>>
>> 2. As far as I know, setrun entry in callout table will be handled as a 
>> softcall, which is a software interrupt(softlevel1) with PIL 1, right?
>>  
>> 3. If we found there are so many overdue setrun entries in callout 
>> table, what we should check by using mdb?
>>
>> Here are background information with my other questions:
>>
>> I had encountered a system hang issue in my test env and I forced a 
>> crash dump file successfully.
>>
>> In this crash dump file, I found a e1000g driver thread blocked on 
>> cv_timedwait, but we supposed it should return after 1 tick, but it had 
>> never returned.
>>
>>  > ::callout ! grep fffffe80044e5c80
>> setrun                   fffffe80044e5c80 3ffffffffffe1a80            
>> 1036d (T-33730)  ---> it was overdue about 33730 ticks.
>>
>> stack pointer for thread fffffe80044e5c80: fffffe80044e5880
>> [ fffffe80044e5880 _resume_from_idle+0xf8() ]
>>  fffffe80044e58c0 swtch+0x167()
>>  fffffe80044e5930 cv_timedwait+0xcf(ffffffff82f76390, ffffffff82f76388, 
>> 1036d)
>>  fffffe80044e59c0 cv_timedwait_sig+0x2cc(ffffffff82f76390, 
>> ffffffff82f76388, 1036d)
>>  fffffe80044e5a70 e1000g_send+0x136(ffffffff82f76370, ffffffffac2fce40)
>>  fffffe80044e5ab0 e1000g_m_tx+0x6f(ffffffff82f76000, ffffffffa21f8180)
>>  fffffe80044e5ad0 dls_tx+0x1d(ffffffff82f2ec80, ffffffffa21f8180)
>>  fffffe80044e5b20 dld_wsrv+0xcc(ffffffff894acb70)
>>  fffffe80044e5b50 runservice+0x42(ffffffff894acb70)
>>  fffffe80044e5b80 queue_service+0x42(ffffffff894acb70)
>>  fffffe80044e5bc0 stream_service+0x73(ffffffff83905740)
>>  fffffe80044e5c60 taskq_d_thread+0xbb(ffffffff833af820)
>>  fffffe80044e5c70 thread_start+8()
>>
>> After I check the callout table, I found the relevant callout entry for 
>> this thread, and it was overdue for some reason:
>>
>>  > ::callout ! grep fffffe80044e5c80
>> setrun                   fffffe80044e5c80 3ffffffffffe1a80            
>> 1036d (T-33730)       --------> This indicate it is overdue.
>>
>> I also find about 2573 overdue entries in callout table, it was really a 
>> big number:
>>
>>  > ::callout
>> FUNCTION                 ARGUMENT         ID               TIME           
>> sigalarm2proc            ffffffff9569aae0 7fffffffffffc010            
>> 144a1 (T-17038) -----> This indicated the it was overdue.
>> sigalarm2proc            ffffffff91bb7510 7fffffffffffe010            
>> 14484 (T-17067)
>> sigalarm2proc            ffffffff9569c380 7fffffffffffc020            
>> 144a1 (T-17038)
>> sigalarm2proc            ffffffff95428d48 7fffffffffffc030            
>> 144a1 (T-17038)
>> sigalarm2proc            ffffffff91bb8db0 7fffffffffffe030            
>> 14483 (T-17068)
>> sigalarm2proc            ffffffff9542b238 7fffffffffffc040            
>> 144a1 (T-17038)
>> .....................[snipped]....................................................................
>>
>>  > ::callout ! grep "T-" | wc -l
>>     2573
>>
>> Why we could find so many overdue entries in callout table?
>>
>> Since the setrun entry will be processed by softcall, I tried to print 
>> the global softcall list by softhead.
>> And we did find there are two callout_execute was pending to handled by 
>> softint:
>>
>>  > *softhead::list softcall_t sc_next|::print softcall_t
>> {
>>     sc_func = callout_execute
>>     sc_arg = 0xffffffff80219000
>>     sc_next = softcalls+0x1290
>> }
>> {
>>     sc_func = callout_execute
>>     sc_arg = 0xffffffff80216000
>>     sc_next = 0
>> }
>>
>>  > softcall_state/J
>> softcall_state:
>> softcall_state: 2              
>>
>> #define SOFT_PEND               0x02    /* softcall list needs processing */
>>
>> But when I checked the softint by following command, I found no pending 
>> count in softlevel1 line, why?
>>
>> I think the softlevel1 should call into softint(), then, the 
>> callout_execute would be executed.
>>
>> But it seemed it had never happened, who can give me some ideas about it?
>>
>>
>>  > ::softint
>> ADDR             PEND PIL ARG1             ARG2            ISR(s)
>> ffffffff8277a5c0 0    1   ffffffff8048da80 0               errorq_intr
>> fffffffffbc05ae8 0    1   0                0               
>> softlevel1            ------------> no pending count.
>> ffffffff8277a4c0 0    2   ffffffff8048dd00 0               errorq_intr
>> fffffffffbc00070 0    2   0                0               cbe_low_level
>> ffffffff83a706c0 0    4   ffffffff90004d18 0               ghd_doneq_process
>> ffffffff8ed31880 0    4   ffffffff90004d18 0               
>> ghd_timeout_softintr
>> ffffffff83946e00 0    4   ffffffff8f046c40 0               power_soft_intr
>> ffffffff83a70c00 0    4   ffffffff83b1b000 0               bge_chip_factotum
>> ffffffff83a70cc0 0    4   ffffffff83b1b000 0               bge_reschedule
>> ffffffff8277a2c0 0    4   0                0               asysoftintr
>> ffffffff82f4d100 0    9   ffffffff82f76370 0               
>> e1000g_tx_softint_worker
>> ffffffff82f4df00 0    9   ffffffff82f86370 0               
>> e1000g_tx_softint_worker
>> ffffffff833b3e80 0    9   ffffffff801af7e8 0               hcdi_soft_intr
>> ffffffff8277a000 0    9   ffffffff801afb68 0               hcdi_soft_intr
>> fffffffffbc00030 0    10  0                0               cbe_softclock
>>
>>
>> Actually, at that time, system still had 1 IDLE CPU, I think the system 
>> shouldn't have so many overdue callout entries.
>>
>>  > ::cpuinfo -v
>>  ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH 
>> THREAD           PROC
>>   0 fffffffffbc27730  1f    1    6 169   no    no t-3    
>> fffffe80000bfc80 sched
>>                        |    |    | 
>>             RUNNING <--+    |    +--> PIL THREAD
>>               READY         |          10 fffffe80000bfc80
>>            QUIESCED         |           6 fffffe80000b9c80
>>              EXISTS         |
>>              ENABLE         +-->  PRI THREAD           PROC
>>                                    99 fffffe80000d1c80 sched
>>
>>  ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH 
>> THREAD           PROC
>>   1 fffffffffbc2f260  1b    1    0  -1   no    no t-17   
>> fffffe8000401c80 (idle)
>>                        |    |
>>             RUNNING <--+    +-->  PRI THREAD           PROC
>>               READY                60 fffffe80044d9c80 sched
>>              EXISTS
>>              ENABLE
>>
>> What else should I check?
>>     
> _______________________________________________
> mdb-discuss mailing list
> [EMAIL PROTECTED]
>   


-- 
Cheers,

------------------------------------------------------------
Oliver Yang | Work from office | http://blog.csdn.net/yayong

_______________________________________________
opensolaris-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code

Re: [osol-code] [mdb-discuss] [Fwd: Questions about overdue callout_execute]

Reply via email to