Re: [driver-discuss] Am I understanding this correctly? -- potential e1000g bug

Jason King Thu, 17 Sep 2009 08:27:46 -0700

On Thu, Sep 17, 2009 at 10:16 AM, Garrett D'Amore <[email protected]> wrote:
> Look closely at the stack.  You'll notice that a PIL9 interrupt
> *interrupted* e1000g while it was servicing an interrupt.  I don't think
> e1000g is at fault here.  Something else is doing it.


This is probably my lack of knowledge about how solaris handles
interrupts, but with doing a little digging:

>  0xffffff0007c49c60::findstack -v
stack pointer for thread ffffff0007c49c60: ffffff0007c49b30
  ffffff0007c49bb0 rm_isr+0xaa()
  ffffff0007c49c00 av_dispatch_autovect+0x7c(10)
  ffffff0007c49c40 dispatch_hardint+0x33(10, 6)
  ffffff0007c4f450 switch_sp_and_call+0x13()
  ffffff0007c4f4a0 do_interrupt+0x9e(ffffff0007c4f4b0, b)
  ffffff0007c4f4b0 _interrupt+0xba()

I'm assuming this portion of the stack dump is what you're talking
about... looking at the function signature for dispatch_hardint -- the
new vector is 10, and the old ipl is 6.

> ::interrupts -d
IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# Driver Name(s)
3    0xb1 12  ISA    Edg Fixed  0   1     0x0/0x3   asy#1
4    0xb0 12  ISA    Edg Fixed  0   1     0x0/0x4   asy#0
6    0x41 5   ISA    Edg Fixed  0   1     0x0/0x6   fdc#0
7    0x42 5   ISA    Edg Fixed  1   1     0x0/0x7   ecpp#0
9    0x81 9   PCI    Lvl Fixed  1   1     0x0/0x9   acpi_wrapper_isr
15   0x43 5   ISA    Edg Fixed  0   1     0x0/0xf   ata#1
16   0x83 9   PCI    Lvl Fixed  1   4     0x0/0x10  hci1394#0, uhci#3, uhci#0,
nvidia#0
17   0x87 8   PCI    Lvl Fixed  0   1     0x0/0x11  audio810#0
18   0x86 9   PCI    Lvl Fixed  1   1     0x0/0x12  pci-ide#1
19   0x85 9   PCI    Lvl Fixed  0   1     0x0/0x13  uhci#1
23   0x84 9   PCI    Lvl Fixed  1   1     0x0/0x17  ehci#0
26   0x40 5   PCI    Lvl Fixed  1   1     0x1/0x2   aac#0
48   0x60 6   PCI    Lvl Fixed  1   1     0x2/0x0   e1000g#0
72   0x82 7   PCI    Edg MSI    0   1     -         pcie_pci#0
73   0x30 4   PCI    Edg MSI    0   1     -         pcie_pci#2
74   0x44 5   PCI    Edg MSI    0   1     -         adpu320#0
160  0xa0 0          Edg IPI    all 0     -         poke_cpu
192  0xc0 13         Edg IPI    all 1     -         xc_serv
208  0xd0 14         Edg IPI    all 1     -         kcpc_hw_overflow_intr
209  0xd1 14         Edg IPI    all 1     -         cbe_fire
210  0xd3 14         Edg IPI    all 1     -         cbe_fire
240  0xe0 15         Edg IPI    all 1     -         xc_serv
241  0xe1 15         Edg IPI    all 1     -         apic_error_intr

That makes sense -- e1000g#0 is IPL 6, however shouldn't there then be
an entry somewhere in there with a VECT value of 0x0a and an IPL of 9?
 Or do i still have more learning to do?


>
>   - Garrett
>
> Jason King wrote:
>>
>> I have a desktop that keeps freezing.. after some work, I managed to
>> force a crashdump via kmdb.. It's a dual-core xeon desktop running OS
>> 2009.06 -- in this case I'm running virtualbox on it with a bridged
>> ethernet connection.
>>
>> My very rudimentary analysis is this:
>>
>> zsh 3 % scat 0
>>
>>  Solaris[TM] CAT 5.2 for Solaris 11 64-bit x64
>>    SV4990M, Aug 26 2009
>>
>>  Copyright © 2009 Sun Microsystems, Inc. All rights reserved.
>>  Use is subject to license terms.
>>
>>  Feedback regarding the tool should be sent to [email protected]
>>  Visit the Solaris CAT blog at http://blogs.sun.com/SolarisCAT
>>
>> opening unix.0 vmcore.0 ...dumphdr...symtab...core...done
>> loading core data: modules...symbols...CTF...done
>>
>> core file:      /var/crash/homer/vmcore.0
>> user:           Jason King (jking:101)
>> release:        5.11 (64-bit)
>> version:        snv_111b
>> machine:        i86pc
>> node name:      homer
>> system type:    i86pc
>> hostid:         4eda84
>> dump_conflags:  0x10000 (DUMP_KERNEL) on /dev/zvol/dsk/rpool/dump(1.96G)
>> snooping:       0x1
>> boothowto:      0x22040 (DEBUG|VERBOSE|KMDB)
>> time of crash:  Thu Sep 17 09:39:15 CDT 2009
>> age of system:  15 hours 32 minutes 10.05 seconds
>> panic CPU:      0 (2 CPUs, 3.93G memory)
>> panic string:   BAD TRAP: type=e (#pf Page fault) rp=ffffff00078d1da0
>> addr=0 occurred in module "<unknown>" due to a NULL pointer
>> dereference
>>
>> sanity checks: settings...vmem...
>> WARNING: CPU0 has cpu_intr_actv for 2
>> WARNING: CPU1 has cpu_intr_actv for 6 9
>> WARNING: last_swtch[1]: 0x553c75 (1 minutes 9.68 seconds earlier)
>> WARNING: PIL9 interrupt thread 0xffffff0007c49c60 on CPU1 pinning PIL6
>> interrupt thread 0xffffff0007c4fc60 pinning IA thread
>> 0xffffff01d6d55740
>> sysent...clock...misc...
>> WARNING: 54 expired realtime (max -1m41.272660310s) and 27 expired
>> normal (max -11.562660310s) callouts
>> done
>>
>> Does this mean that interrupt thread 0xffffff0007c4fc60 is taking too
>> long?  It would explain why the box seems to hang.
>> That thread is:
>>
>>  % mdb -k unix.0 vmcore.0
>> Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc
>> pcplusmp scsi_vhci zfs sd sockfs ip hook neti sctp arp usba uhci s1394
>> fctl md lofs audiosup fcip fcp cpc random crypto logindmux ptm ufs
>> nsmb sppp ipc ]
>>
>>>
>>>  0xffffff0007c49c60::findstack
>>>
>>
>> stack pointer for thread ffffff0007c49c60: ffffff0007c49b30
>>  ffffff0007c49bb0 rm_isr+0xaa()
>>  ffffff0007c49c00 av_dispatch_autovect+0x7c()
>>  ffffff0007c49c40 dispatch_hardint+0x33()
>>  ffffff0007c4f450 switch_sp_and_call+0x13()
>>  ffffff0007c4f4a0 do_interrupt+0x9e()
>>  ffffff0007c4f4b0 _interrupt+0xba()
>>  ffffff0007c4f5c0 default_lock_delay+0x8c()
>>  ffffff0007c4f630 lock_set_spl_spin+0xc2()
>>  ffffff0007c4f690 mutex_vector_enter+0x45e()
>>  ffffff0007c4f6c0 RTSemEventSignal+0x6a()
>>  ffffff0007c4f740 0xfffffffff836c57b()
>>  ffffff0007c4f770 0xfffffffff836d73a()
>>  ffffff0007c4f830 vboxNetFltSolarisRecv+0x331()
>>  ffffff0007c4f880 VBoxNetFltSolarisModReadPut+0x107()
>>  ffffff0007c4f8f0 putnext+0x21e()
>>  ffffff0007c4f950 dld_str_rx_raw+0xb3()
>>  ffffff0007c4fa10 dls_rx_promisc+0x179()
>>  ffffff0007c4fa50 mac_promisc_dispatch_one+0x5f()
>>  ffffff0007c4fac0 mac_promisc_dispatch+0x105()
>>  ffffff0007c4fb10 mac_rx+0x3e()
>>  ffffff0007c4fb50 mac_rx_ring+0x4c()
>>  ffffff0007c4fbb0 e1000g_intr+0x17e()
>>
>> Do I appear to be on the right track, and can anyone offer any
>> additional suggestions where to go from here (or even recognize the
>> problem)?
>> _______________________________________________
>> driver-discuss mailing list
>> [email protected]
>> http://mail.opensolaris.org/mailman/listinfo/driver-discuss
>>
>
>
_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Re: [driver-discuss] Am I understanding this correctly? -- potential e1000g bug

Reply via email to