Re: [driver-discuss] Am I understanding this correctly? -- potential e1000g bug

Garrett D'Amore Thu, 17 Sep 2009 08:38:27 -0700

I believe PIL 10 is used by the system timer. Probably moreinvestigation is required.


   - Garrett


Jason King wrote:

On Thu, Sep 17, 2009 at 10:16 AM, Garrett D'Amore <[email protected]> wrote:

Look closely at the stack.  You'll notice that a PIL9 interrupt
*interrupted* e1000g while it was servicing an interrupt.  I don't think
e1000g is at fault here.  Something else is doing it.


This is probably my lack of knowledge about how solaris handles
interrupts, but with doing a little digging:

 0xffffff0007c49c60::findstack -v

stack pointer for thread ffffff0007c49c60: ffffff0007c49b30
  ffffff0007c49bb0 rm_isr+0xaa()
  ffffff0007c49c00 av_dispatch_autovect+0x7c(10)
  ffffff0007c49c40 dispatch_hardint+0x33(10, 6)
  ffffff0007c4f450 switch_sp_and_call+0x13()
  ffffff0007c4f4a0 do_interrupt+0x9e(ffffff0007c4f4b0, b)
  ffffff0007c4f4b0 _interrupt+0xba()

I'm assuming this portion of the stack dump is what you're talking
about... looking at the function signature for dispatch_hardint -- the
new vector is 10, and the old ipl is 6.

::interrupts -d

IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# Driver Name(s)
3    0xb1 12  ISA    Edg Fixed  0   1     0x0/0x3   asy#1
4    0xb0 12  ISA    Edg Fixed  0   1     0x0/0x4   asy#0
6    0x41 5   ISA    Edg Fixed  0   1     0x0/0x6   fdc#0
7    0x42 5   ISA    Edg Fixed  1   1     0x0/0x7   ecpp#0
9    0x81 9   PCI    Lvl Fixed  1   1     0x0/0x9   acpi_wrapper_isr
15   0x43 5   ISA    Edg Fixed  0   1     0x0/0xf   ata#1
16   0x83 9   PCI    Lvl Fixed  1   4     0x0/0x10  hci1394#0, uhci#3, uhci#0,
nvidia#0
17   0x87 8   PCI    Lvl Fixed  0   1     0x0/0x11  audio810#0
18   0x86 9   PCI    Lvl Fixed  1   1     0x0/0x12  pci-ide#1
19   0x85 9   PCI    Lvl Fixed  0   1     0x0/0x13  uhci#1
23   0x84 9   PCI    Lvl Fixed  1   1     0x0/0x17  ehci#0
26   0x40 5   PCI    Lvl Fixed  1   1     0x1/0x2   aac#0
48   0x60 6   PCI    Lvl Fixed  1   1     0x2/0x0   e1000g#0
72   0x82 7   PCI    Edg MSI    0   1     -         pcie_pci#0
73   0x30 4   PCI    Edg MSI    0   1     -         pcie_pci#2
74   0x44 5   PCI    Edg MSI    0   1     -         adpu320#0
160  0xa0 0          Edg IPI    all 0     -         poke_cpu
192  0xc0 13         Edg IPI    all 1     -         xc_serv
208  0xd0 14         Edg IPI    all 1     -         kcpc_hw_overflow_intr
209  0xd1 14         Edg IPI    all 1     -         cbe_fire
210  0xd3 14         Edg IPI    all 1     -         cbe_fire
240  0xe0 15         Edg IPI    all 1     -         xc_serv
241  0xe1 15         Edg IPI    all 1     -         apic_error_intr

That makes sense -- e1000g#0 is IPL 6, however shouldn't there then be
an entry somewhere in there with a VECT value of 0x0a and an IPL of 9?
 Or do i still have more learning to do?

  - Garrett

Jason King wrote:

I have a desktop that keeps freezing.. after some work, I managed to
force a crashdump via kmdb.. It's a dual-core xeon desktop running OS
2009.06 -- in this case I'm running virtualbox on it with a bridged
ethernet connection.

My very rudimentary analysis is this:

zsh 3 % scat 0

 Solaris[TM] CAT 5.2 for Solaris 11 64-bit x64
   SV4990M, Aug 26 2009

 Copyright © 2009 Sun Microsystems, Inc. All rights reserved.
 Use is subject to license terms.

 Feedback regarding the tool should be sent to [email protected]
 Visit the Solaris CAT blog at http://blogs.sun.com/SolarisCAT

opening unix.0 vmcore.0 ...dumphdr...symtab...core...done
loading core data: modules...symbols...CTF...done

core file:      /var/crash/homer/vmcore.0
user:           Jason King (jking:101)
release:        5.11 (64-bit)
version:        snv_111b
machine:        i86pc
node name:      homer
system type:    i86pc
hostid:         4eda84
dump_conflags:  0x10000 (DUMP_KERNEL) on /dev/zvol/dsk/rpool/dump(1.96G)
snooping:       0x1
boothowto:      0x22040 (DEBUG|VERBOSE|KMDB)
time of crash:  Thu Sep 17 09:39:15 CDT 2009
age of system:  15 hours 32 minutes 10.05 seconds
panic CPU:      0 (2 CPUs, 3.93G memory)
panic string:   BAD TRAP: type=e (#pf Page fault) rp=ffffff00078d1da0
addr=0 occurred in module "<unknown>" due to a NULL pointer
dereference

sanity checks: settings...vmem...
WARNING: CPU0 has cpu_intr_actv for 2
WARNING: CPU1 has cpu_intr_actv for 6 9
WARNING: last_swtch[1]: 0x553c75 (1 minutes 9.68 seconds earlier)
WARNING: PIL9 interrupt thread 0xffffff0007c49c60 on CPU1 pinning PIL6
interrupt thread 0xffffff0007c4fc60 pinning IA thread
0xffffff01d6d55740
sysent...clock...misc...
WARNING: 54 expired realtime (max -1m41.272660310s) and 27 expired
normal (max -11.562660310s) callouts
done

Does this mean that interrupt thread 0xffffff0007c4fc60 is taking too
long?  It would explain why the box seems to hang.
That thread is:

 % mdb -k unix.0 vmcore.0
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc
pcplusmp scsi_vhci zfs sd sockfs ip hook neti sctp arp usba uhci s1394
fctl md lofs audiosup fcip fcp cpc random crypto logindmux ptm ufs
nsmb sppp ipc ]

 0xffffff0007c49c60::findstack

stack pointer for thread ffffff0007c49c60: ffffff0007c49b30
 ffffff0007c49bb0 rm_isr+0xaa()
 ffffff0007c49c00 av_dispatch_autovect+0x7c()
 ffffff0007c49c40 dispatch_hardint+0x33()
 ffffff0007c4f450 switch_sp_and_call+0x13()
 ffffff0007c4f4a0 do_interrupt+0x9e()
 ffffff0007c4f4b0 _interrupt+0xba()
 ffffff0007c4f5c0 default_lock_delay+0x8c()
 ffffff0007c4f630 lock_set_spl_spin+0xc2()
 ffffff0007c4f690 mutex_vector_enter+0x45e()
 ffffff0007c4f6c0 RTSemEventSignal+0x6a()
 ffffff0007c4f740 0xfffffffff836c57b()
 ffffff0007c4f770 0xfffffffff836d73a()
 ffffff0007c4f830 vboxNetFltSolarisRecv+0x331()
 ffffff0007c4f880 VBoxNetFltSolarisModReadPut+0x107()
 ffffff0007c4f8f0 putnext+0x21e()
 ffffff0007c4f950 dld_str_rx_raw+0xb3()
 ffffff0007c4fa10 dls_rx_promisc+0x179()
 ffffff0007c4fa50 mac_promisc_dispatch_one+0x5f()
 ffffff0007c4fac0 mac_promisc_dispatch+0x105()
 ffffff0007c4fb10 mac_rx+0x3e()
 ffffff0007c4fb50 mac_rx_ring+0x4c()
 ffffff0007c4fbb0 e1000g_intr+0x17e()

Do I appear to be on the right track, and can anyone offer any
additional suggestions where to go from here (or even recognize the
problem)?
_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss


_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Re: [driver-discuss] Am I understanding this correctly? -- potential e1000g bug

Reply via email to