[OmniOS-discuss] Ang: fmdump help?

2014-05-12 Thread Johan Kragsterman
Hi again!


Got some more info about what I wrote last. Is this a hardware problem?


I did some dtrace of the dump, and got this:



root@omni:/var/crash/unknown# savecore -f /var/crash/unknown/vmdump.1
savecore: System dump time: Sat May 10 21:47:04 2014

savecore: saving system crash dump in /var/crash/unknown/{unix,vmcore}.1
Constructing namelist /var/crash/unknown/unix.1
Constructing corefile /var/crash/unknown/vmcore.1
 0:41 100% done: 607251 of 607251 pages saved
root@omni:/var/crash/unknown# mdb -k unix.1 vmcore.1
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp 
scsi_vhci zfs sata sd ip hook neti sockfs arp usba uhci stmf stmf_sbd md lofs 
mpt_sas random idm nfs crypto ptm kvm cpc smbsrv ufs logindmux nsmb ]

 ::status
debugging crash dump vmcore.1 (64-bit) from omni
operating system: 5.11 omnios-8c08411 (i86pc)
image uuid: e43a2059-c9b8-e592-b307-f05eafbbe15b
panic message: pcieb-0: PCI(-X) Express Fatal Error. (0x145)
dump content: kernel pages only


 ::stack
vpanic()
pcieb_intr_handler+0x1c9(ff0a1da39830, 0)
av_dispatch_autovect+0x95(49)
dispatch_hardint+0x36(49, 0)
switch_sp_and_call+0x13()
do_interrupt+0xa8(ff0047e9d110, fe03e383e000)
_interrupt+0xba()
htable_lookup+0x73(ff0a08ecce78, fe03e383e000, 1)
htable_getpte+0x58(ff0a08ecce78, fe03e383e000, ff0047e9d2ec, 
ff0047e9d2e0, 1)
htable_getpage+0x30(ff0a08ecce78, fe03e383e000, ff0047e9d34c)
hat_getpfnum+0x71(ff0a08ecce78, fe03e383e000)
kvm_va2pa+0x1b()
mmu_alloc_roots+0xaa()
kvm_mmu_load+0x40()
kvm_mmu_reload+0x18()
vcpu_enter_guest+0x68()
__vcpu_run+0x8b()
kvm_arch_vcpu_ioctl_run+0x112()
kvm_ioctl+0x466()
cdev_ioctl+0x39(1080005, 2000ae80, 0, 202003, ff0a2c4995e8, 
ff0047e9dea8)
spec_ioctl+0x60(ff0a2c875380, 2000ae80, 0, 202003, ff0a2c4995e8, 
ff0047e9dea8) 
fop_ioctl+0x55(ff0a2c875380, 2000ae80, 0, 202003, ff0a2c4995e8, 
ff0047e9dea8)
ioctl+0x9b(d, 2000ae80, 0)
sys_syscall+0x17a()



 ::msgbuf
MESSAGE   
vcpu 7 received sipi with vector # 10
vcpu 6 received sipi with vector # 10
kvm_lapic_reset: vcpu=ff0a38b5a000, id=2, base_msr= fee00800 PRIx64 
base_addre
ss=fee0
kvm_lapic_reset: vcpu=ff0a38b52000, id=3, base_msr= fee00800 PRIx64 
base_addre
ss=fee0
kvm_lapic_reset: vcpu=ff0a38b4a000, id=4, base_msr= fee00800 PRIx64 
base_addre
ss=fee0
kvm_lapic_reset: vcpu=ff0a38ba2000, id=5, base_msr= fee00800 PRIx64 
base_addre
ss=fee0
kvm_lapic_reset: vcpu=ff0a38b92000, id=7, base_msr= fee00800 PRIx64 
base_addre
ss=fee0
kvm_lapic_reset: vcpu=ff0a38b9a000, id=6, base_msr= fee00800 PRIx64 
base_addre
ss=fee0
unhandled wrmsr: 0x0 data 0
vcpu 1 received sipi with vector # 98
kvm_lapic_reset: vcpu=ff0a38b62000, id=1, base_msr= fee00800 PRIx64 
base_addre
ss=fee0
vcpu 2 received sipi with vector # 98
kvm_lapic_reset: vcpu=ff0a38b5a000, id=2, base_msr= fee00800 PRIx64 
base_addre
ss=fee0
vcpu 3 received sipi with vector # 98
kvm_lapic_reset: vcpu=ff0a38b52000, id=3, base_msr= fee00800 PRIx64 
base_addre
ss=fee0
vcpu 4 received sipi with vector # 98
kvm_lapic_reset: vcpu=ff0a38b4a000, id=4, base_msr= fee00800 PRIx64 
base_address=f
ee0
vcpu 5 received sipi with vector # 98
kvm_lapic_reset: vcpu=ff0a38ba2000, id=5, base_msr= fee00800 PRIx64 
base_address=f
ee0
vcpu 6 received sipi with vector # 98
kvm_lapic_reset: vcpu=ff0a38b9a000, id=6, base_msr= fee00800 PRIx64 
base_address=f
ee0
vcpu 7 received sipi with vector # 98
kvm_lapic_reset: vcpu=ff0a38b92000, id=7, base_msr= fee00800 PRIx64 
base_address=f
ee0
kvm_lapic_reset: vcpu=ff0a38ba2000, id=0, base_msr= fee00100 PRIx64 
base_address=f
ee0
vmcs revision_id = e
kvm_lapic_reset: vcpu=ff0a38b4a000, id=1, base_msr= fee0 PRIx64 
base_address=f
ee0
vmcs revision_id = e
unhandled wrmsr: 0x1010101 data fd7fffdfe870
unhandled wrmsr: 0x1010101 data fd7fffdfe870
unhandled wrmsr: 0xff318d0c data fd7fffdfe840
unhandled wrmsr: 0xff318d0c data fd7fffdfe840
unhandled wrmsr: 0xffdfef38 data 301a4
unhandled wrmsr: 0xffdfef38 data 301a4
vcpu 1 received sipi with vector # 10
kvm_lapic_reset: vcpu=ff0a38b4a000, id=1, base_msr= fee00800 PRIx64 
base_address=f
ee0
unhandled rdmsr: 0x756e6547
unhandled wrmsr: 0x0 data 6c65746e756e6547
vcpu 1 received sipi with vector # 9f
kvm_lapic_reset: vcpu=ff0a38b4a000, id=1, base_msr= fee00800 PRIx64 
base_address=f
ee0
kvm_lapic_reset: vcpu=ff0a38b52000, id=0, base_msr= fee00100 PRIx64 
base_address=f
ee0
vmcs revision_id = e
kvm_lapic_reset: vcpu=ff0a38b5a000, id=1, base_msr= fee0 PRIx64 
base_address=f
ee0
vmcs revision_id = e
kvm_lapic_reset: vcpu=ff0a38b62000, id=2, base_msr= fee0 PRIx64 
base_address=f
ee0
vmcs revision_id = e
kvm_lapic_reset: vcpu=ff0a384e9000, id=3, base_msr= fee0 

Re: [OmniOS-discuss] Ang: fmdump help?

2014-05-12 Thread Johan Kragsterman
Thanks again, Dan!


Some more questions further down...


-Dan McDonald dan...@omniti.com skrev: -
Till: Johan Kragsterman johan.kragster...@capvert.se
Från: Dan McDonald dan...@omniti.com
Datum: 2014-05-12 15:46
Kopia: OmniOS-discuss@lists.omniti.com omnios-discuss@lists.omniti.com
Ärende: Re: [OmniOS-discuss] Ang: fmdump help?


On May 12, 2014, at 8:46 AM, Johan Kragsterman johan.kragster...@capvert.se 
wrote:



 panic message: pcieb-0: PCI(-X) Express Fatal Error. (0x145)






Does this mean it is the PCI-X bus? And/or a device on that bus? It makes sense 
if so, because the e1000g3 is on an Intel quad port PCI-X adapter on the only 
PCI-X bus on the system. And I had severe issues with a client connected to 
that port. But could a port issue really crash the system? Wouldn't it be more 
likely that it is the bus?

First step will be that I'll change the connections to that port to another 
port on the same nic, and see if it'll be some changes.

If I still got problems, I'll change the nic to a similar, and if that doesn't 
help, I put another nic on a PCIe-bus instead.







That's these flags from pcie_impl.h (viewable from the source, it's not an 
installed system header file):

#define PF_ERR_NO_ERROR         (1  0) /* No error seen */
#define PF_ERR_NO_PANIC         (1  2) /* Error should not panic sys */
#define PF_ERR_PANIC            (1  6) /* Error should panic system */
#define PF_ERR_MATCH_DOM        (1  9) /* Error Handled By IO domain */

That's a lot of flags set, and all of this flag-setting happens during a fault 
scan of the PCIe bus (see pcie_fault.c, especially starting with 
pf_scan_fabric() and its descendants).

I'd be inclined to say this is a HW error, especially given your e1000g3 device 
complained, per here:

NOTICE: e1000g3 link down
NOTICE: vnic1000 link down
NOTICE: e1000g3 link up, 100 Mbps, full duplex
NOTICE: vnic1000 link up, 100 Mbps, unknown duplex
NOTICE: SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major

Dan


Rgrds Johan



___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] Ang: fmdump help?

2014-05-12 Thread Dan McDonald

On May 12, 2014, at 11:06 AM, Johan Kragsterman johan.kragster...@capvert.se 
wrote:

 Thanks again, Dan!
 
 
 Some more questions further down...
 
 
 
 
 Does this mean it is the PCI-X bus? And/or a device on that bus? It makes 
 sense if so, because the e1000g3 is on an Intel quad port PCI-X adapter on 
 the only PCI-X bus on the system. And I had severe issues with a client 
 connected to that port. But could a port issue really crash the system? 
 Wouldn't it be more likely that it is the bus?

The error message originates from the pcieb (PCI-E bus controller):

161 f8077000   4440 228   1  pcieb (PCIe bridge/switch driver)

and yes it's likely the bus, as that message/panic happens after a bus scan.  I 
indicated e1000g3 so you could maybe see if the slot it was in was bad.

 First step will be that I'll change the connections to that port to another 
 port on the same nic, and see if it'll be some changes.
 
 If I still got problems, I'll change the nic to a similar, and if that 
 doesn't help, I put another nic on a PCIe-bus instead.
 

That's what I'd do.

Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] Ang: fmdump help?

2014-05-12 Thread Johan Kragsterman


-Dan McDonald dan...@omniti.com skrev: -
Till: Johan Kragsterman johan.kragster...@capvert.se
Från: Dan McDonald dan...@omniti.com
Datum: 2014-05-12 17:15
Kopia: OmniOS-discuss@lists.omniti.com omnios-discuss@lists.omniti.com
Ärende: Re: [OmniOS-discuss] Ang: fmdump help?

On May 12, 2014, at 11:06 AM, Johan Kragsterman johan.kragster...@capvert.se 
wrote:

 Thanks again, Dan!
 
 
 Some more questions further down...
 
 
 
 
 Does this mean it is the PCI-X bus? And/or a device on that bus? It makes 
 sense if so, because the e1000g3 is on an Intel quad port PCI-X adapter on 
 the only PCI-X bus on the system. And I had severe issues with a client 
 connected to that port. But could a port issue really crash the system? 
 Wouldn't it be more likely that it is the bus?

The error message originates from the pcieb (PCI-E bus controller):

161 f8077000   4440 228   1  pcieb (PCIe bridge/switch driver)

and yes it's likely the bus, as that message/panic happens after a bus scan.  I 
indicated e1000g3 so you could maybe see if the slot it was in was bad.

 First step will be that I'll change the connections to that port to another 
 port on the same nic, and see if it'll be some changes.
 
 If I still got problems, I'll change the nic to a similar, and if that 
 doesn't help, I put another nic on a PCIe-bus instead.
 

That's what I'd do.

Dan




The nic is on a PCI-X bus, not a PCIe bus. All nic ports on the system are on 
that PCI-X nic. No nic on PCIe. Does that mean that the e1000g3 had nothing to 
do with the problem?
And that the problem must be on a PCIe bus/device?

If so, I can rule out the nic. And concentrate on other devices/buses.

The only adapters that are in PCIe slot/buses are the SAS controller and the 
graphics adapter. Or perhaps the integrated SATA controller as well is on a 
PCIe bus...

I actually got two more of these T5500, so I could easily switch to another 
one, if I needed that.





___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] Ang: fmdump help?

2014-05-12 Thread Dan McDonald
I'm not sure if that code is common to PCI-X as well.  After all, the printf 
message mentions PCI-X (but maybe as a typo)?

And interrupts from PCI-X may still sabotage PCIe.  I'd continue to focus on 
that NIC for starters (and save the dumps if you've the disk space).

Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss