[OmniOS-discuss] Ang: fmdump help?
Hi again! Got some more info about what I wrote last. Is this a hardware problem? I did some dtrace of the dump, and got this: root@omni:/var/crash/unknown# savecore -f /var/crash/unknown/vmdump.1 savecore: System dump time: Sat May 10 21:47:04 2014 savecore: saving system crash dump in /var/crash/unknown/{unix,vmcore}.1 Constructing namelist /var/crash/unknown/unix.1 Constructing corefile /var/crash/unknown/vmcore.1 0:41 100% done: 607251 of 607251 pages saved root@omni:/var/crash/unknown# mdb -k unix.1 vmcore.1 Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci zfs sata sd ip hook neti sockfs arp usba uhci stmf stmf_sbd md lofs mpt_sas random idm nfs crypto ptm kvm cpc smbsrv ufs logindmux nsmb ] ::status debugging crash dump vmcore.1 (64-bit) from omni operating system: 5.11 omnios-8c08411 (i86pc) image uuid: e43a2059-c9b8-e592-b307-f05eafbbe15b panic message: pcieb-0: PCI(-X) Express Fatal Error. (0x145) dump content: kernel pages only ::stack vpanic() pcieb_intr_handler+0x1c9(ff0a1da39830, 0) av_dispatch_autovect+0x95(49) dispatch_hardint+0x36(49, 0) switch_sp_and_call+0x13() do_interrupt+0xa8(ff0047e9d110, fe03e383e000) _interrupt+0xba() htable_lookup+0x73(ff0a08ecce78, fe03e383e000, 1) htable_getpte+0x58(ff0a08ecce78, fe03e383e000, ff0047e9d2ec, ff0047e9d2e0, 1) htable_getpage+0x30(ff0a08ecce78, fe03e383e000, ff0047e9d34c) hat_getpfnum+0x71(ff0a08ecce78, fe03e383e000) kvm_va2pa+0x1b() mmu_alloc_roots+0xaa() kvm_mmu_load+0x40() kvm_mmu_reload+0x18() vcpu_enter_guest+0x68() __vcpu_run+0x8b() kvm_arch_vcpu_ioctl_run+0x112() kvm_ioctl+0x466() cdev_ioctl+0x39(1080005, 2000ae80, 0, 202003, ff0a2c4995e8, ff0047e9dea8) spec_ioctl+0x60(ff0a2c875380, 2000ae80, 0, 202003, ff0a2c4995e8, ff0047e9dea8) fop_ioctl+0x55(ff0a2c875380, 2000ae80, 0, 202003, ff0a2c4995e8, ff0047e9dea8) ioctl+0x9b(d, 2000ae80, 0) sys_syscall+0x17a() ::msgbuf MESSAGE vcpu 7 received sipi with vector # 10 vcpu 6 received sipi with vector # 10 kvm_lapic_reset: vcpu=ff0a38b5a000, id=2, base_msr= fee00800 PRIx64 base_addre ss=fee0 kvm_lapic_reset: vcpu=ff0a38b52000, id=3, base_msr= fee00800 PRIx64 base_addre ss=fee0 kvm_lapic_reset: vcpu=ff0a38b4a000, id=4, base_msr= fee00800 PRIx64 base_addre ss=fee0 kvm_lapic_reset: vcpu=ff0a38ba2000, id=5, base_msr= fee00800 PRIx64 base_addre ss=fee0 kvm_lapic_reset: vcpu=ff0a38b92000, id=7, base_msr= fee00800 PRIx64 base_addre ss=fee0 kvm_lapic_reset: vcpu=ff0a38b9a000, id=6, base_msr= fee00800 PRIx64 base_addre ss=fee0 unhandled wrmsr: 0x0 data 0 vcpu 1 received sipi with vector # 98 kvm_lapic_reset: vcpu=ff0a38b62000, id=1, base_msr= fee00800 PRIx64 base_addre ss=fee0 vcpu 2 received sipi with vector # 98 kvm_lapic_reset: vcpu=ff0a38b5a000, id=2, base_msr= fee00800 PRIx64 base_addre ss=fee0 vcpu 3 received sipi with vector # 98 kvm_lapic_reset: vcpu=ff0a38b52000, id=3, base_msr= fee00800 PRIx64 base_addre ss=fee0 vcpu 4 received sipi with vector # 98 kvm_lapic_reset: vcpu=ff0a38b4a000, id=4, base_msr= fee00800 PRIx64 base_address=f ee0 vcpu 5 received sipi with vector # 98 kvm_lapic_reset: vcpu=ff0a38ba2000, id=5, base_msr= fee00800 PRIx64 base_address=f ee0 vcpu 6 received sipi with vector # 98 kvm_lapic_reset: vcpu=ff0a38b9a000, id=6, base_msr= fee00800 PRIx64 base_address=f ee0 vcpu 7 received sipi with vector # 98 kvm_lapic_reset: vcpu=ff0a38b92000, id=7, base_msr= fee00800 PRIx64 base_address=f ee0 kvm_lapic_reset: vcpu=ff0a38ba2000, id=0, base_msr= fee00100 PRIx64 base_address=f ee0 vmcs revision_id = e kvm_lapic_reset: vcpu=ff0a38b4a000, id=1, base_msr= fee0 PRIx64 base_address=f ee0 vmcs revision_id = e unhandled wrmsr: 0x1010101 data fd7fffdfe870 unhandled wrmsr: 0x1010101 data fd7fffdfe870 unhandled wrmsr: 0xff318d0c data fd7fffdfe840 unhandled wrmsr: 0xff318d0c data fd7fffdfe840 unhandled wrmsr: 0xffdfef38 data 301a4 unhandled wrmsr: 0xffdfef38 data 301a4 vcpu 1 received sipi with vector # 10 kvm_lapic_reset: vcpu=ff0a38b4a000, id=1, base_msr= fee00800 PRIx64 base_address=f ee0 unhandled rdmsr: 0x756e6547 unhandled wrmsr: 0x0 data 6c65746e756e6547 vcpu 1 received sipi with vector # 9f kvm_lapic_reset: vcpu=ff0a38b4a000, id=1, base_msr= fee00800 PRIx64 base_address=f ee0 kvm_lapic_reset: vcpu=ff0a38b52000, id=0, base_msr= fee00100 PRIx64 base_address=f ee0 vmcs revision_id = e kvm_lapic_reset: vcpu=ff0a38b5a000, id=1, base_msr= fee0 PRIx64 base_address=f ee0 vmcs revision_id = e kvm_lapic_reset: vcpu=ff0a38b62000, id=2, base_msr= fee0 PRIx64 base_address=f ee0 vmcs revision_id = e kvm_lapic_reset: vcpu=ff0a384e9000, id=3, base_msr= fee0
Re: [OmniOS-discuss] Ang: fmdump help?
Thanks again, Dan! Some more questions further down... -Dan McDonald dan...@omniti.com skrev: - Till: Johan Kragsterman johan.kragster...@capvert.se Från: Dan McDonald dan...@omniti.com Datum: 2014-05-12 15:46 Kopia: OmniOS-discuss@lists.omniti.com omnios-discuss@lists.omniti.com Ärende: Re: [OmniOS-discuss] Ang: fmdump help? On May 12, 2014, at 8:46 AM, Johan Kragsterman johan.kragster...@capvert.se wrote: panic message: pcieb-0: PCI(-X) Express Fatal Error. (0x145) Does this mean it is the PCI-X bus? And/or a device on that bus? It makes sense if so, because the e1000g3 is on an Intel quad port PCI-X adapter on the only PCI-X bus on the system. And I had severe issues with a client connected to that port. But could a port issue really crash the system? Wouldn't it be more likely that it is the bus? First step will be that I'll change the connections to that port to another port on the same nic, and see if it'll be some changes. If I still got problems, I'll change the nic to a similar, and if that doesn't help, I put another nic on a PCIe-bus instead. That's these flags from pcie_impl.h (viewable from the source, it's not an installed system header file): #define PF_ERR_NO_ERROR (1 0) /* No error seen */ #define PF_ERR_NO_PANIC (1 2) /* Error should not panic sys */ #define PF_ERR_PANIC (1 6) /* Error should panic system */ #define PF_ERR_MATCH_DOM (1 9) /* Error Handled By IO domain */ That's a lot of flags set, and all of this flag-setting happens during a fault scan of the PCIe bus (see pcie_fault.c, especially starting with pf_scan_fabric() and its descendants). I'd be inclined to say this is a HW error, especially given your e1000g3 device complained, per here: NOTICE: e1000g3 link down NOTICE: vnic1000 link down NOTICE: e1000g3 link up, 100 Mbps, full duplex NOTICE: vnic1000 link up, 100 Mbps, unknown duplex NOTICE: SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major Dan Rgrds Johan ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] Ang: fmdump help?
On May 12, 2014, at 11:06 AM, Johan Kragsterman johan.kragster...@capvert.se wrote: Thanks again, Dan! Some more questions further down... Does this mean it is the PCI-X bus? And/or a device on that bus? It makes sense if so, because the e1000g3 is on an Intel quad port PCI-X adapter on the only PCI-X bus on the system. And I had severe issues with a client connected to that port. But could a port issue really crash the system? Wouldn't it be more likely that it is the bus? The error message originates from the pcieb (PCI-E bus controller): 161 f8077000 4440 228 1 pcieb (PCIe bridge/switch driver) and yes it's likely the bus, as that message/panic happens after a bus scan. I indicated e1000g3 so you could maybe see if the slot it was in was bad. First step will be that I'll change the connections to that port to another port on the same nic, and see if it'll be some changes. If I still got problems, I'll change the nic to a similar, and if that doesn't help, I put another nic on a PCIe-bus instead. That's what I'd do. Dan ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] Ang: fmdump help?
-Dan McDonald dan...@omniti.com skrev: - Till: Johan Kragsterman johan.kragster...@capvert.se Från: Dan McDonald dan...@omniti.com Datum: 2014-05-12 17:15 Kopia: OmniOS-discuss@lists.omniti.com omnios-discuss@lists.omniti.com Ärende: Re: [OmniOS-discuss] Ang: fmdump help? On May 12, 2014, at 11:06 AM, Johan Kragsterman johan.kragster...@capvert.se wrote: Thanks again, Dan! Some more questions further down... Does this mean it is the PCI-X bus? And/or a device on that bus? It makes sense if so, because the e1000g3 is on an Intel quad port PCI-X adapter on the only PCI-X bus on the system. And I had severe issues with a client connected to that port. But could a port issue really crash the system? Wouldn't it be more likely that it is the bus? The error message originates from the pcieb (PCI-E bus controller): 161 f8077000 4440 228 1 pcieb (PCIe bridge/switch driver) and yes it's likely the bus, as that message/panic happens after a bus scan. I indicated e1000g3 so you could maybe see if the slot it was in was bad. First step will be that I'll change the connections to that port to another port on the same nic, and see if it'll be some changes. If I still got problems, I'll change the nic to a similar, and if that doesn't help, I put another nic on a PCIe-bus instead. That's what I'd do. Dan The nic is on a PCI-X bus, not a PCIe bus. All nic ports on the system are on that PCI-X nic. No nic on PCIe. Does that mean that the e1000g3 had nothing to do with the problem? And that the problem must be on a PCIe bus/device? If so, I can rule out the nic. And concentrate on other devices/buses. The only adapters that are in PCIe slot/buses are the SAS controller and the graphics adapter. Or perhaps the integrated SATA controller as well is on a PCIe bus... I actually got two more of these T5500, so I could easily switch to another one, if I needed that. ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] Ang: fmdump help?
I'm not sure if that code is common to PCI-X as well. After all, the printf message mentions PCI-X (but maybe as a typo)? And interrupts from PCI-X may still sabotage PCIe. I'd continue to focus on that NIC for starters (and save the dumps if you've the disk space). Dan ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss