[SOLVED]: HW defect (was: Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT)
Am 17.11.2011 17:33, schrieb John Baldwin: On Thursday, November 17, 2011 3:59:43 am Stefan Esser wrote: Am 16.11.2011 17:16, schrieb John Baldwin: [...] That isn't unusual. Those are the addresses of the metadata provided by the loader, not the base address of the kernel or zfs.ko object themselves. The unexpected relocation type is interesting however. That value in hex is 0x40b. 0xb is the R_X86_64_32S relocation type which is normal for the kernel. I think you just have a single-bit memory error due to a failing DIMM. Thanks for the information about the load address semantics. The other unexpected relocation type I observed was 268435457 == 0x1001, which also hints at a single bit error. But today the system failed with a different error: ath0: ... ioapic0: routing interrupt 18 to ... panic: vm_page_insert: page already inserted This could of course also be caused by a single bit error ... Yes, very likely. Hmmm, perhaps there is a problem with components at room temperature and the system is still significantly warmer after 3 hours? Yes, I strongly suspect it is a thermal effect that the RAM works once it is warmed up. If you have data you care about on the machine, I would just go ahead and replace the RAM now before waiting for the RAM's failure to become worse. Thanks a lot, John! I should have checked the hardware before, but since the system was perfectly stable, once it had been up and running, I had been suspecting an initialization bug instead of defective RAM. In fact, one of the 4GB DIMMs in the system returns bogus data (0x1000 or 0x0400 instead of 0) for some 40 to 50 seconds after power-on. Once warmed up, memtest86+ runs for days without a single extra data error (I wanted to have an estimate for the defect having led to damaged data in disk files). When I was still doing hardware work, I always had a freezer aerosol on my desk, which allowed me to quickly cool down a DUT by a few tens of degrees, but without such a tool I had to wait for the components to cool down over night between test. Best regards, STefan ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT
Am 16.11.2011 17:16, schrieb John Baldwin: On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote: ... WARNING: WITNESS option enabled, expect reduced performance. Table 'FACP' at 0xba918a58 Table 'APIC' at 0xba918b50 Table 'SSDT' at 0xba918be8 Table 'MCFG' at 0xba918dc0 Table 'HPET' at 0xba918e00 ACPI: No SRAT table found Preloaded elf kernel /boot/kernel/kernel at 0x81109000 Preloaded elf obj module /boot/kernel/zfs.ko at 0x81109370 -- kldload: unexpected relocation type 67108875 kernel trap 12 with interrupts disabled The irritating detail is the load address of zfs.ko, which is just 0x370 bytes above the kernel load address ... That isn't unusual. Those are the addresses of the metadata provided by the loader, not the base address of the kernel or zfs.ko object themselves. The unexpected relocation type is interesting however. That value in hex is 0x40b. 0xb is the R_X86_64_32S relocation type which is normal for the kernel. I think you just have a single-bit memory error due to a failing DIMM. Thanks for the information about the load address semantics. The other unexpected relocation type I observed was 268435457 == 0x1001, which also hints at a single bit error. But today the system failed with a different error: ath0: ... ioapic0: routing interrupt 18 to ... panic: vm_page_insert: page already inserted This could of course also be caused by a single bit error ... But the strange thing is that the system runs perfectly stable under load (e.g. make -j8 world) and that the ZFS ARC grows to some 6GB (of 8GB RAM installed) and I'd expect checksum errors to occur, if there is a bad DIMM. Anyway, I'll check with memtest86+ (or whatever best supports my system with 8GB RAM) over night. The system boots reliably when switched off for less than a few hours (I haven't determined the exact limit, but 3 hours are not sufficient to reproduce the boot failure, while 10 hours cause the first boot attempt to fail with 90% likelihood; the second one always succeeds). I'm wondering whether the system RAM is not correctly initialized after being powered off for 10 hours (but I do not understand why 3 hours should not lead to the exact same initial state). BTW: It suffices to have the system at power state S5 for 10 hours to cause the boot failure, while less than 3 hours (without any power or at S5) let the boot succeed on the first attempt. I had already assumed that memory was corrupted during early start-up, but now I think that gptzfsboot writes the zfs kernel module over the start of the loaded kernel. I'll try some more tests later today. Nah, if zfs.ko were loaded over the beginning of the kernel you wouldn't even get to the point of the first kernel printf. Yes, I see that the failure would be less random (3 different kinds of panic and different warning messages before the panic occurs). But I still do not understand how the symptoms can be interpreted: 1) The system booted reliably for many months 2) It boots reliably when powered off for only a few hours 3) It fails on the first boot attempt after 10 hours or more 4) It never shows signs of instability after a successful boot Hmmm, perhaps there is a problem with components at room temperature and the system is still significantly warmer after 3 hours? I'll have to check for such a thermal effect too ... Best regards, STefan ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT
On Thursday, November 17, 2011 3:59:43 am Stefan Esser wrote: Am 16.11.2011 17:16, schrieb John Baldwin: On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote: ... WARNING: WITNESS option enabled, expect reduced performance. Table 'FACP' at 0xba918a58 Table 'APIC' at 0xba918b50 Table 'SSDT' at 0xba918be8 Table 'MCFG' at 0xba918dc0 Table 'HPET' at 0xba918e00 ACPI: No SRAT table found Preloaded elf kernel /boot/kernel/kernel at 0x81109000 Preloaded elf obj module /boot/kernel/zfs.ko at 0x81109370 -- kldload: unexpected relocation type 67108875 kernel trap 12 with interrupts disabled The irritating detail is the load address of zfs.ko, which is just 0x370 bytes above the kernel load address ... That isn't unusual. Those are the addresses of the metadata provided by the loader, not the base address of the kernel or zfs.ko object themselves. The unexpected relocation type is interesting however. That value in hex is 0x40b. 0xb is the R_X86_64_32S relocation type which is normal for the kernel. I think you just have a single-bit memory error due to a failing DIMM. Thanks for the information about the load address semantics. The other unexpected relocation type I observed was 268435457 == 0x1001, which also hints at a single bit error. But today the system failed with a different error: ath0: ... ioapic0: routing interrupt 18 to ... panic: vm_page_insert: page already inserted This could of course also be caused by a single bit error ... Yes, very likely. Hmmm, perhaps there is a problem with components at room temperature and the system is still significantly warmer after 3 hours? Yes, I strongly suspect it is a thermal effect that the RAM works once it is warmed up. If you have data you care about on the machine, I would just go ahead and replace the RAM now before waiting for the RAM's failure to become worse. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT
On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote: Am 11.11.2011 13:15, schrieb Attilio Rao: Can you try rebuilding your kernel and modules from scratch and see if it fixes your problem? Sorry for the delay, but my system seems to need being turned off (S5) for many hours (whole night) to reproduce the problem ... I had already rebuilt my kernel multiple times in the last weeks. But just to be sure, I removed the build directories for kernel and world and built a new kernel after building and installing world from scratch. The next reboot (with boot blocks from the freshly built world) failed again ... But the first lines of boot messages look strange: ... WARNING: WITNESS option enabled, expect reduced performance. Table 'FACP' at 0xba918a58 Table 'APIC' at 0xba918b50 Table 'SSDT' at 0xba918be8 Table 'MCFG' at 0xba918dc0 Table 'HPET' at 0xba918e00 ACPI: No SRAT table found Preloaded elf kernel /boot/kernel/kernel at 0x81109000 Preloaded elf obj module /boot/kernel/zfs.ko at 0x81109370 -- kldload: unexpected relocation type 67108875 kernel trap 12 with interrupts disabled The irritating detail is the load address of zfs.ko, which is just 0x370 bytes above the kernel load address ... That isn't unusual. Those are the addresses of the metadata provided by the loader, not the base address of the kernel or zfs.ko object themselves. The unexpected relocation type is interesting however. That value in hex is 0x40b. 0xb is the R_X86_64_32S relocation type which is normal for the kernel. I think you just have a single-bit memory error due to a failing DIMM. A verbose boot scrolls these lines off the screen to fast (and is to long to be preserved in dmesg.boot from the start), so I do not have any idea whether other values are reported in case of a successful boot. I had already assumed that memory was corrupted during early start-up, but now I think that gptzfsboot writes the zfs kernel module over the start of the loaded kernel. I'll try some more tests later today. Nah, if zfs.ko were loaded over the beginning of the kernel you wouldn't even get to the point of the first kernel printf. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT
Am 11.11.2011 13:15, schrieb Attilio Rao: Can you try rebuilding your kernel and modules from scratch and see if it fixes your problem? Sorry for the delay, but my system seems to need being turned off (S5) for many hours (whole night) to reproduce the problem ... I had already rebuilt my kernel multiple times in the last weeks. But just to be sure, I removed the build directories for kernel and world and built a new kernel after building and installing world from scratch. The next reboot (with boot blocks from the freshly built world) failed again ... But the first lines of boot messages look strange: ... WARNING: WITNESS option enabled, expect reduced performance. Table 'FACP' at 0xba918a58 Table 'APIC' at 0xba918b50 Table 'SSDT' at 0xba918be8 Table 'MCFG' at 0xba918dc0 Table 'HPET' at 0xba918e00 ACPI: No SRAT table found Preloaded elf kernel /boot/kernel/kernel at 0x81109000 Preloaded elf obj module /boot/kernel/zfs.ko at 0x81109370 -- kldload: unexpected relocation type 67108875 kernel trap 12 with interrupts disabled The irritating detail is the load address of zfs.ko, which is just 0x370 bytes above the kernel load address ... A verbose boot scrolls these lines off the screen to fast (and is to long to be preserved in dmesg.boot from the start), so I do not have any idea whether other values are reported in case of a successful boot. I had already assumed that memory was corrupted during early start-up, but now I think that gptzfsboot writes the zfs kernel module over the start of the loaded kernel. I'll try some more tests later today. Regards, STefan ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT
Am 10.11.2011 11:32, schrieb Attilio Rao: 2011/11/10 Stefan Essers...@freebsd.org: I can produce further debug output on demand, but I do not have a serial or firewire console setup for debugging. Is anybody else affected by this boot problem? Can you setup a videocamera or a simple serial console? Did you try to boot with both -s and -v on? Attilio I should be able to attach a serial console. Booting with -s should make no difference (since booting fails during a very early initialization stage). I tried -v, but found that I could not reproduce the cold boot problem without the system being at least in S5 for hours (just switching off power and waiting a few minutes did not suffice, but this morning the system again booted only on the first attempt). This behavior obviously limits the rate of tests possible ... It looks as if the memory holding the loaded kernel and/or modules is corrupted before the kernel is reloaded and started, as indicated by this morning's boot failure: kldload: unexpected relocation type 268435457 kldload: unexpected relocation type 67108865 Fatal trap 12: ... The rest of the panic message and back trace is identical to the trap 12 panic details in my previous message. It really looks as if the loaded kernel image is corrupted at random positions, leading to random panics (but often of the type trap 12 or page fault in kernel) when execution reaches damaged code or data. A reboot succeeded without any problem as in all prior cases ... Any ideas? Regards, STefan ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT
Can you try rebuilding your kernel and modules from scratch and see if it fixes your problem? Attilio 2011/11/11 Stefan Esser s...@freebsd.org: Am 10.11.2011 11:32, schrieb Attilio Rao: 2011/11/10 Stefan Essers...@freebsd.org: I can produce further debug output on demand, but I do not have a serial or firewire console setup for debugging. Is anybody else affected by this boot problem? Can you setup a videocamera or a simple serial console? Did you try to boot with both -s and -v on? Attilio I should be able to attach a serial console. Booting with -s should make no difference (since booting fails during a very early initialization stage). I tried -v, but found that I could not reproduce the cold boot problem without the system being at least in S5 for hours (just switching off power and waiting a few minutes did not suffice, but this morning the system again booted only on the first attempt). This behavior obviously limits the rate of tests possible ... It looks as if the memory holding the loaded kernel and/or modules is corrupted before the kernel is reloaded and started, as indicated by this morning's boot failure: kldload: unexpected relocation type 268435457 kldload: unexpected relocation type 67108865 Fatal trap 12: ... The rest of the panic message and back trace is identical to the trap 12 panic details in my previous message. It really looks as if the loaded kernel image is corrupted at random positions, leading to random panics (but often of the type trap 12 or page fault in kernel) when execution reaches damaged code or data. A reboot succeeded without any problem as in all prior cases ... Any ideas? Regards, STefan ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT
2011/11/10 Stefan Esser s...@freebsd.org: For a few weeks I have been suffering from a problem that requires manual intervention to get my home workstation boot -CURRENT. The kernel panics at varying places and with different panic messages, e.g. (hand transcribed since kernel dumps don't work at that stage): privileged instruction fault while in kernel mode kmem_alloc_nofault +0x37 kmem_init +0x9e vm_kmem_init +0x39 mi_startup +0x77 btext +0x2c On another cold boot attempt: kernel trap 12 with interrupts disabled Fatal trap 12: page fault while in kernel mode elf_relocinternal +0xa8 link_elf_reloc_local +0x2fe link_elf_link_preload +0x69d linker_preload +0x101 mi_startup +0x77 btext +0x2c In all the cases observed, the system starts without any problems on second attempt (pressing RESET or the reboot command in the debugger). The system is working reliably, once booted. This started a few weeks back (after the switch-over to 10-CURRENT, IIRC), and I did not bother to report it at the time, since I thought it was caused by a temporary instability in the code base. The system is an i2600K on ASUS P8H67-M EVO with 8GB of RAM and an amd64 kernel booting from ZFS (gptzfsboot). The kernel is a stripped down GENERIC plus IPFW and ath (but I doubt that the configuration is causing this, since the failure happens before any devices are probed and the identically configured kernels used to cold boot just fine for half a year). Any hint how to further diagnose this case is welcome (but my spare time is very limited and I cannot easily bisect to find a revision that boots, for example). I can produce further debug output on demand, but I do not have a serial or firewire console setup for debugging. Is anybody else affected by this boot problem? Can you setup a videocamera or a simple serial console? Did you try to boot with both -s and -v on? Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org