[SOLVED]: HW defect (was: Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT)

2011-11-30 Thread Stefan Esser
Am 17.11.2011 17:33, schrieb John Baldwin:
 On Thursday, November 17, 2011 3:59:43 am Stefan Esser wrote:
 Am 16.11.2011 17:16, schrieb John Baldwin:
[...]
 That isn't unusual.  Those are the addresses of the metadata provided by 
 the 
 loader, not the base address of the kernel or zfs.ko object themselves.  
 The 
 unexpected relocation type is interesting however.  That value in hex is 
 0x40b.  0xb is the R_X86_64_32S relocation type which is normal for the 
 kernel.  I think you just have a single-bit memory error due to a failing 
 DIMM.

 Thanks for the information about the load address semantics. The other
 unexpected relocation type I observed was 268435457 == 0x1001, which
 also hints at a single bit error. But today the system failed with a
 different error:

 ath0: ...
 ioapic0: routing interrupt 18 to ...
 panic: vm_page_insert: page already inserted

 This could of course also be caused by a single bit error ...
 
 Yes, very likely.
 
 Hmmm, perhaps there is a problem with components at room temperature
 and the system is still significantly warmer after 3 hours?
 
 Yes, I strongly suspect it is a thermal effect that the RAM works once it
 is warmed up.  If you have data you care about on the machine, I would just
 go ahead and replace the RAM now before waiting for the RAM's failure to
 become worse.

Thanks a lot, John!

I should have checked the hardware before, but since the system
was perfectly stable, once it had been up and running, I had been
suspecting an initialization bug instead of defective RAM.

In fact, one of the 4GB DIMMs in the system returns bogus data
(0x1000 or 0x0400 instead of 0) for some 40 to 50 seconds
after power-on. Once warmed up, memtest86+ runs for days without a
single extra data error (I wanted to have an estimate for the defect
having led to damaged data in disk files).

When I was still doing hardware work, I always had a freezer aerosol
on my desk, which allowed me to quickly cool down a DUT by a few tens
of degrees, but without such a tool I had to wait for the components
to cool down over night between test.

Best regards, STefan
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT

2011-11-17 Thread Stefan Esser
Am 16.11.2011 17:16, schrieb John Baldwin:
 On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote:
 ...
 WARNING: WITNESS option enabled, expect reduced performance.
 Table 'FACP' at 0xba918a58
 Table 'APIC' at 0xba918b50
 Table 'SSDT' at 0xba918be8
 Table 'MCFG' at 0xba918dc0
 Table 'HPET' at 0xba918e00
 ACPI: No SRAT table found
 Preloaded elf kernel /boot/kernel/kernel at 0x81109000
 Preloaded elf obj module /boot/kernel/zfs.ko at 0x81109370 --
 kldload: unexpected relocation type 67108875
 kernel trap 12 with interrupts disabled

 The irritating detail is the load address of zfs.ko, which is just
 0x370 bytes above the kernel load address ...
 
 That isn't unusual.  Those are the addresses of the metadata provided by the 
 loader, not the base address of the kernel or zfs.ko object themselves.  The 
 unexpected relocation type is interesting however.  That value in hex is 
 0x40b.  0xb is the R_X86_64_32S relocation type which is normal for the 
 kernel.  I think you just have a single-bit memory error due to a failing 
 DIMM.

Thanks for the information about the load address semantics. The other
unexpected relocation type I observed was 268435457 == 0x1001, which
also hints at a single bit error. But today the system failed with a
different error:

ath0: ...
ioapic0: routing interrupt 18 to ...
panic: vm_page_insert: page already inserted

This could of course also be caused by a single bit error ...

But the strange thing is that the system runs perfectly stable under
load (e.g. make -j8 world) and that the ZFS ARC grows to some 6GB
(of 8GB RAM installed) and I'd expect checksum errors to occur, if
there is a bad DIMM.

Anyway, I'll check with memtest86+ (or whatever best supports my
system with 8GB RAM) over night.

The system boots reliably when switched off for less than a few hours
(I haven't determined the exact limit, but 3 hours are not sufficient
to reproduce the boot failure, while 10 hours cause the first boot
attempt to fail with 90% likelihood; the second one always succeeds).

I'm wondering whether the system RAM is not correctly initialized
after being powered off for 10 hours (but I do not understand why
3 hours should not lead to the exact same initial state). BTW: It
suffices to have the system at power state S5 for 10 hours to cause
the boot failure, while less than 3 hours (without any power or at
S5) let the boot succeed on the first attempt.

 I had already assumed that memory was corrupted during early start-up,
 but now I think that gptzfsboot writes the zfs kernel module over the
 start of the loaded kernel. I'll try some more tests later today.
 
 Nah, if zfs.ko were loaded over the beginning of the kernel you wouldn't even 
 get to the point of the first kernel printf.

Yes, I see that the failure would be less random (3 different kinds
of panic and different warning messages before the panic occurs).

But I still do not understand how the symptoms can be interpreted:

1) The system booted reliably for many months
2) It boots reliably when powered off for only a few hours
3) It fails on the first boot attempt after 10 hours or more
4) It never shows signs of instability after a successful boot

Hmmm, perhaps there is a problem with components at room temperature
and the system is still significantly warmer after 3 hours?

I'll have to check for such a thermal effect too ...

Best regards, STefan
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT

2011-11-17 Thread John Baldwin
On Thursday, November 17, 2011 3:59:43 am Stefan Esser wrote:
 Am 16.11.2011 17:16, schrieb John Baldwin:
  On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote:
  ...
  WARNING: WITNESS option enabled, expect reduced performance.
  Table 'FACP' at 0xba918a58
  Table 'APIC' at 0xba918b50
  Table 'SSDT' at 0xba918be8
  Table 'MCFG' at 0xba918dc0
  Table 'HPET' at 0xba918e00
  ACPI: No SRAT table found
  Preloaded elf kernel /boot/kernel/kernel at 0x81109000
  Preloaded elf obj module /boot/kernel/zfs.ko at 0x81109370 --
  kldload: unexpected relocation type 67108875
  kernel trap 12 with interrupts disabled
 
  The irritating detail is the load address of zfs.ko, which is just
  0x370 bytes above the kernel load address ...
  
  That isn't unusual.  Those are the addresses of the metadata provided by 
  the 
  loader, not the base address of the kernel or zfs.ko object themselves.  
  The 
  unexpected relocation type is interesting however.  That value in hex is 
  0x40b.  0xb is the R_X86_64_32S relocation type which is normal for the 
  kernel.  I think you just have a single-bit memory error due to a failing 
  DIMM.
 
 Thanks for the information about the load address semantics. The other
 unexpected relocation type I observed was 268435457 == 0x1001, which
 also hints at a single bit error. But today the system failed with a
 different error:
 
 ath0: ...
 ioapic0: routing interrupt 18 to ...
 panic: vm_page_insert: page already inserted
 
 This could of course also be caused by a single bit error ...

Yes, very likely.

 Hmmm, perhaps there is a problem with components at room temperature
 and the system is still significantly warmer after 3 hours?

Yes, I strongly suspect it is a thermal effect that the RAM works once it
is warmed up.  If you have data you care about on the machine, I would just
go ahead and replace the RAM now before waiting for the RAM's failure to
become worse.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT

2011-11-16 Thread John Baldwin
On Sunday, November 13, 2011 12:56:12 pm Stefan Esser wrote:
 Am 11.11.2011 13:15, schrieb Attilio Rao:
  Can you try rebuilding your kernel and modules from scratch and see if
  it fixes your problem?
 
 Sorry for the delay, but my system seems to need being turned off (S5)
 for many hours (whole night) to reproduce the problem ...
 
 I had already rebuilt my kernel multiple times in the last weeks. But
 just to be sure, I removed the build directories for kernel and world
 and built a new kernel after building and installing world from scratch.
 The next reboot (with boot  blocks from the freshly built world) failed
 again ...
 
 But the first lines of boot messages look strange:
 
 ...
 WARNING: WITNESS option enabled, expect reduced performance.
 Table 'FACP' at 0xba918a58
 Table 'APIC' at 0xba918b50
 Table 'SSDT' at 0xba918be8
 Table 'MCFG' at 0xba918dc0
 Table 'HPET' at 0xba918e00
 ACPI: No SRAT table found
 Preloaded elf kernel /boot/kernel/kernel at 0x81109000
 Preloaded elf obj module /boot/kernel/zfs.ko at 0x81109370 --
 kldload: unexpected relocation type 67108875
 kernel trap 12 with interrupts disabled
 
 The irritating detail is the load address of zfs.ko, which is just
 0x370 bytes above the kernel load address ...

That isn't unusual.  Those are the addresses of the metadata provided by the 
loader, not the base address of the kernel or zfs.ko object themselves.  The 
unexpected relocation type is interesting however.  That value in hex is 
0x40b.  0xb is the R_X86_64_32S relocation type which is normal for the 
kernel.  I think you just have a single-bit memory error due to a failing 
DIMM.

 A verbose boot scrolls these lines off the screen to fast (and is to
 long to be preserved in dmesg.boot from the start), so I do not have any
 idea whether other values are reported in case of a successful boot.
 
 I had already assumed that memory was corrupted during early start-up,
 but now I think that gptzfsboot writes the zfs kernel module over the
 start of the loaded kernel. I'll try some more tests later today.

Nah, if zfs.ko were loaded over the beginning of the kernel you wouldn't even 
get to the point of the first kernel printf.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT

2011-11-13 Thread Stefan Esser
Am 11.11.2011 13:15, schrieb Attilio Rao:
 Can you try rebuilding your kernel and modules from scratch and see if
 it fixes your problem?

Sorry for the delay, but my system seems to need being turned off (S5)
for many hours (whole night) to reproduce the problem ...

I had already rebuilt my kernel multiple times in the last weeks. But
just to be sure, I removed the build directories for kernel and world
and built a new kernel after building and installing world from scratch.
The next reboot (with boot  blocks from the freshly built world) failed
again ...

But the first lines of boot messages look strange:

...
WARNING: WITNESS option enabled, expect reduced performance.
Table 'FACP' at 0xba918a58
Table 'APIC' at 0xba918b50
Table 'SSDT' at 0xba918be8
Table 'MCFG' at 0xba918dc0
Table 'HPET' at 0xba918e00
ACPI: No SRAT table found
Preloaded elf kernel /boot/kernel/kernel at 0x81109000
Preloaded elf obj module /boot/kernel/zfs.ko at 0x81109370 --
kldload: unexpected relocation type 67108875
kernel trap 12 with interrupts disabled

The irritating detail is the load address of zfs.ko, which is just
0x370 bytes above the kernel load address ...

A verbose boot scrolls these lines off the screen to fast (and is to
long to be preserved in dmesg.boot from the start), so I do not have any
idea whether other values are reported in case of a successful boot.

I had already assumed that memory was corrupted during early start-up,
but now I think that gptzfsboot writes the zfs kernel module over the
start of the loaded kernel. I'll try some more tests later today.

Regards, STefan
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT

2011-11-11 Thread Stefan Esser

Am 10.11.2011 11:32, schrieb Attilio Rao:

2011/11/10 Stefan Essers...@freebsd.org:

I can produce further debug output on demand, but I do not have a serial or
firewire console setup for debugging.

Is anybody else affected by this boot problem?


Can you setup a videocamera or a simple serial console?
Did you try to boot with both -s and -v on?

Attilio


I should be able to attach a serial console.

Booting with -s should make no difference (since booting fails during a 
very early initialization stage).


I tried -v, but found that I could not reproduce the cold boot problem
without the system being at least in S5 for hours (just switching off
power and waiting a few minutes did not suffice, but this morning the
system again booted only on the first attempt). This behavior obviously
limits the rate of tests possible ...


It looks as if the memory holding the loaded kernel and/or modules is
corrupted before the kernel is reloaded and started, as indicated by
this morning's boot failure:

kldload: unexpected relocation type 268435457
kldload: unexpected relocation type 67108865
Fatal trap 12: ...

The rest of the panic message and back trace is identical to the trap 12 
panic details in my previous message.



It really looks as if the loaded kernel image is corrupted at random
positions, leading to random panics (but often of the type trap 12 or
page fault in kernel) when execution reaches damaged code or data.

A reboot succeeded without any problem as in all prior cases ...

Any ideas?

Regards, STefan
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT

2011-11-11 Thread Attilio Rao
Can you try rebuilding your kernel and modules from scratch and see if
it fixes your problem?

Attilio

2011/11/11 Stefan Esser s...@freebsd.org:
 Am 10.11.2011 11:32, schrieb Attilio Rao:

 2011/11/10 Stefan Essers...@freebsd.org:

 I can produce further debug output on demand, but I do not have a serial
 or
 firewire console setup for debugging.

 Is anybody else affected by this boot problem?

 Can you setup a videocamera or a simple serial console?
 Did you try to boot with both -s and -v on?

 Attilio

 I should be able to attach a serial console.

 Booting with -s should make no difference (since booting fails during a very
 early initialization stage).

 I tried -v, but found that I could not reproduce the cold boot problem
 without the system being at least in S5 for hours (just switching off
 power and waiting a few minutes did not suffice, but this morning the
 system again booted only on the first attempt). This behavior obviously
 limits the rate of tests possible ...


 It looks as if the memory holding the loaded kernel and/or modules is
 corrupted before the kernel is reloaded and started, as indicated by
 this morning's boot failure:

 kldload: unexpected relocation type 268435457
 kldload: unexpected relocation type 67108865
 Fatal trap 12: ...

 The rest of the panic message and back trace is identical to the trap 12
 panic details in my previous message.


 It really looks as if the loaded kernel image is corrupted at random
 positions, leading to random panics (but often of the type trap 12 or
 page fault in kernel) when execution reaches damaged code or data.

 A reboot succeeded without any problem as in all prior cases ...

 Any ideas?

 Regards, STefan
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org




-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [amd64] Reproducible cold boot failure (reboot succeeds) in -CURRENT

2011-11-10 Thread Attilio Rao
2011/11/10 Stefan Esser s...@freebsd.org:
 For a few weeks I have been suffering from a problem that requires manual
 intervention to get my home workstation boot -CURRENT.

 The kernel panics at varying places and with different panic messages, e.g.
 (hand transcribed since kernel dumps don't work at that stage):

 privileged instruction fault while in kernel mode

 kmem_alloc_nofault +0x37
 kmem_init +0x9e
 vm_kmem_init +0x39
 mi_startup +0x77
 btext +0x2c


 On another cold boot attempt:

 kernel trap 12 with interrupts disabled
 Fatal trap 12: page fault while in kernel mode

 elf_relocinternal +0xa8
 link_elf_reloc_local +0x2fe
 link_elf_link_preload +0x69d
 linker_preload +0x101
 mi_startup +0x77
 btext +0x2c


 In all the cases observed, the system starts without any problems on second
 attempt (pressing RESET or the reboot command in the debugger).
 The system is working reliably, once booted.

 This started a few weeks back (after the switch-over to 10-CURRENT,
 IIRC), and I did not bother to report it at the time, since I thought it was
 caused by a temporary instability in the code base.

 The system is an i2600K on ASUS P8H67-M EVO with 8GB of RAM and an amd64
 kernel booting from ZFS (gptzfsboot). The kernel is a stripped down GENERIC
 plus IPFW and ath (but I doubt that the configuration is causing this, since
 the failure happens before any devices are probed and the identically
 configured kernels used to cold boot just fine for half a year).

 Any hint how to further diagnose this case is welcome (but my spare
 time is very limited and I cannot easily bisect to find a revision
 that boots, for example).

 I can produce further debug output on demand, but I do not have a serial or
 firewire console setup for debugging.

 Is anybody else affected by this boot problem?

Can you setup a videocamera or a simple serial console?
Did you try to boot with both -s and -v on?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org