I've been recently using grub.pxe (from debian's version 1.99-17) according to the instructions at [0] to boot memtest86+ [1] (from debian's version 4.20-1.1) over the network on x86 machines. Due to the problems described below, i'm using a serial console.
The grub configuration is very simple: -------------------------- serial --speed=115200 terminal_input console serial terminal_output console serial menuentry 'memtest86+ serial console' { set root='(pxe)' echo 'loading memory tester...' linux16 /memtest86+.bin console=ttyS0,115200n8 } -------------------------- On some machines i've done this with, memtest86+ reports transient memory failures very early in the run, and the failures seem to happen even on brand new sticks of RAM, placed in any combination and order in the hardware. The errors were transient -- sometimes i'd get as many as ~300 32-bit words of RAM failing, other times memtest could complete a full pass with no errors. The failures came during an early test where memtest86+ writes each address's value to its own memory location, and then re-reads the memory to verify. Using the serial line, i was able to record the memory failures from a run that had 24 words fail. I was able to transcribe them and convert them to a hexdump format. These are the 24 words that failed (the memory address indices are in the left-hand column): * 00095d30 9c c8 e3 71 dc ff 00 65 32 4b a0 29 08 06 00 01 00095d40 08 00 06 04 00 01 00 65 32 4b a0 29 c0 a8 17 54 00095d50 00 00 00 00 00 00 c0 a8 17 86 55 55 55 55 55 55 00095d60 55 55 55 55 55 55 55 55 55 55 55 55 9a a2 8c 53 * 00097590 00 00 00 00 30 5d 09 00 40 00 00 00 04 00 00 00 000975a0 4d 4d 00 00 00 00 00 00 00 00 00 00 10 38 6a 94 * The first block (of 16 words) appears to be an ARP request packet From the local network's DHCP server to the failing machine (the MAC addresses have been obfuscated here, and i didn't bother updating the checksum to match) The second block (of 8 words) appears to contain a pointer to the first block, a size indicator, and some other stuff i don't recognize. So i think what's happening is something like Matthew Garrett describes in his recent work with UEFI [2], although i'm using BIOS and not UEFI. In particular, i suspect that *after* the bootloader has turned over control to the kernel (memtest in this case), the PXE-driven NIC is continuing to DMA received packets into active RAM. This seems pretty dangerous! Would using pxe_unload before the close of the stanza prevent this situation from happening (i regret i haven't been able to test it myself because i haven't had access to the failing hardware since i completed this diagnosis)? If so, it seems like that should be clearly documented and strongly recommended in grub.texi. Or, should grub be marking certain sections of memory as unavailable somehow before handoff to the kernel? Or is there some other way to avoid this sort of corruption? I've seen similar failures now on pretty different hardware (a fairly old Dell Optiplex GX260 SFF and a new Lenovo ThinkCentre M77). Any ideas? --dkg [0] https://www.gnu.org/software/grub/manual/grub.html#Network [1] http://www.memtest.org/ [2] http://mjg59.dreamwidth.org/11235.html
pgpfAXFZqs91T.pgp
Description: PGP signature
_______________________________________________ Grub-devel mailing list Grub-devel@gnu.org https://lists.gnu.org/mailman/listinfo/grub-devel