I'm working on an embedded system with many already installed units, and 
multiple different boot architectures. I'm in the process of updating the 
firmware to use grub 2.02 instead of 1.95. We're booting a custom linux kernel, 
and have a custom bios.

I have encountered a problem on one kind of hardware. The same thing works 
correctly in all other systems. The nature of the problem is that while grub is 
reading the initrd file, it hits a GPF. The bios prints out a register dump and 
returns to grub, which promptly hits another GPF. This goes on forever until I 
powercycle the system.

We're using official debian grub packages. I see the problem in Debian 
2.02~beta2-22+deb8u1, Debian 2.02~beta3-5, as well as in grub built from the 
top of the grub source tree about two weeks ago. It never fails on grub 1.95.

The unique thing about the systems that fail is that they do not use efi, and 
they are booting from mdraid partitions. All the systems with efi work fine, 
with or without mdraid. All the non-efi systems with only a single boot drive 
(no raid) work fine.

When it hits the GPF, the EIP is pointing into the middle of a block of zeroes. 
Seems unlikely to be a real code area.

Here are some further things I've found out:


1.  If I enter the grub command prompt and execute the grub commands manually, 
it works. The same commands read from the grub.cfg file hit the error.



2.  If I edit the grub.cfg and insert a 3 second sleep after it reads the 
kernel, but before it reads the initrd, then it works.


3.  If I hack the function grub_cmd_initrd() by adding a call to invalidate the 
grub cache at the beginning of the function, that makes it work.


4.  I booted into a failing system using a rescue boot method, and deleted all 
the raid partitions. Since we use raid metadata format 1.0, the system should 
still be bootable as a raw disk instead of as a raid. I rebooted, and was 
dropped into grub-rescue because it couldn't find the original raid volume that 
was there when I ran grub-install. However, I pointed it to the /boot on the 
raw disk device (hd0,2) and voila, it was able to boot. This makes me think the 
problem is in grub's low level raid code for the i386-pc case. i.e., grub was 
reading the exact same blocks that it reads in the failure case, but it was 
reading them without using the mdraid driver. With the mdraid driver, it fails.



5.  The initrd is about 5.7 Mbytes in size. I copied a smaller one from another 
system that was only 3.6 Mbytes. That enabled it to work. I suspect this is 
because the smaller initrd was so small that it finished loading before the 2 
seconds elapsed.



6.  Changing the mdraid metadata format doesn't help. I've tried 0.9, 1.0, and 
1.2. All  behave the same way.

I began suspecting some interaction with the cache when I noticed that the 
cache timeout is 2 seconds, and I had separately found that a delay > 2 seconds 
would fix the problem. So there's some kind of interaction between the raid 
driver and the grub cache that hits this. I tried changing the cache algorithm 
in various ways so that blocks would hash to different locations, but this 
didn't help. I also tried locating /boot in a different partition, and this 
didn't help either.

So at the moment, I can proceed with the workaround of adding a sleep in the 
grub.cfg file. However, it would be much better if we could get clarity that 
this is indeed a grub bug. I hesitate to file a bug at the moment because it 
would be very difficult to provide a way to reproduce it. I'm working to see if 
I can reproduce it on generic hardware.

If anyone can provide hints for how to debug this, it would be most welcome. At 
the moment I'm modifying the source, adding printouts etc., then recompiling 
grub and reinstalling it on the target. Also, if anyone knows a quick hack to 
disable the grub cache, I'd like to try it.

Thanks,

Neil Baylis


_______________________________________________
Grub-devel mailing list
Grub-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/grub-devel

Reply via email to