Re: Xen boot strangeness (Was: Re: [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen))

Chuck Zmudzinski Fri, 30 May 2025 03:30:29 -0700

On 5/29/2025 4:02 PM, Greg A. Woods wrote:
> At Thu, 29 May 2025 15:01:50 -0400, Chuck Zmudzinski <frchu...@gmail.com> 
> wrote:
> ... 
> 
>> When I pass bootdev=dk12 in boot.cfg, the bootloader strangely tries dk1 as 
>> root
>> (which is wrong) and correctly detects dk11 as the dump device. But it never
>> gives me the chance to enter the correct root device and instead tries to 
>> load
>> init which of course it cannot find the NetBSD init on dk1 because dk1 is not
>> the correct NetBSD root device. In fact on this box a Linux distro is 
>> installed
>> on dk1, as evidenced by the filesystem type detected on dk1: ext2fs.
> 
> Ah, I think that's a bug related to some bizarre/old hacks to find the
> "booted_partion" for non-GPT disks:
> 
>               if (strncmp(xcp.xcp_bootdev, devname, strlen(devname)))
>                       continue;
> 
>               if (is_disk && strlen(xcp.xcp_bootdev) > strlen(devname)) {
>                       /* XXX check device_cfdata as in x86_autoconf.c? */
>                       booted_partition = toupper(
>                               xcp.xcp_bootdev[strlen(devname)]) - 'A';
>                       DPRINTF(("%s: booted_partition: %d\n", __func__, 
> booted_partition));
>               }
> 
> It looks like if the first "devname" that's tested is "dk1", then you
> get a mess when you're looking for "dk12".
>


I see this code is from arch/xen/xen/xen_machdep.c rev 1.27.4.1 for netbsd-10.

I also am not sure I understand this code. Is the code supposed to detect,
for example, that the found devname would be wd1 in a case when we passed
wd1a as the bootdev in boot.cfg?

If so, it obviously blows up in this case when we passed dk12 as bootdev
in boot.cfg and assumed dk1 is the found devname just because the first
three characters of "dk1" and "dk12" match and the string length of "dk12"
is greater than the string length of "dk1". We need more sanity checks
than that to always get it right on modern systems with more than 10 dk*
wedge devices.

A very simple sanity check that might fix this case would be to reject
the match if the extra digit in the bootdev string is a numerical digit
instead of a letter of the alphabet between a and p, because only such
a letter would indicate that the two devices are related as a full disk
device and a device that is a partition on the full disk.

Essentially, it looks like we are getting a false positive when searching
for the device on which the root partition resides, and I think maybe if
we add that extra sanity check of making sure the extra digit is a letter
between a and p instead of a numerical digit like 2 we would correctly
detect dk12 as the root device on my system instead of getting dk1 as a
false positive.

I am not sure I understand why it is even necessary to do this search for the
boot device on modern UEFI/GPT partitioned systems. Why not just allow the user
to directly specify the root device and the dump device in the boot.cfg file
on modern UEFI systems with GPT partitions and skip all this complex searching
for a boot device on modern UEFI/GPT systems? Am I missing something? It
seems when I interactively told the bootloader/NetBSD kernel what the root
device is and what the dump device is, the system booted fine without needing
to do any search for the boot device or even to know on which device the root
filesystem resides. Why not just make it possible for the user to give
boot.cfg the correct wedge devices for root and dump and be done with it?
Maybe such simplicity is not possible or so easy with old legacy BIOS booting
and MBR partitioning and on such systems we need to do a complicated search
for the root device from a given bootdev in boot.cfg, but it seems possible to
avoid such complexity with modern UEFI/GPT systems. Can we move in this
direction going forward and slowly deprecate the complicated searching for
boot devices on old legacy systems?

> The code for all this root-finding stuff is spaghetti at best!  Tons of
> old assumptions held together by magic.

Yes, I agree. This code is not very effective in this case with double
digit indices for some of the dk* wedge devices. I think that might be
the main reason it blew up in this case when we passed dk12 as the bootdev
in boot.cfg and also had a device named dk1 on the system. Today this
may be a corner case, but it is probably only so because we are slow to
adapt to the more modern UEFI/GPT booting scheme that is designed to scale
so as to be able to boot many different partitions, certainly partitions
on dk* wedges with an index of 10 or more. So in the future, I think a
system able to boot a dk* wedge with an index of 10 or more should *not*
be considered a corner case.

How to fix it? Or should we do more tests to verify the problem is correctly
understood here? Do we really want the Xen boot code to be unable to reliably
work as expected on systems that have more than 10 dk* wedge devices, which
is what we appear to have now?

Kind regards,

Chuck Zmudzinski

Re: Xen boot strangeness (Was: Re: [SOLVED] Re: Xen 4.18.5_20250521nb0 not ELF binary (Was: Re: EFI and Xen))

Reply via email to