On 5/29/2025 4:02 PM, Greg A. Woods wrote: > At Thu, 29 May 2025 15:01:50 -0400, Chuck Zmudzinski <frchu...@gmail.com> > wrote: > ... > >> When I pass bootdev=dk12 in boot.cfg, the bootloader strangely tries dk1 as >> root >> (which is wrong) and correctly detects dk11 as the dump device. But it never >> gives me the chance to enter the correct root device and instead tries to >> load >> init which of course it cannot find the NetBSD init on dk1 because dk1 is not >> the correct NetBSD root device. In fact on this box a Linux distro is >> installed >> on dk1, as evidenced by the filesystem type detected on dk1: ext2fs. > > Ah, I think that's a bug related to some bizarre/old hacks to find the > "booted_partion" for non-GPT disks: > > if (strncmp(xcp.xcp_bootdev, devname, strlen(devname))) > continue; > > if (is_disk && strlen(xcp.xcp_bootdev) > strlen(devname)) { > /* XXX check device_cfdata as in x86_autoconf.c? */ > booted_partition = toupper( > xcp.xcp_bootdev[strlen(devname)]) - 'A'; > DPRINTF(("%s: booted_partition: %d\n", __func__, > booted_partition)); > } > > It looks like if the first "devname" that's tested is "dk1", then you > get a mess when you're looking for "dk12". >
I see this code is from arch/xen/xen/xen_machdep.c rev 1.27.4.1 for netbsd-10. I also am not sure I understand this code. Is the code supposed to detect, for example, that the found devname would be wd1 in a case when we passed wd1a as the bootdev in boot.cfg? If so, it obviously blows up in this case when we passed dk12 as bootdev in boot.cfg and assumed dk1 is the found devname just because the first three characters of "dk1" and "dk12" match and the string length of "dk12" is greater than the string length of "dk1". We need more sanity checks than that to always get it right on modern systems with more than 10 dk* wedge devices. A very simple sanity check that might fix this case would be to reject the match if the extra digit in the bootdev string is a numerical digit instead of a letter of the alphabet between a and p, because only such a letter would indicate that the two devices are related as a full disk device and a device that is a partition on the full disk. Essentially, it looks like we are getting a false positive when searching for the device on which the root partition resides, and I think maybe if we add that extra sanity check of making sure the extra digit is a letter between a and p instead of a numerical digit like 2 we would correctly detect dk12 as the root device on my system instead of getting dk1 as a false positive. I am not sure I understand why it is even necessary to do this search for the boot device on modern UEFI/GPT partitioned systems. Why not just allow the user to directly specify the root device and the dump device in the boot.cfg file on modern UEFI systems with GPT partitions and skip all this complex searching for a boot device on modern UEFI/GPT systems? Am I missing something? It seems when I interactively told the bootloader/NetBSD kernel what the root device is and what the dump device is, the system booted fine without needing to do any search for the boot device or even to know on which device the root filesystem resides. Why not just make it possible for the user to give boot.cfg the correct wedge devices for root and dump and be done with it? Maybe such simplicity is not possible or so easy with old legacy BIOS booting and MBR partitioning and on such systems we need to do a complicated search for the root device from a given bootdev in boot.cfg, but it seems possible to avoid such complexity with modern UEFI/GPT systems. Can we move in this direction going forward and slowly deprecate the complicated searching for boot devices on old legacy systems? > The code for all this root-finding stuff is spaghetti at best! Tons of > old assumptions held together by magic. Yes, I agree. This code is not very effective in this case with double digit indices for some of the dk* wedge devices. I think that might be the main reason it blew up in this case when we passed dk12 as the bootdev in boot.cfg and also had a device named dk1 on the system. Today this may be a corner case, but it is probably only so because we are slow to adapt to the more modern UEFI/GPT booting scheme that is designed to scale so as to be able to boot many different partitions, certainly partitions on dk* wedges with an index of 10 or more. So in the future, I think a system able to boot a dk* wedge with an index of 10 or more should *not* be considered a corner case. How to fix it? Or should we do more tests to verify the problem is correctly understood here? Do we really want the Xen boot code to be unable to reliably work as expected on systems that have more than 10 dk* wedge devices, which is what we appear to have now? Kind regards, Chuck Zmudzinski