> On 24 May 2026, at 13:58, Sergey Bugaev <[email protected]> wrote: > > On Sun, May 24, 2026 at 4:43 AM Paulo Duarte <[email protected] > <mailto:[email protected]>> wrote: >> >> The imported boot.S places the boot stack inside the .bss segment: >> >> .bss >> .boot_stack: >> .space 4096 >> .boot_stack_end: >> >> c_boot_entry() is the first C function called from _start, with sp >> already pointing at .boot_stack_end. Its first action is to call >> zero_out_bss(), which memsets [__bss_start, __bss_end) — the whole >> .bss range, including the very boot stack the kernel is *currently >> running on*. That wipes the saved x29/x30 and any locals the >> compiler spilled on entry, so the next return / function call >> branches to 0 and the kernel hangs in EL1. >> >> Move the boot stack into its own `.boot_stack` nobits section and >> place that section after `__bss_end` in the linker script so >> zero_out_bss() leaves it alone: >> >> .section .boot_stack, "aw", %nobits >> boot_stack: >> .space 4096 >> .boot_stack_end: >> >> Brought up under qemu-system-aarch64 -M virt the bug fires >> immediately; wip-aarch64 likely never exercised the >> zero_out_bss-from-_start path because its testing was on a >> different boot route. > > Could you expand? What different boot route? > > The patch makes sense, but it is really interesting that this was not > causing issues for us at the time. > This one really eluded me, to be honest, this was my assumption based on the fact that that the wip-aarch64 branch and the upstream master diverged massively on the bootstrap code. This assumption was totally incorrect, and upon further investigation I identified that the the actual root cause was that my cross-compiling gcc had the flag -fno-omit-frame-pointer ON, this was what tripped the bug. with the flag OFF the bug doesn’t trigger. I will amend the commit message on v2 as I think this a genuine bug fix, as it get the kernel a consistent behaviour with or without the flag.
Paulo
