On 2025-03-29 09:44 +0800, Baoquan He wrote: > On 03/29/25 at 01:14am, Roberto Ricci wrote: > [snip] > > Anyway, I performed yet another bisection, this time with just plain > > defconfig plus CONFIG_KEXEC_FILE=y, and I got different results. > > > > Updated steps to reproduce: > > 1. Boot kernel >= v6.8 in a virtual machine created with this command: > > `qemu-system-x86_64 -enable-kvm -smp 1 -m 4.0G -hda disk.qcow2` > > 2. Load the same kernel with: > > `kexec --kexec-file-syscall -l /boot/vmlinuz-6.14.0 --initrd > > /boot/initramfs-6.14.0.img --reuse-cmdline` > > 3. Reboot (or call `kexec -e` directly) > > 4. Hibernate and reboot: `printf reboot >/sys/power/disk && printf disk > > >/sys/power/state` > > 5. Upon resuming, three things could happen, depending on luck: > > OK, this is a little complicated. wondering why you need to do the > hibernation and reboot. Just for curiosity.
The reason I do hibernation and reboot instead of hibernation and then manually boot again is just convenience during tests. The issue occurs with manual reboot too. The reason I want kexec + hibernation to work is to fix a hibernation issue on a system using ZFSBootMenu, a bootloader based on Linux which uses kexec to boot the final OS. Other software using the same mechanism include Petitboot and LinuxBoot. They might be affected as well but I didn't try. > > 5a. A kernel oops: > > ``` > > [ 42.574201] BUG: kernel NULL pointer dereference, address: > > 0000000000000000 > ...snip... > > I will send config and dmesg in replies to this email. > > > > The bisection pointed to > > b3ba234171cd kexec_file: load kernel at top of system RAM if required > [snip] > > I doubt how this caused the failure. I have several questions, could you > help answer: > > 1) Can this problem be stably reproduced with kexec_file_load? Every kernel build I tested which contains that commit is affected. However a given build will not always lead to the same of the three possible outcomes I described. E.g. first you get a oops (case 5a), then you repeat the same steps with the same kernel image and the system may get stuck at a black screen instead (case 5b). But it never fully works. > 2) if answer to 1) is yes, can reverting b3ba234171cd fix it stably? Yes. None of cases 5{a,b,c} I previously described occur. Seems to work fine. > 3) If answer to 1) and 2) is yes, does kexec_load works for you? Asking > this because kexec_load interface defaults to put kexec kernel on top of > system RAM which is equivalent to applying commit b3ba234171cd. No, it doesn't. While hibernation alone works, kexec + hibernation results in the system just rebooting without resuming the hibernation image, but no crash or other weird behaviour occurs. Initially I decided to focus on kexec_file_load in order to narrow things down, but that was before noticing that the bug could manifest itself in different forms. It is possible, indeed, that both syscalls are affected by the same problem, which is not caused by commit b3ba234171cd. I tried to test kexec_load with some older kernels, but I got build errors, so I tested longterm releases where such errors have been fixed. With v4.9.337, kexec (via kexec_load) + hibernation works. With v5.4.291 it doesn't. I'm not sure how bisection could be done in this case. > 4) Can you add '-d' to 'kexec -l' to print more debugging message? When using kexec_file_load, just these two lines get printed: ``` Try gzip decompression. Try LZMA decompression. ``` When using kexec_load on kernel v5.4.291 (which doesn't work): [the output is in a reply to this email] When using kexec_load on kernel v4.9.337 (which works): Identical to above, except for the exact hex value of some addresses. > 5) Can normal kexec trigger the failure? I mean operating kexec w/o > the hibernation/resumption. No, kexec without hibernation seems to work fine, regardless of kernel version and kexec syscall used.