Re: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n

Jani Nikula Fri, 04 Jul 2025 01:29:13 -0700

On Thu, 03 Jul 2025, Askar Safin <safinas...@zohomail.com> wrote:
> TL;DR: I found a bug in strange interaction in kexec_file_load (but not 
> kexec_load) and i915
> TL;DR#2: Second (sometimes third or forth) kexec (using kexec_file_load) 
> fails on my particular hardware
> TL;DR#3: I did 55 expirements, each of them required a lot of boots, in total 
> I did 1908 boots


Thanks for the detailed debug info. I'm afraid all I can say at this
point is, please file all of this in a bug report as described in
[1]. Please add the drm.debug related options, and attach the dmesgs and
configs in the bug instead of pointing at external sites.

BR,
Jani.


[1] https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html


>
> Okay, so I found a bug. Steps to reproduce:
> - I have Dell Precision 7780
> - I have recent Debian x86_64 sid installed (bug reproducible with both 
> Debian kernels and mainline ones)
> - Bug is reproducible on many kernels, including very recent ones, for 
> example 6.15.4
> - Boot system, then do kexec into the same system using kexec_file_load. I. 
> e. pass --kexec-file-syscall to "kexec" command
> - Then kexec from this kexec'ed system again (i. e. you should do two kexec's 
> in a row)
> - Then do 3rd kexec, etc
> - Repeat kexec's until you do 100 kexec's or your system start to misbehave
>
> On my computer the system starts to misbehave after some number of kexec's. 
> This always happens after 2nd kexec attempt.
> I. e. the first kexec is always successful. But second sometimes is not.
> I never was able to perform 100 kexec's in a row.
> After some kexec attempt the system starts to misbehave: oopses, panics, 
> locked system, etc.
>
> Notes:
>
> - I tried to bisect "kexec-tools" package, but bisect merely gave me commit, 
> which switched to kexec_file_load as a default.
> Bug is reproducible if we use kexec_file_load, but doesn't reproduce if we 
> use kexec_load
>
> - Bug is reproducible even if we boot via init=/bin/bash (note: this means 
> that initramfs is still part of the boot process). (If we boot to normal GUI, 
> bug is reproducible, too)
>
> - When I reproduce I use this command line: "root=UUID=... 
> rootflags=subvol=... ro init=..."
>
> - Debian package "plymouth" is required for reproducing. (It reproduces with 
> plymouth, but doesn't reproduce without plymouth.) But note that I never see 
> actual plymouth screen! I. e. presence of
> "plymouth" on the system somehow affects bug reproduciblity despite plymouth 
> animation never actually shown. I don't know why this happens, but I suspect 
> that I don't pass "splash" to kernel command line, and thus don't see 
> plymouth screen. But I suspect that plymouth is still included to initramfs 
> and from there somehow affects boot process
>
> - Bug reproduces in Debian, but doesn't reproduce in Ubuntu. After a lot of 
> expirementing I finally understood why: Ubuntu kernel has 
> CONFIG_INTEL_IOMMU_DEFAULT_ON=y, and Debian kernel has not. Additional 
> expirements found that it is culpit. I. e. the bug is reproducible with 
> CONFIG_INTEL_IOMMU_DEFAULT_ON=n and not reproducbile with 
> CONFIG_INTEL_IOMMU_DEFAULT_ON=y . (So advice for distributions: do what 
> Ubuntu does, i. e. set CONFIG_INTEL_IOMMU_DEFAULT_ON=y to hide this bug)
>
> - Bug is not reproducible in old enough kernels, so I did bisect on Linux. 
> Bisect showed me these commits: d4a2393049..4a75f32fc7. I. e. bug is 
> reproducible in 4a75f32fc7, but doesn't reproduce in d4a2393049. Between them 
> there is a middle commit 52407c220c44c8dcc6a, which is not testable. Here are 
> these commits:
>
> commit 4a75f32fc783128d0c42ef73fa62a20379a66828
> Author: Anusha Srivatsa <anusha.sriva...@intel.com>
>
>    drm/i915/rpl-s: Add PCH Support for Raptor Lake S
>
> commit 52407c220c44c8dcc6aa8aa35ffc8a2db3c849a9
> Author: Anusha Srivatsa <anusha.sriva...@intel.com>
>
>    drm/i915/rpl-s: Add PCI IDS for Raptor Lake S
>
> It seems these commits merely added support for my Intel GPU model. So this 
> is fake regression. I'm not sure this should be treated as proper regression 
> and whether regzbot should be notified. (What do you think?)
>
> Still formally this is regression: I did expirements and they show that bug 
> present in 4a75f32fc783128d0c42 and not present before. (Side note: in latest 
> kernels both wayland and x11 work, in d4a2393049 x11 works and wayland 
> doesn't.)
>
> I tried to reproduce the bug in Qemu, but I was unable to do so. It seems 
> Intel GPU is required, maybe even my particular model.
>
> Here is "lspci -vnn -d :*:0300" for my GPU:
>
> 00:02.0 VGA compatible controller [0300]: Intel Corporation Raptor Lake-S UHD 
> Graphics [8086:a788] (rev 04) (prog-if 00 [VGA controller])
>         Subsystem: Dell Raptor Lake-S UHD Graphics [1028:0c42]
>         Flags: bus master, fast devsel, latency 0, IRQ 202, IOMMU group 0
>         Memory at 604b000000 (64-bit, non-prefetchable) [size=16M]
>         Memory at 4000000000 (64-bit, prefetchable) [size=256M]
>         I/O ports at 3000 [size=64]
>         Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
>         Capabilities: [40] Vendor Specific Information: Len=0c <?>
>         Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
>         Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
>         Capabilities: [d0] Power Management version 2
>         Capabilities: [100] Process Address Space ID (PASID)
>         Capabilities: [200] Address Translation Service (ATS)
>         Capabilities: [300] Page Request Interface (PRI)
>         Capabilities: [320] Single Root I/O Virtualization (SR-IOV)
>         Kernel driver in use: i915
>         Kernel modules: i915
>
> dmidecode:
> https://zerobin.net/?aebea072b93d8122#z4W9URnV+k9ZZErhP4etQkxlfpyRKf++uKMNoO5PGjs=
>
> - I use "root=UUID=... rootflags=subvol=... ro init=..." as a command line 
> for reproducing. If I add "recovery nomodeset dis_ucode_ldr" (this is options 
> used by Ubuntu in recovery mode), the bug stops to reproduce
>
> Again, in short, full list of things required for successful reproducing:
> - Intel GPU, possibly my particular model
> - Kernel with support for my model (4a75f32fc783128d0c42 and later up to 
> 6.15.4)
> - Kexec at least two times. (One kexec never fails, 100 kexec's in a row 
> never succeed)
> - kexec_file_load as opposed to kexec_load
> - Initramfs
> - Lack of parameters "recovery nomodeset dis_ucode_ldr" (i. e. one of them 
> stops reproducing)
> - plymouth
> - CONFIG_INTEL_IOMMU_DEFAULT_ON=n
>
> Removing of ANY of them stops the bug, and I proved this by lots of 
> expirements.
>
> In total I did 55+ expirements, each of them required up to 100 boots. In 
> total I did 1908 (!!!!!!) boots on my physical laptop (I mean kexec boots 
> here). No, I'm not faking this number, here is my actual directories with 
> results:
>
> user@subvolume:~$ ls /rbt/kx-results/
> @rec-2025-06-29T201723Z-bad-4    @rec-2025-06-29T214650Z-good-60  
> @rec-2025-07-03T050626Z-bad-41    @rec-2025-07-03T104125Z-bad-28    
> @rec-2025-07-03T133705Z-bad-3
> @rec-2025-06-29T203429Z-good-60  @rec-2025-06-29T215558Z-bad-8    
> @rec-2025-07-03T060107Z-good-100  @rec-2025-07-03T111727Z-bad-13    
> @rec-2025-07-03T141647Z-good-100
> @rec-2025-06-29T205626Z-good-60  @rec-2025-07-01T042949Z-bad-12   
> @rec-2025-07-03T074810Z-good-100  @rec-2025-07-03T122242Z-good-100  
> @rec-2025-07-03T145705Z-good-100
> @rec-2025-06-29T211612Z-bad-6    @rec-2025-07-02T120101Z-good-60  
> @rec-2025-07-03T082914Z-good-100  @rec-2025-07-03T123958Z-bad-12    
> @rec-2025-07-03T152406Z-bad-50
> @rec-2025-06-29T212932Z-good-60  @rec-2025-07-03T031038Z-good-60  
> @rec-2025-07-03T100615Z-good-100  @rec-2025-07-03T132116Z-good-100  
> @rec-2025-07-03T154204Z-bad-15
> user@subvolume:~$ ls /rbt/kx-manual-testing/
> 2025-07-01-03-19-good-6  2025-07-01-03-56-good-4  2025-07-01-05-28-bad-3  
> 2025-07-01-06-35-bad-2  2025-07-01-09-46-good-8
> 2025-07-01-03-44-good-3  2025-07-01-04-47-good-3  2025-07-01-06-19-bad-2  
> 2025-07-01-09-21-bad-2  2025-07-02-13-09-good
> user@subvolume:~$ ls /rbt/kx-vanilla-results/
> 2025-06-30T005219Z_5.16.0-kx-df0cc57e057f18e4-3e17eec5ff024b63_1626_good_60   
>    
> 2025-06-30T023542Z_5.16.0-rc2-kx-87bb2a410dcfb617-9f30253daecd39e5_1663_bad_4
> 2025-06-30T012313Z_5.17.0-kx-f443e374ae131c16-91b07dce12a83fab_1674_bad_1     
>    
> 2025-06-30T032312Z_5.16.0-rc2-kx-c9ee950a2ca55ea0-854a1f40ce042801_1662_bad_6
> 2025-06-30T013555Z_5.16.0-kx-22ef12195e13c5ec-9aaf880b25942f2a_1668_bad_7     
>    
> 2025-06-30T033528Z_5.16.0-rc2-kx-ba884a411700dc56-854a1f40ce042801_1662_good_60
> 2025-06-30T014106Z_5.16.0-kx-9bcbf894b6872216-b828905f3cf12050_1664_bad_2     
>    
> 2025-06-30T034645Z_5.16.0-rc2-kx-d4a23930490df39f-854a1f40ce042801_1662_good_60
> 2025-06-30T014634Z_5.16.0-rc5-kx-cb6846fbb83b574c-83e7c6cf2ede57b4_1663_bad_6 
>    
> 2025-06-30T035232Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_5
> 2025-06-30T015713Z_5.16.0-rc2-kx-15bb79910fe734ad-9f30253daecd39e5_1663_good_60
>   
> 2025-06-30T042058Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_1
> 2025-06-30T020235Z_5.16.0-rc5-kx-b06103b5325364e0-26176b9b704a5c24_1664_bad_6 
>    2025-06-30T050000Z_6.15.4-kx-e60eb441596d1c70-2378f4efc5e956e5_2366_bad_2
> 2025-06-30T020717Z_5.16.0-rc5-kx-eacef9fd61dcf5ea-26176b9b704a5c24_1664_bad_1 
>    2025-06-30T053011Z_6.15.4-kx-e60eb441596d1c70-2378f4efc5e956e5_2366_good_60
> 2025-06-30T021738Z_5.16.0-rc2-kx-67b858dd89932086-8d2f1d17f1e1933c_1662_good_60
>   
> 2025-06-30T060619Z_5.16.0-rc2-kx-d4a23930490df39f-854a1f40ce042801_1662_good_60
> 2025-06-30T022759Z_5.16.0-rc2-kx-17815f624a90579a-854a1f40ce042801_1662_good_60
>   
> 2025-06-30T061448Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_1
>
> Each number in the end of file/directory name is number of boots. In total we 
> have 1908 boots. Testing was mostly automatical, using my script.
>
> Here is one example dmesg from mainline commit e60eb441596d1c70 (somewhere 
> around 6.15.4):
>
> https://zerobin.net/?119ff118fd47b363#BpziYs6dNz5PaT7H8w2hlveoEYa4DDtITGkyd9o57LE=
>
> This is was dmesg from 2nd (and in the same time last) boot. The next boot 
> (i. e. kexec) was unsuccessful. Corresponding config:
>
> https://zerobin.net/?009c807e1df41af8#gnmrswlbaFbdPTuzNq6NFkQd/Jhb3Ds0ZlLiwNanXnc=
>
> If you want results from all expirements, here is a link: 
> https://filebin.net/45g2757b2iwaeen7 (1 Mb, expires after 7 days). Usually 
> expirements come with full reproducer script.
>
> But what I described above is already enough, I think this link is not needed.
>
> I will be available for testing in coming days, then I will switch to other 
> things, and so will not be available for testing.
> If you want more time, then, please, ask for it, i. e. say me something like 
> "Please, be available for testing in more 10 days".
>
> --
> Askar Safin
> https://types.pl/@safinaskar
>

-- 
Jani Nikula, Intel

Re: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n

Reply via email to