Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008
Hi Ma Jun, Greetings again. So, I just tested the recommended patch and the issue with the graphical login screen was successfully resolved. Thank you very much for your prompt reviews and recommended patches. God bless. Best regards, Mirsad Todorovac On 1/25/24 10:29, Mirsad Todorovac wrote: Hi Ma Jun, Copy that. This appears to be the exact problem, and thank you for reviewing the bug report at such a short notice. I apologise for the wrong assertion. The patch you sent then just triggered another bug, and it is not manifested without the patch (but a NULL pointer dereference instead). But of course, it is not profitable to remove your patch and have the NULL ptr dereference, but a proper fix is required. Thanks again. Best regards, Mirsad Todorovac On 1/25/2024 8:38 AM, Ma, Jun wrote: Hi Mirsad, On 1/25/2024 1:48 AM, Mirsad Todorovac wrote: Hi, Ma Jun, Normally, I would reply under the quoted text, but I will adjust to your convention. I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland session to block at typing password and ENTER in the graphical logon screen (tested several times). This problem is not caused by my patch. Based on your syslog, it looks more like a shedule issue. I just saw a similar problem, please refer to the link below https://gitlab.freedesktop.org/drm/amd/-/issues/3124 Regards, Ma Jun After that, I was not able to even log from another box with ssh, or the session would block (tested one time, second time too, thrid time it passed after I connected before attempt to login on XWayland console). You might find useful syslog and dmesg of the freeze on this link (they were +100K): https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/ The exact applied patch was this: marvin@defiant:~/linux/kernel/linux_torvalds$ git diff diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 73f6d7e72c73..6ef333df9adf 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct amdgpu_device *adev) if (!amdgpu_sriov_vf(adev)) { snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", ucode_prefix); - err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name); - /* don't check this. There are apparently firmwares in the wild with - * incorrect size in the header - */ - if (err == -ENODEV) - goto out; + err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev); if (err) - dev_dbg(adev->dev, - "gfx10: amdgpu_ucode_request() failed \"%s\"\n", - fw_name); + goto out; + + /* don't validate this firmware. There are apparently firmwares + * in the wild with incorrect size in the header + */ rlc_hdr = (const struct rlc_firmware_header_v2_0 *)adev->gfx.rlc_fw->data; version_major = le16_to_cpu(rlc_hdr->header.header_version_major); version_minor = le16_to_cpu(rlc_hdr->header.header_version_minor); marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms Linux 6.7.0-xway-09721-g61da593f4458 x86_64 marvin@defiant:~/linux/kernel/linux_torvalds$ So, there seems to be a problem with the way the patch affects XWayland. Checked multiple times the exact commit with and without the diff. Hope this helps, because I am not familiar with the amdgpu driver. Best regards, Mirsad Todorovac On 1/22/24 09:34, Ma, Jun wrote: Perhaps similar to the problem I encountered earlier, you can try the following patch https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html Regards, Ma Jun On 1/21/2024 3:54 AM, Mirsad Todorovac wrote: Hi, The last email did not pass to the most of the recipients due to banned .xz attachment. As the .config is too big to send inline or uncompressed either, I will omit it in this attempt. In the meantime, I had some success in decoding the stack trace, but sadly not complete. I don't think this Oops is deterministic, but I am working on a reproducer. The platform is Ubuntu 22.04 LTS. Complete list of hardware and .config is available here: https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/ Best regards, Mirsad --- kernel: [ 5.576702] BUG: kernel NULL pointer dereference, address: 0008 kernel: [ 5.576707] #PF: supervisor read access in kernel mode kernel: [ 5.576710] #PF: error_code(0x) - not-present page kernel: [ 5.576712] PGD 0 P4D 0 kernel: [
Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008
Hi Ma Jun, Copy that. This appears to be the exact problem, and thank you for reviewing the bug report at such a short notice. I apologise for the wrong assertion. The patch you sent then just triggered another bug, and it is not manifested without the patch (but a NULL pointer dereference instead). But of course, it is not profitable to remove your patch and have the NULL ptr dereference, but a proper fix is required. Thanks again. Best regards, Mirsad Todorovac On 1/25/2024 8:38 AM, Ma, Jun wrote: Hi Mirsad, On 1/25/2024 1:48 AM, Mirsad Todorovac wrote: Hi, Ma Jun, Normally, I would reply under the quoted text, but I will adjust to your convention. I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland session to block at typing password and ENTER in the graphical logon screen (tested several times). This problem is not caused by my patch. Based on your syslog, it looks more like a shedule issue. I just saw a similar problem, please refer to the link below https://gitlab.freedesktop.org/drm/amd/-/issues/3124 Regards, Ma Jun After that, I was not able to even log from another box with ssh, or the session would block (tested one time, second time too, thrid time it passed after I connected before attempt to login on XWayland console). You might find useful syslog and dmesg of the freeze on this link (they were +100K): https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/ The exact applied patch was this: marvin@defiant:~/linux/kernel/linux_torvalds$ git diff diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 73f6d7e72c73..6ef333df9adf 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct amdgpu_device *adev) if (!amdgpu_sriov_vf(adev)) { snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", ucode_prefix); - err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name); - /* don't check this. There are apparently firmwares in the wild with -* incorrect size in the header -*/ - if (err == -ENODEV) - goto out; + err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev); if (err) - dev_dbg(adev->dev, - "gfx10: amdgpu_ucode_request() failed \"%s\"\n", - fw_name); + goto out; + + /* don't validate this firmware. There are apparently firmwares +* in the wild with incorrect size in the header +*/ rlc_hdr = (const struct rlc_firmware_header_v2_0 *)adev->gfx.rlc_fw->data; version_major = le16_to_cpu(rlc_hdr->header.header_version_major); version_minor = le16_to_cpu(rlc_hdr->header.header_version_minor); marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms Linux 6.7.0-xway-09721-g61da593f4458 x86_64 marvin@defiant:~/linux/kernel/linux_torvalds$ So, there seems to be a problem with the way the patch affects XWayland. Checked multiple times the exact commit with and without the diff. Hope this helps, because I am not familiar with the amdgpu driver. Best regards, Mirsad Todorovac On 1/22/24 09:34, Ma, Jun wrote: Perhaps similar to the problem I encountered earlier, you can try the following patch https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html Regards, Ma Jun On 1/21/2024 3:54 AM, Mirsad Todorovac wrote: Hi, The last email did not pass to the most of the recipients due to banned .xz attachment. As the .config is too big to send inline or uncompressed either, I will omit it in this attempt. In the meantime, I had some success in decoding the stack trace, but sadly not complete. I don't think this Oops is deterministic, but I am working on a reproducer. The platform is Ubuntu 22.04 LTS. Complete list of hardware and .config is available here: https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/ Best regards, Mirsad --- kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 0008 kernel: [5.576707] #PF: supervisor read access in kernel mode kernel: [5.576710] #PF: error_code(0x) - not-present page kernel: [5.576712] PGD 0 P4D 0 kernel: [5.576715] Oops: [#1] PREEMPT SMP NOPTI kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2 kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 kernel: [5.576726
Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008
Hi, Ma Jun, Normally, I would reply under the quoted text, but I will adjust to your convention. I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland session to block at typing password and ENTER in the graphical logon screen (tested several times). After that, I was not able to even log from another box with ssh, or the session would block (tested one time, second time too, thrid time it passed after I connected before attempt to login on XWayland console). You might find useful syslog and dmesg of the freeze on this link (they were +100K): https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/ The exact applied patch was this: marvin@defiant:~/linux/kernel/linux_torvalds$ git diff diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 73f6d7e72c73..6ef333df9adf 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct amdgpu_device *adev) if (!amdgpu_sriov_vf(adev)) { snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", ucode_prefix); - err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name); - /* don't check this. There are apparently firmwares in the wild with -* incorrect size in the header -*/ - if (err == -ENODEV) - goto out; + err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev); if (err) - dev_dbg(adev->dev, - "gfx10: amdgpu_ucode_request() failed \"%s\"\n", - fw_name); + goto out; + + /* don't validate this firmware. There are apparently firmwares +* in the wild with incorrect size in the header +*/ rlc_hdr = (const struct rlc_firmware_header_v2_0 *)adev->gfx.rlc_fw->data; version_major = le16_to_cpu(rlc_hdr->header.header_version_major); version_minor = le16_to_cpu(rlc_hdr->header.header_version_minor); marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms Linux 6.7.0-xway-09721-g61da593f4458 x86_64 marvin@defiant:~/linux/kernel/linux_torvalds$ So, there seems to be a problem with the way the patch affects XWayland. Checked multiple times the exact commit with and without the diff. Hope this helps, because I am not familiar with the amdgpu driver. Best regards, Mirsad Todorovac On 1/22/24 09:34, Ma, Jun wrote: Perhaps similar to the problem I encountered earlier, you can try the following patch https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html Regards, Ma Jun On 1/21/2024 3:54 AM, Mirsad Todorovac wrote: Hi, The last email did not pass to the most of the recipients due to banned .xz attachment. As the .config is too big to send inline or uncompressed either, I will omit it in this attempt. In the meantime, I had some success in decoding the stack trace, but sadly not complete. I don't think this Oops is deterministic, but I am working on a reproducer. The platform is Ubuntu 22.04 LTS. Complete list of hardware and .config is available here: https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/ Best regards, Mirsad --- kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 0008 kernel: [5.576707] #PF: supervisor read access in kernel mode kernel: [5.576710] #PF: error_code(0x) - not-present page kernel: [5.576712] PGD 0 P4D 0 kernel: [5.576715] Oops: [#1] PREEMPT SMP NOPTI kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2 kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init (drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu kernel: [ 5.576872] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 ed 0f 84 b3 fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87 01 00 4c 89 ff <48> 8b 40 08 0f b7 50 0a 0f b7 70 08 e8 e4 42 fb ff 41 89 c2 85 c0 All code 0: 8d 55 a8lea-0x58(%rbp),%edx 3: 4c 89 ffmov%r15,%rdi 6: e8 e4 83 ec ff call 0xffec83ef b: 41 89 c2mov%eax,%r10d e: 83 f8 edcmp$0xffed,%eax 11: 0f 84 b3 fd ff ff je 0xfdca 17: 85 c0 test %eax,%eax 19: 74 05 je 0x20 1b: 0f 1f 44 00 00
Re: BUG [RESEND]: kernel NULL pointer dereference, address: 0000000000000008
On 22. 01. 2024. 09:34, Ma, Jun wrote: > Perhaps similar to the problem I encountered earlier, you can > try the following patch > > https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html Appaarently, this patch prevented NULL dereference, it was no longer in the log. However, there is another hang in XWayland password entry dialog, but I do not think that I figured out what is wrong. Best regards, Mirsad > Regards, > Ma Jun > > On 1/21/2024 3:54 AM, Mirsad Todorovac wrote: >> Hi, >> >> The last email did not pass to the most of the recipients due to banned .xz >> attachment. >> >> As the .config is too big to send inline or uncompressed either, I will omit >> it in this >> attempt. In the meantime, I had some success in decoding the stack trace, >> but sadly not >> complete. >> >> I don't think this Oops is deterministic, but I am working on a reproducer. >> >> The platform is Ubuntu 22.04 LTS. >> >> Complete list of hardware and .config is available here: >> >> https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/ >> >> Best regards, >> Mirsad >> >> --- >> kernel: [5.576702] BUG: kernel NULL pointer dereference, address: >> 0008 >> kernel: [5.576707] #PF: supervisor read access in kernel mode >> kernel: [5.576710] #PF: error_code(0x) - not-present page >> kernel: [5.576712] PGD 0 P4D 0 >> kernel: [5.576715] Oops: [#1] PREEMPT SMP NOPTI >> kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted >> 6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2 >> kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG >> Lightning, BIOS 1.21 04/26/2023 >> kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init >> (drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 >> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu >> kernel: [ 5.576872] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 ed >> 0f 84 b3 fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87 01 00 4c 89 ff >> <48> 8b 40 08 0f b7 50 0a 0f b7 70 08 e8 e4 42 fb ff 41 89 c2 85 c0 >> All code >> >> 0: 8d 55 a8lea-0x58(%rbp),%edx >> 3: 4c 89 ffmov%r15,%rdi >> 6: e8 e4 83 ec ff call 0xffec83ef >> b: 41 89 c2mov%eax,%r10d >> e: 83 f8 edcmp$0xffed,%eax >>11: 0f 84 b3 fd ff ff je 0xfdca >>17: 85 c0 test %eax,%eax >>19: 74 05 je 0x20 >>1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) >>20: 49 8b 87 08 87 01 00mov0x18708(%r15),%rax >>27: 4c 89 ffmov%r15,%rdi >>2a:* 48 8b 40 08 mov0x8(%rax),%rax <-- >> trapping instruction >>2e: 0f b7 50 0a movzwl 0xa(%rax),%edx >>32: 0f b7 70 08 movzwl 0x8(%rax),%esi >>36: e8 e4 42 fb ff call 0xfffb431f >>3b: 41 89 c2mov%eax,%r10d >>3e: 85 c0 test %eax,%eax >> >> Code starting with the faulting instruction >> === >> 0: 48 8b 40 08 mov0x8(%rax),%rax >> 4: 0f b7 50 0a movzwl 0xa(%rax),%edx >> 8: 0f b7 70 08 movzwl 0x8(%rax),%esi >> c: e8 e4 42 fb ff call 0xfffb42f5 >>11: 41 89 c2mov%eax,%r10d >>14: 85 c0 test %eax,%eax >> kernel: [5.576878] RSP: 0018:a5b3c103f720 EFLAGS: 00010282 >> kernel: [5.576881] RAX: RBX: c1d73489 RCX: >> >> kernel: [5.576884] RDX: RSI: RDI: >> 91ae4fa8 >> kernel: [5.576886] RBP: a5b3c103f7b0 R08: R09: >> >> kernel: [5.576889] R10: ffea R11: R12: >> 91ae4fa986e8 >> kernel: [5.576892] R13: 91ae4fa986d8 R14: 91ae4fa986f8 R15: >> 91ae4fa8 >> kernel: [5.576895] FS: 7fdaa343c8c0() GS:91bd5844() >> knlGS: >> kernel: [5.576898] CS: 0010 DS: ES: 00
Re: [BUG][BISECTED] Freeze at loading init ramdisk
On 1/22/24 11:20, Uwe Kleine-König wrote: On Thu, Jan 18, 2024 at 09:04:05PM +0100, Mirsad Todorovac wrote: On 1/18/24 08:45, Uwe Kleine-König wrote: Hello Mirsad, On Wed, Jan 17, 2024 at 07:47:49PM +0100, Mirsad Todorovac wrote: On 1/16/24 01:32, Mirsad Todorovac wrote: On the Ubuntu 22.04 LTS Jammy platform, on a mainline vanilla torvalds tree kernel, the boot freezes upon first two lines and before any systemd messages. (Please find the config attached.) Bisecting the bug led to this result: marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good d97a78423c33f68ca6543de510a409167baed6f5 is the first bad commit commit d97a78423c33f68ca6543de510a409167baed6f5 Merge: 61da593f4458 689237ab37c5 Author: Linus Torvalds Date: Fri Jan 12 14:38:08 2024 -0800 [...] Hope this helps. P.S. As I see that this is a larger merge commit, with 5K+ lines changed, I don't think I can bisect further to determine the culprit. Actually it's not that hard. If a merge commit is the first bad commit for a bisection, either the merge wasn't done correctly (less likely, looking at d97a78423c33f68ca6543de510a409167baed6f5 I'd bet this isn't the problem); or changes on different sides conflict or you did something wrong during bisection. To rule out the third option, you can just retest d97a78423c33, 61da593f4458 and 689237ab37c5. If d97a78423c33 is the only bad one, you did it right. This was confirmed. Then to further debug the second option you can find out the offending commit on each side with a bisection as follows, here for the RHS (i.e. 689237ab37c5): git bisect start 689237ab37c5 $(git merge-base 61da593f4458 689237ab37c5) and then in each bisection step do: git merge --no-commit 61da593f4458 test if the problem is present git reset --hard git bisect good/bad In this case you get merge conflicts in drivers/video/fbdev/amba-clcd.c and drivers/video/fbdev/vermilion/vermilion.c. In the assumption that you don't have these enabled in your .config, you can just ignore these. Side note: A problem during bisection can be that the .config changes along the process. You should put your config into (say) arch/x86/configs/lala_defconfig and do make lala_defconfig before building each step to prevent this. I must have done something wrong: marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect log # bad: [689237ab37c59b9909bc9371d7fece3081683fba] fbdev/intelfb: Remove driver # good: [de927f6c0b07d9e698416c5b287c521b07694cac] Merge tag 's390-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux git bisect start '689237ab37c5' 'de927f6c0b07d9e698416c5b287c521b07694cac' # good: [d9f25b59ed85ae45801cf45fe17eb269b0ef3038] fbdev: Remove support for Carillo Ranch driver git bisect good d9f25b59ed85ae45801cf45fe17eb269b0ef3038 # good: [e2e0b838a1849f92612a8305c09aaf31bf824350] video/sticore: Remove info field from STI struct git bisect good e2e0b838a1849f92612a8305c09aaf31bf824350 # good: [778e73d2411abc8f3a2d60dbf038acaec218792e] drm/hyperv: Remove firmware framebuffers with aperture helper git bisect good 778e73d2411abc8f3a2d60dbf038acaec218792e # good: [df67699c9cb0ceb70f6cc60630ca938c06773eda] firmware/sysfb: Clear screen_info state after consuming it git bisect good df67699c9cb0ceb70f6cc60630ca938c06773eda FTR: Now that you identified df67699c9cb0ce as the culprit, calling git bisect good on it was wrong, so something was fishy in your testing and it's no surprise the bisection found a wrong result. Copy that. But it is my first attempt on a bisect of a merge commit, so I will simply ask to be forgiven. Maybe I forgot "git reset --hard" in some step. I have to do a thorough homework on the merge commit magic. Best regards, Mirsad Best regards Uwe
BUG [RESEND]: kernel NULL pointer dereference, address: 0000000000000008
%rax 6: 73 01 jae0x9 8: c3 ret 9: 48 8b 0d 73 b5 0f 00mov0xfb573(%rip),%rcx# 0xfb583 10: f7 d8 neg%eax 12: 64 89 01mov%eax,%fs:(%rcx) 15: 48 rex.W kernel: [5.577729] RSP: 002b:7ffeb4f87d28 EFLAGS: 0246 ORIG_RAX: 0139 kernel: [5.577733] RAX: ffda RBX: 55aedf3eeeb0 RCX: 7fdaa331e88d kernel: [5.577736] RDX: RSI: 55aedf3efb80 RDI: 001a kernel: [5.577738] RBP: 0002 R08: R09: 0002 kernel: [5.577741] R10: 001a R11: 0246 R12: 55aedf3efb80 kernel: [5.577744] R13: 55aedf3f2060 R14: R15: 55aedf2b1220 kernel: [5.577748] kernel: [5.577750] Modules linked in: intel_rapl_msr intel_rapl_common amdgpu(+) edac_mce_amd kvm_amd kvm snd_hda_codec_realtek snd_hda_codec_generic irqbypass ledtrig_audio crct10dif_pclmul polyval_clmulni polyval_generic snd_hda_codec_hdmi ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 amdxcp snd_hda_intel aesni_intel drm_exec snd_intel_dspcfg crypto_simd gpu_sched snd_intel_sdw_acpi cryptd nls_iso8859_1 drm_buddy snd_hda_codec snd_seq_midi drm_suballoc_helper snd_seq_midi_event drm_ttm_helper joydev snd_hda_core input_leds ttm rapl snd_rawmidi snd_hwdep drm_display_helper snd_seq snd_pcm wmi_bmof cec k10temp snd_seq_device ccp rc_core snd_timer snd drm_kms_helper i2c_algo_bit soundcore mac_hid tcp_bbr sch_fq msr parport_pc ppdev lp drm parport efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbhid hid crc32_pclmul nvme r8169 ahci nvme_core i2c_piix4 xhci_pci libahci xhci_pci_renesas realtek video wmi gpio_amdpt kernel: [5.577817] CR2: 0008 kernel: [5.577820] ---[ end trace ]--- kernel: [5.914230] RIP: 0010:gfx_v10_0_early_init (drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu kernel: [ 5.914388] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 ed 0f 84 b3 fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87 01 00 4c 89 ff <48> 8b 40 08 0f b7 50 0a 0f b7 70 08 e8 e4 42 fb ff 41 89 c2 85 c0 All code 0: 8d 55 a8lea-0x58(%rbp),%edx 3: 4c 89 ffmov%r15,%rdi 6: e8 e4 83 ec ff call 0xffec83ef b: 41 89 c2mov%eax,%r10d e: 83 f8 edcmp$0xffed,%eax 11: 0f 84 b3 fd ff ff je 0xfdca 17: 85 c0 test %eax,%eax 19: 74 05 je 0x20 1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 20: 49 8b 87 08 87 01 00mov0x18708(%r15),%rax 27: 4c 89 ffmov%r15,%rdi 2a:* 48 8b 40 08 mov0x8(%rax),%rax <-- trapping instruction 2e: 0f b7 50 0a movzwl 0xa(%rax),%edx 32: 0f b7 70 08 movzwl 0x8(%rax),%esi 36: e8 e4 42 fb ff call 0xfffb431f 3b: 41 89 c2mov%eax,%r10d 3e: 85 c0 test %eax,%eax Code starting with the faulting instruction === 0: 48 8b 40 08 mov0x8(%rax),%rax 4: 0f b7 50 0a movzwl 0xa(%rax),%edx 8: 0f b7 70 08 movzwl 0x8(%rax),%esi c: e8 e4 42 fb ff call 0xfffb42f5 11: 41 89 c2mov%eax,%r10d 14: 85 c0 test %eax,%eax rsyslogd: rsyslogd's groupid changed to 111 kernel: [5.914394] RSP: 0018:a5b3c103f720 EFLAGS: 00010282 kernel: [5.914397] RAX: RBX: c1d73489 RCX: kernel: [5.914399] RDX: RSI: RDI: 91ae4fa8 kernel: [5.914402] RBP: a5b3c103f7b0 R08: R09: kernel: [5.914405] R10: ffea R11: R12: 91ae4fa986e8 kernel: [5.914408] R13: 91ae4fa986d8 R14: 91ae4fa986f8 R15: 91ae4fa8 kernel: [5.914410] FS: 7fdaa343c8c0() GS:91bd5844() knlGS: kernel: [5.914414] CS: 0010 DS: ES: CR0: 80050033 kernel: [5.914416] CR2: 0008 CR3: 0001222d CR4: 00750ef0 kernel: [5.914419] PKRU: 5554 Best regards, Mirsad On 1/18/24 18:23, Mirsad Todorovac wrote: Hi, Unfortunately, I was not able to reboot in this kernel again to do the stack decode, but I thought that any information about the NULL pointer dereference is better than no info. The system is Ubuntu 23.10 Mantic with AMD product: Navi 23 [Radeon RX 6600/6600 XT/6600M] graphic card. Please find the config and the hw listing attached. Best regards, Mirsad
Re: [BUG][BISECTED] Freeze at loading init ramdisk
On 1/20/24 12:25, Bagas Sanjaya wrote: On Wed, Jan 17, 2024 at 07:47:49PM +0100, Mirsad Todorovac wrote: On 1/16/24 01:32, Mirsad Todorovac wrote: Hi, On the Ubuntu 22.04 LTS Jammy platform, on a mainline vanilla torvalds tree kernel, the boot freezes upon first two lines and before any systemd messages. (Please find the config attached.) Bisecting the bug led to this result: marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good d97a78423c33f68ca6543de510a409167baed6f5 is the first bad commit commit d97a78423c33f68ca6543de510a409167baed6f5 Merge: 61da593f4458 689237ab37c5 Author: Linus Torvalds Date: Fri Jan 12 14:38:08 2024 -0800 Merge tag 'fbdev-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev Pull fbdev updates from Helge Deller: "Three fbdev drivers (~8500 lines of code) removed. The Carillo Ranch fbdev driver is for an Intel product which was never shipped, and for the intelfb and the amba-clcd drivers the drm drivers can be used instead. The other code changes are minor: some fb_deferred_io flushing fixes, imxfb margin fixes and stifb cleanups. Summary: - Remove intelfb fbdev driver (Thomas Zimmermann) - Remove amba-clcd fbdev driver (Linus Walleij) - Remove vmlfb Carillo Ranch fbdev driver (Matthew Wilcox) - fb_deferred_io flushing fixes (Nam Cao) - imxfb code fixes and cleanups (Dario Binacchi) - stifb primary screen detection cleanups (Thomas Zimmermann)" * tag 'fbdev-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: (28 commits) fbdev/intelfb: Remove driver fbdev/hyperv_fb: Do not clear global screen_info firmware/sysfb: Clear screen_info state after consuming it fbdev/hyperv_fb: Remove firmware framebuffers with aperture helpers drm/hyperv: Remove firmware framebuffers with aperture helper fbdev/sis: Remove dependency on screen_info video/logo: use %u format specifier for unsigned int values video/sticore: Remove info field from STI struct arch/parisc: Detect primary video device from device instance fbdev/stifb: Allocate fb_info instance with framebuffer_alloc() video/sticore: Store ROM device in STI struct fbdev: flush deferred IO before closing fbdev: flush deferred work in fb_deferred_io_fsync() fbdev: amba-clcd: Delete the old CLCD driver fbdev: Remove support for Carillo Ranch driver fbdev: hgafb: fix kernel-doc comments fbdev: mmp: Fix typo and wording in code comment fbdev: fsl-diu-fb: Fix sparse warning due to virt_to_phys() prototype change fbdev: imxfb: add '*/' on a separate line in block comment fbdev: imxfb: use __func__ for function name ... Documentation/fb/index.rst | 1 - Documentation/fb/intelfb.rst | 155 -- Documentation/userspace-api/ioctl/ioctl-number.rst | 1 - MAINTAINERS | 12 - arch/parisc/video/fbdev.c | 2 +- drivers/Makefile | 3 +- drivers/firmware/sysfb.c | 14 +- drivers/gpu/drm/hyperv/hyperv_drm_drv.c | 8 +- drivers/video/backlight/Kconfig | 7 - drivers/video/backlight/Makefile | 1 - drivers/video/backlight/cr_bllcd.c | 264 --- drivers/video/fbdev/Kconfig | 72 - drivers/video/fbdev/Makefile | 2 - drivers/video/fbdev/amba-clcd.c | 986 - drivers/video/fbdev/core/fb_defio.c | 8 +- drivers/video/fbdev/fsl-diu-fb.c | 2 +- drivers/video/fbdev/hgafb.c | 13 +- drivers/video/fbdev/hyperv_fb.c | 20 +- drivers/video/fbdev/imxfb.c | 179 +- drivers/video/fbdev/intelfb/Makefile | 8 - drivers/video/fbdev/intelfb/intelfb.h | 382 drivers/video/fbdev/intelfb/intelfb_i2c.c | 209 -- drivers/video/fbdev/intelfb/intelfbdrv.c | 1680 drivers/video/fbdev/intelfb/intelfbhw.c | 2115 drivers/video/fbdev/intelfb/intelfbhw.h | 609 -- drivers/video/fbdev/mmp/hw/mmp_spi.c | 2 +- drivers/video/fbdev/sis/sis_main.c | 37 - drivers/video/fbdev/stifb.c | 109 +- drivers/video/fbdev/vermilion/Makefile | 6 - drivers/video/fbdev/vermilion/cr_pll.c | 195 -- drivers/video/fbdev/vermilion/vermilion.c | 1175 --- drivers/video/fbdev/vermilion/vermilion.h
Re: REGRESSION: no console on current -git
On 1/20/24 01:32, Jens Axboe wrote: On 1/19/24 5:27 PM, Helge Deller wrote: On 1/19/24 22:22, Jens Axboe wrote: On 1/19/24 2:14 PM, Helge Deller wrote: On 1/19/24 22:01, Jens Axboe wrote: On 1/19/24 1:55 PM, Helge Deller wrote: Adding Mirsad Todorovac (who reported a similar issue). On 1/19/24 19:39, Jens Axboe wrote: My trusty R7525 test box is failing to show a console, or in fact anything, on current -git. There's no output after: Loading Linux 6.7.0+ ... Loading initial ramdisk ... and I don't get a console up. I went through the bisection pain and found this was the culprit: commit df67699c9cb0ceb70f6cc60630ca938c06773eda Author: Thomas Zimmermann Date: Wed Jan 3 11:15:11 2024 +0100 firmware/sysfb: Clear screen_info state after consuming it Reverting this commit, and everything is fine. Looking at dmesg with a buggy kernel, I get no frame or fb messages. On a good kernel, it looks ilke this: [1.416486] efifb: probing for efifb [1.416602] efifb: framebuffer at 0xde00, using 3072k, total 3072k [1.416605] efifb: mode is 1024x768x32, linelength=4096, pages=1 [1.416607] efifb: scrolling: redraw [1.416608] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0 [1.449746] fb0: EFI VGA frame buffer device Happy to test a fix, or barring that, can someone just revert this commit please? I've temporarily added a revert patch into the fbdev for-next tree for now, so people should not face the issue in the for-next series: https://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev.git/commit/?h=for-next I'd like to wait for Thomas to return on monday to check the issue as there are some other upcoming patches in this area from him. Given the issue (and that I'm not the only one reporting it), can we please just get that pushed so it'll make -rc1? It can always get re-introduced in a fixed fashion. I don't run -next here, I rely on mainline working for my testing. I agree, it would be good to get it fixed for -rc1. So, it's ok for me, but I won't be able to test the revert short time right now. If you can assure that the revert fixes it, and builds in git-head, I can now prepare the pull request for Linus now (or he just reverts commit df67699c9cb0 manually). I already tested a revert on top of the current tree, and it builds just fine and boots with a working console. So reverting it does work and solves the issue. I sent a pull request with the revert. Thanks! You forgot the Reported-by, but no big deal. Hi, I confirm that this revert df67699c9cb0ce also solved the original initrd boot problem here: 1991 git checkout d97a78423c33 1992 git revert df67699c9cb0ce 1993 make clean; make olddefconfig 1994 time nice make -j 36 bindeb-pkg |& tee ../err-6.8-mrg-1.log; date 1995 sudo apt-get -s install ../linux-image-6.7.0-bagas-vanilla-rvt-09751-g6b082430adc8_6.7.0-09751-g6b082430adc8-26_amd64.deb 1996 sudo apt-get -y install ../linux-image-6.7.0-bagas-vanilla-rvt-09751-g6b082430adc8_6.7.0-09751-g6b082430adc8-26_amd64.deb You might add: Tested-by: Mirsad Goran Todorovac at your convenience. Best regards, Mirsad
Re: [BUG][BISECTED] Freeze at loading init ramdisk
On 1/18/24 22:14, Uwe Kleine-König wrote: On Thu, Jan 18, 2024 at 09:04:05PM +0100, Mirsad Todorovac wrote: On 1/18/24 08:45, Uwe Kleine-König wrote: Hello Mirsad, On Wed, Jan 17, 2024 at 07:47:49PM +0100, Mirsad Todorovac wrote: On 1/16/24 01:32, Mirsad Todorovac wrote: On the Ubuntu 22.04 LTS Jammy platform, on a mainline vanilla torvalds tree kernel, the boot freezes upon first two lines and before any systemd messages. (Please find the config attached.) Bisecting the bug led to this result: marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good d97a78423c33f68ca6543de510a409167baed6f5 is the first bad commit commit d97a78423c33f68ca6543de510a409167baed6f5 Merge: 61da593f4458 689237ab37c5 Author: Linus Torvalds Date: Fri Jan 12 14:38:08 2024 -0800 [...] Hope this helps. P.S. As I see that this is a larger merge commit, with 5K+ lines changed, I don't think I can bisect further to determine the culprit. Actually it's not that hard. If a merge commit is the first bad commit for a bisection, either the merge wasn't done correctly (less likely, looking at d97a78423c33f68ca6543de510a409167baed6f5 I'd bet this isn't the problem); or changes on different sides conflict or you did something wrong during bisection. To rule out the third option, you can just retest d97a78423c33, 61da593f4458 and 689237ab37c5. If d97a78423c33 is the only bad one, you did it right. This was confirmed. Then to further debug the second option you can find out the offending commit on each side with a bisection as follows, here for the RHS (i.e. 689237ab37c5): git bisect start 689237ab37c5 $(git merge-base 61da593f4458 689237ab37c5) and then in each bisection step do: git merge --no-commit 61da593f4458 test if the problem is present git reset --hard git bisect good/bad In this case you get merge conflicts in drivers/video/fbdev/amba-clcd.c and drivers/video/fbdev/vermilion/vermilion.c. In the assumption that you don't have these enabled in your .config, you can just ignore these. Side note: A problem during bisection can be that the .config changes along the process. You should put your config into (say) arch/x86/configs/lala_defconfig and do make lala_defconfig before building each step to prevent this. I must have done something wrong: marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect log # bad: [689237ab37c59b9909bc9371d7fece3081683fba] fbdev/intelfb: Remove driver # good: [de927f6c0b07d9e698416c5b287c521b07694cac] Merge tag 's390-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux git bisect start '689237ab37c5' 'de927f6c0b07d9e698416c5b287c521b07694cac' # good: [d9f25b59ed85ae45801cf45fe17eb269b0ef3038] fbdev: Remove support for Carillo Ranch driver git bisect good d9f25b59ed85ae45801cf45fe17eb269b0ef3038 # good: [e2e0b838a1849f92612a8305c09aaf31bf824350] video/sticore: Remove info field from STI struct git bisect good e2e0b838a1849f92612a8305c09aaf31bf824350 # good: [778e73d2411abc8f3a2d60dbf038acaec218792e] drm/hyperv: Remove firmware framebuffers with aperture helper git bisect good 778e73d2411abc8f3a2d60dbf038acaec218792e # good: [df67699c9cb0ceb70f6cc60630ca938c06773eda] firmware/sysfb: Clear screen_info state after consuming it git bisect good df67699c9cb0ceb70f6cc60630ca938c06773eda marvin@defiant:~/linux/kernel/linux_torvalds$ with the error: marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good Bisecting: 0 revisions left to test after this (roughly 0 steps) drivers/video/fbdev/amba-clcd.c: needs merge drivers/video/fbdev/vermilion/vermilion.c: needs merge error: you need to resolve your current index first It seems you forgot the "git reset --hard" step. Doing it in this state should still be possible. Well, it was possible, but I obviously got the wrong result: marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect log # bad: [689237ab37c59b9909bc9371d7fece3081683fba] fbdev/intelfb: Remove driver # good: [de927f6c0b07d9e698416c5b287c521b07694cac] Merge tag 's390-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux git bisect start '689237ab37c5' 'de927f6c0b07d9e698416c5b287c521b07694cac' # good: [d9f25b59ed85ae45801cf45fe17eb269b0ef3038] fbdev: Remove support for Carillo Ranch driver git bisect good d9f25b59ed85ae45801cf45fe17eb269b0ef3038 # good: [e2e0b838a1849f92612a8305c09aaf31bf824350] video/sticore: Remove info field from STI struct git bisect good e2e0b838a1849f92612a8305c09aaf31bf824350 # good: [778e73d2411abc8f3a2d60dbf038acaec218792e] drm/hyperv: Remove firmware framebuffers with aperture helper git bisect good 778e73d2411abc8f3a2d60dbf038acaec218792e # good: [df67699c9cb0ceb70f6cc60630ca938c06773eda] firmware/sysfb: Clear screen_info state after consuming it git bisect good df67699c9cb0ceb70f6cc60630ca938c06773eda # good: [df67699c9cb0ceb70f6cc60630ca938c06773eda] firmware/sysfb: Clear screen_info state after consuming it git b
Re: [BUG][BISECTED] Freeze at loading init ramdisk
On 1/18/24 08:45, Uwe Kleine-König wrote: Hello Mirsad, On Wed, Jan 17, 2024 at 07:47:49PM +0100, Mirsad Todorovac wrote: On 1/16/24 01:32, Mirsad Todorovac wrote: On the Ubuntu 22.04 LTS Jammy platform, on a mainline vanilla torvalds tree kernel, the boot freezes upon first two lines and before any systemd messages. (Please find the config attached.) Bisecting the bug led to this result: marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good d97a78423c33f68ca6543de510a409167baed6f5 is the first bad commit commit d97a78423c33f68ca6543de510a409167baed6f5 Merge: 61da593f4458 689237ab37c5 Author: Linus Torvalds Date: Fri Jan 12 14:38:08 2024 -0800 [...] Hope this helps. P.S. As I see that this is a larger merge commit, with 5K+ lines changed, I don't think I can bisect further to determine the culprit. Actually it's not that hard. If a merge commit is the first bad commit for a bisection, either the merge wasn't done correctly (less likely, looking at d97a78423c33f68ca6543de510a409167baed6f5 I'd bet this isn't the problem); or changes on different sides conflict or you did something wrong during bisection. To rule out the third option, you can just retest d97a78423c33, 61da593f4458 and 689237ab37c5. If d97a78423c33 is the only bad one, you did it right. This was confirmed. Then to further debug the second option you can find out the offending commit on each side with a bisection as follows, here for the RHS (i.e. 689237ab37c5): git bisect start 689237ab37c5 $(git merge-base 61da593f4458 689237ab37c5) and then in each bisection step do: git merge --no-commit 61da593f4458 test if the problem is present git reset --hard git bisect good/bad In this case you get merge conflicts in drivers/video/fbdev/amba-clcd.c and drivers/video/fbdev/vermilion/vermilion.c. In the assumption that you don't have these enabled in your .config, you can just ignore these. Side note: A problem during bisection can be that the .config changes along the process. You should put your config into (say) arch/x86/configs/lala_defconfig and do make lala_defconfig before building each step to prevent this. I must have done something wrong: marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect log # bad: [689237ab37c59b9909bc9371d7fece3081683fba] fbdev/intelfb: Remove driver # good: [de927f6c0b07d9e698416c5b287c521b07694cac] Merge tag 's390-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux git bisect start '689237ab37c5' 'de927f6c0b07d9e698416c5b287c521b07694cac' # good: [d9f25b59ed85ae45801cf45fe17eb269b0ef3038] fbdev: Remove support for Carillo Ranch driver git bisect good d9f25b59ed85ae45801cf45fe17eb269b0ef3038 # good: [e2e0b838a1849f92612a8305c09aaf31bf824350] video/sticore: Remove info field from STI struct git bisect good e2e0b838a1849f92612a8305c09aaf31bf824350 # good: [778e73d2411abc8f3a2d60dbf038acaec218792e] drm/hyperv: Remove firmware framebuffers with aperture helper git bisect good 778e73d2411abc8f3a2d60dbf038acaec218792e # good: [df67699c9cb0ceb70f6cc60630ca938c06773eda] firmware/sysfb: Clear screen_info state after consuming it git bisect good df67699c9cb0ceb70f6cc60630ca938c06773eda marvin@defiant:~/linux/kernel/linux_torvalds$ with the error: marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good Bisecting: 0 revisions left to test after this (roughly 0 steps) drivers/video/fbdev/amba-clcd.c: needs merge drivers/video/fbdev/vermilion/vermilion.c: needs merge error: you need to resolve your current index first marvin@defiant:~/linux/kernel/linux_torvalds$ Best regards, Mirsad Best regards Uwe
Re: [BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]
Hi Christian, Aha, thanks, that explains it. Then the KCSAN report must be false positive. Kind regards, Mirsad On 8/25/23 09:05, Christian König wrote: Hi Mirsad, the name SPSC stands for SingleProducerSingleConsumer, so yes even by the name of the component we make it clear that this can only be used by one producer and one consumer thread at the same time. Here disabling preemption is just done so that the consumer thread doesn't busy loop for the producer thread to be scheduled in again. Regards, Christian. Am 24.08.23 um 19:46 schrieb Mirsad Goran Todorovac: Thank you, Christian. Glad to hear about that. However, I guess this assumes that this piece of code between -<>- preempt_disable(); tail = (struct spsc_node **)atomic_long_xchg(>tail, (long)>next); WRITE_ONCE(*tail, node); atomic_inc(>job_count); /* * In case of first element verify new node will be visible to the consumer * thread when we ping the kernel thread that there is new work to do. */ smp_wmb(); preempt_enable(); -<>- ... executes only on one CPU/core/thread? I understood that preempt_disable() disables only interrupts on one core/CPU: https://kernelnewbies.kernelnewbies.narkive.com/6LTlgsAe/preempt-disable-disables-preemption-on-all-processors So, we might have a race in theory between WRITE_ONCE() and atomic_inc(). Kind regards, Mirsad On 8/21/2023 8:22 PM, Christian König wrote: I'm not sure about that. On the one hand it might generate some noise. I know tons of cases where logic is: Ok if we see the updated value immediately it will optimize things, but if not it's unproblematic because there is another check after the next memory barrier. On the other hand we probably have cases where this is not correctly implemented. So double checking those would most like be good idea. Regards, Christian. Am 21.08.23 um 16:28 schrieb Mirsad Todorovac: Hi Christian, Thank you for the update. Should I continue reporting what KCSAN gives? I will try to filter these to save your time for evaluation ... Kind regards, Mirsad On 8/21/23 15:20, Christian König wrote: Hi Mirsad, well this is a false positive. That drm_sched_entity_is_ready() doesn't see the data written by drm_sched_entity_push_job() is part of the logic here. Regards, Christian. Am 18.08.23 um 15:44 schrieb Mirsad Todorovac: On 8/17/23 21:54, Mirsad Todorovac wrote: Hi, This is your friendly bug reporter. The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS and a Ryzen 7950X box. Please find attached the complete dmesg output from the ring buffer and lshw output. NOTE: The kernel reports tainted kernel, but to my knowledge there are no proprietary (G) modules, but this taint is turned on by the previous bugs. dmesg excerpt: [ 8791.864576] == [ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched] [ 8791.864776] write (marked) to 0x9b74491b7c40 of 8 bytes by task 3807 on cpu 18: [ 8791.864788] drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched] [ 8791.864852] amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu] [ 8791.868731] drm_ioctl_kernel+0x127/0x210 [drm] [ 8791.869222] drm_ioctl+0x38f/0x6f0 [drm] [ 8791.869711] amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu] [ 8791.873660] __x64_sys_ioctl+0xd2/0x120 [ 8791.873676] do_syscall_64+0x58/0x90 [ 8791.873688] entry_SYSCALL_64_after_hwframe+0x73/0xdd [ 8791.873710] read to 0x9b74491b7c40 of 8 bytes by task 1119 on cpu 27: [ 8791.873722] drm_sched_entity_is_ready+0x16/0x50 [gpu_sched] [ 8791.873786] drm_sched_select_entity+0x1c7/0x220 [gpu_sched] [ 8791.873849] drm_sched_main+0xd2/0x500 [gpu_sched] [ 8791.873912] kthread+0x18b/0x1d0 [ 8791.873924] ret_from_fork+0x43/0x70 [ 8791.873939] ret_from_fork_asm+0x1b/0x30 [ 8791.873955] value changed: 0x -> 0x9b750ebcfc00 [ 8791.873971] Reported by Kernel Concurrency Sanitizer on: [ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: G L 6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35 [ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 [ 8791.874002] == P.S. According to Mr. Heo's instructions, I am adding the unwound trace here: [ 1879.706518] == [ 1879.706616] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched] [ 1879.706737] write (marked) to 0x8f3672748c40 of 8 bytes by task 4087 on cpu 10: [ 1879.706748] drm_sched_entity_push_job (./include/drm/spsc_queue.h:74 drivers/gpu/drm/scheduler/sched_entity.c:574) gpu_sched [ 1879.706808] amdgpu_cs_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1375 drivers/gpu/drm/amd/amdgpu/
Re: [BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]
Hi Christian, Thank you for the update. Should I continue reporting what KCSAN gives? I will try to filter these to save your time for evaluation ... Kind regards, Mirsad On 8/21/23 15:20, Christian König wrote: Hi Mirsad, well this is a false positive. That drm_sched_entity_is_ready() doesn't see the data written by drm_sched_entity_push_job() is part of the logic here. Regards, Christian. Am 18.08.23 um 15:44 schrieb Mirsad Todorovac: On 8/17/23 21:54, Mirsad Todorovac wrote: Hi, This is your friendly bug reporter. The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS and a Ryzen 7950X box. Please find attached the complete dmesg output from the ring buffer and lshw output. NOTE: The kernel reports tainted kernel, but to my knowledge there are no proprietary (G) modules, but this taint is turned on by the previous bugs. dmesg excerpt: [ 8791.864576] == [ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched] [ 8791.864776] write (marked) to 0x9b74491b7c40 of 8 bytes by task 3807 on cpu 18: [ 8791.864788] drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched] [ 8791.864852] amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu] [ 8791.868731] drm_ioctl_kernel+0x127/0x210 [drm] [ 8791.869222] drm_ioctl+0x38f/0x6f0 [drm] [ 8791.869711] amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu] [ 8791.873660] __x64_sys_ioctl+0xd2/0x120 [ 8791.873676] do_syscall_64+0x58/0x90 [ 8791.873688] entry_SYSCALL_64_after_hwframe+0x73/0xdd [ 8791.873710] read to 0x9b74491b7c40 of 8 bytes by task 1119 on cpu 27: [ 8791.873722] drm_sched_entity_is_ready+0x16/0x50 [gpu_sched] [ 8791.873786] drm_sched_select_entity+0x1c7/0x220 [gpu_sched] [ 8791.873849] drm_sched_main+0xd2/0x500 [gpu_sched] [ 8791.873912] kthread+0x18b/0x1d0 [ 8791.873924] ret_from_fork+0x43/0x70 [ 8791.873939] ret_from_fork_asm+0x1b/0x30 [ 8791.873955] value changed: 0x -> 0x9b750ebcfc00 [ 8791.873971] Reported by Kernel Concurrency Sanitizer on: [ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: G L 6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35 [ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 [ 8791.874002] == P.S. According to Mr. Heo's instructions, I am adding the unwound trace here: [ 1879.706518] == [ 1879.706616] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched] [ 1879.706737] write (marked) to 0x8f3672748c40 of 8 bytes by task 4087 on cpu 10: [ 1879.706748] drm_sched_entity_push_job (./include/drm/spsc_queue.h:74 drivers/gpu/drm/scheduler/sched_entity.c:574) gpu_sched [ 1879.706808] amdgpu_cs_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1375 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1469) amdgpu [ 1879.710589] drm_ioctl_kernel (drivers/gpu/drm/drm_ioctl.c:788) drm [ 1879.711068] drm_ioctl (drivers/gpu/drm/drm_ioctl.c:892) drm [ 1879.711551] amdgpu_drm_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:2748) amdgpu [ 1879.715319] __x64_sys_ioctl (fs/ioctl.c:51 fs/ioctl.c:870 fs/ioctl.c:856 fs/ioctl.c:856) [ 1879.715334] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 1879.715345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) [ 1879.715365] read to 0x8f3672748c40 of 8 bytes by task 1098 on cpu 11: [ 1879.715376] drm_sched_entity_is_ready (drivers/gpu/drm/scheduler/sched_entity.c:134) gpu_sched [ 1879.715435] drm_sched_select_entity (drivers/gpu/drm/scheduler/sched_main.c:248 drivers/gpu/drm/scheduler/sched_main.c:893) gpu_sched [ 1879.715495] drm_sched_main (drivers/gpu/drm/scheduler/sched_main.c:1019) gpu_sched [ 1879.715554] kthread (kernel/kthread.c:389) [ 1879.715563] ret_from_fork (arch/x86/kernel/process.c:145) [ 1879.715575] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) [ 1879.715590] value changed: 0x -> 0x8f360663dc00 [ 1879.715604] Reported by Kernel Concurrency Sanitizer on: [ 1879.715612] CPU: 11 PID: 1098 Comm: gfx_0.0.0 Tainted: G L 6.5.0-rc6+ #47 [ 1879.715624] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 [ 1879.715631] == It seems that the line in question might be: first = spsc_queue_push(>job_queue, _job->queue_node); which expands to: static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node) { struct spsc_node **tail; node->next = NULL; preempt_disable(); tail = (struct spsc_node **)atomic_long_xchg(>tail, (long)>next); WRITE_ONCE(*tail, node); atomic_inc(>job_count); /* * In case of first element verify
[BUG]: amdgpu: soft lockup - CPU#1 stuck for 26s! [systemd-udevd:635]
69.199110] ? vcn_v1_0_enc_ring_emit_fence (drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1647 (discriminator 1)) amdgpu [ 69.204958] ? __pfx_vcn_v1_0_enc_ring_emit_fence (drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu [ 69.210899] ? __pfx_vcn_v1_0_enc_ring_emit_fence (drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu [ 69.216910] ? __pfx_vcn_v1_0_enc_ring_emit_fence (drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu [ 69.222561] module_address_lookup+0x8c/0xe0 [ 69.222573] ? __pfx_vcn_v1_0_enc_ring_emit_fence (drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu [ 69.228237] kallsyms_lookup_buildid+0x107/0x1b0 [ 69.228251] ? __pfx_vcn_v1_0_enc_ring_emit_fence (drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu [ 69.234368] kallsyms_lookup+0x14/0x30 [ 69.234381] test_for_valid_rec+0x38/0x90 [ 69.234411] ? sched_clock_noinstr+0x9/0x10 [ 69.234448] ? srso_alias_return_thunk+0x5/0x7f [ 69.234459] ? __mutex_lock_slowpath+0x13/0x20 [ 69.234470] ? srso_alias_return_thunk+0x5/0x7f [ 69.234481] ? mutex_lock+0xa7/0xb0 [ 69.234492] ftrace_module_enable+0x22e/0x3b0 [ 69.234525] load_module+0x3357/0x3980 [ 69.234533] ? aa_file_perm+0x1fc/0x800 [ 69.234562] ? srso_alias_return_thunk+0x5/0x7f [ 69.234593] ? security_kernel_post_read_file+0x79/0x90 [ 69.234618] init_module_from_file+0xdf/0x130 [ 69.234642] ? srso_alias_return_thunk+0x5/0x7f [ 69.234653] ? init_module_from_file+0xdf/0x130 [ 69.234668] idempotent_init_module+0x241/0x360 [ 69.234683] __x64_sys_finit_module+0x8e/0xf0 [ 69.234693] do_syscall_64+0x58/0x90 [ 69.234705] ? srso_alias_return_thunk+0x5/0x7f [ 69.234716] ? exit_to_user_mode_prepare+0x76/0x230 [ 69.234748] ? srso_alias_return_thunk+0x5/0x7f [ 69.234758] ? syscall_exit_to_user_mode+0x29/0x40 [ 69.234769] ? srso_alias_return_thunk+0x5/0x7f [ 69.234780] ? do_syscall_64+0x68/0x90 [ 69.234803] ? srso_alias_return_thunk+0x5/0x7f [ 69.234830] ? exit_to_user_mode_prepare+0x76/0x230 [ 69.234841] ? srso_alias_return_thunk+0x5/0x7f [ 69.234852] ? syscall_exit_to_user_mode+0x29/0x40 [ 69.234869] ? srso_alias_return_thunk+0x5/0x7f [ 69.234888] ? do_syscall_64+0x68/0x90 [ 69.234897] ? srso_alias_return_thunk+0x5/0x7f [ 69.234922] ? do_syscall_64+0x68/0x90 [ 69.234952] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 69.234978] RIP: 0033:0x7f452d11ea3d [ 69.234996] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 a3 0f 00 f7 d8 64 89 01 48 All code 0:5b pop%rbx 1:41 5cpop%r12 3:c3 ret 4:66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) b:00 00 d:f3 0f 1e fa endbr64 11:48 89 f8 mov%rdi,%rax 14:48 89 f7 mov%rsi,%rdi 17:48 89 d6 mov%rdx,%rsi 1a:48 89 ca mov%rcx,%rdx 1d:4d 89 c2 mov%r8,%r10 20:4d 89 c8 mov%r9,%r8 23:4c 8b 4c 24 08 mov0x8(%rsp),%r9 28:0f 05syscall 2a:*48 3d 01 f0 ff ffcmp$0xf001,%rax<-- trapping instruction 30:73 01jae0x33 32:c3 ret 33:48 8b 0d c3 a3 0f 00 mov0xfa3c3(%rip),%rcx# 0xfa3fd 3a:f7 d8neg%eax 3c:64 89 01 mov%eax,%fs:(%rcx) 3f:48 rex.W Code starting with the faulting instruction === 0:48 3d 01 f0 ff ffcmp$0xf001,%rax 6:73 01jae0x9 8:c3 ret 9:48 8b 0d c3 a3 0f 00 mov0xfa3c3(%rip),%rcx# 0xfa3d3 10:f7 d8neg%eax 12:64 89 01 mov%eax,%fs:(%rcx) 15:48 rex.W [ 69.235005] RSP: 002b:7ffda20bffe8 EFLAGS: 0246 ORIG_RAX: 0139 [ 69.235020] RAX: ffda RBX: 5561184c0f30 RCX: 7f452d11ea3d [ 69.235028] RDX: RSI: 55611837ad80 RDI: 001a [ 69.235035] RBP: 0002 R08: R09: 0002 [ 69.235052] R10: 001a R11: 0246 R12: 55611837ad80 [ 69.235059] R13: 55611836bc10 R14: R15: 5561184ba330 [ 69.235072] [ 69.462372] == Best regards, Mirsad Todorovac
[BUG]: amdgpu: soft lockup - CPU#1 stuck for 26s! [systemd-udevd:635]
69.210899] ? __pfx_vcn_v1_0_enc_ring_emit_fence (drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu [ 69.216910] ? __pfx_vcn_v1_0_enc_ring_emit_fence (drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu [ 69.222561] module_address_lookup+0x8c/0xe0 [ 69.222573] ? __pfx_vcn_v1_0_enc_ring_emit_fence (drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu [ 69.228237] kallsyms_lookup_buildid+0x107/0x1b0 [ 69.228251] ? __pfx_vcn_v1_0_enc_ring_emit_fence (drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu [ 69.234368] kallsyms_lookup+0x14/0x30 [ 69.234381] test_for_valid_rec+0x38/0x90 [ 69.234411] ? sched_clock_noinstr+0x9/0x10 [ 69.234448] ? srso_alias_return_thunk+0x5/0x7f [ 69.234459] ? __mutex_lock_slowpath+0x13/0x20 [ 69.234470] ? srso_alias_return_thunk+0x5/0x7f [ 69.234481] ? mutex_lock+0xa7/0xb0 [ 69.234492] ftrace_module_enable+0x22e/0x3b0 [ 69.234525] load_module+0x3357/0x3980 [ 69.234533] ? aa_file_perm+0x1fc/0x800 [ 69.234562] ? srso_alias_return_thunk+0x5/0x7f [ 69.234593] ? security_kernel_post_read_file+0x79/0x90 [ 69.234618] init_module_from_file+0xdf/0x130 [ 69.234642] ? srso_alias_return_thunk+0x5/0x7f [ 69.234653] ? init_module_from_file+0xdf/0x130 [ 69.234668] idempotent_init_module+0x241/0x360 [ 69.234683] __x64_sys_finit_module+0x8e/0xf0 [ 69.234693] do_syscall_64+0x58/0x90 [ 69.234705] ? srso_alias_return_thunk+0x5/0x7f [ 69.234716] ? exit_to_user_mode_prepare+0x76/0x230 [ 69.234748] ? srso_alias_return_thunk+0x5/0x7f [ 69.234758] ? syscall_exit_to_user_mode+0x29/0x40 [ 69.234769] ? srso_alias_return_thunk+0x5/0x7f [ 69.234780] ? do_syscall_64+0x68/0x90 [ 69.234803] ? srso_alias_return_thunk+0x5/0x7f [ 69.234830] ? exit_to_user_mode_prepare+0x76/0x230 [ 69.234841] ? srso_alias_return_thunk+0x5/0x7f [ 69.234852] ? syscall_exit_to_user_mode+0x29/0x40 [ 69.234869] ? srso_alias_return_thunk+0x5/0x7f [ 69.234888] ? do_syscall_64+0x68/0x90 [ 69.234897] ? srso_alias_return_thunk+0x5/0x7f [ 69.234922] ? do_syscall_64+0x68/0x90 [ 69.234952] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 69.234978] RIP: 0033:0x7f452d11ea3d [ 69.234996] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 a3 0f 00 f7 d8 64 89 01 48 All code 0: 5b pop%rbx 1: 41 5c pop%r12 3: c3 ret 4: 66 0f 1f 84 00 00 00nopw 0x0(%rax,%rax,1) b: 00 00 d: f3 0f 1e fa endbr64 11: 48 89 f8mov%rdi,%rax 14: 48 89 f7mov%rsi,%rdi 17: 48 89 d6mov%rdx,%rsi 1a: 48 89 camov%rcx,%rdx 1d: 4d 89 c2mov%r8,%r10 20: 4d 89 c8mov%r9,%r8 23: 4c 8b 4c 24 08 mov0x8(%rsp),%r9 28: 0f 05 syscall 2a:* 48 3d 01 f0 ff ff cmp$0xf001,%rax <-- trapping instruction 30: 73 01 jae0x33 32: c3 ret 33: 48 8b 0d c3 a3 0f 00mov0xfa3c3(%rip),%rcx# 0xfa3fd 3a: f7 d8 neg%eax 3c: 64 89 01mov%eax,%fs:(%rcx) 3f: 48 rex.W Code starting with the faulting instruction === 0: 48 3d 01 f0 ff ff cmp$0xf001,%rax 6: 73 01 jae0x9 8: c3 ret 9: 48 8b 0d c3 a3 0f 00mov0xfa3c3(%rip),%rcx# 0xfa3d3 10: f7 d8 neg%eax 12: 64 89 01mov%eax,%fs:(%rcx) 15: 48 rex.W [ 69.235005] RSP: 002b:7ffda20bffe8 EFLAGS: 0246 ORIG_RAX: 0139 [ 69.235020] RAX: ffda RBX: 5561184c0f30 RCX: 7f452d11ea3d [ 69.235028] RDX: RSI: 55611837ad80 RDI: 001a [ 69.235035] RBP: 0002 R08: R09: 0002 [ 69.235052] R10: 001a R11: 0246 R12: 55611837ad80 [ 69.235059] R13: 55611836bc10 R14: R15: 5561184ba330 [ 69.235072] [ 69.462372] == Best regards, Mirsad Todorovac config-6.5.0-rc7-kcsan-g706a74159504.xz Description: application/xz lshw.txt.xz Description: application/xz
Re: [BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]
On 8/17/23 21:54, Mirsad Todorovac wrote: Hi, This is your friendly bug reporter. The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS and a Ryzen 7950X box. Please find attached the complete dmesg output from the ring buffer and lshw output. NOTE: The kernel reports tainted kernel, but to my knowledge there are no proprietary (G) modules, but this taint is turned on by the previous bugs. dmesg excerpt: [ 8791.864576] == [ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched] [ 8791.864776] write (marked) to 0x9b74491b7c40 of 8 bytes by task 3807 on cpu 18: [ 8791.864788] drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched] [ 8791.864852] amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu] [ 8791.868731] drm_ioctl_kernel+0x127/0x210 [drm] [ 8791.869222] drm_ioctl+0x38f/0x6f0 [drm] [ 8791.869711] amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu] [ 8791.873660] __x64_sys_ioctl+0xd2/0x120 [ 8791.873676] do_syscall_64+0x58/0x90 [ 8791.873688] entry_SYSCALL_64_after_hwframe+0x73/0xdd [ 8791.873710] read to 0x9b74491b7c40 of 8 bytes by task 1119 on cpu 27: [ 8791.873722] drm_sched_entity_is_ready+0x16/0x50 [gpu_sched] [ 8791.873786] drm_sched_select_entity+0x1c7/0x220 [gpu_sched] [ 8791.873849] drm_sched_main+0xd2/0x500 [gpu_sched] [ 8791.873912] kthread+0x18b/0x1d0 [ 8791.873924] ret_from_fork+0x43/0x70 [ 8791.873939] ret_from_fork_asm+0x1b/0x30 [ 8791.873955] value changed: 0x -> 0x9b750ebcfc00 [ 8791.873971] Reported by Kernel Concurrency Sanitizer on: [ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: G L 6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35 [ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 [ 8791.874002] == P.S. According to Mr. Heo's instructions, I am adding the unwound trace here: [ 1879.706518] == [ 1879.706616] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched] [ 1879.706737] write (marked) to 0x8f3672748c40 of 8 bytes by task 4087 on cpu 10: [ 1879.706748] drm_sched_entity_push_job (./include/drm/spsc_queue.h:74 drivers/gpu/drm/scheduler/sched_entity.c:574) gpu_sched [ 1879.706808] amdgpu_cs_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1375 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1469) amdgpu [ 1879.710589] drm_ioctl_kernel (drivers/gpu/drm/drm_ioctl.c:788) drm [ 1879.711068] drm_ioctl (drivers/gpu/drm/drm_ioctl.c:892) drm [ 1879.711551] amdgpu_drm_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:2748) amdgpu [ 1879.715319] __x64_sys_ioctl (fs/ioctl.c:51 fs/ioctl.c:870 fs/ioctl.c:856 fs/ioctl.c:856) [ 1879.715334] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 1879.715345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) [ 1879.715365] read to 0x8f3672748c40 of 8 bytes by task 1098 on cpu 11: [ 1879.715376] drm_sched_entity_is_ready (drivers/gpu/drm/scheduler/sched_entity.c:134) gpu_sched [ 1879.715435] drm_sched_select_entity (drivers/gpu/drm/scheduler/sched_main.c:248 drivers/gpu/drm/scheduler/sched_main.c:893) gpu_sched [ 1879.715495] drm_sched_main (drivers/gpu/drm/scheduler/sched_main.c:1019) gpu_sched [ 1879.715554] kthread (kernel/kthread.c:389) [ 1879.715563] ret_from_fork (arch/x86/kernel/process.c:145) [ 1879.715575] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) [ 1879.715590] value changed: 0x -> 0x8f360663dc00 [ 1879.715604] Reported by Kernel Concurrency Sanitizer on: [ 1879.715612] CPU: 11 PID: 1098 Comm: gfx_0.0.0 Tainted: G L 6.5.0-rc6+ #47 [ 1879.715624] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 [ 1879.715631] == It seems that the line in question might be: first = spsc_queue_push(>job_queue, _job->queue_node); which expands to: static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node) { struct spsc_node **tail; node->next = NULL; preempt_disable(); tail = (struct spsc_node **)atomic_long_xchg(>tail, (long)>next); WRITE_ONCE(*tail, node); atomic_inc(>job_count); /* * In case of first element verify new node will be visible to the consumer * thread when we ping the kernel thread that there is new work to do. */ smp_wmb(); preempt_enable(); return tail == >head; } According to the manual, preempt_disable() only guaranteed exclusion on a single CPU/core/thread, so we might be plagued with the slow, old fashioned locking unless anyone had a better idea. Best regards, Mirsad Todorovac
[BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]
Hi, This is your friendly bug reporter. The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS and a Ryzen 7950X box. Please find attached the complete dmesg output from the ring buffer and lshw output. NOTE: The kernel reports tainted kernel, but to my knowledge there are no proprietary (G) modules, but this taint is turned on by the previous bugs. dmesg excerpt: [ 8791.864576] == [ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched] [ 8791.864776] write (marked) to 0x9b74491b7c40 of 8 bytes by task 3807 on cpu 18: [ 8791.864788] drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched] [ 8791.864852] amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu] [ 8791.868731] drm_ioctl_kernel+0x127/0x210 [drm] [ 8791.869222] drm_ioctl+0x38f/0x6f0 [drm] [ 8791.869711] amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu] [ 8791.873660] __x64_sys_ioctl+0xd2/0x120 [ 8791.873676] do_syscall_64+0x58/0x90 [ 8791.873688] entry_SYSCALL_64_after_hwframe+0x73/0xdd [ 8791.873710] read to 0x9b74491b7c40 of 8 bytes by task 1119 on cpu 27: [ 8791.873722] drm_sched_entity_is_ready+0x16/0x50 [gpu_sched] [ 8791.873786] drm_sched_select_entity+0x1c7/0x220 [gpu_sched] [ 8791.873849] drm_sched_main+0xd2/0x500 [gpu_sched] [ 8791.873912] kthread+0x18b/0x1d0 [ 8791.873924] ret_from_fork+0x43/0x70 [ 8791.873939] ret_from_fork_asm+0x1b/0x30 [ 8791.873955] value changed: 0x -> 0x9b750ebcfc00 [ 8791.873971] Reported by Kernel Concurrency Sanitizer on: [ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: G L 6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35 [ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 [ 8791.874002] == Best regards, Mirsad Todorovac dmesg-3.log.xz Description: application/xz lshw.txt.xz Description: application/xz
Re: [Intel-gfx] [PATCH 2/2] drm/i915: Fix a memory leak with reused mmap_offset
Hi, On 1/18/23 11:39, Das, Nirmoy wrote: On 1/18/2023 11:26 AM, Mirsad Todorovac wrote: Hi, On 1/18/23 10:19, Tvrtko Ursulin wrote: Thanks for working on this, it looks good to me and it aligns with how i915 uses the facility. Copying Mirsad who reported the issue in case he is still happy to give it a quick test. Mirsad, I don't know if you are subscribed to one of the two mailing lists where series was posted. In case not, you can grab both patches from https://patchwork.freedesktop.org/series/112952/. Nirmoy - we also have an IGT written by Chuansheng - https://patchwork.freedesktop.org/patch/515720/?series=101035=4. A more generic one could be placed in gem_mmap_offset test but this one works too in my testing and is IMO better than nothing. Finally, let me add some tags below: On 17/01/2023 17:52, Nirmoy Das wrote: drm_vma_node_allow() and drm_vma_node_revoke() should be called in balanced pairs. We call drm_vma_node_allow() once per-file everytime a user calls mmap_offset, but only call drm_vma_node_revoke once per-file on each mmap_offset. As the mmap_offset is reused by the client, the per-file vm_count may remain non-zero and the rbtree leaked. Call drm_vma_node_allow_once() instead to prevent that memory leak. Cc: Tvrtko Ursulin Cc: Andi Shyti Fixes: 786555987207 ("drm/i915/gem: Store mmap_offsets in an rbtree rather than a plain list") Reported-by: Chuansheng Liu Reported-by: Mirsad Todorovac Cc: # v5.7+ Reviewed-by: Tvrtko Ursulin Regards, Tvrtko Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_mman.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c index 4f69bff63068..2aac6bf78740 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c @@ -697,7 +697,7 @@ mmap_offset_attach(struct drm_i915_gem_object *obj, GEM_BUG_ON(lookup_mmo(obj, mmap_type) != mmo); out: if (file) - drm_vma_node_allow(>vma_node, file); + drm_vma_node_allow_once(>vma_node, file); return mmo; err: The drm/i915 patch seems OK and there are currently no memory leaks as of reported by /sys/kernel/debug/kmemleak under the same Chrome load that triggered the initial bug ... Thanks, Mirsad for quickly checking this! There was no problem, Nirmoy, everything applied neatly :) Regards, Mirsad -- Mirsad Goran Todorovac Sistem inženjer Grafički fakultet | Akademija likovnih umjetnosti Sveučilište u Zagrebu System engineer Faculty of Graphic Arts | Academy of Fine Arts University of Zagreb, Republic of Croatia
Re: [PATCH 2/2] drm/i915: Fix a memory leak with reused mmap_offset
Hi, On 1/18/23 10:19, Tvrtko Ursulin wrote: Thanks for working on this, it looks good to me and it aligns with how i915 uses the facility. Copying Mirsad who reported the issue in case he is still happy to give it a quick test. Mirsad, I don't know if you are subscribed to one of the two mailing lists where series was posted. In case not, you can grab both patches from https://patchwork.freedesktop.org/series/112952/. Nirmoy - we also have an IGT written by Chuansheng - https://patchwork.freedesktop.org/patch/515720/?series=101035=4. A more generic one could be placed in gem_mmap_offset test but this one works too in my testing and is IMO better than nothing. Finally, let me add some tags below: On 17/01/2023 17:52, Nirmoy Das wrote: drm_vma_node_allow() and drm_vma_node_revoke() should be called in balanced pairs. We call drm_vma_node_allow() once per-file everytime a user calls mmap_offset, but only call drm_vma_node_revoke once per-file on each mmap_offset. As the mmap_offset is reused by the client, the per-file vm_count may remain non-zero and the rbtree leaked. Call drm_vma_node_allow_once() instead to prevent that memory leak. Cc: Tvrtko Ursulin Cc: Andi Shyti Fixes: 786555987207 ("drm/i915/gem: Store mmap_offsets in an rbtree rather than a plain list") Reported-by: Chuansheng Liu Reported-by: Mirsad Todorovac Cc: # v5.7+ Reviewed-by: Tvrtko Ursulin Regards, Tvrtko Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_mman.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c index 4f69bff63068..2aac6bf78740 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c @@ -697,7 +697,7 @@ mmap_offset_attach(struct drm_i915_gem_object *obj, GEM_BUG_ON(lookup_mmo(obj, mmap_type) != mmo); out: if (file) - drm_vma_node_allow(>vma_node, file); + drm_vma_node_allow_once(>vma_node, file); return mmo; err: The drm/i915 patch seems OK and there are currently no memory leaks as of reported by /sys/kernel/debug/kmemleak under the same Chrome load that triggered the initial bug ... Will post you if there are any changes. Regards, Mirsad -- Mirsad Goran Todorovac Sistem inženjer Grafički fakultet | Akademija likovnih umjetnosti Sveučilište u Zagrebu System engineer Faculty of Graphic Arts | Academy of Fine Arts University of Zagreb, Republic of Croatia