Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008
Hi Ma Jun, Greetings again. So, I just tested the recommended patch and the issue with the graphical login screen was successfully resolved. Thank you very much for your prompt reviews and recommended patches. God bless. Best regards, Mirsad Todorovac On 1/25/24 10:29, Mirsad Todorovac wrote: Hi Ma Jun, Copy that. This appears to be the exact problem, and thank you for reviewing the bug report at such a short notice. I apologise for the wrong assertion. The patch you sent then just triggered another bug, and it is not manifested without the patch (but a NULL pointer dereference instead). But of course, it is not profitable to remove your patch and have the NULL ptr dereference, but a proper fix is required. Thanks again. Best regards, Mirsad Todorovac On 1/25/2024 8:38 AM, Ma, Jun wrote: Hi Mirsad, On 1/25/2024 1:48 AM, Mirsad Todorovac wrote: Hi, Ma Jun, Normally, I would reply under the quoted text, but I will adjust to your convention. I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland session to block at typing password and ENTER in the graphical logon screen (tested several times). This problem is not caused by my patch. Based on your syslog, it looks more like a shedule issue. I just saw a similar problem, please refer to the link below https://gitlab.freedesktop.org/drm/amd/-/issues/3124 Regards, Ma Jun After that, I was not able to even log from another box with ssh, or the session would block (tested one time, second time too, thrid time it passed after I connected before attempt to login on XWayland console). You might find useful syslog and dmesg of the freeze on this link (they were +100K): https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/ The exact applied patch was this: marvin@defiant:~/linux/kernel/linux_torvalds$ git diff diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 73f6d7e72c73..6ef333df9adf 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct amdgpu_device *adev) if (!amdgpu_sriov_vf(adev)) { snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", ucode_prefix); - err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name); - /* don't check this. There are apparently firmwares in the wild with - * incorrect size in the header - */ - if (err == -ENODEV) - goto out; + err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev); if (err) - dev_dbg(adev->dev, - "gfx10: amdgpu_ucode_request() failed \"%s\"\n", - fw_name); + goto out; + + /* don't validate this firmware. There are apparently firmwares + * in the wild with incorrect size in the header + */ rlc_hdr = (const struct rlc_firmware_header_v2_0 *)adev->gfx.rlc_fw->data; version_major = le16_to_cpu(rlc_hdr->header.header_version_major); version_minor = le16_to_cpu(rlc_hdr->header.header_version_minor); marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms Linux 6.7.0-xway-09721-g61da593f4458 x86_64 marvin@defiant:~/linux/kernel/linux_torvalds$ So, there seems to be a problem with the way the patch affects XWayland. Checked multiple times the exact commit with and without the diff. Hope this helps, because I am not familiar with the amdgpu driver. Best regards, Mirsad Todorovac On 1/22/24 09:34, Ma, Jun wrote: Perhaps similar to the problem I encountered earlier, you can try the following patch https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html Regards, Ma Jun On 1/21/2024 3:54 AM, Mirsad Todorovac wrote: Hi, The last email did not pass to the most of the recipients due to banned .xz attachment. As the .config is too big to send inline or uncompressed either, I will omit it in this attempt. In the meantime, I had some success in decoding the stack trace, but sadly not complete. I don't think this Oops is deterministic, but I am working on a reproducer. The platform is Ubuntu 22.04 LTS. Complete list of hardware and .config is available here: https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/ Best regards, Mirsad --- kernel: [ 5.576702] BUG: kernel NULL pointer dereference, address: 0008 kernel: [ 5.576707] #PF: supervisor read access in kernel mode kernel: [ 5.576710] #PF: error_code(0x) - not-present page kernel: [ 5.576712] PGD 0 P4D 0 kernel: [ 5.576715] Oops: [#1] PREEMPT SMP NOPTI kernel: [
Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008
Hi Ma Jun, Copy that. This appears to be the exact problem, and thank you for reviewing the bug report at such a short notice. I apologise for the wrong assertion. The patch you sent then just triggered another bug, and it is not manifested without the patch (but a NULL pointer dereference instead). But of course, it is not profitable to remove your patch and have the NULL ptr dereference, but a proper fix is required. Thanks again. Best regards, Mirsad Todorovac On 1/25/2024 8:38 AM, Ma, Jun wrote: Hi Mirsad, On 1/25/2024 1:48 AM, Mirsad Todorovac wrote: Hi, Ma Jun, Normally, I would reply under the quoted text, but I will adjust to your convention. I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland session to block at typing password and ENTER in the graphical logon screen (tested several times). This problem is not caused by my patch. Based on your syslog, it looks more like a shedule issue. I just saw a similar problem, please refer to the link below https://gitlab.freedesktop.org/drm/amd/-/issues/3124 Regards, Ma Jun After that, I was not able to even log from another box with ssh, or the session would block (tested one time, second time too, thrid time it passed after I connected before attempt to login on XWayland console). You might find useful syslog and dmesg of the freeze on this link (they were +100K): https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/ The exact applied patch was this: marvin@defiant:~/linux/kernel/linux_torvalds$ git diff diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 73f6d7e72c73..6ef333df9adf 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct amdgpu_device *adev) if (!amdgpu_sriov_vf(adev)) { snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", ucode_prefix); - err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name); - /* don't check this. There are apparently firmwares in the wild with -* incorrect size in the header -*/ - if (err == -ENODEV) - goto out; + err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev); if (err) - dev_dbg(adev->dev, - "gfx10: amdgpu_ucode_request() failed \"%s\"\n", - fw_name); + goto out; + + /* don't validate this firmware. There are apparently firmwares +* in the wild with incorrect size in the header +*/ rlc_hdr = (const struct rlc_firmware_header_v2_0 *)adev->gfx.rlc_fw->data; version_major = le16_to_cpu(rlc_hdr->header.header_version_major); version_minor = le16_to_cpu(rlc_hdr->header.header_version_minor); marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms Linux 6.7.0-xway-09721-g61da593f4458 x86_64 marvin@defiant:~/linux/kernel/linux_torvalds$ So, there seems to be a problem with the way the patch affects XWayland. Checked multiple times the exact commit with and without the diff. Hope this helps, because I am not familiar with the amdgpu driver. Best regards, Mirsad Todorovac On 1/22/24 09:34, Ma, Jun wrote: Perhaps similar to the problem I encountered earlier, you can try the following patch https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html Regards, Ma Jun On 1/21/2024 3:54 AM, Mirsad Todorovac wrote: Hi, The last email did not pass to the most of the recipients due to banned .xz attachment. As the .config is too big to send inline or uncompressed either, I will omit it in this attempt. In the meantime, I had some success in decoding the stack trace, but sadly not complete. I don't think this Oops is deterministic, but I am working on a reproducer. The platform is Ubuntu 22.04 LTS. Complete list of hardware and .config is available here: https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/ Best regards, Mirsad --- kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 0008 kernel: [5.576707] #PF: supervisor read access in kernel mode kernel: [5.576710] #PF: error_code(0x) - not-present page kernel: [5.576712] PGD 0 P4D 0 kernel: [5.576715] Oops: [#1] PREEMPT SMP NOPTI kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2 kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init
Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008
Hi Mirsad, On 1/25/2024 1:48 AM, Mirsad Todorovac wrote: > Hi, Ma Jun, > > Normally, I would reply under the quoted text, but I will adjust to your > convention. > > I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland > session > to block at typing password and ENTER in the graphical logon screen (tested > several times). > This problem is not caused by my patch. Based on your syslog, it looks more like a shedule issue. I just saw a similar problem, please refer to the link below https://gitlab.freedesktop.org/drm/amd/-/issues/3124 Regards, Ma Jun > After that, I was not able to even log from another box with ssh, or the > session would > block (tested one time, second time too, thrid time it passed after I > connected before > attempt to login on XWayland console). > > You might find useful syslog and dmesg of the freeze on this link (they were > +100K): > > https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/ > > The exact applied patch was this: > > marvin@defiant:~/linux/kernel/linux_torvalds$ git diff > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > index 73f6d7e72c73..6ef333df9adf 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > @@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct > amdgpu_device *adev) > > if (!amdgpu_sriov_vf(adev)) { > snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", > ucode_prefix); > - err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name); > - /* don't check this. There are apparently firmwares in the > wild with > -* incorrect size in the header > -*/ > - if (err == -ENODEV) > - goto out; > + err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev); > if (err) > - dev_dbg(adev->dev, > - "gfx10: amdgpu_ucode_request() failed > \"%s\"\n", > - fw_name); > + goto out; > + > + /* don't validate this firmware. There are apparently > firmwares > +* in the wild with incorrect size in the header > +*/ > rlc_hdr = (const struct rlc_firmware_header_v2_0 > *)adev->gfx.rlc_fw->data; > version_major = > le16_to_cpu(rlc_hdr->header.header_version_major); > version_minor = > le16_to_cpu(rlc_hdr->header.header_version_minor); > marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms > Linux 6.7.0-xway-09721-g61da593f4458 x86_64 > marvin@defiant:~/linux/kernel/linux_torvalds$ > > So, there seems to be a problem with the way the patch affects XWayland. > > Checked multiple times the exact commit with and without the diff. > > Hope this helps, because I am not familiar with the amdgpu driver. > > Best regards, > Mirsad Todorovac > > On 1/22/24 09:34, Ma, Jun wrote: >> Perhaps similar to the problem I encountered earlier, you can >> try the following patch >> >> https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html >> >> Regards, >> Ma Jun >> >> On 1/21/2024 3:54 AM, Mirsad Todorovac wrote: >>> Hi, >>> >>> The last email did not pass to the most of the recipients due to banned .xz >>> attachment. >>> >>> As the .config is too big to send inline or uncompressed either, I will >>> omit it in this >>> attempt. In the meantime, I had some success in decoding the stack trace, >>> but sadly not >>> complete. >>> >>> I don't think this Oops is deterministic, but I am working on a reproducer. >>> >>> The platform is Ubuntu 22.04 LTS. >>> >>> Complete list of hardware and .config is available here: >>> >>> https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/ >>> >>> Best regards, >>> Mirsad >>> >>> --- >>> kernel: [5.576702] BUG: kernel NULL pointer dereference, address: >>> 0008 >>> kernel: [5.576707] #PF: supervisor read access in kernel mode >>> kernel: [5.576710] #PF: error_code(0x) - not-present page >>> kernel: [5.576712] PGD 0 P4D 0 >>> kernel: [5.576715] Oops: [#1] PREEMPT SMP NOPTI >>> kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted >>> 6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2 >>> kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG >>> Lightning, BIOS 1.21 04/26/2023 >>> kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init >>> (drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 >>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu >>> kernel: [ 5.576872] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 >>> ed 0f 84 b3 fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87
Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008
Hi, Ma Jun, Normally, I would reply under the quoted text, but I will adjust to your convention. I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland session to block at typing password and ENTER in the graphical logon screen (tested several times). After that, I was not able to even log from another box with ssh, or the session would block (tested one time, second time too, thrid time it passed after I connected before attempt to login on XWayland console). You might find useful syslog and dmesg of the freeze on this link (they were +100K): https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/ The exact applied patch was this: marvin@defiant:~/linux/kernel/linux_torvalds$ git diff diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 73f6d7e72c73..6ef333df9adf 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct amdgpu_device *adev) if (!amdgpu_sriov_vf(adev)) { snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", ucode_prefix); - err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name); - /* don't check this. There are apparently firmwares in the wild with -* incorrect size in the header -*/ - if (err == -ENODEV) - goto out; + err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev); if (err) - dev_dbg(adev->dev, - "gfx10: amdgpu_ucode_request() failed \"%s\"\n", - fw_name); + goto out; + + /* don't validate this firmware. There are apparently firmwares +* in the wild with incorrect size in the header +*/ rlc_hdr = (const struct rlc_firmware_header_v2_0 *)adev->gfx.rlc_fw->data; version_major = le16_to_cpu(rlc_hdr->header.header_version_major); version_minor = le16_to_cpu(rlc_hdr->header.header_version_minor); marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms Linux 6.7.0-xway-09721-g61da593f4458 x86_64 marvin@defiant:~/linux/kernel/linux_torvalds$ So, there seems to be a problem with the way the patch affects XWayland. Checked multiple times the exact commit with and without the diff. Hope this helps, because I am not familiar with the amdgpu driver. Best regards, Mirsad Todorovac On 1/22/24 09:34, Ma, Jun wrote: Perhaps similar to the problem I encountered earlier, you can try the following patch https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html Regards, Ma Jun On 1/21/2024 3:54 AM, Mirsad Todorovac wrote: Hi, The last email did not pass to the most of the recipients due to banned .xz attachment. As the .config is too big to send inline or uncompressed either, I will omit it in this attempt. In the meantime, I had some success in decoding the stack trace, but sadly not complete. I don't think this Oops is deterministic, but I am working on a reproducer. The platform is Ubuntu 22.04 LTS. Complete list of hardware and .config is available here: https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/ Best regards, Mirsad --- kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 0008 kernel: [5.576707] #PF: supervisor read access in kernel mode kernel: [5.576710] #PF: error_code(0x) - not-present page kernel: [5.576712] PGD 0 P4D 0 kernel: [5.576715] Oops: [#1] PREEMPT SMP NOPTI kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2 kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023 kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init (drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu kernel: [ 5.576872] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 ed 0f 84 b3 fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87 01 00 4c 89 ff <48> 8b 40 08 0f b7 50 0a 0f b7 70 08 e8 e4 42 fb ff 41 89 c2 85 c0 All code 0: 8d 55 a8lea-0x58(%rbp),%edx 3: 4c 89 ffmov%r15,%rdi 6: e8 e4 83 ec ff call 0xffec83ef b: 41 89 c2mov%eax,%r10d e: 83 f8 edcmp$0xffed,%eax 11: 0f 84 b3 fd ff ff je 0xfdca 17: 85 c0 test %eax,%eax 19: 74 05 je 0x20 1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 20: 49 8b 87 08 87 01 00