Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008

2024-01-25 Thread Mirsad Todorovac

Hi Ma Jun,

Greetings again.

So, I just tested the recommended patch and the issue with the graphical login
screen was successfully resolved.

Thank you very much for your prompt reviews and recommended patches.

God bless.

Best regards,
Mirsad Todorovac

On 1/25/24 10:29, Mirsad Todorovac wrote:

Hi Ma Jun,

Copy that. This appears to be the exact problem, and thank you for
reviewing the bug report at such a short notice.

I apologise for the wrong assertion.

The patch you sent then just triggered another bug, and it is not manifested 
without the patch (but a NULL pointer dereference instead).

But of course, it is not profitable to remove your patch and have
the NULL ptr dereference, but a proper fix is required.

Thanks again.

Best regards,
Mirsad Todorovac

On 1/25/2024 8:38 AM, Ma, Jun wrote:

Hi Mirsad,


On 1/25/2024 1:48 AM, Mirsad Todorovac wrote:

Hi, Ma Jun,

Normally, I would reply under the quoted text, but I will adjust to your 
convention.

I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland 
session
to block at typing password and ENTER in the graphical logon screen (tested 
several times).


This problem is not caused by my patch.
Based on your syslog, it looks more like a shedule issue.
I just saw a similar problem, please refer to the link below
https://gitlab.freedesktop.org/drm/amd/-/issues/3124

Regards,
Ma Jun

After that, I was not able to even log from another box with ssh, or the 
session would
block (tested one time, second time too, thrid time it passed after I connected 
before
attempt to login on XWayland console).

You might find useful syslog and dmesg of the freeze on this link (they were 
+100K):

https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/

The exact applied patch was this:

marvin@defiant:~/linux/kernel/linux_torvalds$ git diff
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 73f6d7e72c73..6ef333df9adf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct 
amdgpu_device *adev)
   if (!amdgpu_sriov_vf(adev)) {
   snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", 
ucode_prefix);
-   err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name);
-   /* don't check this.  There are apparently firmwares in the 
wild with
-    * incorrect size in the header
-    */
-   if (err == -ENODEV)
-   goto out;
+   err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev);
   if (err)
-   dev_dbg(adev->dev,
-   "gfx10: amdgpu_ucode_request() failed \"%s\"\n",
-   fw_name);
+   goto out;
+
+   /* don't validate this firmware.  There are apparently firmwares
+    * in the wild with incorrect size in the header
+    */
   rlc_hdr = (const struct rlc_firmware_header_v2_0 
*)adev->gfx.rlc_fw->data;
   version_major = 
le16_to_cpu(rlc_hdr->header.header_version_major);
   version_minor = 
le16_to_cpu(rlc_hdr->header.header_version_minor);
marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms
Linux 6.7.0-xway-09721-g61da593f4458 x86_64
marvin@defiant:~/linux/kernel/linux_torvalds$

So, there seems to be a problem with the way the patch affects XWayland.

Checked multiple times the exact commit with and without the diff.

Hope this helps, because I am not familiar with the amdgpu driver.

Best regards,
Mirsad Todorovac

On 1/22/24 09:34, Ma, Jun wrote:

Perhaps similar to the problem I encountered earlier, you can
try the following patch

https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html

Regards,
Ma Jun

On 1/21/2024 3:54 AM, Mirsad Todorovac wrote:

Hi,

The last email did not pass to the most of the recipients due to banned .xz 
attachment.

As the .config is too big to send inline or uncompressed either, I will omit it 
in this
attempt. In the meantime, I had some success in decoding the stack trace, but 
sadly not
complete.

I don't think this Oops is deterministic, but I am working on a reproducer.

The platform is Ubuntu 22.04 LTS.

Complete list of hardware and .config is available here:

https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/

Best regards,
Mirsad

---
kernel: [    5.576702] BUG: kernel NULL pointer dereference, address: 
0008
kernel: [    5.576707] #PF: supervisor read access in kernel mode
kernel: [    5.576710] #PF: error_code(0x) - not-present page
kernel: [    5.576712] PGD 0 P4D 0
kernel: [    5.576715] Oops:  [#1] PREEMPT SMP NOPTI
kernel: [  

Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008

2024-01-25 Thread Mirsad Todorovac

Hi Ma Jun,

Copy that. This appears to be the exact problem, and thank you for
reviewing the bug report at such a short notice.

I apologise for the wrong assertion.

The patch you sent then just triggered another bug, and it is not 
manifested without the patch (but a NULL pointer dereference instead).


But of course, it is not profitable to remove your patch and have
the NULL ptr dereference, but a proper fix is required.

Thanks again.

Best regards,
Mirsad Todorovac

On 1/25/2024 8:38 AM, Ma, Jun wrote:

Hi Mirsad,


On 1/25/2024 1:48 AM, Mirsad Todorovac wrote:

Hi, Ma Jun,

Normally, I would reply under the quoted text, but I will adjust to your 
convention.

I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland 
session
to block at typing password and ENTER in the graphical logon screen (tested 
several times).


This problem is not caused by my patch.
Based on your syslog, it looks more like a shedule issue.
I just saw a similar problem, please refer to the link below
https://gitlab.freedesktop.org/drm/amd/-/issues/3124

Regards,
Ma Jun

After that, I was not able to even log from another box with ssh, or the 
session would
block (tested one time, second time too, thrid time it passed after I connected 
before
attempt to login on XWayland console).

You might find useful syslog and dmesg of the freeze on this link (they were 
+100K):

https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/

The exact applied patch was this:

marvin@defiant:~/linux/kernel/linux_torvalds$ git diff
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 73f6d7e72c73..6ef333df9adf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct 
amdgpu_device *adev)

   if (!amdgpu_sriov_vf(adev)) {

   snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", 
ucode_prefix);
-   err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name);
-   /* don't check this.  There are apparently firmwares in the 
wild with
-* incorrect size in the header
-*/
-   if (err == -ENODEV)
-   goto out;
+   err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev);
   if (err)
-   dev_dbg(adev->dev,
-   "gfx10: amdgpu_ucode_request() failed \"%s\"\n",
-   fw_name);
+   goto out;
+
+   /* don't validate this firmware.  There are apparently firmwares
+* in the wild with incorrect size in the header
+*/
   rlc_hdr = (const struct rlc_firmware_header_v2_0 
*)adev->gfx.rlc_fw->data;
   version_major = 
le16_to_cpu(rlc_hdr->header.header_version_major);
   version_minor = 
le16_to_cpu(rlc_hdr->header.header_version_minor);
marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms
Linux 6.7.0-xway-09721-g61da593f4458 x86_64
marvin@defiant:~/linux/kernel/linux_torvalds$

So, there seems to be a problem with the way the patch affects XWayland.

Checked multiple times the exact commit with and without the diff.

Hope this helps, because I am not familiar with the amdgpu driver.

Best regards,
Mirsad Todorovac

On 1/22/24 09:34, Ma, Jun wrote:

Perhaps similar to the problem I encountered earlier, you can
try the following patch

https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html

Regards,
Ma Jun

On 1/21/2024 3:54 AM, Mirsad Todorovac wrote:

Hi,

The last email did not pass to the most of the recipients due to banned .xz 
attachment.

As the .config is too big to send inline or uncompressed either, I will omit it 
in this
attempt. In the meantime, I had some success in decoding the stack trace, but 
sadly not
complete.

I don't think this Oops is deterministic, but I am working on a reproducer.

The platform is Ubuntu 22.04 LTS.

Complete list of hardware and .config is available here:

https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/

Best regards,
Mirsad

---
kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 
0008
kernel: [5.576707] #PF: supervisor read access in kernel mode
kernel: [5.576710] #PF: error_code(0x) - not-present page
kernel: [5.576712] PGD 0 P4D 0
kernel: [5.576715] Oops:  [#1] PREEMPT SMP NOPTI
kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 
6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2
kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG 
Lightning, BIOS 1.21 04/26/2023
kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init 

Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008

2024-01-24 Thread Ma, Jun
Hi Mirsad,


On 1/25/2024 1:48 AM, Mirsad Todorovac wrote:
> Hi, Ma Jun,
> 
> Normally, I would reply under the quoted text, but I will adjust to your 
> convention.
> 
> I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland 
> session
> to block at typing password and ENTER in the graphical logon screen (tested 
> several times).
> 
This problem is not caused by my patch. 
Based on your syslog, it looks more like a shedule issue.
I just saw a similar problem, please refer to the link below
https://gitlab.freedesktop.org/drm/amd/-/issues/3124

Regards,
Ma Jun
> After that, I was not able to even log from another box with ssh, or the 
> session would
> block (tested one time, second time too, thrid time it passed after I 
> connected before
> attempt to login on XWayland console).
> 
> You might find useful syslog and dmesg of the freeze on this link (they were 
> +100K):
> 
> https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/
> 
> The exact applied patch was this:
> 
> marvin@defiant:~/linux/kernel/linux_torvalds$ git diff
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index 73f6d7e72c73..6ef333df9adf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct 
> amdgpu_device *adev)
>
>   if (!amdgpu_sriov_vf(adev)) {
>   snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", 
> ucode_prefix);
> -   err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name);
> -   /* don't check this.  There are apparently firmwares in the 
> wild with
> -* incorrect size in the header
> -*/
> -   if (err == -ENODEV)
> -   goto out;
> +   err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev);
>   if (err)
> -   dev_dbg(adev->dev,
> -   "gfx10: amdgpu_ucode_request() failed 
> \"%s\"\n",
> -   fw_name);
> +   goto out;
> +
> +   /* don't validate this firmware.  There are apparently 
> firmwares
> +* in the wild with incorrect size in the header
> +*/
>   rlc_hdr = (const struct rlc_firmware_header_v2_0 
> *)adev->gfx.rlc_fw->data;
>   version_major = 
> le16_to_cpu(rlc_hdr->header.header_version_major);
>   version_minor = 
> le16_to_cpu(rlc_hdr->header.header_version_minor);
> marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms
> Linux 6.7.0-xway-09721-g61da593f4458 x86_64
> marvin@defiant:~/linux/kernel/linux_torvalds$
> 
> So, there seems to be a problem with the way the patch affects XWayland.
> 
> Checked multiple times the exact commit with and without the diff.
> 
> Hope this helps, because I am not familiar with the amdgpu driver.
> 
> Best regards,
> Mirsad Todorovac
> 
> On 1/22/24 09:34, Ma, Jun wrote:
>> Perhaps similar to the problem I encountered earlier, you can
>> try the following patch
>>
>> https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html
>>
>> Regards,
>> Ma Jun
>>
>> On 1/21/2024 3:54 AM, Mirsad Todorovac wrote:
>>> Hi,
>>>
>>> The last email did not pass to the most of the recipients due to banned .xz 
>>> attachment.
>>>
>>> As the .config is too big to send inline or uncompressed either, I will 
>>> omit it in this
>>> attempt. In the meantime, I had some success in decoding the stack trace, 
>>> but sadly not
>>> complete.
>>>
>>> I don't think this Oops is deterministic, but I am working on a reproducer.
>>>
>>> The platform is Ubuntu 22.04 LTS.
>>>
>>> Complete list of hardware and .config is available here:
>>>
>>> https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/
>>>
>>> Best regards,
>>> Mirsad
>>>
>>> ---
>>> kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 
>>> 0008
>>> kernel: [5.576707] #PF: supervisor read access in kernel mode
>>> kernel: [5.576710] #PF: error_code(0x) - not-present page
>>> kernel: [5.576712] PGD 0 P4D 0
>>> kernel: [5.576715] Oops:  [#1] PREEMPT SMP NOPTI
>>> kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 
>>> 6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2
>>> kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG 
>>> Lightning, BIOS 1.21 04/26/2023
>>> kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init 
>>> (drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 
>>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu
>>> kernel: [ 5.576872] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 
>>> ed 0f 84 b3 fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87 

Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008

2024-01-24 Thread Mirsad Todorovac

Hi, Ma Jun,

Normally, I would reply under the quoted text, but I will adjust to your 
convention.

I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland 
session
to block at typing password and ENTER in the graphical logon screen (tested 
several times).

After that, I was not able to even log from another box with ssh, or the 
session would
block (tested one time, second time too, thrid time it passed after I connected 
before
attempt to login on XWayland console).

You might find useful syslog and dmesg of the freeze on this link (they were 
+100K):

https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/

The exact applied patch was this:

marvin@defiant:~/linux/kernel/linux_torvalds$ git diff
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 73f6d7e72c73..6ef333df9adf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct 
amdgpu_device *adev)
  
 if (!amdgpu_sriov_vf(adev)) {

 snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", 
ucode_prefix);
-   err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name);
-   /* don't check this.  There are apparently firmwares in the 
wild with
-* incorrect size in the header
-*/
-   if (err == -ENODEV)
-   goto out;
+   err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev);
 if (err)
-   dev_dbg(adev->dev,
-   "gfx10: amdgpu_ucode_request() failed \"%s\"\n",
-   fw_name);
+   goto out;
+
+   /* don't validate this firmware.  There are apparently firmwares
+* in the wild with incorrect size in the header
+*/
 rlc_hdr = (const struct rlc_firmware_header_v2_0 
*)adev->gfx.rlc_fw->data;
 version_major = 
le16_to_cpu(rlc_hdr->header.header_version_major);
 version_minor = 
le16_to_cpu(rlc_hdr->header.header_version_minor);
marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms
Linux 6.7.0-xway-09721-g61da593f4458 x86_64
marvin@defiant:~/linux/kernel/linux_torvalds$

So, there seems to be a problem with the way the patch affects XWayland.

Checked multiple times the exact commit with and without the diff.

Hope this helps, because I am not familiar with the amdgpu driver.

Best regards,
Mirsad Todorovac

On 1/22/24 09:34, Ma, Jun wrote:

Perhaps similar to the problem I encountered earlier, you can
try the following patch

https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html

Regards,
Ma Jun

On 1/21/2024 3:54 AM, Mirsad Todorovac wrote:

Hi,

The last email did not pass to the most of the recipients due to banned .xz 
attachment.

As the .config is too big to send inline or uncompressed either, I will omit it 
in this
attempt. In the meantime, I had some success in decoding the stack trace, but 
sadly not
complete.

I don't think this Oops is deterministic, but I am working on a reproducer.

The platform is Ubuntu 22.04 LTS.

Complete list of hardware and .config is available here:

https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/

Best regards,
Mirsad

---
kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 
0008
kernel: [5.576707] #PF: supervisor read access in kernel mode
kernel: [5.576710] #PF: error_code(0x) - not-present page
kernel: [5.576712] PGD 0 P4D 0
kernel: [5.576715] Oops:  [#1] PREEMPT SMP NOPTI
kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 
6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2
kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG 
Lightning, BIOS 1.21 04/26/2023
kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init 
(drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu
kernel: [ 5.576872] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 ed 0f 84 b3 
fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87 01 00 4c 89 ff <48> 8b 40 08 
0f b7 50 0a 0f b7 70 08 e8 e4 42 fb ff 41 89 c2 85 c0
All code

 0: 8d 55 a8lea-0x58(%rbp),%edx
 3: 4c 89 ffmov%r15,%rdi
 6: e8 e4 83 ec ff  call   0xffec83ef
 b: 41 89 c2mov%eax,%r10d
 e: 83 f8 edcmp$0xffed,%eax
11: 0f 84 b3 fd ff ff   je 0xfdca
17: 85 c0   test   %eax,%eax
19: 74 05   je 0x20
1b: 0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
20: 49 8b 87 08 87 01 00