Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008

2024-01-25 Thread Mirsad Todorovac

Hi Ma Jun,

Greetings again.

So, I just tested the recommended patch and the issue with the graphical login
screen was successfully resolved.

Thank you very much for your prompt reviews and recommended patches.

God bless.

Best regards,
Mirsad Todorovac

On 1/25/24 10:29, Mirsad Todorovac wrote:

Hi Ma Jun,

Copy that. This appears to be the exact problem, and thank you for
reviewing the bug report at such a short notice.

I apologise for the wrong assertion.

The patch you sent then just triggered another bug, and it is not manifested 
without the patch (but a NULL pointer dereference instead).

But of course, it is not profitable to remove your patch and have
the NULL ptr dereference, but a proper fix is required.

Thanks again.

Best regards,
Mirsad Todorovac

On 1/25/2024 8:38 AM, Ma, Jun wrote:

Hi Mirsad,


On 1/25/2024 1:48 AM, Mirsad Todorovac wrote:

Hi, Ma Jun,

Normally, I would reply under the quoted text, but I will adjust to your 
convention.

I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland 
session
to block at typing password and ENTER in the graphical logon screen (tested 
several times).


This problem is not caused by my patch.
Based on your syslog, it looks more like a shedule issue.
I just saw a similar problem, please refer to the link below
https://gitlab.freedesktop.org/drm/amd/-/issues/3124

Regards,
Ma Jun

After that, I was not able to even log from another box with ssh, or the 
session would
block (tested one time, second time too, thrid time it passed after I connected 
before
attempt to login on XWayland console).

You might find useful syslog and dmesg of the freeze on this link (they were 
+100K):

https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/

The exact applied patch was this:

marvin@defiant:~/linux/kernel/linux_torvalds$ git diff
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 73f6d7e72c73..6ef333df9adf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct 
amdgpu_device *adev)
   if (!amdgpu_sriov_vf(adev)) {
   snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", 
ucode_prefix);
-   err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name);
-   /* don't check this.  There are apparently firmwares in the 
wild with
-    * incorrect size in the header
-    */
-   if (err == -ENODEV)
-   goto out;
+   err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev);
   if (err)
-   dev_dbg(adev->dev,
-   "gfx10: amdgpu_ucode_request() failed \"%s\"\n",
-   fw_name);
+   goto out;
+
+   /* don't validate this firmware.  There are apparently firmwares
+    * in the wild with incorrect size in the header
+    */
   rlc_hdr = (const struct rlc_firmware_header_v2_0 
*)adev->gfx.rlc_fw->data;
   version_major = 
le16_to_cpu(rlc_hdr->header.header_version_major);
   version_minor = 
le16_to_cpu(rlc_hdr->header.header_version_minor);
marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms
Linux 6.7.0-xway-09721-g61da593f4458 x86_64
marvin@defiant:~/linux/kernel/linux_torvalds$

So, there seems to be a problem with the way the patch affects XWayland.

Checked multiple times the exact commit with and without the diff.

Hope this helps, because I am not familiar with the amdgpu driver.

Best regards,
Mirsad Todorovac

On 1/22/24 09:34, Ma, Jun wrote:

Perhaps similar to the problem I encountered earlier, you can
try the following patch

https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html

Regards,
Ma Jun

On 1/21/2024 3:54 AM, Mirsad Todorovac wrote:

Hi,

The last email did not pass to the most of the recipients due to banned .xz 
attachment.

As the .config is too big to send inline or uncompressed either, I will omit it 
in this
attempt. In the meantime, I had some success in decoding the stack trace, but 
sadly not
complete.

I don't think this Oops is deterministic, but I am working on a reproducer.

The platform is Ubuntu 22.04 LTS.

Complete list of hardware and .config is available here:

https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/

Best regards,
Mirsad

---
kernel: [    5.576702] BUG: kernel NULL pointer dereference, address: 
0008
kernel: [    5.576707] #PF: supervisor read access in kernel mode
kernel: [    5.576710] #PF: error_code(0x) - not-present page
kernel: [    5.576712] PGD 0 P4D 0
kernel: [

Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008

2024-01-25 Thread Mirsad Todorovac

Hi Ma Jun,

Copy that. This appears to be the exact problem, and thank you for
reviewing the bug report at such a short notice.

I apologise for the wrong assertion.

The patch you sent then just triggered another bug, and it is not 
manifested without the patch (but a NULL pointer dereference instead).


But of course, it is not profitable to remove your patch and have
the NULL ptr dereference, but a proper fix is required.

Thanks again.

Best regards,
Mirsad Todorovac

On 1/25/2024 8:38 AM, Ma, Jun wrote:

Hi Mirsad,


On 1/25/2024 1:48 AM, Mirsad Todorovac wrote:

Hi, Ma Jun,

Normally, I would reply under the quoted text, but I will adjust to your 
convention.

I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland 
session
to block at typing password and ENTER in the graphical logon screen (tested 
several times).


This problem is not caused by my patch.
Based on your syslog, it looks more like a shedule issue.
I just saw a similar problem, please refer to the link below
https://gitlab.freedesktop.org/drm/amd/-/issues/3124

Regards,
Ma Jun

After that, I was not able to even log from another box with ssh, or the 
session would
block (tested one time, second time too, thrid time it passed after I connected 
before
attempt to login on XWayland console).

You might find useful syslog and dmesg of the freeze on this link (they were 
+100K):

https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/

The exact applied patch was this:

marvin@defiant:~/linux/kernel/linux_torvalds$ git diff
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 73f6d7e72c73..6ef333df9adf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct 
amdgpu_device *adev)

   if (!amdgpu_sriov_vf(adev)) {

   snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", 
ucode_prefix);
-   err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name);
-   /* don't check this.  There are apparently firmwares in the 
wild with
-* incorrect size in the header
-*/
-   if (err == -ENODEV)
-   goto out;
+   err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev);
   if (err)
-   dev_dbg(adev->dev,
-   "gfx10: amdgpu_ucode_request() failed \"%s\"\n",
-   fw_name);
+   goto out;
+
+   /* don't validate this firmware.  There are apparently firmwares
+* in the wild with incorrect size in the header
+*/
   rlc_hdr = (const struct rlc_firmware_header_v2_0 
*)adev->gfx.rlc_fw->data;
   version_major = 
le16_to_cpu(rlc_hdr->header.header_version_major);
   version_minor = 
le16_to_cpu(rlc_hdr->header.header_version_minor);
marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms
Linux 6.7.0-xway-09721-g61da593f4458 x86_64
marvin@defiant:~/linux/kernel/linux_torvalds$

So, there seems to be a problem with the way the patch affects XWayland.

Checked multiple times the exact commit with and without the diff.

Hope this helps, because I am not familiar with the amdgpu driver.

Best regards,
Mirsad Todorovac

On 1/22/24 09:34, Ma, Jun wrote:

Perhaps similar to the problem I encountered earlier, you can
try the following patch

https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html

Regards,
Ma Jun

On 1/21/2024 3:54 AM, Mirsad Todorovac wrote:

Hi,

The last email did not pass to the most of the recipients due to banned .xz 
attachment.

As the .config is too big to send inline or uncompressed either, I will omit it 
in this
attempt. In the meantime, I had some success in decoding the stack trace, but 
sadly not
complete.

I don't think this Oops is deterministic, but I am working on a reproducer.

The platform is Ubuntu 22.04 LTS.

Complete list of hardware and .config is available here:

https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/

Best regards,
Mirsad

---
kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 
0008
kernel: [5.576707] #PF: supervisor read access in kernel mode
kernel: [5.576710] #PF: error_code(0x) - not-present page
kernel: [5.576712] PGD 0 P4D 0
kernel: [5.576715] Oops:  [#1] PREEMPT SMP NOPTI
kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 
6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2
kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG 
Lightning, BIOS 1.21 04/26/2023
kernel: [5.576726

Re: BUG [RESEND][NEW BUG]: kernel NULL pointer dereference, address: 0000000000000008

2024-01-24 Thread Mirsad Todorovac

Hi, Ma Jun,

Normally, I would reply under the quoted text, but I will adjust to your 
convention.

I have just discovered that your patch causes Ubuntu 22.04 LTS GNOME XWayland 
session
to block at typing password and ENTER in the graphical logon screen (tested 
several times).

After that, I was not able to even log from another box with ssh, or the 
session would
block (tested one time, second time too, thrid time it passed after I connected 
before
attempt to login on XWayland console).

You might find useful syslog and dmesg of the freeze on this link (they were 
+100K):

https://magrf.grf.hr/~mtodorov/linux/bugreports/6.7.0/amdgpu/6.7.0-xway-09721-g61da593f4458/

The exact applied patch was this:

marvin@defiant:~/linux/kernel/linux_torvalds$ git diff
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 73f6d7e72c73..6ef333df9adf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3996,16 +3996,13 @@ static int gfx_v10_0_init_microcode(struct 
amdgpu_device *adev)
  
 if (!amdgpu_sriov_vf(adev)) {

 snprintf(fw_name, sizeof(fw_name), "amdgpu/%s_rlc.bin", 
ucode_prefix);
-   err = amdgpu_ucode_request(adev, >gfx.rlc_fw, fw_name);
-   /* don't check this.  There are apparently firmwares in the 
wild with
-* incorrect size in the header
-*/
-   if (err == -ENODEV)
-   goto out;
+   err = request_firmware(>gfx.rlc_fw, fw_name, adev->dev);
 if (err)
-   dev_dbg(adev->dev,
-   "gfx10: amdgpu_ucode_request() failed \"%s\"\n",
-   fw_name);
+   goto out;
+
+   /* don't validate this firmware.  There are apparently firmwares
+* in the wild with incorrect size in the header
+*/
 rlc_hdr = (const struct rlc_firmware_header_v2_0 
*)adev->gfx.rlc_fw->data;
 version_major = 
le16_to_cpu(rlc_hdr->header.header_version_major);
 version_minor = 
le16_to_cpu(rlc_hdr->header.header_version_minor);
marvin@defiant:~/linux/kernel/linux_torvalds$ uname -rms
Linux 6.7.0-xway-09721-g61da593f4458 x86_64
marvin@defiant:~/linux/kernel/linux_torvalds$

So, there seems to be a problem with the way the patch affects XWayland.

Checked multiple times the exact commit with and without the diff.

Hope this helps, because I am not familiar with the amdgpu driver.

Best regards,
Mirsad Todorovac

On 1/22/24 09:34, Ma, Jun wrote:

Perhaps similar to the problem I encountered earlier, you can
try the following patch

https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html

Regards,
Ma Jun

On 1/21/2024 3:54 AM, Mirsad Todorovac wrote:

Hi,

The last email did not pass to the most of the recipients due to banned .xz 
attachment.

As the .config is too big to send inline or uncompressed either, I will omit it 
in this
attempt. In the meantime, I had some success in decoding the stack trace, but 
sadly not
complete.

I don't think this Oops is deterministic, but I am working on a reproducer.

The platform is Ubuntu 22.04 LTS.

Complete list of hardware and .config is available here:

https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/

Best regards,
Mirsad

---
kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 
0008
kernel: [5.576707] #PF: supervisor read access in kernel mode
kernel: [5.576710] #PF: error_code(0x) - not-present page
kernel: [5.576712] PGD 0 P4D 0
kernel: [5.576715] Oops:  [#1] PREEMPT SMP NOPTI
kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 
6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2
kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG 
Lightning, BIOS 1.21 04/26/2023
kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init 
(drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu
kernel: [ 5.576872] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 ed 0f 84 b3 
fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87 01 00 4c 89 ff <48> 8b 40 08 
0f b7 50 0a 0f b7 70 08 e8 e4 42 fb ff 41 89 c2 85 c0
All code

 0: 8d 55 a8lea-0x58(%rbp),%edx
 3: 4c 89 ffmov%r15,%rdi
 6: e8 e4 83 ec ff  call   0xffec83ef
 b: 41 89 c2mov%eax,%r10d
 e: 83 f8 edcmp$0xffed,%eax
11: 0f 84 b3 fd ff ff   je 0xfdca
17: 85 c0   test   %eax,%eax
19: 74 05   je 0x20
1b: 0f 1f 44 00 00  

Re: BUG [RESEND]: kernel NULL pointer dereference, address: 0000000000000008

2024-01-22 Thread Mirsad Todorovac
On 22. 01. 2024. 09:34, Ma, Jun wrote:
> Perhaps similar to the problem I encountered earlier, you can
> try the following patch
> 
> https://lists.freedesktop.org/archives/amd-gfx/2024-January/103259.html

Appaarently, this patch prevented NULL dereference, it was no longer in the log.

However, there is another hang in XWayland password entry dialog, but I do not
think that I figured out what is wrong.

Best regards,
Mirsad

> Regards,
> Ma Jun
> 
> On 1/21/2024 3:54 AM, Mirsad Todorovac wrote:
>> Hi,
>>
>> The last email did not pass to the most of the recipients due to banned .xz 
>> attachment.
>>
>> As the .config is too big to send inline or uncompressed either, I will omit 
>> it in this
>> attempt. In the meantime, I had some success in decoding the stack trace, 
>> but sadly not
>> complete.
>>
>> I don't think this Oops is deterministic, but I am working on a reproducer.
>>
>> The platform is Ubuntu 22.04 LTS.
>>
>> Complete list of hardware and .config is available here:
>>
>> https://domac.alu.unizg.hr/~mtodorov/linux/bugreports/amdgpu/6.7.0-rtl-v02-nokcsan-09928-g052d534373b7/
>>
>> Best regards,
>> Mirsad
>>
>> ---
>> kernel: [5.576702] BUG: kernel NULL pointer dereference, address: 
>> 0008
>> kernel: [5.576707] #PF: supervisor read access in kernel mode
>> kernel: [5.576710] #PF: error_code(0x) - not-present page
>> kernel: [5.576712] PGD 0 P4D 0
>> kernel: [5.576715] Oops:  [#1] PREEMPT SMP NOPTI
>> kernel: [5.576718] CPU: 9 PID: 650 Comm: systemd-udevd Not tainted 
>> 6.7.0-rtl-v0.2-nokcsan-09928-g052d534373b7 #2
>> kernel: [5.576723] Hardware name: ASRock X670E PG Lightning/X670E PG 
>> Lightning, BIOS 1.21 04/26/2023
>> kernel: [5.576726] RIP: 0010:gfx_v10_0_early_init 
>> (drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 
>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu
>> kernel: [ 5.576872] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 ed 
>> 0f 84 b3 fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87 01 00 4c 89 ff 
>> <48> 8b 40 08 0f b7 50 0a 0f b7 70 08 e8 e4 42 fb ff 41 89 c2 85 c0
>> All code
>> 
>> 0:   8d 55 a8lea-0x58(%rbp),%edx
>> 3:   4c 89 ffmov%r15,%rdi
>> 6:   e8 e4 83 ec ff  call   0xffec83ef
>> b:   41 89 c2mov%eax,%r10d
>> e:   83 f8 edcmp$0xffed,%eax
>>11:   0f 84 b3 fd ff ff   je 0xfdca
>>17:   85 c0   test   %eax,%eax
>>19:   74 05   je 0x20
>>1b:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
>>20:   49 8b 87 08 87 01 00mov0x18708(%r15),%rax
>>27:   4c 89 ffmov%r15,%rdi
>>2a:*  48 8b 40 08 mov0x8(%rax),%rax   <-- 
>> trapping instruction
>>2e:   0f b7 50 0a movzwl 0xa(%rax),%edx
>>32:   0f b7 70 08 movzwl 0x8(%rax),%esi
>>36:   e8 e4 42 fb ff  call   0xfffb431f
>>3b:   41 89 c2mov%eax,%r10d
>>3e:   85 c0   test   %eax,%eax
>>
>> Code starting with the faulting instruction
>> ===
>> 0:   48 8b 40 08 mov0x8(%rax),%rax
>> 4:   0f b7 50 0a movzwl 0xa(%rax),%edx
>> 8:   0f b7 70 08 movzwl 0x8(%rax),%esi
>> c:   e8 e4 42 fb ff  call   0xfffb42f5
>>11:   41 89 c2mov%eax,%r10d
>>14:   85 c0   test   %eax,%eax
>> kernel: [5.576878] RSP: 0018:a5b3c103f720 EFLAGS: 00010282
>> kernel: [5.576881] RAX:  RBX: c1d73489 RCX: 
>> 
>> kernel: [5.576884] RDX:  RSI:  RDI: 
>> 91ae4fa8
>> kernel: [5.576886] RBP: a5b3c103f7b0 R08:  R09: 
>> 
>> kernel: [5.576889] R10: ffea R11:  R12: 
>> 91ae4fa986e8
>> kernel: [5.576892] R13: 91ae4fa986d8 R14: 91ae4fa986f8 R15: 
>> 91ae4fa8
>> kernel: [5.576895] FS:  7fdaa343c8c0() GS:91bd5844() 
>> knlGS:
>> kernel: [5.576898] CS:  0010 DS:  ES: 00

Re: [BUG][BISECTED] Freeze at loading init ramdisk

2024-01-22 Thread Mirsad Todorovac




On 1/22/24 11:20, Uwe Kleine-König wrote:

On Thu, Jan 18, 2024 at 09:04:05PM +0100, Mirsad Todorovac wrote:



On 1/18/24 08:45, Uwe Kleine-König wrote:

Hello Mirsad,

On Wed, Jan 17, 2024 at 07:47:49PM +0100, Mirsad Todorovac wrote:

On 1/16/24 01:32, Mirsad Todorovac wrote:

On the Ubuntu 22.04 LTS Jammy platform, on a mainline vanilla torvalds tree 
kernel, the boot
freezes upon first two lines and before any systemd messages.

(Please find the config attached.)

Bisecting the bug led to this result:

marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good
d97a78423c33f68ca6543de510a409167baed6f5 is the first bad commit
commit d97a78423c33f68ca6543de510a409167baed6f5
Merge: 61da593f4458 689237ab37c5
Author: Linus Torvalds 
Date:   Fri Jan 12 14:38:08 2024 -0800

[...]

Hope this helps.


P.S.

As I see that this is a larger merge commit, with 5K+ lines changed, I don't 
think I can
bisect further to determine the culprit.


Actually it's not that hard. If a merge commit is the first bad commit
for a bisection, either the merge wasn't done correctly (less likely,
looking at d97a78423c33f68ca6543de510a409167baed6f5 I'd bet this isn't
the problem); or changes on different sides conflict or you did
something wrong during bisection.

To rule out the third option, you can just retest d97a78423c33,
61da593f4458 and 689237ab37c5. If d97a78423c33 is the only bad one, you
did it right.


This was confirmed.


Then to further debug the second option you can find out the offending
commit on each side with a bisection as follows, here for the RHS (i.e.
689237ab37c5):

git bisect start 689237ab37c5 $(git merge-base 61da593f4458 
689237ab37c5)

and then in each bisection step do:

git merge --no-commit 61da593f4458
test if the problem is present
git reset --hard
git bisect good/bad

In this case you get merge conflicts in drivers/video/fbdev/amba-clcd.c
and drivers/video/fbdev/vermilion/vermilion.c. In the assumption that
you don't have these enabled in your .config, you can just ignore these.

Side note: A problem during bisection can be that the .config changes
along the process. You should put your config into (say)
arch/x86/configs/lala_defconfig and do

make lala_defconfig

before building each step to prevent this.


I must have done something wrong:

marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect log
# bad: [689237ab37c59b9909bc9371d7fece3081683fba] fbdev/intelfb: Remove driver
# good: [de927f6c0b07d9e698416c5b287c521b07694cac] Merge tag 's390-6.8-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect start '689237ab37c5' 'de927f6c0b07d9e698416c5b287c521b07694cac'
# good: [d9f25b59ed85ae45801cf45fe17eb269b0ef3038] fbdev: Remove support for 
Carillo Ranch driver
git bisect good d9f25b59ed85ae45801cf45fe17eb269b0ef3038
# good: [e2e0b838a1849f92612a8305c09aaf31bf824350] video/sticore: Remove info 
field from STI struct
git bisect good e2e0b838a1849f92612a8305c09aaf31bf824350
# good: [778e73d2411abc8f3a2d60dbf038acaec218792e] drm/hyperv: Remove firmware 
framebuffers with aperture helper
git bisect good 778e73d2411abc8f3a2d60dbf038acaec218792e
# good: [df67699c9cb0ceb70f6cc60630ca938c06773eda] firmware/sysfb: Clear 
screen_info state after consuming it
git bisect good df67699c9cb0ceb70f6cc60630ca938c06773eda


FTR: Now that you identified df67699c9cb0ce as the culprit, calling
git bisect good on it was wrong, so something was fishy in your testing
and it's no surprise the bisection found a wrong result.


Copy that. But it is my first attempt on a bisect of a merge commit, so I will 
simply
ask to be forgiven. Maybe I forgot "git reset --hard" in some step.

I have to do a thorough homework on the merge commit magic.

Best regards,
Mirsad



Best regards
Uwe



BUG [RESEND]: kernel NULL pointer dereference, address: 0000000000000008

2024-01-20 Thread Mirsad Todorovac
%rax
   6:   73 01   jae0x9
   8:   c3  ret
   9:   48 8b 0d 73 b5 0f 00mov0xfb573(%rip),%rcx# 0xfb583
  10:   f7 d8   neg%eax
  12:   64 89 01mov%eax,%fs:(%rcx)
  15:   48  rex.W
kernel: [5.577729] RSP: 002b:7ffeb4f87d28 EFLAGS: 0246 ORIG_RAX: 
0139
kernel: [5.577733] RAX: ffda RBX: 55aedf3eeeb0 RCX: 
7fdaa331e88d
kernel: [5.577736] RDX:  RSI: 55aedf3efb80 RDI: 
001a
kernel: [5.577738] RBP: 0002 R08:  R09: 
0002
kernel: [5.577741] R10: 001a R11: 0246 R12: 
55aedf3efb80
kernel: [5.577744] R13: 55aedf3f2060 R14:  R15: 
55aedf2b1220
kernel: [5.577748]  
kernel: [5.577750] Modules linked in: intel_rapl_msr intel_rapl_common 
amdgpu(+) edac_mce_amd kvm_amd kvm snd_hda_codec_realtek snd_hda_codec_generic 
irqbypass ledtrig_audio crct10dif_pclmul polyval_clmulni polyval_generic 
snd_hda_codec_hdmi ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 
amdxcp snd_hda_intel aesni_intel drm_exec snd_intel_dspcfg crypto_simd 
gpu_sched snd_intel_sdw_acpi cryptd nls_iso8859_1 drm_buddy snd_hda_codec 
snd_seq_midi drm_suballoc_helper snd_seq_midi_event drm_ttm_helper joydev 
snd_hda_core input_leds ttm rapl snd_rawmidi snd_hwdep drm_display_helper 
snd_seq snd_pcm wmi_bmof cec k10temp snd_seq_device ccp rc_core snd_timer snd 
drm_kms_helper i2c_algo_bit soundcore mac_hid tcp_bbr sch_fq msr parport_pc 
ppdev lp drm parport efi_pstore ip_tables x_tables autofs4 btrfs 
blake2b_generic xor raid6_pq libcrc32c hid_generic usbhid hid crc32_pclmul nvme 
r8169 ahci nvme_core i2c_piix4 xhci_pci libahci xhci_pci_renesas realtek video 
wmi gpio_amdpt
kernel: [5.577817] CR2: 0008
kernel: [5.577820] ---[ end trace  ]---
kernel: [5.914230] RIP: 0010:gfx_v10_0_early_init 
(drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:4009 
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c:7478) amdgpu
kernel: [ 5.914388] Code: 8d 55 a8 4c 89 ff e8 e4 83 ec ff 41 89 c2 83 f8 ed 0f 84 b3 
fd ff ff 85 c0 74 05 0f 1f 44 00 00 49 8b 87 08 87 01 00 4c 89 ff <48> 8b 40 08 
0f b7 50 0a 0f b7 70 08 e8 e4 42 fb ff 41 89 c2 85 c0
All code

   0:   8d 55 a8lea-0x58(%rbp),%edx
   3:   4c 89 ffmov%r15,%rdi
   6:   e8 e4 83 ec ff  call   0xffec83ef
   b:   41 89 c2mov%eax,%r10d
   e:   83 f8 edcmp$0xffed,%eax
  11:   0f 84 b3 fd ff ff   je 0xfdca
  17:   85 c0   test   %eax,%eax
  19:   74 05   je 0x20
  1b:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
  20:   49 8b 87 08 87 01 00mov0x18708(%r15),%rax
  27:   4c 89 ffmov%r15,%rdi
  2a:*  48 8b 40 08 mov0x8(%rax),%rax   <-- trapping 
instruction
  2e:   0f b7 50 0a movzwl 0xa(%rax),%edx
  32:   0f b7 70 08 movzwl 0x8(%rax),%esi
  36:   e8 e4 42 fb ff  call   0xfffb431f
  3b:   41 89 c2mov%eax,%r10d
  3e:   85 c0   test   %eax,%eax

Code starting with the faulting instruction
===
   0:   48 8b 40 08 mov0x8(%rax),%rax
   4:   0f b7 50 0a movzwl 0xa(%rax),%edx
   8:   0f b7 70 08 movzwl 0x8(%rax),%esi
   c:   e8 e4 42 fb ff  call   0xfffb42f5
  11:   41 89 c2mov%eax,%r10d
  14:   85 c0   test   %eax,%eax
rsyslogd: rsyslogd's groupid changed to 111
kernel: [5.914394] RSP: 0018:a5b3c103f720 EFLAGS: 00010282
kernel: [5.914397] RAX:  RBX: c1d73489 RCX: 

kernel: [5.914399] RDX:  RSI:  RDI: 
91ae4fa8
kernel: [5.914402] RBP: a5b3c103f7b0 R08:  R09: 

kernel: [5.914405] R10: ffea R11:  R12: 
91ae4fa986e8
kernel: [5.914408] R13: 91ae4fa986d8 R14: 91ae4fa986f8 R15: 
91ae4fa8
kernel: [5.914410] FS:  7fdaa343c8c0() GS:91bd5844() 
knlGS:
kernel: [5.914414] CS:  0010 DS:  ES:  CR0: 80050033
kernel: [5.914416] CR2: 0008 CR3: 0001222d CR4: 
00750ef0
kernel: [5.914419] PKRU: 5554

Best regards,
Mirsad

On 1/18/24 18:23, Mirsad Todorovac wrote:

Hi,

Unfortunately, I was not able to reboot in this kernel again to do the stack 
decode, but I thought
that any information about the NULL pointer dereference is better than no info.

The system is Ubuntu 23.10 Mantic with AMD product: Navi 23 [Radeon RX 
6600/6600 XT/6600M]
graphic card.

Please find the config and the hw listing attached.

Best regards,
Mirsad

Re: [BUG][BISECTED] Freeze at loading init ramdisk

2024-01-20 Thread Mirsad Todorovac

On 1/20/24 12:25, Bagas Sanjaya wrote:

On Wed, Jan 17, 2024 at 07:47:49PM +0100, Mirsad Todorovac wrote:

On 1/16/24 01:32, Mirsad Todorovac wrote:

Hi,

On the Ubuntu 22.04 LTS Jammy platform, on a mainline vanilla torvalds tree 
kernel, the boot
freezes upon first two lines and before any systemd messages.

(Please find the config attached.)

Bisecting the bug led to this result:

marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good
d97a78423c33f68ca6543de510a409167baed6f5 is the first bad commit
commit d97a78423c33f68ca6543de510a409167baed6f5
Merge: 61da593f4458 689237ab37c5
Author: Linus Torvalds 
Date:   Fri Jan 12 14:38:08 2024 -0800

      Merge tag 'fbdev-for-6.8-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev
      Pull fbdev updates from Helge Deller:
   "Three fbdev drivers (~8500 lines of code) removed. The Carillo Ranch
    fbdev driver is for an Intel product which was never shipped, and for
    the intelfb and the amba-clcd drivers the drm drivers can be used
    instead.
    The other code changes are minor: some fb_deferred_io flushing fixes,
    imxfb margin fixes and stifb cleanups.
    Summary:
     - Remove intelfb fbdev driver (Thomas Zimmermann)
     - Remove amba-clcd fbdev driver (Linus Walleij)
     - Remove vmlfb Carillo Ranch fbdev driver (Matthew Wilcox)
     - fb_deferred_io flushing fixes (Nam Cao)
     - imxfb code fixes and cleanups (Dario Binacchi)
     - stifb primary screen detection cleanups (Thomas Zimmermann)"
      * tag 'fbdev-for-6.8-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: (28 commits)
    fbdev/intelfb: Remove driver
    fbdev/hyperv_fb: Do not clear global screen_info
    firmware/sysfb: Clear screen_info state after consuming it
    fbdev/hyperv_fb: Remove firmware framebuffers with aperture helpers
    drm/hyperv: Remove firmware framebuffers with aperture helper
    fbdev/sis: Remove dependency on screen_info
    video/logo: use %u format specifier for unsigned int values
    video/sticore: Remove info field from STI struct
    arch/parisc: Detect primary video device from device instance
    fbdev/stifb: Allocate fb_info instance with framebuffer_alloc()
    video/sticore: Store ROM device in STI struct
    fbdev: flush deferred IO before closing
    fbdev: flush deferred work in fb_deferred_io_fsync()
    fbdev: amba-clcd: Delete the old CLCD driver
    fbdev: Remove support for Carillo Ranch driver
    fbdev: hgafb: fix kernel-doc comments
    fbdev: mmp: Fix typo and wording in code comment
    fbdev: fsl-diu-fb: Fix sparse warning due to virt_to_phys() prototype 
change
    fbdev: imxfb: add '*/' on a separate line in block comment
    fbdev: imxfb: use __func__ for function name
    ...

   Documentation/fb/index.rst |    1 -
   Documentation/fb/intelfb.rst   |  155 --
   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 -
   MAINTAINERS    |   12 -
   arch/parisc/video/fbdev.c  |    2 +-
   drivers/Makefile   |    3 +-
   drivers/firmware/sysfb.c   |   14 +-
   drivers/gpu/drm/hyperv/hyperv_drm_drv.c    |    8 +-
   drivers/video/backlight/Kconfig    |    7 -
   drivers/video/backlight/Makefile   |    1 -
   drivers/video/backlight/cr_bllcd.c |  264 ---
   drivers/video/fbdev/Kconfig    |   72 -
   drivers/video/fbdev/Makefile   |    2 -
   drivers/video/fbdev/amba-clcd.c    |  986 -
   drivers/video/fbdev/core/fb_defio.c    |    8 +-
   drivers/video/fbdev/fsl-diu-fb.c   |    2 +-
   drivers/video/fbdev/hgafb.c    |   13 +-
   drivers/video/fbdev/hyperv_fb.c    |   20 +-
   drivers/video/fbdev/imxfb.c    |  179 +-
   drivers/video/fbdev/intelfb/Makefile   |    8 -
   drivers/video/fbdev/intelfb/intelfb.h  |  382 
   drivers/video/fbdev/intelfb/intelfb_i2c.c  |  209 --
   drivers/video/fbdev/intelfb/intelfbdrv.c   | 1680 
   drivers/video/fbdev/intelfb/intelfbhw.c    | 2115 

   drivers/video/fbdev/intelfb/intelfbhw.h    |  609 --
   drivers/video/fbdev/mmp/hw/mmp_spi.c   |    2 +-
   drivers/video/fbdev/sis/sis_main.c |   37 -
   drivers/video/fbdev/stifb.c    |  109 +-
   drivers/video/fbdev/vermilion/Makefile |    6 -
   drivers/video/fbdev/vermilion/cr_pll.c |  195 --
   drivers/video/fbdev/vermilion/vermilion.c  | 1175 ---
   drivers/video/fbdev/vermilion/vermilion.h  

Re: REGRESSION: no console on current -git

2024-01-20 Thread Mirsad Todorovac

On 1/20/24 01:32, Jens Axboe wrote:

On 1/19/24 5:27 PM, Helge Deller wrote:

On 1/19/24 22:22, Jens Axboe wrote:

On 1/19/24 2:14 PM, Helge Deller wrote:

On 1/19/24 22:01, Jens Axboe wrote:

On 1/19/24 1:55 PM, Helge Deller wrote:

Adding Mirsad Todorovac (who reported a similar issue).

On 1/19/24 19:39, Jens Axboe wrote:

My trusty R7525 test box is failing to show a console, or in fact anything,
on current -git. There's no output after:

Loading Linux 6.7.0+ ...
Loading initial ramdisk ...

and I don't get a console up. I went through the bisection pain and
found this was the culprit:

commit df67699c9cb0ceb70f6cc60630ca938c06773eda
Author: Thomas Zimmermann 
Date:   Wed Jan 3 11:15:11 2024 +0100

firmware/sysfb: Clear screen_info state after consuming it

Reverting this commit, and everything is fine. Looking at dmesg with a
buggy kernel, I get no frame or fb messages. On a good kernel, it looks
ilke this:

[1.416486] efifb: probing for efifb
[1.416602] efifb: framebuffer at 0xde00, using 3072k, total 3072k
[1.416605] efifb: mode is 1024x768x32, linelength=4096, pages=1
[1.416607] efifb: scrolling: redraw
[1.416608] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[1.449746] fb0: EFI VGA frame buffer device

Happy to test a fix, or barring that, can someone just revert this
commit please?


I've temporarily added a revert patch into the fbdev for-next tree for now,
so people should not face the issue in the for-next series:
https://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev.git/commit/?h=for-next
I'd like to wait for Thomas to return on monday to check the issue
as there are some other upcoming patches in this area from him.


Given the issue (and that I'm not the only one reporting it), can we
please just get that pushed so it'll make -rc1? It can always get
re-introduced in a fixed fashion. I don't run -next here, I rely on
mainline working for my testing.


I agree, it would be good to get it fixed for -rc1.
So, it's ok for me, but I won't be able to test the revert short time right now.
If you can assure that the revert fixes it, and builds in git-head,
I can now prepare the pull request for Linus now (or he just reverts
commit df67699c9cb0 manually).


I already tested a revert on top of the current tree, and it builds just
fine and boots with a working console. So reverting it does work and
solves the issue.


I sent a pull request with the revert.


Thanks! You forgot the Reported-by, but no big deal.


Hi,

I confirm that this revert df67699c9cb0ce also solved the original initrd boot 
problem
here:

 1991  git checkout d97a78423c33
 1992  git revert df67699c9cb0ce
 1993  make clean; make olddefconfig
 1994  time nice make -j 36 bindeb-pkg |& tee ../err-6.8-mrg-1.log; date
 1995  sudo apt-get -s install 
../linux-image-6.7.0-bagas-vanilla-rvt-09751-g6b082430adc8_6.7.0-09751-g6b082430adc8-26_amd64.deb
 1996  sudo apt-get -y install 
../linux-image-6.7.0-bagas-vanilla-rvt-09751-g6b082430adc8_6.7.0-09751-g6b082430adc8-26_amd64.deb

You might add:

Tested-by: Mirsad Goran Todorovac 

at your convenience.

Best regards,
Mirsad


Re: [BUG][BISECTED] Freeze at loading init ramdisk

2024-01-18 Thread Mirsad Todorovac




On 1/18/24 22:14, Uwe Kleine-König wrote:

On Thu, Jan 18, 2024 at 09:04:05PM +0100, Mirsad Todorovac wrote:



On 1/18/24 08:45, Uwe Kleine-König wrote:

Hello Mirsad,

On Wed, Jan 17, 2024 at 07:47:49PM +0100, Mirsad Todorovac wrote:

On 1/16/24 01:32, Mirsad Todorovac wrote:

On the Ubuntu 22.04 LTS Jammy platform, on a mainline vanilla torvalds tree 
kernel, the boot
freezes upon first two lines and before any systemd messages.

(Please find the config attached.)

Bisecting the bug led to this result:

marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good
d97a78423c33f68ca6543de510a409167baed6f5 is the first bad commit
commit d97a78423c33f68ca6543de510a409167baed6f5
Merge: 61da593f4458 689237ab37c5
Author: Linus Torvalds 
Date:   Fri Jan 12 14:38:08 2024 -0800

[...]

Hope this helps.


P.S.

As I see that this is a larger merge commit, with 5K+ lines changed, I don't 
think I can
bisect further to determine the culprit.


Actually it's not that hard. If a merge commit is the first bad commit
for a bisection, either the merge wasn't done correctly (less likely,
looking at d97a78423c33f68ca6543de510a409167baed6f5 I'd bet this isn't
the problem); or changes on different sides conflict or you did
something wrong during bisection.

To rule out the third option, you can just retest d97a78423c33,
61da593f4458 and 689237ab37c5. If d97a78423c33 is the only bad one, you
did it right.


This was confirmed.


Then to further debug the second option you can find out the offending
commit on each side with a bisection as follows, here for the RHS (i.e.
689237ab37c5):

git bisect start 689237ab37c5 $(git merge-base 61da593f4458 
689237ab37c5)

and then in each bisection step do:

git merge --no-commit 61da593f4458
test if the problem is present
git reset --hard
git bisect good/bad

In this case you get merge conflicts in drivers/video/fbdev/amba-clcd.c
and drivers/video/fbdev/vermilion/vermilion.c. In the assumption that
you don't have these enabled in your .config, you can just ignore these.

Side note: A problem during bisection can be that the .config changes
along the process. You should put your config into (say)
arch/x86/configs/lala_defconfig and do

make lala_defconfig

before building each step to prevent this.


I must have done something wrong:

marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect log
# bad: [689237ab37c59b9909bc9371d7fece3081683fba] fbdev/intelfb: Remove driver
# good: [de927f6c0b07d9e698416c5b287c521b07694cac] Merge tag 's390-6.8-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect start '689237ab37c5' 'de927f6c0b07d9e698416c5b287c521b07694cac'
# good: [d9f25b59ed85ae45801cf45fe17eb269b0ef3038] fbdev: Remove support for 
Carillo Ranch driver
git bisect good d9f25b59ed85ae45801cf45fe17eb269b0ef3038
# good: [e2e0b838a1849f92612a8305c09aaf31bf824350] video/sticore: Remove info 
field from STI struct
git bisect good e2e0b838a1849f92612a8305c09aaf31bf824350
# good: [778e73d2411abc8f3a2d60dbf038acaec218792e] drm/hyperv: Remove firmware 
framebuffers with aperture helper
git bisect good 778e73d2411abc8f3a2d60dbf038acaec218792e
# good: [df67699c9cb0ceb70f6cc60630ca938c06773eda] firmware/sysfb: Clear 
screen_info state after consuming it
git bisect good df67699c9cb0ceb70f6cc60630ca938c06773eda
marvin@defiant:~/linux/kernel/linux_torvalds$

with the error:

marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good
Bisecting: 0 revisions left to test after this (roughly 0 steps)
drivers/video/fbdev/amba-clcd.c: needs merge
drivers/video/fbdev/vermilion/vermilion.c: needs merge
error: you need to resolve your current index first


It seems you forgot the "git reset --hard" step.  Doing it in this state
should still be possible.


Well, it was possible, but I obviously got the wrong result:

marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect log
# bad: [689237ab37c59b9909bc9371d7fece3081683fba] fbdev/intelfb: Remove driver
# good: [de927f6c0b07d9e698416c5b287c521b07694cac] Merge tag 's390-6.8-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect start '689237ab37c5' 'de927f6c0b07d9e698416c5b287c521b07694cac'
# good: [d9f25b59ed85ae45801cf45fe17eb269b0ef3038] fbdev: Remove support for 
Carillo Ranch driver
git bisect good d9f25b59ed85ae45801cf45fe17eb269b0ef3038
# good: [e2e0b838a1849f92612a8305c09aaf31bf824350] video/sticore: Remove info 
field from STI struct
git bisect good e2e0b838a1849f92612a8305c09aaf31bf824350
# good: [778e73d2411abc8f3a2d60dbf038acaec218792e] drm/hyperv: Remove firmware 
framebuffers with aperture helper
git bisect good 778e73d2411abc8f3a2d60dbf038acaec218792e
# good: [df67699c9cb0ceb70f6cc60630ca938c06773eda] firmware/sysfb: Clear 
screen_info state after consuming it
git bisect good df67699c9cb0ceb70f6cc60630ca938c06773eda
# good: [df67699c9cb0ceb70f6cc60630ca938c06773eda] firmware/sysfb: Clear 
screen_info state after consuming it
git b

Re: [BUG][BISECTED] Freeze at loading init ramdisk

2024-01-18 Thread Mirsad Todorovac




On 1/18/24 08:45, Uwe Kleine-König wrote:

Hello Mirsad,

On Wed, Jan 17, 2024 at 07:47:49PM +0100, Mirsad Todorovac wrote:

On 1/16/24 01:32, Mirsad Todorovac wrote:

On the Ubuntu 22.04 LTS Jammy platform, on a mainline vanilla torvalds tree 
kernel, the boot
freezes upon first two lines and before any systemd messages.

(Please find the config attached.)

Bisecting the bug led to this result:

marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good
d97a78423c33f68ca6543de510a409167baed6f5 is the first bad commit
commit d97a78423c33f68ca6543de510a409167baed6f5
Merge: 61da593f4458 689237ab37c5
Author: Linus Torvalds 
Date:   Fri Jan 12 14:38:08 2024 -0800

[...]

Hope this helps.


P.S.

As I see that this is a larger merge commit, with 5K+ lines changed, I don't 
think I can
bisect further to determine the culprit.


Actually it's not that hard. If a merge commit is the first bad commit
for a bisection, either the merge wasn't done correctly (less likely,
looking at d97a78423c33f68ca6543de510a409167baed6f5 I'd bet this isn't
the problem); or changes on different sides conflict or you did
something wrong during bisection.

To rule out the third option, you can just retest d97a78423c33,
61da593f4458 and 689237ab37c5. If d97a78423c33 is the only bad one, you
did it right.


This was confirmed.


Then to further debug the second option you can find out the offending
commit on each side with a bisection as follows, here for the RHS (i.e.
689237ab37c5):

git bisect start 689237ab37c5 $(git merge-base 61da593f4458 
689237ab37c5)

and then in each bisection step do:

git merge --no-commit 61da593f4458
test if the problem is present
git reset --hard
git bisect good/bad

In this case you get merge conflicts in drivers/video/fbdev/amba-clcd.c
and drivers/video/fbdev/vermilion/vermilion.c. In the assumption that
you don't have these enabled in your .config, you can just ignore these.

Side note: A problem during bisection can be that the .config changes
along the process. You should put your config into (say)
arch/x86/configs/lala_defconfig and do

make lala_defconfig

before building each step to prevent this.


I must have done something wrong:

marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect log
# bad: [689237ab37c59b9909bc9371d7fece3081683fba] fbdev/intelfb: Remove driver
# good: [de927f6c0b07d9e698416c5b287c521b07694cac] Merge tag 's390-6.8-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect start '689237ab37c5' 'de927f6c0b07d9e698416c5b287c521b07694cac'
# good: [d9f25b59ed85ae45801cf45fe17eb269b0ef3038] fbdev: Remove support for 
Carillo Ranch driver
git bisect good d9f25b59ed85ae45801cf45fe17eb269b0ef3038
# good: [e2e0b838a1849f92612a8305c09aaf31bf824350] video/sticore: Remove info 
field from STI struct
git bisect good e2e0b838a1849f92612a8305c09aaf31bf824350
# good: [778e73d2411abc8f3a2d60dbf038acaec218792e] drm/hyperv: Remove firmware 
framebuffers with aperture helper
git bisect good 778e73d2411abc8f3a2d60dbf038acaec218792e
# good: [df67699c9cb0ceb70f6cc60630ca938c06773eda] firmware/sysfb: Clear 
screen_info state after consuming it
git bisect good df67699c9cb0ceb70f6cc60630ca938c06773eda
marvin@defiant:~/linux/kernel/linux_torvalds$

with the error:

marvin@defiant:~/linux/kernel/linux_torvalds$ git bisect good
Bisecting: 0 revisions left to test after this (roughly 0 steps)
drivers/video/fbdev/amba-clcd.c: needs merge
drivers/video/fbdev/vermilion/vermilion.c: needs merge
error: you need to resolve your current index first
marvin@defiant:~/linux/kernel/linux_torvalds$

Best regards,
Mirsad


Best regards
Uwe



Re: [BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]

2023-08-25 Thread Mirsad Todorovac

Hi Christian,

Aha, thanks, that explains it. Then the KCSAN report must be false positive.

Kind regards,
Mirsad

On 8/25/23 09:05, Christian König wrote:

Hi Mirsad,

the name SPSC stands for SingleProducerSingleConsumer, so yes even by the name 
of the component we make it clear that this can only be used by one producer 
and one consumer thread at the same time.

Here disabling preemption is just done so that the consumer thread doesn't busy 
loop for the producer thread to be scheduled in again.

Regards,
Christian.

Am 24.08.23 um 19:46 schrieb Mirsad Goran Todorovac:

Thank you, Christian.

Glad to hear about that.

However, I guess this assumes that this piece of code between

-<>-
 preempt_disable();

 tail = (struct spsc_node **)atomic_long_xchg(>tail, 
(long)>next);
 WRITE_ONCE(*tail, node);
 atomic_inc(>job_count);

 /*
  * In case of first element verify new node will be visible to the consumer
  * thread when we ping the kernel thread that there is new work to do.
  */
 smp_wmb();

 preempt_enable();
-<>-

... executes only on one CPU/core/thread?

I understood that preempt_disable() disables only interrupts on one core/CPU:

https://kernelnewbies.kernelnewbies.narkive.com/6LTlgsAe/preempt-disable-disables-preemption-on-all-processors

So, we might have a race in theory between WRITE_ONCE() and atomic_inc().

Kind regards,
Mirsad


On 8/21/2023 8:22 PM, Christian König wrote:

I'm not sure about that.

On the one hand it might generate some noise. I know tons of cases where logic 
is: Ok if we see the updated value immediately it will optimize things, but if 
not it's unproblematic because there is another check after the next memory 
barrier.

On the other hand we probably have cases where this is not correctly 
implemented. So double checking those would most like be good idea.

Regards,
Christian.

Am 21.08.23 um 16:28 schrieb Mirsad Todorovac:

Hi Christian,

Thank you for the update.

Should I continue reporting what KCSAN gives? I will try to filter these to 
save your time for
evaluation ...

Kind regards,
Mirsad

On 8/21/23 15:20, Christian König wrote:

Hi Mirsad,

well this is a false positive.

That drm_sched_entity_is_ready() doesn't see the data written by 
drm_sched_entity_push_job() is part of the logic here.

Regards,
Christian.

Am 18.08.23 um 15:44 schrieb Mirsad Todorovac:

On 8/17/23 21:54, Mirsad Todorovac wrote:

Hi,

This is your friendly bug reporter.

The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS and a Ryzen 
7950X box.

Please find attached the complete dmesg output from the ring buffer and lshw 
output.

NOTE: The kernel reports tainted kernel, but to my knowledge there are no 
proprietary (G) modules,
   but this taint is turned on by the previous bugs.

dmesg excerpt:

[ 8791.864576] 
==
[ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / 
drm_sched_entity_push_job [gpu_sched]

[ 8791.864776] write (marked) to 0x9b74491b7c40 of 8 bytes by task 3807 on 
cpu 18:
[ 8791.864788]  drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched]
[ 8791.864852]  amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu]
[ 8791.868731]  drm_ioctl_kernel+0x127/0x210 [drm]
[ 8791.869222]  drm_ioctl+0x38f/0x6f0 [drm]
[ 8791.869711]  amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu]
[ 8791.873660]  __x64_sys_ioctl+0xd2/0x120
[ 8791.873676]  do_syscall_64+0x58/0x90
[ 8791.873688]  entry_SYSCALL_64_after_hwframe+0x73/0xdd

[ 8791.873710] read to 0x9b74491b7c40 of 8 bytes by task 1119 on cpu 27:
[ 8791.873722]  drm_sched_entity_is_ready+0x16/0x50 [gpu_sched]
[ 8791.873786]  drm_sched_select_entity+0x1c7/0x220 [gpu_sched]
[ 8791.873849]  drm_sched_main+0xd2/0x500 [gpu_sched]
[ 8791.873912]  kthread+0x18b/0x1d0
[ 8791.873924]  ret_from_fork+0x43/0x70
[ 8791.873939]  ret_from_fork_asm+0x1b/0x30

[ 8791.873955] value changed: 0x -> 0x9b750ebcfc00

[ 8791.873971] Reported by Kernel Concurrency Sanitizer on:
[ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: G L 
6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
[ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, 
BIOS 1.21 04/26/2023
[ 8791.874002] 
==


P.S.

According to Mr. Heo's instructions, I am adding the unwound trace here:

[ 1879.706518] 
==
[ 1879.706616] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / 
drm_sched_entity_push_job [gpu_sched]

[ 1879.706737] write (marked) to 0x8f3672748c40 of 8 bytes by task 4087 on 
cpu 10:
[ 1879.706748] drm_sched_entity_push_job (./include/drm/spsc_queue.h:74 
drivers/gpu/drm/scheduler/sched_entity.c:574) gpu_sched
[ 1879.706808] amdgpu_cs_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1375 
drivers/gpu/drm/amd/amdgpu/

Re: [BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]

2023-08-21 Thread Mirsad Todorovac

Hi Christian,

Thank you for the update.

Should I continue reporting what KCSAN gives? I will try to filter these to 
save your time for
evaluation ...

Kind regards,
Mirsad

On 8/21/23 15:20, Christian König wrote:

Hi Mirsad,

well this is a false positive.

That drm_sched_entity_is_ready() doesn't see the data written by 
drm_sched_entity_push_job() is part of the logic here.

Regards,
Christian.

Am 18.08.23 um 15:44 schrieb Mirsad Todorovac:

On 8/17/23 21:54, Mirsad Todorovac wrote:

Hi,

This is your friendly bug reporter.

The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS and a Ryzen 
7950X box.

Please find attached the complete dmesg output from the ring buffer and lshw 
output.

NOTE: The kernel reports tainted kernel, but to my knowledge there are no 
proprietary (G) modules,
   but this taint is turned on by the previous bugs.

dmesg excerpt:

[ 8791.864576] 
==
[ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / 
drm_sched_entity_push_job [gpu_sched]

[ 8791.864776] write (marked) to 0x9b74491b7c40 of 8 bytes by task 3807 on 
cpu 18:
[ 8791.864788]  drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched]
[ 8791.864852]  amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu]
[ 8791.868731]  drm_ioctl_kernel+0x127/0x210 [drm]
[ 8791.869222]  drm_ioctl+0x38f/0x6f0 [drm]
[ 8791.869711]  amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu]
[ 8791.873660]  __x64_sys_ioctl+0xd2/0x120
[ 8791.873676]  do_syscall_64+0x58/0x90
[ 8791.873688]  entry_SYSCALL_64_after_hwframe+0x73/0xdd

[ 8791.873710] read to 0x9b74491b7c40 of 8 bytes by task 1119 on cpu 27:
[ 8791.873722]  drm_sched_entity_is_ready+0x16/0x50 [gpu_sched]
[ 8791.873786]  drm_sched_select_entity+0x1c7/0x220 [gpu_sched]
[ 8791.873849]  drm_sched_main+0xd2/0x500 [gpu_sched]
[ 8791.873912]  kthread+0x18b/0x1d0
[ 8791.873924]  ret_from_fork+0x43/0x70
[ 8791.873939]  ret_from_fork_asm+0x1b/0x30

[ 8791.873955] value changed: 0x -> 0x9b750ebcfc00

[ 8791.873971] Reported by Kernel Concurrency Sanitizer on:
[ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: G L 
6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
[ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, 
BIOS 1.21 04/26/2023
[ 8791.874002] 
==


P.S.

According to Mr. Heo's instructions, I am adding the unwound trace here:

[ 1879.706518] 
==
[ 1879.706616] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / 
drm_sched_entity_push_job [gpu_sched]

[ 1879.706737] write (marked) to 0x8f3672748c40 of 8 bytes by task 4087 on 
cpu 10:
[ 1879.706748] drm_sched_entity_push_job (./include/drm/spsc_queue.h:74 
drivers/gpu/drm/scheduler/sched_entity.c:574) gpu_sched
[ 1879.706808] amdgpu_cs_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1375 
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1469) amdgpu
[ 1879.710589] drm_ioctl_kernel (drivers/gpu/drm/drm_ioctl.c:788) drm
[ 1879.711068] drm_ioctl (drivers/gpu/drm/drm_ioctl.c:892) drm
[ 1879.711551] amdgpu_drm_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:2748) 
amdgpu
[ 1879.715319] __x64_sys_ioctl (fs/ioctl.c:51 fs/ioctl.c:870 fs/ioctl.c:856 
fs/ioctl.c:856)
[ 1879.715334] do_syscall_64 (arch/x86/entry/common.c:50 
arch/x86/entry/common.c:80)
[ 1879.715345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

[ 1879.715365] read to 0x8f3672748c40 of 8 bytes by task 1098 on cpu 11:
[ 1879.715376] drm_sched_entity_is_ready 
(drivers/gpu/drm/scheduler/sched_entity.c:134) gpu_sched
[ 1879.715435] drm_sched_select_entity 
(drivers/gpu/drm/scheduler/sched_main.c:248 
drivers/gpu/drm/scheduler/sched_main.c:893) gpu_sched
[ 1879.715495] drm_sched_main (drivers/gpu/drm/scheduler/sched_main.c:1019) 
gpu_sched
[ 1879.715554] kthread (kernel/kthread.c:389)
[ 1879.715563] ret_from_fork (arch/x86/kernel/process.c:145)
[ 1879.715575] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)

[ 1879.715590] value changed: 0x -> 0x8f360663dc00

[ 1879.715604] Reported by Kernel Concurrency Sanitizer on:
[ 1879.715612] CPU: 11 PID: 1098 Comm: gfx_0.0.0 Tainted: G L 
6.5.0-rc6+ #47
[ 1879.715624] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, 
BIOS 1.21 04/26/2023
[ 1879.715631] 
==

It seems that the line in question might be:

first = spsc_queue_push(>job_queue, _job->queue_node);

which expands to:

static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node 
*node)
{
struct spsc_node **tail;

node->next = NULL;

preempt_disable();

tail = (struct spsc_node **)atomic_long_xchg(>tail, 
(long)>next);
WRITE_ONCE(*tail, node);
atomic_inc(>job_count);

/*
 * In case of first element verify 

[BUG]: amdgpu: soft lockup - CPU#1 stuck for 26s! [systemd-udevd:635]

2023-08-20 Thread Mirsad Todorovac
69.199110] ? vcn_v1_0_enc_ring_emit_fence 
(drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1647 (discriminator 1)) amdgpu
[   69.204958] ? __pfx_vcn_v1_0_enc_ring_emit_fence 
(drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu
[   69.210899] ? __pfx_vcn_v1_0_enc_ring_emit_fence 
(drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu
[   69.216910] ? __pfx_vcn_v1_0_enc_ring_emit_fence 
(drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu
[   69.222561] module_address_lookup+0x8c/0xe0
[   69.222573] ? __pfx_vcn_v1_0_enc_ring_emit_fence 
(drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu
[   69.228237] kallsyms_lookup_buildid+0x107/0x1b0
[   69.228251] ? __pfx_vcn_v1_0_enc_ring_emit_fence 
(drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu
[   69.234368] kallsyms_lookup+0x14/0x30
[   69.234381] test_for_valid_rec+0x38/0x90
[   69.234411] ? sched_clock_noinstr+0x9/0x10
[   69.234448] ? srso_alias_return_thunk+0x5/0x7f
[   69.234459] ? __mutex_lock_slowpath+0x13/0x20
[   69.234470] ? srso_alias_return_thunk+0x5/0x7f
[   69.234481] ? mutex_lock+0xa7/0xb0
[   69.234492] ftrace_module_enable+0x22e/0x3b0
[   69.234525] load_module+0x3357/0x3980
[   69.234533] ? aa_file_perm+0x1fc/0x800
[   69.234562] ? srso_alias_return_thunk+0x5/0x7f
[   69.234593] ? security_kernel_post_read_file+0x79/0x90
[   69.234618] init_module_from_file+0xdf/0x130
[   69.234642] ? srso_alias_return_thunk+0x5/0x7f
[   69.234653] ? init_module_from_file+0xdf/0x130
[   69.234668] idempotent_init_module+0x241/0x360
[   69.234683] __x64_sys_finit_module+0x8e/0xf0
[   69.234693] do_syscall_64+0x58/0x90
[   69.234705] ? srso_alias_return_thunk+0x5/0x7f
[   69.234716] ? exit_to_user_mode_prepare+0x76/0x230
[   69.234748] ? srso_alias_return_thunk+0x5/0x7f
[   69.234758] ? syscall_exit_to_user_mode+0x29/0x40
[   69.234769] ? srso_alias_return_thunk+0x5/0x7f
[   69.234780] ? do_syscall_64+0x68/0x90
[   69.234803] ? srso_alias_return_thunk+0x5/0x7f
[   69.234830] ? exit_to_user_mode_prepare+0x76/0x230
[   69.234841] ? srso_alias_return_thunk+0x5/0x7f
[   69.234852] ? syscall_exit_to_user_mode+0x29/0x40
[   69.234869] ? srso_alias_return_thunk+0x5/0x7f
[   69.234888] ? do_syscall_64+0x68/0x90
[   69.234897] ? srso_alias_return_thunk+0x5/0x7f
[   69.234922] ? do_syscall_64+0x68/0x90
[   69.234952] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[   69.234978] RIP: 0033:0x7f452d11ea3d
[ 69.234996] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 
73 01 c3 48 8b 0d c3 a3 0f 00 f7 d8 64 89 01 48
All code

   0:5b   pop%rbx
   1:41 5cpop%r12
   3:c3   ret
   4:66 0f 1f 84 00 00 00 nopw   0x0(%rax,%rax,1)
   b:00 00
   d:f3 0f 1e fa  endbr64
  11:48 89 f8 mov%rdi,%rax
  14:48 89 f7 mov%rsi,%rdi
  17:48 89 d6 mov%rdx,%rsi
  1a:48 89 ca mov%rcx,%rdx
  1d:4d 89 c2 mov%r8,%r10
  20:4d 89 c8 mov%r9,%r8
  23:4c 8b 4c 24 08   mov0x8(%rsp),%r9
  28:0f 05syscall
  2a:*48 3d 01 f0 ff ffcmp$0xf001,%rax<-- 
trapping instruction
  30:73 01jae0x33
  32:c3   ret
  33:48 8b 0d c3 a3 0f 00 mov0xfa3c3(%rip),%rcx# 0xfa3fd
  3a:f7 d8neg%eax
  3c:64 89 01 mov%eax,%fs:(%rcx)
  3f:48   rex.W

Code starting with the faulting instruction
===
   0:48 3d 01 f0 ff ffcmp$0xf001,%rax
   6:73 01jae0x9
   8:c3   ret
   9:48 8b 0d c3 a3 0f 00 mov0xfa3c3(%rip),%rcx# 0xfa3d3
  10:f7 d8neg%eax
  12:64 89 01 mov%eax,%fs:(%rcx)
  15:48   rex.W
[   69.235005] RSP: 002b:7ffda20bffe8 EFLAGS: 0246 ORIG_RAX: 
0139
[   69.235020] RAX: ffda RBX: 5561184c0f30 RCX: 7f452d11ea3d
[   69.235028] RDX:  RSI: 55611837ad80 RDI: 001a
[   69.235035] RBP: 0002 R08:  R09: 0002
[   69.235052] R10: 001a R11: 0246 R12: 55611837ad80
[   69.235059] R13: 55611836bc10 R14:  R15: 5561184ba330
[   69.235072]  
[   69.462372] 
==

Best regards,
Mirsad Todorovac


[BUG]: amdgpu: soft lockup - CPU#1 stuck for 26s! [systemd-udevd:635]

2023-08-20 Thread Mirsad Todorovac
69.210899] ? __pfx_vcn_v1_0_enc_ring_emit_fence 
(drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu
[   69.216910] ? __pfx_vcn_v1_0_enc_ring_emit_fence 
(drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu
[   69.222561] module_address_lookup+0x8c/0xe0
[   69.222573] ? __pfx_vcn_v1_0_enc_ring_emit_fence 
(drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu
[   69.228237] kallsyms_lookup_buildid+0x107/0x1b0
[   69.228251] ? __pfx_vcn_v1_0_enc_ring_emit_fence 
(drivers/gpu/drm/amd/amdgpu/vcn_v1_0.c:1646) amdgpu
[   69.234368] kallsyms_lookup+0x14/0x30
[   69.234381] test_for_valid_rec+0x38/0x90
[   69.234411] ? sched_clock_noinstr+0x9/0x10
[   69.234448] ? srso_alias_return_thunk+0x5/0x7f
[   69.234459] ? __mutex_lock_slowpath+0x13/0x20
[   69.234470] ? srso_alias_return_thunk+0x5/0x7f
[   69.234481] ? mutex_lock+0xa7/0xb0
[   69.234492] ftrace_module_enable+0x22e/0x3b0
[   69.234525] load_module+0x3357/0x3980
[   69.234533] ? aa_file_perm+0x1fc/0x800
[   69.234562] ? srso_alias_return_thunk+0x5/0x7f
[   69.234593] ? security_kernel_post_read_file+0x79/0x90
[   69.234618] init_module_from_file+0xdf/0x130
[   69.234642] ? srso_alias_return_thunk+0x5/0x7f
[   69.234653] ? init_module_from_file+0xdf/0x130
[   69.234668] idempotent_init_module+0x241/0x360
[   69.234683] __x64_sys_finit_module+0x8e/0xf0
[   69.234693] do_syscall_64+0x58/0x90
[   69.234705] ? srso_alias_return_thunk+0x5/0x7f
[   69.234716] ? exit_to_user_mode_prepare+0x76/0x230
[   69.234748] ? srso_alias_return_thunk+0x5/0x7f
[   69.234758] ? syscall_exit_to_user_mode+0x29/0x40
[   69.234769] ? srso_alias_return_thunk+0x5/0x7f
[   69.234780] ? do_syscall_64+0x68/0x90
[   69.234803] ? srso_alias_return_thunk+0x5/0x7f
[   69.234830] ? exit_to_user_mode_prepare+0x76/0x230
[   69.234841] ? srso_alias_return_thunk+0x5/0x7f
[   69.234852] ? syscall_exit_to_user_mode+0x29/0x40
[   69.234869] ? srso_alias_return_thunk+0x5/0x7f
[   69.234888] ? do_syscall_64+0x68/0x90
[   69.234897] ? srso_alias_return_thunk+0x5/0x7f
[   69.234922] ? do_syscall_64+0x68/0x90
[   69.234952] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[   69.234978] RIP: 0033:0x7f452d11ea3d
[ 69.234996] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 
73 01 c3 48 8b 0d c3 a3 0f 00 f7 d8 64 89 01 48
All code

   0:   5b  pop%rbx
   1:   41 5c   pop%r12
   3:   c3  ret
   4:   66 0f 1f 84 00 00 00nopw   0x0(%rax,%rax,1)
   b:   00 00
   d:   f3 0f 1e fa endbr64
  11:   48 89 f8mov%rdi,%rax
  14:   48 89 f7mov%rsi,%rdi
  17:   48 89 d6mov%rdx,%rsi
  1a:   48 89 camov%rcx,%rdx
  1d:   4d 89 c2mov%r8,%r10
  20:   4d 89 c8mov%r9,%r8
  23:   4c 8b 4c 24 08  mov0x8(%rsp),%r9
  28:   0f 05   syscall
  2a:*  48 3d 01 f0 ff ff   cmp$0xf001,%rax <-- 
trapping instruction
  30:   73 01   jae0x33
  32:   c3  ret
  33:   48 8b 0d c3 a3 0f 00mov0xfa3c3(%rip),%rcx# 0xfa3fd
  3a:   f7 d8   neg%eax
  3c:   64 89 01mov%eax,%fs:(%rcx)
  3f:   48  rex.W

Code starting with the faulting instruction
===
   0:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
   6:   73 01   jae0x9
   8:   c3  ret
   9:   48 8b 0d c3 a3 0f 00mov0xfa3c3(%rip),%rcx# 0xfa3d3
  10:   f7 d8   neg%eax
  12:   64 89 01mov%eax,%fs:(%rcx)
  15:   48  rex.W
[   69.235005] RSP: 002b:7ffda20bffe8 EFLAGS: 0246 ORIG_RAX: 
0139
[   69.235020] RAX: ffda RBX: 5561184c0f30 RCX: 7f452d11ea3d
[   69.235028] RDX:  RSI: 55611837ad80 RDI: 001a
[   69.235035] RBP: 0002 R08:  R09: 0002
[   69.235052] R10: 001a R11: 0246 R12: 55611837ad80
[   69.235059] R13: 55611836bc10 R14:  R15: 5561184ba330
[   69.235072]  
[   69.462372] 
==

Best regards,
Mirsad Todorovac

config-6.5.0-rc7-kcsan-g706a74159504.xz
Description: application/xz


lshw.txt.xz
Description: application/xz


Re: [BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]

2023-08-18 Thread Mirsad Todorovac

On 8/17/23 21:54, Mirsad Todorovac wrote:

Hi,

This is your friendly bug reporter.

The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS and a Ryzen 
7950X box.

Please find attached the complete dmesg output from the ring buffer and lshw 
output.

NOTE: The kernel reports tainted kernel, but to my knowledge there are no 
proprietary (G) modules,
   but this taint is turned on by the previous bugs.

dmesg excerpt:

[ 8791.864576] 
==
[ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / 
drm_sched_entity_push_job [gpu_sched]

[ 8791.864776] write (marked) to 0x9b74491b7c40 of 8 bytes by task 3807 on 
cpu 18:
[ 8791.864788]  drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched]
[ 8791.864852]  amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu]
[ 8791.868731]  drm_ioctl_kernel+0x127/0x210 [drm]
[ 8791.869222]  drm_ioctl+0x38f/0x6f0 [drm]
[ 8791.869711]  amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu]
[ 8791.873660]  __x64_sys_ioctl+0xd2/0x120
[ 8791.873676]  do_syscall_64+0x58/0x90
[ 8791.873688]  entry_SYSCALL_64_after_hwframe+0x73/0xdd

[ 8791.873710] read to 0x9b74491b7c40 of 8 bytes by task 1119 on cpu 27:
[ 8791.873722]  drm_sched_entity_is_ready+0x16/0x50 [gpu_sched]
[ 8791.873786]  drm_sched_select_entity+0x1c7/0x220 [gpu_sched]
[ 8791.873849]  drm_sched_main+0xd2/0x500 [gpu_sched]
[ 8791.873912]  kthread+0x18b/0x1d0
[ 8791.873924]  ret_from_fork+0x43/0x70
[ 8791.873939]  ret_from_fork_asm+0x1b/0x30

[ 8791.873955] value changed: 0x -> 0x9b750ebcfc00

[ 8791.873971] Reported by Kernel Concurrency Sanitizer on:
[ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: G L 
6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
[ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, 
BIOS 1.21 04/26/2023
[ 8791.874002] 
==


P.S.

According to Mr. Heo's instructions, I am adding the unwound trace here:

[ 1879.706518] 
==
[ 1879.706616] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / 
drm_sched_entity_push_job [gpu_sched]

[ 1879.706737] write (marked) to 0x8f3672748c40 of 8 bytes by task 4087 on 
cpu 10:
[ 1879.706748] drm_sched_entity_push_job (./include/drm/spsc_queue.h:74 
drivers/gpu/drm/scheduler/sched_entity.c:574) gpu_sched
[ 1879.706808] amdgpu_cs_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1375 
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1469) amdgpu
[ 1879.710589] drm_ioctl_kernel (drivers/gpu/drm/drm_ioctl.c:788) drm
[ 1879.711068] drm_ioctl (drivers/gpu/drm/drm_ioctl.c:892) drm
[ 1879.711551] amdgpu_drm_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:2748) 
amdgpu
[ 1879.715319] __x64_sys_ioctl (fs/ioctl.c:51 fs/ioctl.c:870 fs/ioctl.c:856 
fs/ioctl.c:856)
[ 1879.715334] do_syscall_64 (arch/x86/entry/common.c:50 
arch/x86/entry/common.c:80)
[ 1879.715345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

[ 1879.715365] read to 0x8f3672748c40 of 8 bytes by task 1098 on cpu 11:
[ 1879.715376] drm_sched_entity_is_ready 
(drivers/gpu/drm/scheduler/sched_entity.c:134) gpu_sched
[ 1879.715435] drm_sched_select_entity 
(drivers/gpu/drm/scheduler/sched_main.c:248 
drivers/gpu/drm/scheduler/sched_main.c:893) gpu_sched
[ 1879.715495] drm_sched_main (drivers/gpu/drm/scheduler/sched_main.c:1019) 
gpu_sched
[ 1879.715554] kthread (kernel/kthread.c:389)
[ 1879.715563] ret_from_fork (arch/x86/kernel/process.c:145)
[ 1879.715575] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)

[ 1879.715590] value changed: 0x -> 0x8f360663dc00

[ 1879.715604] Reported by Kernel Concurrency Sanitizer on:
[ 1879.715612] CPU: 11 PID: 1098 Comm: gfx_0.0.0 Tainted: G L 
6.5.0-rc6+ #47
[ 1879.715624] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, 
BIOS 1.21 04/26/2023
[ 1879.715631] 
==

It seems that the line in question might be:

first = spsc_queue_push(>job_queue, _job->queue_node);

which expands to:

static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node 
*node)
{
struct spsc_node **tail;

node->next = NULL;

preempt_disable();

tail = (struct spsc_node **)atomic_long_xchg(>tail, 
(long)>next);
WRITE_ONCE(*tail, node);
atomic_inc(>job_count);

/*
 * In case of first element verify new node will be visible to the 
consumer
 * thread when we ping the kernel thread that there is new work to do.
 */
smp_wmb();

preempt_enable();

return tail == >head;
}

According to the manual, preempt_disable() only guaranteed exclusion on a 
single CPU/core/thread, so
we might be plagued with the slow, old fashioned locking unless anyone had a 
better idea.

Best regards,
Mirsad Todorovac


[BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]

2023-08-17 Thread Mirsad Todorovac

Hi,

This is your friendly bug reporter.

The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS and a Ryzen 
7950X box.

Please find attached the complete dmesg output from the ring buffer and lshw 
output.

NOTE: The kernel reports tainted kernel, but to my knowledge there are no 
proprietary (G) modules,
  but this taint is turned on by the previous bugs.

dmesg excerpt:

[ 8791.864576] 
==
[ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / 
drm_sched_entity_push_job [gpu_sched]

[ 8791.864776] write (marked) to 0x9b74491b7c40 of 8 bytes by task 3807 on 
cpu 18:
[ 8791.864788]  drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched]
[ 8791.864852]  amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu]
[ 8791.868731]  drm_ioctl_kernel+0x127/0x210 [drm]
[ 8791.869222]  drm_ioctl+0x38f/0x6f0 [drm]
[ 8791.869711]  amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu]
[ 8791.873660]  __x64_sys_ioctl+0xd2/0x120
[ 8791.873676]  do_syscall_64+0x58/0x90
[ 8791.873688]  entry_SYSCALL_64_after_hwframe+0x73/0xdd

[ 8791.873710] read to 0x9b74491b7c40 of 8 bytes by task 1119 on cpu 27:
[ 8791.873722]  drm_sched_entity_is_ready+0x16/0x50 [gpu_sched]
[ 8791.873786]  drm_sched_select_entity+0x1c7/0x220 [gpu_sched]
[ 8791.873849]  drm_sched_main+0xd2/0x500 [gpu_sched]
[ 8791.873912]  kthread+0x18b/0x1d0
[ 8791.873924]  ret_from_fork+0x43/0x70
[ 8791.873939]  ret_from_fork_asm+0x1b/0x30

[ 8791.873955] value changed: 0x -> 0x9b750ebcfc00

[ 8791.873971] Reported by Kernel Concurrency Sanitizer on:
[ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: G L 
6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
[ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, 
BIOS 1.21 04/26/2023
[ 8791.874002] 
==

Best regards,
Mirsad Todorovac

dmesg-3.log.xz
Description: application/xz


lshw.txt.xz
Description: application/xz


Re: [Intel-gfx] [PATCH 2/2] drm/i915: Fix a memory leak with reused mmap_offset

2023-01-18 Thread Mirsad Todorovac

Hi,

On 1/18/23 11:39, Das, Nirmoy wrote:


On 1/18/2023 11:26 AM, Mirsad Todorovac wrote:

Hi,

On 1/18/23 10:19, Tvrtko Ursulin wrote:


Thanks for working on this, it looks good to me and it aligns with how i915 
uses the facility.

Copying Mirsad who reported the issue in case he is still happy to give it a quick test. Mirsad, I don't know if you are 
subscribed to one of the two mailing lists where series was posted. In case not, you can grab both patches from 
https://patchwork.freedesktop.org/series/112952/.


Nirmoy - we also have an IGT written by Chuansheng - https://patchwork.freedesktop.org/patch/515720/?series=101035=4. A more 
generic one could be placed in gem_mmap_offset test but this one works too in my testing and is IMO better than nothing.


Finally, let me add some tags below:

On 17/01/2023 17:52, Nirmoy Das wrote:

drm_vma_node_allow() and drm_vma_node_revoke() should be called in
balanced pairs. We call drm_vma_node_allow() once per-file everytime a
user calls mmap_offset, but only call drm_vma_node_revoke once per-file
on each mmap_offset. As the mmap_offset is reused by the client, the
per-file vm_count may remain non-zero and the rbtree leaked.

Call drm_vma_node_allow_once() instead to prevent that memory leak.

Cc: Tvrtko Ursulin 
Cc: Andi Shyti 


Fixes: 786555987207 ("drm/i915/gem: Store mmap_offsets in an rbtree rather than a 
plain list")
Reported-by: Chuansheng Liu 
Reported-by: Mirsad Todorovac 
Cc:  # v5.7+
Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko



Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gem/i915_gem_mman.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c 
b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index 4f69bff63068..2aac6bf78740 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -697,7 +697,7 @@ mmap_offset_attach(struct drm_i915_gem_object *obj,
  GEM_BUG_ON(lookup_mmo(obj, mmap_type) != mmo);
  out:
  if (file)
-    drm_vma_node_allow(>vma_node, file);
+    drm_vma_node_allow_once(>vma_node, file);
  return mmo;
  err:


The drm/i915 patch seems OK and there are currently no memory leaks as of
reported by /sys/kernel/debug/kmemleak under the same Chrome load that triggered
the initial bug ...



Thanks, Mirsad for quickly checking this!


There was no problem, Nirmoy, everything applied neatly :)

Regards,
Mirsad

--
Mirsad Goran Todorovac
Sistem inženjer
Grafički fakultet | Akademija likovnih umjetnosti
Sveučilište u Zagrebu

System engineer
Faculty of Graphic Arts | Academy of Fine Arts
University of Zagreb, Republic of Croatia


Re: [PATCH 2/2] drm/i915: Fix a memory leak with reused mmap_offset

2023-01-18 Thread Mirsad Todorovac

Hi,

On 1/18/23 10:19, Tvrtko Ursulin wrote:


Thanks for working on this, it looks good to me and it aligns with how i915 
uses the facility.

Copying Mirsad who reported the issue in case he is still happy to give it a quick test. Mirsad, I don't know if you are subscribed 
to one of the two mailing lists where series was posted. In case not, you can grab both patches from 
https://patchwork.freedesktop.org/series/112952/.


Nirmoy - we also have an IGT written by Chuansheng - https://patchwork.freedesktop.org/patch/515720/?series=101035=4. A more 
generic one could be placed in gem_mmap_offset test but this one works too in my testing and is IMO better than nothing.


Finally, let me add some tags below:

On 17/01/2023 17:52, Nirmoy Das wrote:

drm_vma_node_allow() and drm_vma_node_revoke() should be called in
balanced pairs. We call drm_vma_node_allow() once per-file everytime a
user calls mmap_offset, but only call drm_vma_node_revoke once per-file
on each mmap_offset. As the mmap_offset is reused by the client, the
per-file vm_count may remain non-zero and the rbtree leaked.

Call drm_vma_node_allow_once() instead to prevent that memory leak.

Cc: Tvrtko Ursulin 
Cc: Andi Shyti 


Fixes: 786555987207 ("drm/i915/gem: Store mmap_offsets in an rbtree rather than a 
plain list")
Reported-by: Chuansheng Liu 
Reported-by: Mirsad Todorovac 
Cc:  # v5.7+
Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko



Signed-off-by: Nirmoy Das 
---
  drivers/gpu/drm/i915/gem/i915_gem_mman.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c 
b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index 4f69bff63068..2aac6bf78740 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -697,7 +697,7 @@ mmap_offset_attach(struct drm_i915_gem_object *obj,
  GEM_BUG_ON(lookup_mmo(obj, mmap_type) != mmo);
  out:
  if (file)
-    drm_vma_node_allow(>vma_node, file);
+    drm_vma_node_allow_once(>vma_node, file);
  return mmo;
  err:


The drm/i915 patch seems OK and there are currently no memory leaks as of
reported by /sys/kernel/debug/kmemleak under the same Chrome load that triggered
the initial bug ...

Will post you if there are any changes.

Regards,
Mirsad

--
Mirsad Goran Todorovac
Sistem inženjer
Grafički fakultet | Akademija likovnih umjetnosti
Sveučilište u Zagrebu

System engineer
Faculty of Graphic Arts | Academy of Fine Arts
University of Zagreb, Republic of Croatia