Control: tag -1 - moreinfo

-------- Forwarded Message --------
From: Gabriel Francisco <frc.gabr...@gmail.com>
To: Ben Hutchings <b...@decadent.org.uk>
Subject: Re: Bug#1053122: linux-image-6.5.0-1-amd64: using
smp_processor_id() in preemptible
Date: 12/10/23 20:23:30
Message-Id:
<CAFXNTqMQtkYNhZANXapZtDBstcy74DL9giCX7eH8MV=wpky...@mail.gmail.com>

Hi,

> The CPU registers contain several addresses starting ffff89, except for
> rbx which starts ffff99 (and is the faulting address).  That looks like
> a single bit got flipped.

Thanks for the explanation! (now I know how to detect bit flips) :D

> The first BUG message should be more meaningful that what comes after.
> This shows the kernel tried to access non-existent memory.

Yes, I should have reported the first one indeed, I thought too much and
ended reporting the second one. Sorry about that.

> This could be due to a kernel bug, but is more likely a hardware
> problem.  Please test the RAM with memtest86+.  Also if you've enabled
> any overclocking options, turn those off.

Even with XMP(3000@1.35v) enabled (F4-3000C16-16GISB), memtest86+ ran for 3
hours and printed PASS in the screen.
I removed the XMP profile from my memories and ordered new rams to check if
my current ones are faulty (or not).

The message in dmesg was only one occasion. (but I reported it anyways)

The hang does still happens with/without XMP when running 6.5.x kernel
series. It happens when maximizing a video (or time-to-time when my cursor
enters the video area) when using kernel 6.5.x. It does not happen with
kernel 6.1.x series.

I'm using amgpu module.

Greetings,

*Gabriel Francisco*
Linux User #507840
email: frc.gabriel[at]gmail.com <frc.gabr...@gmail.com>


On Thu, Oct 5, 2023 at 1:15 AM Ben Hutchings <b...@decadent.org.uk> wrote:

> Control: retitle -1 linux-image-6.5.0-1-amd64: Kernel page fault in
> process exit due to bit flip
> Control: tag -1 moreinfo
> 
> On Wed, 2023-09-27 at 20:45 +0200, Gabriel Francisco wrote:
> > Package: src:linux
> > Version: 6.5.3-1
> > Severity: important
> > Tags: upstream
> > X-Debbugs-Cc: frc.gabr...@gmail.com
> > 
> > Dear Maintainer,
> > 
> > First of all thanks for your hard work!
> > 
> > I noticed my computer started freezing for few seconds when
> entering/exiting
> > full screen videos in youtube using firefox and while trying to check if
> the
> > issue also afected chromium I saw the following message in dmesg:
> > 
> > [12569.564300] BUG: unable to handle page fault for address:
> ffff991989e936b8
> > [12569.564304] #PF: supervisor write access in kernel mode
> > [12569.564306] #PF: error_code(0x0002) - not-present page
> 
> The first BUG message should be more meaningful that what comes after.
> This shows the kernel tried to access non-existent memory.
> 
> > [12569.564308] PGD 0 P4D 0
> > [12569.564311] Oops: 0002 [#1] PREEMPT SMP NOPTI
> > [12569.564314] CPU: 10 PID: 328649 Comm: Chroot Helper Not tainted
> 6.5.0-1-amd64 #1  Debian 6.5.3-1
> > [12569.564317] Hardware name: ASUS System Product Name/ROG STRIX B550-F
> GAMING WIFI II, BIOS 3205 08/14/2023
> > [12569.564318] RIP: 0010:down_write+0x23/0x70
> > [12569.564324] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53
> 48 89 fb e8 2e bc ff ff bf 01 00 00 00 e8 74 3a 53 ff 31 c0 ba 01 00 00 00
> <f0> 48 0f b1 13 75 33 65 48 8b 04 25 80 29 03 00 48 89 43 08 bf 01
> > [12569.564326] RSP: 0018:ffffa189d736fc70 EFLAGS: 00010246
> > [12569.564328] RAX: 0000000000000000 RBX: ffff991989e936b8 RCX:
> ffff891797aaef00
> > [12569.564330] RDX: 0000000000000001 RSI: ffff891989e645c0 RDI:
> ffffffff8e7c95dc
> > [12569.564331] RBP: ffffffffffffffff R08: 0000000000000060 R09:
> 0000000080400014
> > [12569.564333] R10: ffff8918cbfeb7f8 R11: 0000000000000006 R12:
> 00007f7e5fd00000
> > [12569.564334] R13: 0000000000000001 R14: ffff891989e645c0 R15:
> ffff891989e64958
> 
> The CPU registers contain several addresses starting ffff89, except for
> rbx which starts ffff99 (and is the faulting address).  That looks like
> a single bit got flipped.
> 
> This could be due to a kernel bug, but is more likely a hardware
> problem.  Please test the RAM with memtest86+.  Also if you've enabled
> any overclocking options, turn those off.
> 
> [...]
> > After that the computer can't shutdown and systemd keeps waiting on
> process PID 328649 (Chroot Helper).
> 
> This (and the other BUG messages) are because that process crashed in
> kernel mode and couldn't properly exit.
> 
> Ben.
> 
> --
> Ben Hutchings
> Beware of bugs in the above code;
> I have only proved it correct, not tried it. - Donald Knuth
> 
> 

Hi,

> The CPU registers contain several addresses starting ffff89, except for
> rbx which starts ffff99 (and is the faulting address).  That looks like
> a single bit got flipped.

Thanks for the explanation! (now I know how to detect bit flips) :D

> The first BUG message should be more meaningful that what comes after.
> This shows the kernel tried to access non-existent memory.

Yes, I should have reported the first one indeed, I thought too much and ended reporting the second one. Sorry about that.

> This could be due to a kernel bug, but is more likely a hardware
> problem.  Please test the RAM with memtest86+.  Also if you've enabled
> any overclocking options, turn those off.

Even with XMP(3000@1.35v) enabled (F4-3000C16-16GISB), memtest86+ ran for 3 hours and printed PASS in the screen.
I removed the XMP profile from my memories and ordered new rams to check if my current ones are faulty (or not).

The message in dmesg was only one occasion. (but I reported it anyways)

The hang does still happens with/without XMP when running 6.5.x kernel series. It happens when maximizing a video (or time-to-time when my cursor enters the video area) when using kernel 6.5.x. It does not happen with kernel 6.1.x series.

I'm using amgpu module.

Greetings,

Gabriel Francisco
Linux User #507840


On Thu, Oct 5, 2023 at 1:15 AM Ben Hutchings <b...@decadent.org.uk> wrote:
Control: retitle -1 linux-image-6.5.0-1-amd64: Kernel page fault in
process exit due to bit flip
Control: tag -1 moreinfo

On Wed, 2023-09-27 at 20:45 +0200, Gabriel Francisco wrote:
> Package: src:linux
> Version: 6.5.3-1
> Severity: important
> Tags: upstream
> X-Debbugs-Cc: frc.gabr...@gmail.com
>
> Dear Maintainer,
>
> First of all thanks for your hard work!
>
> I noticed my computer started freezing for few seconds when entering/exiting
> full screen videos in youtube using firefox and while trying to check if the
> issue also afected chromium I saw the following message in dmesg:
>
> [12569.564300] BUG: unable to handle page fault for address: ffff991989e936b8
> [12569.564304] #PF: supervisor write access in kernel mode
> [12569.564306] #PF: error_code(0x0002) - not-present page

The first BUG message should be more meaningful that what comes after.
This shows the kernel tried to access non-existent memory.

> [12569.564308] PGD 0 P4D 0
> [12569.564311] Oops: 0002 [#1] PREEMPT SMP NOPTI
> [12569.564314] CPU: 10 PID: 328649 Comm: Chroot Helper Not tainted 6.5.0-1-amd64 #1  Debian 6.5.3-1
> [12569.564317] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING WIFI II, BIOS 3205 08/14/2023
> [12569.564318] RIP: 0010:down_write+0x23/0x70
> [12569.564324] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 48 89 fb e8 2e bc ff ff bf 01 00 00 00 e8 74 3a 53 ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 33 65 48 8b 04 25 80 29 03 00 48 89 43 08 bf 01
> [12569.564326] RSP: 0018:ffffa189d736fc70 EFLAGS: 00010246
> [12569.564328] RAX: 0000000000000000 RBX: ffff991989e936b8 RCX: ffff891797aaef00
> [12569.564330] RDX: 0000000000000001 RSI: ffff891989e645c0 RDI: ffffffff8e7c95dc
> [12569.564331] RBP: ffffffffffffffff R08: 0000000000000060 R09: 0000000080400014
> [12569.564333] R10: ffff8918cbfeb7f8 R11: 0000000000000006 R12: 00007f7e5fd00000
> [12569.564334] R13: 0000000000000001 R14: ffff891989e645c0 R15: ffff891989e64958

The CPU registers contain several addresses starting ffff89, except for
rbx which starts ffff99 (and is the faulting address).  That looks like
a single bit got flipped.

This could be due to a kernel bug, but is more likely a hardware
problem.  Please test the RAM with memtest86+.  Also if you've enabled
any overclocking options, turn those off.

[...]
> After that the computer can't shutdown and systemd keeps waiting on process PID 328649 (Chroot Helper).

This (and the other BUG messages) are because that process crashed in
kernel mode and couldn't properly exit.

Ben.

--
Ben Hutchings
Beware of bugs in the above code;
I have only proved it correct, not tried it. - Donald Knuth

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to