Bug#1072004: linux: regression in the 9p protocol in 6.8 breaks autopkgtest qemu jobs (affecting debci)

2024-05-27 Thread Thorsten Leemhuis
On 27.05.24 14:22, Luca Boccassi wrote:
>> https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461
> 
> This has been reported upstream 3 weeks ago, but so far it seems no
> action has been taken:
> 
> https://lore.kernel.org/all/Zj0ErxVBE3DYT2Ea@gpd/

Hmmm, that thread is strange, why are David's replies not where they are
supposed to be? Whatever. The last thing from just a few days ago seems
to be a inquiry from David to Andrea that was not yet answered afaics:
https://lore.kernel.org/all/531994.1716450...@warthog.procyon.org.uk/

Would also help a lot to know if this is a 6.8.y only thing, or happens
with 6.9 and mainline as well, as 6.8.y will likely be EOLed soon.

Ciao, Thorsten



Bug#1071420: linux-image-6.8.9-1-amd64: cannot mount btrfs root partition

2024-05-21 Thread Thorsten Leemhuis
TWIMC, the problem systemd is facing due to the removal of a obsolete
option (that might or might not lead to the problem this bug is about)
was finally properly reported upstream now – and from the first reply is
sounds like a workaround is likely to be expected:

https://lore.kernel.org/all/ZkxZT0J-z0GYvfy8@gardel-login/



Bug#1071420: linux-image-6.8.9-1-amd64: cannot mount btrfs root partition

2024-05-19 Thread Linux regression tracking (Thorsten Leemhuis)
On Sat, 18 May 2024 22:25:14 +0200 Matteo Settenvini
 wrote:
> 
> booting kernel 6.8.9-1 with dracut, systemd, and btrfs as the root device 
> fails
> to mount the root partition. I just tried the kernel from sid and it seems 
> indeed \
> affected. The 6.7 kernel from trixie is instead booting fine even after
> regenerating all initrds.
> 
> According to bl...@debian.org, this is likely due to
> https://github.com/torvalds/linux/commit/a1912f712188291f9d7d434fba155461f1ebef66

Would be great to know what the actual problem is. Are there any error
messages from systemd or the kernel?

The upstream bug (https://github.com/systemd/systemd/pull/32892 ) about
this also does not state what goes wrong (either in general or certain
situations).

Such details would likely be needed to convince the btrfs upstream devs
to revert the change or apply a workaround -- especially as I'm pretty
sure there are already a lot of btrfs systems with systemd and 6.8
(release upstream 2+ month ago and regularly used in Arch, Fedora and
Tumbleweed for weeks now) out there and working just fine (including the
Fedora machine one I write from).

Thorsten



Bug#1054514: [PATCH 1/1] drm/qxl: fixes qxl_fence_wait

2024-03-08 Thread Thorsten Leemhuis
On 08.03.24 02:08, Alex Constantino wrote:
> Fix OOM scenario by doing multiple notifications to the OOM handler through
> a busy wait logic.
> Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
> result in a '[TTM] Buffer eviction failed' exception whenever it reached a
> timeout.

Thx for working on this.

> Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
> Link: 
> https://lore.kernel.org/regressions/fb0fda6a-3750-4e1b-893f-97a3e402b...@leemhuis.info

Nitpicking: that ideally should be pointing to
https://lore.kernel.org/regressions/ztgydqrlk6wx_...@eldamar.lan/ , as
that the report and not just a reply to prod things.

Ciao, Thorsten



Bug#1061449: linux-image-6.7-amd64: a boot message from amdgpu

2024-01-28 Thread Linux regression tracking (Thorsten Leemhuis)
On 27.01.24 14:14, Salvatore Bonaccorso wrote:
>
> In Debian (https://bugs.debian.org/1061449) we got the following
> quotred report:
> 
> On Wed, Jan 24, 2024 at 07:38:16PM +0100, Patrice Duroux wrote:
>>
>> Giving a try to 6.7, here is a message extracted from dmesg:
>> [4.177226] [ cut here ]
>> [4.177227] WARNING: CPU: 6 PID: 248 at
>> drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_factory.c:387
>> construct_phy+0xb26/0xd60 [amdgpu]
> [...]

Not my area of expertise, but looks a lot like a duplicate of
https://gitlab.freedesktop.org/drm/amd/-/issues/3122#note_2252835

Mario (now CCed) already prepared a patch for that issue that seems to work.

HTH, Ciao, Thorsten



Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently

2023-12-06 Thread Linux regression tracking (Thorsten Leemhuis)
Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Gerd, it seems this regression[1] fell through the cracks. Could you
please take a look? Or is there a good reason why this can't be
addressed? Or was it dealt with and I just missed it?

[1] apparently caused by 5a838e5d5825c8 ("drm/qxl: simplify
qxl_fence_wait") [v5.13-rc1] from Gerd; for details see
https://lore.kernel.org/regressions/ztgydqrlk6wx_...@eldamar.lan/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

On 24.10.23 23:39, Timo Lindfors wrote:
> Hi,
> 
> On Tue, 24 Oct 2023, Salvatore Bonaccorso wrote:
>> Thanks for the excelent constructed report! I think it's best to
>> forward this directly to upstream including the people for the
>> bisected commit to get some idea.
> 
> Thanks for the quick reply!
> 
>> Can you reproduce the issue with 6.5.8-1 in unstable as well?
> 
> Unfortunately yes:
> 
> ansible@target:~$ uname -r
> 6.5.0-3-amd64
> ansible@target:~$ time sudo ./reproduce.bash
> Wed 25 Oct 2023 12:27:00 AM EEST starting round 1
> Wed 25 Oct 2023 12:27:24 AM EEST starting round 2
> Wed 25 Oct 2023 12:27:48 AM EEST starting round 3
> bug was reproduced after 3 tries
> 
> real    0m48.838s
> user    0m1.115s
> sys 0m45.530s
> 
> I also tested upstream tag v6.6-rc6:
> 
> ...
> + detected_version=6.6.0-rc6
> + '[' 6.6.0-rc6 '!=' 6.6.0-rc6 ']'
> + exec ssh target sudo ./reproduce.bash
> Wed 25 Oct 2023 12:37:16 AM EEST starting round 1
> Wed 25 Oct 2023 12:37:42 AM EEST starting round 2
> Wed 25 Oct 2023 12:38:10 AM EEST starting round 3
> Wed 25 Oct 2023 12:38:36 AM EEST starting round 4
> Wed 25 Oct 2023 12:39:01 AM EEST starting round 5
> Wed 25 Oct 2023 12:39:27 AM EEST starting round 6
> bug was reproduced after 6 tries
> 
> 
> For completeness, here is also the grub_set_default_version.bash script
> that I had to write to automate this (maybe these could be in debian
> wiki?):
> 
> #!/bin/bash
> set -x
> 
> version="$1"
> 
> idx=$(expr $(grep "menuentry " /boot/grub/grub.cfg | sed 1d |grep -n
> "'Debian GNU/Linux, with Linux $version'"|cut -d: -f1) - 1)
> exec sudo grub-set-default "1>$idx"
> 
> 
> 
> -Timo
> 
> 
> 



Bug#1051592: Regression: Commit "netfilter: nf_tables: disallow rule addition to bound chain via NFTA_RULE_CHAIN_ID" breaks ruleset loading in linux-stable

2023-09-12 Thread Linux regression tracking (Thorsten Leemhuis)
On 12.09.23 00:57, Pablo Neira Ayuso wrote:
> On Mon, Sep 11, 2023 at 11:37:50PM +0200, Timo Sigurdsson wrote:
>>
>> recently, Debian updated their stable kernel from 6.1.38 to 6.1.52
>> which broke nftables ruleset loading on one of my machines with lots
>> of "Operation not supported" errors. I've reported this to the
>> Debian project (see link below) and Salvatore Bonaccorso and I
>> identified "netfilter: nf_tables: disallow rule addition to bound
>> chain via NFTA_RULE_CHAIN_ID" (0ebc1064e487) as the offending commit
>> that introduced the regression. Salvatore also found that this issue
>> affects the 5.10 stable tree as well (observed in 5.10.191), but he
>> cannot reproduce it on 6.4.13 and 6.5.2.
>>
>> The issue only occurs with some rulesets. While I can't trigger it
>> with simple/minimal rulesets that I use on some machines, it does
>> occur with a more complex ruleset that has been in use for months
>> (if not years, for large parts of it). I'm attaching a somewhat
>> stripped down version of the ruleset from the machine I originally
>> observed this issue on. It's still not a small or simple ruleset,
>> but I'll try to reduce it further when I have more time.
>>
>> The error messages shown when trying to load the ruleset don't seem
>> to be helpful. Just two simple examples: Just to give two simple
>> examples from the log when nftables fails to start:
>> /etc/nftables.conf:99:4-44: Error: Could not process rule: Operation not 
>> supported
>> tcp option maxseg size 1-500 counter drop
>> ^
>> /etc/nftables.conf:308:4-27: Error: Could not process rule: Operation not 
>> supported
>> tcp dport sip-tls accept
>> 
> 
> I can reproduce this issue with 5.10.191 and 6.1.52 and nftables v1.0.6,
> this is not reproducible with v1.0.7 and v1.0.8.
> 
>> Since the issue only affects some stable trees, Salvatore thought it
>> might be an incomplete backport that causes this.
>>
>> If you need further information, please let me know.
> 
> Userspace nftables v1.0.6 generates incorrect bytecode that hits a new
> kernel check that rejects adding rules to bound chains. The incorrect
> bytecode adds the chain binding, attach it to the rule and it adds the
> rules to the chain binding. I have cherry-picked these three patches
> for nftables v1.0.6 userspace and your ruleset restores fine.
> [...]

H. Well, this sounds like a kernel regression to me that normally
should be dealt with on the kernel level, as users after updating the
kernel should never have to update any userspace stuff to continue what
they have been doing before the kernel update.

Can't the kernel somehow detect the incorrect bytecode and do the right
thing(tm) somehow?

But yes, don't worry, I know that reality is not black and white and
that it's crucial that things like package filtering do exactly what the
user expect it to do; that's why this might be one of those rare
situations where "user has to update userspace components to support
newer kernels" might be the better of two bad choices. But I had to ask
to ensure it's something like that.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.



Bug#1042753: nouveau bug in linux/6.1.38-2

2023-08-04 Thread Thorsten Leemhuis
Hi!

On 02.08.23 23:28, Olaf Skibbe wrote:
> Dear Maintainers,
> 
> Hereby I would like to report an apparent bug in the nouveau driver in
> linux/6.1.38-2.

Thx for your report. Maybe your problem is caused by a incomplete
backport. I Cced the maintainers for the drivers (and the regressions
and the stable list), maybe one of them has an idea, as they know the
driver.

If they don't reply in the next few days, please check if the problem is
also present in mainline. If not, check if the latest 6.1.y. release
already fixes this. If not, try to check which of the four patches you
reverted to make things going is actually causing this (e.g. first only
revert the one that was applied last; then the two last ones; ...).

> Running a current debian stable on a Dell Latitude E6510 with a
> "NVIDIA Corporation GT218M" graphic card, the monitor turns black
> after the grub screen. Also switching to a console (Strg-Alt-F2) shows
> just a black screen. Access via ssh is possible.
> 
> ~# uname -r
> 6.1.0-10-amd64
> 
> demesg shows the following error message:
> 
> [    3.560153] WARNING: CPU: 0 PID: 176 at
> drivers/gpu/drm/nouveau/nvkm/engine/disp/dp.c:460
> nvkm_dp_acquire+0x26a/0x490 [nouveau]
> [    3.560287] Modules linked in: sd_mod t10_pi sr_mod crc64_rocksoft
> cdrom crc64 crc_t10dif crct10dif_generic nouveau(+) ahci libahci mxm_wmi
> i2c_algo_bit drm_display_helper libata cec rc_core drm_ttm_helper ttm
> scsi_mod e1000e drm_kms_helper ptp firewire_ohci sdhci_pci cqhci
> ehci_pci sdhci ehci_hcd firewire_core i2c_i801 crct10dif_pclmul
> crct10dif_common drm crc32_pclmul crc32c_intel psmouse usbcore mmc_core
> crc_itu_t pps_core scsi_common i2c_smbus lpc_ich usb_common battery
> video wmi button
> [    3.560322] CPU: 0 PID: 176 Comm: kworker/u16:5 Not tainted
> 6.1.0-10-amd64 #1  Debian 6.1.38-2
> [    3.560325] Hardware name: Dell Inc. Latitude E6510/0N5KHN, BIOS A17
> 05/12/2017
> [    3.560327] Workqueue: nvkm-disp nv50_disp_super [nouveau]
> [    3.560433] RIP: 0010:nvkm_dp_acquire+0x26a/0x490 [nouveau]
> [    3.560538] Code: 48 8b 44 24 58 65 48 2b 04 25 28 00 00 00 0f 85 37
> 02 00 00 48 83 c4 60 44 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc
> cc <0f> 0b c1 e8 03 41 88 6d 62 44 89 fe 48 89 df 48 69 c0 cf 0d d6 26
> [    3.560541] RSP: 0018:9899c048bd60 EFLAGS: 00010246
> [    3.560542] RAX: 00041eb0 RBX: 88e0209d2600 RCX:
> 00041eb0
> [    3.560544] RDX: c079f760 RSI:  RDI:
> 9899c048bcf0
> [    3.560545] RBP: 0001 R08: 9899c048bc64 R09:
> 5b76
> [    3.560546] R10: 000d R11: 9899c048bde0 R12:
> ffea
> [    3.560548] R13: 88e00b39e480 R14: 00044d45 R15:
> 
> [    3.560549] FS:  () GS:88e123c0()
> knlGS:
> [    3.560551] CS:  0010 DS:  ES:  CR0: 80050033
> [    3.560552] CR2: 7f57f4e90451 CR3: 00018141 CR4:
> 06f0
> [    3.560554] Call Trace:
> [    3.560558]  
> [    3.560560]  ? __warn+0x7d/0xc0
> [    3.560566]  ? nvkm_dp_acquire+0x26a/0x490 [nouveau]
> [    3.560671]  ? report_bug+0xe6/0x170
> [    3.560675]  ? handle_bug+0x41/0x70
> [    3.560679]  ? exc_invalid_op+0x13/0x60
> [    3.560681]  ? asm_exc_invalid_op+0x16/0x20
> [    3.560685]  ? init_reset_begun+0x20/0x20 [nouveau]
> [    3.560769]  ? nvkm_dp_acquire+0x26a/0x490 [nouveau]
> [    3.560888]  nv50_disp_super_2_2+0x70/0x430 [nouveau]
> [    3.560997]  nv50_disp_super+0x113/0x210 [nouveau]
> [    3.561103]  process_one_work+0x1c7/0x380
> [    3.561109]  worker_thread+0x4d/0x380
> [    3.561113]  ? rescuer_thread+0x3a0/0x3a0
> [    3.561116]  kthread+0xe9/0x110
> [    3.561120]  ? kthread_complete_and_exit+0x20/0x20
> [    3.561122]  ret_from_fork+0x22/0x30
> [    3.561130]  
> 
> Further information:
> 
> $ lspci -v -s $(lspci | grep -i vga | awk '{ print $1 }')
> 01:00.0 VGA compatible controller: NVIDIA Corporation GT218M [NVS 3100M]
> (rev a2) (prog-if 00 [VGA controller])
> Subsystem: Dell Latitude E6510
> Flags: bus master, fast devsel, latency 0, IRQ 27
> Memory at e200 (32-bit, non-prefetchable) [size=16M]
> Memory at d000 (64-bit, prefetchable) [size=256M]
> Memory at e000 (64-bit, prefetchable) [size=32M]
> I/O ports at 7000 [size=128]
> Expansion ROM at 000c [disabled] [size=128K]
> Capabilities: 
> Kernel driver in use: nouveau
> Kernel modules: nouveau
> 
> I reported this bug to debian already, see
> https://bugs.debian.org/1042753 for context.
> 
> With support (thanks Diederik!) I managed to figure out that the cause
> was a regression between upstream kernel version 6.1.27 and 6.1.38.
> 
> I build a new 6.1.38 kernel with these commits reverted:
> 
> 62aecf23f3d1 drm/nouveau: add nv_encoder pointer check for NULL
> fb725beca62d drm/nouveau/dp: check for NULL nv_connector->native_mode
> 90748be0f4f3 drm/nouveau: don't detect DSM for non-NVIDIA 

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-30 Thread Thorsten Leemhuis
On 27.06.23 00:34, Nick Hastings wrote:
> * Linux regression tracking (Thorsten Leemhuis)  
> [230626 21:09]:
>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
>> for once, to make this easily accessible to everyone.
>>
>> Nick, what's the status/was there any progress? Did you do what Mario
>> suggested and file a nouveau bug?
> 
> It was not apparent that the suggestion to open "a Nouveau drm bug" was
> addressed to me.

I wish things were earlier for reporters, but from what I can see this
is the only way forward if you or some silent bystander cares.

>> I ask, as I still have this on my list of regressions and it seems there
>> was no progress in three+ weeks now.
> 
> I have not pursued this further since as far as I could tell I already
> provided all requested information and I don't actually use nouveau, so
> I blacklisted it.

I doubt any developer cares enough to take a closer look[1] without a
proper nouveau bug and some help & prodding from someone affected. And
looks to me like reverting the culprit now might create even bigger
problems for users.

Hence I guess then this won't be fixed in the end. In a ideal world this
would not happen, but we don't live in one and all have just 24 hours in
a day. :-/

Nevertheless: thx for your report your help through this thread.

[1] some points on the following page kinda explain this
https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot inconclusive: reporting deadlock (see thread for details)



>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot backburner: slow progress, likely just affects one machine
>> #regzbot poke
>>
>>
>> On 02.06.23 02:57, Limonciello, Mario wrote:
>>> [AMD Official Use Only - General]
>>>
>>>> -Original Message-
>>>> From: Nick Hastings 
>>>> Sent: Thursday, June 1, 2023 7:02 PM
>>>> To: Karol Herbst 
>>>> Cc: Limonciello, Mario ; Lyude Paul
>>>> ; Lukas Wunner ; Salvatore
>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
>>>> Wysocki ; Len Brown ; linux-
>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
>>>> regressi...@lists.linux.dev
>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>>>>
>>>> Hi,
>>>>
>>>> * Karol Herbst  [230602 03:10]:
>>>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
>>>>>  wrote:
>>>>>>> -Original Message-
>>>>>>> From: Karol Herbst 
>>>>>>> Sent: Thursday, June 1, 2023 12:19 PM
>>>>>>> To: Limonciello, Mario 
>>>>>>> Cc: Nick Hastings ; Lyude Paul
>>>>>>> ; Lukas Wunner ; Salvatore
>>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
>>>>>>> Wysocki ; Len Brown ; linux-
>>>>>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
>>>>>>> regressi...@lists.linux.dev
>>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>>>> system)
>>>>>>>
>>>>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>> [AMD Official Use Only - General]
>>>>>>>>
>>>>>>>>> -Original Message-
>>>>>>>>> From: Karol Herbst 
>>>>>>>>> Sent: Thursday, June 1, 2023 11:33 AM
>>>>>>>>> To: Limonciello, Mario 
>>>>>>>>> Cc: Nick Hastings ; Lyude Paul
>>>>>>>>> ; Lukas Wunner ; Salvatore
>>>>>>>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael
>>>> J.
>>>>>>>>> Wy

Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

2023-06-26 Thread Linux regression tracking (Thorsten Leemhuis)
Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Nick, what's the status/was there any progress? Did you do what Mario
suggested and file a nouveau bug?

I ask, as I still have this on my list of regressions and it seems there
was no progress in three+ weeks now.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot backburner: slow progress, likely just affects one machine
#regzbot poke


On 02.06.23 02:57, Limonciello, Mario wrote:
> [AMD Official Use Only - General]
> 
>> -Original Message-
>> From: Nick Hastings 
>> Sent: Thursday, June 1, 2023 7:02 PM
>> To: Karol Herbst 
>> Cc: Limonciello, Mario ; Lyude Paul
>> ; Lukas Wunner ; Salvatore
>> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
>> Wysocki ; Len Brown ; linux-
>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
>> regressi...@lists.linux.dev
>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>>
>> Hi,
>>
>> * Karol Herbst  [230602 03:10]:
>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
>>>  wrote:
> -Original Message-
> From: Karol Herbst 
> Sent: Thursday, June 1, 2023 12:19 PM
> To: Limonciello, Mario 
> Cc: Nick Hastings ; Lyude Paul
> ; Lukas Wunner ; Salvatore
> Bonaccorso ; 1036...@bugs.debian.org; Rafael J.
> Wysocki ; Len Brown ; linux-
> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
> regressi...@lists.linux.dev
> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>> system)
>
> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
>  wrote:
>>
>> [AMD Official Use Only - General]
>>
>>> -Original Message-
>>> From: Karol Herbst 
>>> Sent: Thursday, June 1, 2023 11:33 AM
>>> To: Limonciello, Mario 
>>> Cc: Nick Hastings ; Lyude Paul
>>> ; Lukas Wunner ; Salvatore
>>> Bonaccorso ; 1036...@bugs.debian.org; Rafael
>> J.
>>> Wysocki ; Len Brown ; linux-
>>> a...@vger.kernel.org; linux-ker...@vger.kernel.org;
>>> regressi...@lists.linux.dev
>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
>> _OSI
>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> system)
>>>
>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario

 Lyude, Lukas, Karol

 This thread is in relation to this commit:

 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")

 Nick has found that runtime PM is *not* working for nouveau.

>>>
>>> keep in mind we have a list of PCIe controllers where we apply a
>>> workaround:
>>>
>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
>>> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
>>>
>>> And I suspect there might be one or two more IDs we'll have to add
>>> there. Do we have any logs?
>>
>> There's some archived onto the distro bug.  Search this page for
> "journalctl.log.gz"
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
>>
>
> interesting.. It seems to be the same controller used here. I wonder
> if the pci topology is different or if the workaround is applied at
> all.

 I didn't see the message in the log about the workaround being applied
 in that log, so I guess PCI topology difference is a likely suspect.

>>>
>>> yeah, but I also couldn't see a log with the usual nouveau messages,
>>> so it's kinda weird.
>>>
>>> Anyway, the output of `lspci -tvnn` would help
>>
>> % lspci -tvnn
>> -[:00]-+-00.0  Intel Corporation Device [8086:3e20]
>>+-01.0-[01]00.0  NVIDIA Corporation TU117M [GeForce GTX 1650
>> Mobile / Max-Q] [10de:1f91]
> 
> So the bridge it's connected to is the same that the quirk *should have been* 
> triggering.
> 
> May 29 15:02:42 xps kernel: pci :00:01.0: [8086:1901] type 01 class 
> 0x060400
> 
> Since the quirk isn't working and this is still a problem in 6.4-rc4 I 
> suggest opening a
> Nouveau drm bug to figure out why.
> 
>>+-02.0  Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
>> [8086:3e9b]
>>+-04.0  Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core
>> Processor Thermal Subsystem [8086:1903]
>>+-08.0  Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 /
>> 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
>>+-12.0  Intel Corporation Cannon Lake PCH Thermal Controller
>> [8086:a379]
>>+-14.0  

Re: virtio_balloon regression in 5.19-rc3 #forregzbot

2022-08-15 Thread Thorsten Leemhuis
On 10.07.22 10:06, Thorsten Leemhuis wrote:
> On 04.07.22 11:40, Thorsten Leemhuis wrote:
>> TWIMC: this mail is primarily send for documentation purposes and for
>> regzbot, my Linux kernel regression tracking bot. These mails usually
>> contain '#forregzbot' in the subject, to make them easy to spot and filter.
>>
>> On 21.06.22 11:35, Thorsten Leemhuis wrote:
>>> On 20.06.22 20:49, Ben Hutchings wrote:
>>>> I've tested a 5.19-rc3 kernel on top of QEMU/KVM with machine type
>>>> pc-q35-5.2.  It has a virtio balloon device defined in libvirt as:
>>>>
>>>> 
>>>>   >>> function="0x0"/>
>>>> 
>>>>
>>>> but the virtio_balloon driver fails to bind to it:
>>>>
>>>> virtio_balloon virtio4: init_vqs: add stat_vq failed
>>>> virtio_balloon: probe of virtio4 failed with error -5
>>>>
>>> [...]
>>> #regzbot ^introduced v5.18..v5.19-rc3
>>> #regzbot ignore-activity
>>
>> #regzbot introduced 8b4ec69d7e09
>> #regzbot monitor
>> https://lore.kernel.org/all/20220622012940.21441-1-jasow...@redhat.com/
> 
> #regzbot fixed-by: 6a9720576c
> #regzbot ignore-activity

For the record: the fix was merged through a different branch and thus
got a different commit id:

#regzbot fixed-by: ebe797f25f68f28581f46a9cb9c1997ac15c39a0



Re: virtio_balloon regression in 5.19-rc3 #forregzbot

2022-07-10 Thread Thorsten Leemhuis



On 04.07.22 11:40, Thorsten Leemhuis wrote:
> TWIMC: this mail is primarily send for documentation purposes and for
> regzbot, my Linux kernel regression tracking bot. These mails usually
> contain '#forregzbot' in the subject, to make them easy to spot and filter.
> 
> On 21.06.22 11:35, Thorsten Leemhuis wrote:
>> [TLDR: I'm adding this regression report to the list of tracked
>> regressions; all text from me you find below is based on a few templates
>> paragraphs you might have encountered already already in similar form.]
>>
>> On 20.06.22 20:49, Ben Hutchings wrote:
>>> I've tested a 5.19-rc3 kernel on top of QEMU/KVM with machine type
>>> pc-q35-5.2.  It has a virtio balloon device defined in libvirt as:
>>>
>>> 
>>>   >> function="0x0"/>
>>> 
>>>
>>> but the virtio_balloon driver fails to bind to it:
>>>
>>> virtio_balloon virtio4: init_vqs: add stat_vq failed
>>> virtio_balloon: probe of virtio4 failed with error -5
>>>
>> [...]
>> #regzbot ^introduced v5.18..v5.19-rc3
>> #regzbot ignore-activity
> 
> #regzbot introduced 8b4ec69d7e09
> #regzbot monitor
> https://lore.kernel.org/all/20220622012940.21441-1-jasow...@redhat.com/

#regzbot fixed-by: 6a9720576c
#regzbot ignore-activity

For details see:
https://lore.kernel.org/all/cacgkmeu8eecpamy__oqqnf7iuku7nho_-mij2zwulfv2rv+...@mail.gmail.com/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.



Re: virtio_balloon regression in 5.19-rc3 #forregzbot

2022-07-04 Thread Thorsten Leemhuis
TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.

On 21.06.22 11:35, Thorsten Leemhuis wrote:
> [TLDR: I'm adding this regression report to the list of tracked
> regressions; all text from me you find below is based on a few templates
> paragraphs you might have encountered already already in similar form.]
> 
> On 20.06.22 20:49, Ben Hutchings wrote:
>> I've tested a 5.19-rc3 kernel on top of QEMU/KVM with machine type
>> pc-q35-5.2.  It has a virtio balloon device defined in libvirt as:
>>
>> 
>>   > function="0x0"/>
>> 
>>
>> but the virtio_balloon driver fails to bind to it:
>>
>> virtio_balloon virtio4: init_vqs: add stat_vq failed
>> virtio_balloon: probe of virtio4 failed with error -5
>>
> [...]
> #regzbot ^introduced v5.18..v5.19-rc3
> #regzbot ignore-activity

#regzbot introduced 8b4ec69d7e09
#regzbot monitor
https://lore.kernel.org/all/20220622012940.21441-1-jasow...@redhat.com/



Re: virtio_balloon regression in 5.19-rc3

2022-06-21 Thread Thorsten Leemhuis
[TLDR: I'm adding this regression report to the list of tracked
regressions; all text from me you find below is based on a few templates
paragraphs you might have encountered already already in similar form.]

On 20.06.22 20:49, Ben Hutchings wrote:
> I've tested a 5.19-rc3 kernel on top of QEMU/KVM with machine type
> pc-q35-5.2.  It has a virtio balloon device defined in libvirt as:
> 
> 
>function="0x0"/>
> 
> 
> but the virtio_balloon driver fails to bind to it:
> 
> virtio_balloon virtio4: init_vqs: add stat_vq failed
> virtio_balloon: probe of virtio4 failed with error -5
> 
> On a 5.18 kernel with similar configuration, it binds successfully.
> 
> I've attached the kernel config for 5.19-rc3.

CCing the regression mailing list, as it should be in the loop for all
regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

Thanks for the report. To be sure below issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
tracking bot:

#regzbot ^introduced v5.18..v5.19-rc3
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply -- ideally with also
telling regzbot about it, as explained here:
https://linux-regtracking.leemhuis.info/tracked-regression/

Reminder for developers: When fixing the issue, add 'Link:' tags
pointing to the report (the mail this one replies to), as explained for
in the Linux kernel's documentation; above webpage explains why this is
important for tracked regressions.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.



Bug#1005005: Regression from 3c196f056666 ("drm/amdgpu: always reset the asic in suspend (v2)") on suspend?

2022-03-21 Thread Thorsten Leemhuis
On 21.03.22 19:49, Dominique Dumont wrote:
> On Monday, 21 March 2022 09:57:59 CET Thorsten Leemhuis wrote:
>> Dominique/Salvatore/Eric, what's the status of this regression?
>> According to the debian bug tracker the problem is solved with 5.16 and
>> 5.17, but was 5.15 ever fixed?
> 
> I don't think so.
> 
> On kernel side, the commit fixing this issue is
> e55a3aea418269266d84f426b3bd70794d3389c8 . 
> 
> According to the logs of [1] , this commit landed in v5.17-rc3
> 
> HTH
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

And from there it among others got backported to 5.15.22:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.15.y=8a15ac1786c92dce6ecbeb4e4c237f5f80c2c703

https://lwn.net/Articles/884107/

Another indicator that Eric's problem is something else.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.



Bug#1005005: Regression from 3c196f056666 ("drm/amdgpu: always reset the asic in suspend (v2)") on suspend?

2022-03-21 Thread Thorsten Leemhuis
On 21.03.22 13:07, Éric Valette wrote:
> My problem has never been fixed.
>
> The proposed patch has been applied to 5.15. I do not remerber which version 
> 28 maybe.
> 
> I still have à RIP in pm_suspend. Did not test the Last two 15 versions.
> 
> I can leave with 5.10 est using own compiled kernels.
> 
> Thanks for asking.

This thread/the debian bug report
(https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1005005 ) is getting
long which makes things hard to grasp. But to me it looks a lot like the
problem you are facing is different from the problem that others ran
into and bisected -- but I might be totally wrong there. Have you ever
tried reverting 3c196f05 to seem if it helps (sorry if that's
mentioned in the bug report somewhere, as I said, it became long)? I
guess a bisection from your side really would help a lot; but before you
go down that route you might want to give 5.17 and the latest 5.15.y
kernel a try.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.



Bug#1005005: Regression from 3c196f056666 ("drm/amdgpu: always reset the asic in suspend (v2)") on suspend?

2022-03-21 Thread Thorsten Leemhuis
Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

Dominique/Salvatore/Eric, what's the status of this regression?
According to the debian bug tracker the problem is solved with 5.16 and
5.17, but was 5.15 ever fixed?

Ciao, Thorsten

On 21.02.22 15:16, Alex Deucher wrote:
> On Mon, Feb 21, 2022 at 3:29 AM Eric Valette  wrote:
>>
>> On 20/02/2022 16:48, Dominique Dumont wrote:
>>> On Monday, 14 February 2022 22:52:27 CET Alex Deucher wrote:
 Does the system actually suspend?
>>>
>>> Not really. The screens looks like it's going to suspend, but it does come
>>> back after 10s or so. The light mounted in the middle of the power button 
>>> does
>>> not switch off.
>>
>>
>> As I have a very similar problem and also commented on the original
>> debian bug report
>> (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1005005), I will add
>> some information here on another amd only laptop (renoir AMD Ryzen 7
>> 4800H with Radeon Graphics + Radeon RX 5500/5500M / Pro 5500M).
>>
>> For me the suspend works once, but after the first resume (I do know
>> know if it is in the suspend path or the resume path I see a RIP in the
>> dmesg (see aditional info in debian bug))  and later suspend do not
>> work: It only go to the kde login screen.
>>
>> I was unable due to network connectivity to do a full bisect but tested
>> with the patch I had on my laptop:
>>
>> 5.10.101 works, 5.10 from debian works
>> 5.11 works
>> 5.12 works
>> 5.13 suspend works but when resuming the PC is dead I have to reboot
>> 5.14 seems to work but looking at dmesg it is full of RIP messages at
>> various places.
>> 5.15.24 is a described 5.15 from debian is behaving identically
>> 5.16 from debian is behaving identically.
>>
 Is this system S0i3 or regular S3?
>>
>> For me it is real S3.
>>
>> The proposed patch is intended for INTEl + intel gpu + amdgpu but I have
>> dual amd GPU.
> 
> It doesn't really matter what the platform is, it could still
> potentially help on your system, it depends on the bios implementation
> for your platform and how it handles suspend. You can try the patch,
> but I don't think you are hitting the same issue.  I bisect would be
> helpful in your case.
> 
> Alex



Bug#1005005: Regression from 3c196f056666 ("drm/amdgpu: always reset the asic in suspend (v2)") on suspend?

2022-02-14 Thread Thorsten Leemhuis


[TLDR: I'm adding the regression report below to regzbot, the Linux
kernel regression tracking bot; all text you find below is compiled from
a few templates paragraphs you might have encountered already already
from similar mails.]

Hi, this is your Linux kernel regression tracker speaking.

CCing the regression mailing list, as it should be in the loop for all
regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

To be sure this issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced 3c196f05
#regzbot title amdgfx: suspend stopped working
#regzbot ignore-activity
#regzbot link: https://bugs.debian.org/1005005

Reminder for developers: when fixing the issue, please add a 'Link:'
tags pointing to the report (the mail quoted above) using
lore.kernel.org/r/, as explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'. This allows the bot to connect
the report with any patches posted or committed to fix the issue; this
again allows the bot to show the current status of regressions and
automatically resolve the issue when the fix hits the right tree.

I'm sending this to everyone that got the initial report, to make them
aware of the tracking. I also hope that messages like this motivate
people to directly get at least the regression mailing list and ideally
even regzbot involved when dealing with regressions, as messages like
this wouldn't be needed then.

Don't worry, I'll send further messages wrt to this regression just to
the lists (with a tag in the subject so people can filter them away), if
they are relevant just for regzbot. With a bit of luck no such messages
will be needed anyway.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.


On 12.02.22 19:23, Salvatore Bonaccorso wrote:
> Hi Alex, hi all
> 
> In Debian we got a regression report from Dominique Dumont, CC'ed in
> https://bugs.debian.org/1005005 that afer an update to 5.15.15 based
> kernel, his machine noe longer suspends correctly, after screen going
> black as usual it comes back. The Debian bug above contians a trace.
> 
> Dominique confirmed that this issue persisted after updating to 5.16.7
> furthermore he bisected the issue and found 
> 
>   3c196f0510912645c7c5d9107706003f67c3 is the first bad commit
>   commit 3c196f0510912645c7c5d9107706003f67c3
>   Author: Alex Deucher 
>   Date:   Fri Nov 12 11:25:30 2021 -0500
> 
>   drm/amdgpu: always reset the asic in suspend (v2)
> 
>   [ Upstream commit daf8de0874ab5b74b38a38726fdd3d07ef98a7ee ]
> 
>   If the platform suspend happens to fail and the power rail
>   is not turned off, the GPU will be in an unknown state on
>   resume, so reset the asic so that it will be in a known
>   good state on resume even if the platform suspend failed.
> 
>   v2: handle s0ix
> 
>   Acked-by: Luben Tuikov 
>   Acked-by: Evan Quan 
>   Signed-off-by: Alex Deucher 
>   Signed-off-by: Sasha Levin 
> 
>drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 -
>1 file changed, 4 insertions(+), 1 deletion(-)
> 
> to be the first bad commit, see https://bugs.debian.org/1005005#34 .
> 
> Does this ring any bell? Any idea on the problem?
> 
> Regards,
> Salvatore

-- 
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
CC the regression list and tell regzbot about the issue, as that ensures
the regression makes it onto the radar of the Linux kernel's regression
tracker -- that's in your interest, as it ensures your report won't fall
through the cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include 'Link:' tag in the patch descriptions pointing to all reports
about the issue. This has been expected from developers even before
regzbot showed up for reasons explained in