Re: Bug#1061449: linux-image-6.7-amd64: a boot message from amdgpu

2024-01-29 Thread Salvatore Bonaccorso
Hi,

[for this reply dropping the Debian bugreport to avoid later followups
sending the ack to the mailinglist and adding noise]

On Sun, Jan 28, 2024 at 11:44:59AM +0100, Linux regression tracking (Thorsten 
Leemhuis) wrote:
> On 27.01.24 14:14, Salvatore Bonaccorso wrote:
> >
> > In Debian (https://bugs.debian.org/1061449) we got the following
> > quotred report:
> > 
> > On Wed, Jan 24, 2024 at 07:38:16PM +0100, Patrice Duroux wrote:
> >>
> >> Giving a try to 6.7, here is a message extracted from dmesg:
> >> [4.177226] [ cut here ]
> >> [4.177227] WARNING: CPU: 6 PID: 248 at
> >> drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_factory.c:387
> >> construct_phy+0xb26/0xd60 [amdgpu]
> > [...]
> 
> Not my area of expertise, but looks a lot like a duplicate of
> https://gitlab.freedesktop.org/drm/amd/-/issues/3122#note_2252835
> 
> Mario (now CCed) already prepared a patch for that issue that seems to work.

#regzbot link: https://gitlab.freedesktop.org/drm/amd/-/issues/3122

Thanks. Indeed the reporter confirmed in
https://bugs.debian.org/1061449#55 that the patch fixes the issue.

So a duplicate of the above.

Regards,
Salvatore


Re: Bug#1061449: linux-image-6.7-amd64: a boot message from amdgpu

2024-01-27 Thread Salvatore Bonaccorso
Hi

In Debian (https://bugs.debian.org/1061449) we got the following
quotred report:

On Wed, Jan 24, 2024 at 07:38:16PM +0100, Patrice Duroux wrote:
> Package: src:linux
> Version: 6.7.1-1~exp1
> Severity: normal
> 
> Dear Maintainer,
> 
> Giving a try to 6.7, here is a message extracted from dmesg:
> 
> [4.177226] [ cut here ]
> [4.177227] WARNING: CPU: 6 PID: 248 at
> drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_factory.c:387
> construct_phy+0xb26/0xd60 [amdgpu]
> [4.177658] Modules linked in: amdgpu(+) i915(+) sd_mod drm_exec amdxcp
> gpu_sched drm_buddy nvme i2c_algo_bit drm_suballoc_helper drm_display_helper
> ahci nvme_core hid_generic crc32_pclmul libahci crc32c_intel t10_pi cec libata
> crc64_rocksoft_generic ghash_clmulni_intel rc_core drm_ttm_helper
> crc64_rocksoft sha512_ssse3 i2c_hid_acpi ttm rtsx_pci_sdmmc i2c_hid xhci_pci
> crc_t10dif sha512_generic mmc_core scsi_mod xhci_hcd drm_kms_helper video hid
> crct10dif_generic intel_lpss_pci crct10dif_pclmul i2c_i801 sha256_ssse3
> intel_lpss crc64 thunderbolt drm e1000e usbcore sha1_ssse3 rtsx_pci i2c_smbus
> scsi_common crct10dif_common idma64 usb_common battery wmi button aesni_intel
> crypto_simd cryptd
> [4.177689] CPU: 6 PID: 248 Comm: (udev-worker) Not tainted 6.7-amd64 #1
> Debian 6.7.1-1~exp1
> [4.177691] Hardware name: Dell Inc. Precision 7540/0T2FXT, BIOS 1.29.0
> 11/03/2023
> [4.177692] RIP: 0010:construct_phy+0xb26/0xd60 [amdgpu]
> [4.178050] Code: b9 01 00 00 00 83 fe 01 74 40 48 8b 82 f8 03 00 00 89 f2
> 48 c7 c6 00 35 a7 c1 48 8b 40 10 48 8b 00 48 8b 78 08 e8 ba b7 5b fb <0f> 0b 
> 49
> 8b 87 d0 01 00 00 b9 0f 00 00 00 48 8b 80 e8 04 00 00 48
> [4.178052] RSP: 0018:aad300857408 EFLAGS: 00010246
> [4.178053] RAX:  RBX: 96df636a1700 RCX:
> c000efff
> [4.178054] RDX:  RSI: efff RDI:
> 0001
> [4.178055] RBP: 96df4d379c00 R08:  R09:
> aad3008571d0
> [4.178056] R10: 0003 R11: bded2428 R12:
> aad300857474
> [4.178057] R13: c1933140 R14: aad3008577d0 R15:
> 96df43e82000
> [4.178058] FS:  7fcd5d9648c0() GS:96e2cc38()
> knlGS:
> [4.178060] CS:  0010 DS:  ES:  CR0: 80050033
> [4.178061] CR2: 7fcd5d932a6d CR3: 000103e9a004 CR4:
> 003706f0
> [4.178062] DR0:  DR1:  DR2:
> 
> [4.178063] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [4.178063] Call Trace:
> [4.178066]  
> [4.178067]  ? construct_phy+0xb26/0xd60 [amdgpu]
> [4.178422]  ? __warn+0x81/0x130
> [4.178426]  ? construct_phy+0xb26/0xd60 [amdgpu]
> [4.178784]  ? report_bug+0x171/0x1a0
> [4.178787]  ? handle_bug+0x3c/0x80
> [4.178789]  ? exc_invalid_op+0x17/0x70
> [4.178790]  ? asm_exc_invalid_op+0x1a/0x20
> [4.178793]  ? construct_phy+0xb26/0xd60 [amdgpu]
> [4.179149]  ? construct_phy+0xb26/0xd60 [amdgpu]
> [4.179507]  link_create+0x1b2/0x200 [amdgpu]
> [4.179865]  create_links+0x135/0x420 [amdgpu]
> [4.180196]  dc_create+0x321/0x640 [amdgpu]
> [4.180529]  amdgpu_dm_init.isra.0+0x2a0/0x1ed0 [amdgpu]
> [4.180881]  ? sysvec_apic_timer_interrupt+0xe/0x90
> [4.180883]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [4.180885]  ? delay_tsc+0x37/0xa0
> [4.180889]  dm_hw_init+0x12/0x30 [amdgpu]
> [4.181240]  amdgpu_device_init+0x1e42/0x24a0 [amdgpu]
> [4.181517]  amdgpu_driver_load_kms+0x19/0x190 [amdgpu]
> [4.181793]  amdgpu_pci_probe+0x165/0x4c0 [amdgpu]
> [4.182067]  local_pci_probe+0x42/0xa0
> [4.182070]  pci_device_probe+0xc7/0x240
> [4.182072]  really_probe+0x19b/0x3e0
> [4.182075]  ? __pfx___driver_attach+0x10/0x10
> [4.182076]  __driver_probe_device+0x78/0x160
> [4.182078]  driver_probe_device+0x1f/0x90
> [4.182079]  __driver_attach+0xd2/0x1c0
> [4.182081]  bus_for_each_dev+0x85/0xd0
> [4.182083]  bus_add_driver+0x116/0x220
> [4.182085]  driver_register+0x59/0x100
> [4.182087]  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
> [4.182356]  do_one_initcall+0x58/0x320
> [4.182359]  do_init_module+0x60/0x240
> [4.182361]  init_module_from_file+0x89/0xe0
> [4.182364]  idempotent_init_module+0x120/0x2b0
> [4.182366]  __x64_sys_finit_module+0x5e/0xb0
> [4.182367]  do_syscall_64+0x61/0x120
> [4.182370]  ? do_syscall_64+0x70/0x120
> [4.182372]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [4.182375] RIP: 0033:0x7fcd5e130f19
> [4.182376] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89
> f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 
> 01
> f0 ff ff 73 01 c3 48 8b 0d cf 1e 0d 00 f7 d8 64 89 01 48
> [4.182378] RSP: 002b:7ffd314afa38 EFLAGS: 0246 ORIG_RAX:
> 0139
> [4.182379] RAX: ffda RBX:

Re: Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently

2023-10-24 Thread Salvatore Bonaccorso
Hi Timo,

On Tue, Oct 24, 2023 at 11:14:32PM +0300, Timo Lindfors wrote:
> Package: src:linux
> Version: 6.1.55-1
> Severity: normal
> 
> Steps to reproduce:
> 1) Install Debian 12 as a virtual machine using virt-manager, choose qxl
>graphics card. You only need basic installation without wayland or X.
> 2) Login from the console and save thë following to reproduce.bash:
> 
> #!/bin/bash
> 
> chvt 3
> for j in $(seq 80); do
> echo "$(date) starting round $j"
> if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ];
> then
> echo "bug was reproduced after $j tries"
> exit 1
> fi
> for i in $(seq 100); do
> dmesg > /dev/tty3
> done
> done
> 
> echo "bug could not be reproduced"
> exit 0
> 
> 
> 3) Run chmod a+x reproduce.bash
> 4) Run ./reproduce.bash and wait for up to 20 minutes.
> 
> Expected results:
> 4) The system prints a steady flow of text without kernel error messages
> 
> Actual messages:
> 4) At some point the text stops flowing and the script prints "bug was
>reproduced". If you run "journalctl --boot" you see
> 
> kernel: [TTM] Buffer eviction failed
> kernel: qxl :00:02.0: object_init failed for (3149824, 0x0001)
> kernel: [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
> 
> 
> 
> More info:
> 1) The bug does not occur if I downgrade the kernel to
>linux-image-5.10.0-26-amd64_5.10.197-1_amd64.deb from Debian 11.
> 2) I used the following test_linux.bash to bisect this issue against
>upstream source:
> 
> #!/bin/bash
> set -x
> 
> gitversion="$(git describe HEAD|sed 's@^v@@')"
> 
> git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
> git show bec771b5e0901f4b0bc861bcb58056de5151ae3a | patch -p1
> # Build
> cp ~/kernel.config .config
> # cp /boot/config-$(uname -r) .config
> # scripts/config --enable LOCALVERSION_AUTO
> # scripts/config --disable DEBUG_INFO
> # scripts/config --disable SYSTEM_TRUSTED_KEYRING
> # scripts/config --set-str SYSTEM_TRUSTED_KEYS ''
> # scripts/config --disable STACKPROTECTOR_STRONG
> make olddefconfig
> # make localmodconfig
> make -j$(nproc --all) bindeb-pkg
> rc="$?"
> if [ "$rc" != "0" ]; then
> exit 125
> fi
> git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
> 
> package="$(ls --sort=time ../linux-image-*_amd64.deb|head -n1)"
> version=$(echo $package | cut -d_ -f1|cut -d- -f3-)
> 
> if [ "$gitversion" != "$version" ]; then
> echo "Build produced version $gitversion but got $version, ignoring"
> #exit 255
> fi
> 
> # Deploy
> scp $package target:a.deb
> ssh target sudo apt install ./a.deb
> ssh target rm -f a.deb
> ssh target ./grub_set_default_version.bash $version
> ssh target sudo shutdown -r now
> sleep 40
> 
> detected_version=$(ssh target uname -r)
> if [ "$detected_version" != "$version" ]; then
> echo "Booted to $detected_version but expected $version"
> exit 255
> fi
> 
> # Test
> exec ssh target sudo ./reproduce.bash
> 
> 
> Bisect printed the following log:
> 
> git bisect start
> # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we 
> do not have to worry about obj->mm.lock, v7.
> git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> # bad: [762949bb1da78941b25e63f7e952af037eee15a9] drm: fix 
> drm_mode_create_blob comment
> git bisect bad 762949bb1da78941b25e63f7e952af037eee15a9
> # bad: [e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc] drm/gma500: Drop DRM_GMA600 
> config option
> git bisect bad e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc
> # bad: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify 
> qxl_fence_wait
> git bisect bad 5a838e5d5825c85556011478abde708251cc0776
> # bad: [d2b6f8a179194de0ffc4886ffc2c4358d86047b8] Merge tag 
> 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> git bisect bad d2b6f8a179194de0ffc4886ffc2c4358d86047b8
> # bad: [68a32ba14177d4a21c4a9a941cf1d7aea86d436f] Merge tag 
> 'drm-next-2021-04-28' of git://anongit.freedesktop.org/drm/drm
> git bisect bad 68a32ba14177d4a21c4a9a941cf1d7aea86d436f
> # bad: [0698b13403788a646073fcd9b2294f2dce0ce429] drm/amdgpu: skip 
> PP_MP1_STATE_UNLOAD on aldebaran
> git bisect bad 0698b13403788a646073fcd9b2294f2dce0ce429
> # bad: [e1a5e6a8c48bf99ea374fb3e535661cfe226bca4] drm/doc: Add RFC section
> git bisect bad e1a5e6a8c48bf99ea374fb3e535661cfe226bca4
> # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we 
> do not have to worry about obj->mm.lock, v7.
> git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> # bad: [2c8ab3339e398bbbcb0980933e266b93bedaae52] drm/i915: Pin timeline map 
> after first timeline pin, v4.
> git bisect bad 2c8ab3339e398bbbcb0980933e266b93bedaae52
> # bad: [2eb8e1a69d9f8cc9c0a75e327f854957224ba421] drm/i915/gem: Drop 
> relocation support on all new hardware (v6)
> git bisect bad 2eb8e1a69d9f8cc9c0a75e327f854957224ba421
> # bad: [b5b6f6a610127b17f20c0ca03dd27beee4ddc2b2] drm/i915/gem: Drop legacy 
> execbuffer support (v

Re: [PATCH 1/2] fbdev/offb: Update expected device name

2023-04-16 Thread Salvatore Bonaccorso
Hi

looping in as well the regressions list (hoping not doing any mistake
with the regzbot commands):

On Wed, Apr 12, 2023 at 11:55:08AM +0200, Cyril Brulebois wrote:
> Since commit 241d2fb56a18 ("of: Make OF framebuffer device names unique"),
> as spotted by Frédéric Bonnard, the historical "of-display" device is
> gone: the updated logic creates "of-display.0" instead, then as many
> "of-display.N" as required.
> 
> This means that offb no longer finds the expected device, which prevents
> the Debian Installer from setting up its interface, at least on ppc64el.
> 
> It might be better to iterate on all possible nodes, but updating the
> hardcoded device from "of-display" to "of-display.0" is confirmed to fix
> the Debian Installer at the very least.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217328
> Link: https://bugs.debian.org/1033058
> Fixes: 241d2fb56a18 ("of: Make OF framebuffer device names unique")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Cyril Brulebois 
> ---
>  drivers/video/fbdev/offb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/video/fbdev/offb.c b/drivers/video/fbdev/offb.c
> index b97d251d894b..6264c7184457 100644
> --- a/drivers/video/fbdev/offb.c
> +++ b/drivers/video/fbdev/offb.c
> @@ -698,7 +698,7 @@ MODULE_DEVICE_TABLE(of, offb_of_match_display);
>  
>  static struct platform_driver offb_driver_display = {
>   .driver = {
> - .name = "of-display",
> + .name = "of-display.0",
>   .of_match_table = offb_of_match_display,
>   },
>   .probe = offb_probe_display,

#regzbot ^introduced 241d2fb56a18
#regzbot title: Open Firmware framebuffer cannot find of-display
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=217328
#regzbot link: 
https://lore.kernel.org/all/20230412095509.2196162-1-cy...@debamax.com/T/#m34493480243a2cad2ae359abfd9db5e755f41add
#regzbot link: https://bugs.debian.org/1033058

Regards,
Salvatore


Assertion failure in i915 intel_display.c#assert_plane() after resume from hibernation

2023-01-23 Thread Salvatore Bonaccorso
Hi

A user in Debian, cc'ed reporte the following issue when resuming from
hibernation, tested as well on recent 6.1.7 kernel, context see
https://bugs.debian.org/971068

> Can repro on the sid kernel, uname -a of
>   Linux nabtop 6.1.0-2-686-pae #1 SMP PREEMPT_DYNAMIC Debian 6.1.7-1 
> (2023-01-18) i686 GNU/Linux
> 
> Log below. Backtrace is only trivially different.
> 
> Best,
> наб
> 
> -- >8 --
> Jan 22 14:06:46 nabtop kernel: OOM killer disabled.
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Marking nosave pages: [mem 
> 0x-0x0fff]
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Marking nosave pages: [mem 
> 0x0009f000-0x000f]
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Marking nosave pages: [mem 
> 0xb5aa1000-0xb5aa6fff]
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Marking nosave pages: [mem 
> 0xb5bba000-0xb5c0efff]
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Marking nosave pages: [mem 
> 0xb5d08000-0xb5f0efff]
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Marking nosave pages: [mem 
> 0xb5f18000-0xb5f1efff]
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Marking nosave pages: [mem 
> 0xb5f65000-0xb5f9efff]
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Marking nosave pages: [mem 
> 0xb5fe1000-0xb5ffefff]
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Marking nosave pages: [mem 
> 0xb600-0x]
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Basic memory bitmaps created
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Preallocating image memory
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Allocated 183519 pages for 
> snapshot
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Allocated 734076 kbytes in 
> 0.70 seconds (1048.68 MB/s)
> Jan 22 14:06:46 nabtop kernel: Freezing remaining freezable tasks ... 
> (elapsed 0.001 seconds) done.
> Jan 22 14:06:46 nabtop kernel: wifi1: deauthenticating from de:0d:17:ad:80:55 
> by local choice (Reason: 3=DEAUTH_LEAVING)
> Jan 22 14:06:46 nabtop kernel: ACPI: EC: interrupt blocked
> Jan 22 14:06:46 nabtop kernel: ACPI: PM: Preparing to enter system sleep 
> state S4
> Jan 22 14:06:46 nabtop kernel: ACPI: EC: event blocked
> Jan 22 14:06:46 nabtop kernel: ACPI: EC: EC stopped
> Jan 22 14:06:46 nabtop kernel: ACPI: PM: Saving platform NVS memory
> Jan 22 14:06:46 nabtop kernel: Disabling non-boot CPUs ...
> Jan 22 14:06:46 nabtop kernel: smpboot: CPU 1 is now offline
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Creating image:
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Need to copy 175700 pages
> Jan 22 14:06:46 nabtop kernel: PM: hibernation: Normal pages needed: 57765 + 
> 1024, available pages: 167322
> Jan 22 14:06:46 nabtop kernel: ACPI: PM: Restoring platform NVS memory
> Jan 22 14:06:46 nabtop kernel: ACPI: EC: EC started
> Jan 22 14:06:46 nabtop kernel: Enabling non-boot CPUs ...
> Jan 22 14:06:46 nabtop kernel: x86: Booting SMP configuration:
> Jan 22 14:06:46 nabtop kernel: smpboot: Booting Node 0 Processor 1 APIC 0x1
> Jan 22 14:06:46 nabtop kernel: CPU1 is up
> Jan 22 14:06:46 nabtop kernel: ACPI: PM: Waking up from system sleep state S4
> Jan 22 14:06:46 nabtop kernel: ACPI: EC: interrupt unblocked
> Jan 22 14:06:46 nabtop kernel: ACPI: EC: event unblocked
> Jan 22 14:06:46 nabtop kernel: usb usb1: root hub lost power or was reset
> Jan 22 14:06:46 nabtop kernel: usb usb2: root hub lost power or was reset
> Jan 22 14:06:46 nabtop kernel: usb usb4: root hub lost power or was reset
> Jan 22 14:06:46 nabtop kernel: usb usb3: root hub lost power or was reset
> Jan 22 14:06:46 nabtop kernel: usb usb6: root hub lost power or was reset
> Jan 22 14:06:46 nabtop kernel: usb usb7: root hub lost power or was reset
> Jan 22 14:06:46 nabtop kernel: usb usb8: root hub lost power or was reset
> Jan 22 14:06:46 nabtop kernel: usb usb5: root hub lost power or was reset
> Jan 22 14:06:46 nabtop kernel: sd 0:0:0:0: [sda] Starting disk
> Jan 22 14:06:46 nabtop kernel: iwlwifi :08:00.0: Radio type=0x1-0x2-0x0
> Jan 22 14:06:46 nabtop kernel: iwlwifi :08:00.0: Radio type=0x1-0x2-0x0
> Jan 22 14:06:46 nabtop kernel: [ cut here ]
> Jan 22 14:06:46 nabtop kernel: primary B assertion failure (expected off, 
> current on)
> Jan 22 14:06:46 nabtop kernel: WARNING: CPU: 0 PID: 1038 at 
> drivers/gpu/drm/i915/display/intel_display.c:476 assert_plane+0x9f/0xb0 [i915]
> Jan 22 14:06:46 nabtop kernel: Modules linked in: ghash_generic gf128mul gcm 
> ccm algif_aead des_generic libdes ecb algif_skcipher bnep cmac md4 algif_hash 
> af_alg binfmt_misc btusb btrtl btbcm btintel btmtk bluetooth 
> jitterentropy_rng sha512_generic ctr drbg joydev ansi_cprng ecdh_generic ecc 
> iwldvm mac80211 libarc4 iTCO_wdt intel_pmc_bxt snd_hda_codec_conexant 
> iTCO_vendor_support uvcvideo watchdog snd_hda_codec_generic ledtrig_audio 
> videobuf2_vmalloc videobuf2_memops i915 videobuf2_v4l2 nls_ascii 
> snd_hda_intel iwlwifi videobuf2_common snd_intel_dspcfg drm_buddy 
> snd_intel_sdw_acpi nl

Re: [PATCH] nouveau: explicitly wait on the fence in nouveau_bo_move_m2mf

2022-09-20 Thread Salvatore Bonaccorso
Hi,

On Tue, Sep 20, 2022 at 01:36:32PM +0200, Karol Herbst wrote:
> On Tue, Sep 20, 2022 at 12:42 PM Salvatore Bonaccorso  
> wrote:
> >
> > Hi,
> >
> > On Fri, Aug 19, 2022 at 10:09:28PM +0200, Karol Herbst wrote:
> > > It is a bit unlcear to us why that's helping, but it does and unbreaks
> > > suspend/resume on a lot of GPUs without any known drawbacks.
> > >
> > > Cc: sta...@vger.kernel.org # v5.15+
> > > Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
> > > Signed-off-by: Karol Herbst 
> > > ---
> > >  drivers/gpu/drm/nouveau/nouveau_bo.c | 9 +
> > >  1 file changed, 9 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
> > > b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > index 35bb0bb3fe61..126b3c6e12f9 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > @@ -822,6 +822,15 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, 
> > > int evict,
> > >   if (ret == 0) {
> > >   ret = nouveau_fence_new(chan, false, &fence);
> > >   if (ret == 0) {
> > > + /* TODO: figure out a better solution here
> > > +  *
> > > +  * wait on the fence here explicitly as 
> > > going through
> > > +  * ttm_bo_move_accel_cleanup somehow 
> > > doesn't seem to do it.
> > > +  *
> > > +  * Without this the operation can timeout 
> > > and we'll fallback to a
> > > +  * software copy, which might take several 
> > > minutes to finish.
> > > +  */
> > > + nouveau_fence_wait(fence, false, false);
> > >   ret = ttm_bo_move_accel_cleanup(bo,
> > >   
> > > &fence->base,
> > >   evict, 
> > > false,
> > > --
> > > 2.37.1
> > >
> > >
> >
> > While this is marked for 5.15+ only, a user in Debian was seeing the
> > suspend issue as well on 5.10.y and did confirm the commit fixes the
> > issue as well in the 5.10.y series:
> >
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#69
> >
> > Karol, Lyude, should that as well be picked for 5.10.y?
> >
> 
> mhh from the original report 5.10 was fine, but maybe something got
> backported and it broke it? I'll try to do some testing on my machine
> and see what I can figure out, but it could also be a debian only
> issue at this point.

Right, this is a possiblity, thanks for looking into it!

Computer Enthusiastic, can you verify the problem as well in a
non-Debian patched upstream kernel directly from the 5.10.y series
(latest 5.10.144) and verify the fix there?

Regards,
Salvatore


Re: [PATCH] nouveau: explicitly wait on the fence in nouveau_bo_move_m2mf

2022-09-20 Thread Salvatore Bonaccorso
Hi,

On Fri, Aug 19, 2022 at 10:09:28PM +0200, Karol Herbst wrote:
> It is a bit unlcear to us why that's helping, but it does and unbreaks
> suspend/resume on a lot of GPUs without any known drawbacks.
> 
> Cc: sta...@vger.kernel.org # v5.15+
> Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
> Signed-off-by: Karol Herbst 
> ---
>  drivers/gpu/drm/nouveau/nouveau_bo.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
> b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 35bb0bb3fe61..126b3c6e12f9 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -822,6 +822,15 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int 
> evict,
>   if (ret == 0) {
>   ret = nouveau_fence_new(chan, false, &fence);
>   if (ret == 0) {
> + /* TODO: figure out a better solution here
> +  *
> +  * wait on the fence here explicitly as going 
> through
> +  * ttm_bo_move_accel_cleanup somehow doesn't 
> seem to do it.
> +  *
> +  * Without this the operation can timeout and 
> we'll fallback to a
> +  * software copy, which might take several 
> minutes to finish.
> +  */
> + nouveau_fence_wait(fence, false, false);
>   ret = ttm_bo_move_accel_cleanup(bo,
>   &fence->base,
>   evict, false,
> -- 
> 2.37.1
> 
> 

While this is marked for 5.15+ only, a user in Debian was seeing the
suspend issue as well on 5.10.y and did confirm the commit fixes the
issue as well in the 5.10.y series:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#69

Karol, Lyude, should that as well be picked for 5.10.y?

Regards,
Salvatore


Regression from 3c196f056666 ("drm/amdgpu: always reset the asic in suspend (v2)") on suspend?

2022-02-12 Thread Salvatore Bonaccorso
Hi Alex, hi all

In Debian we got a regression report from Dominique Dumont, CC'ed in
https://bugs.debian.org/1005005 that afer an update to 5.15.15 based
kernel, his machine noe longer suspends correctly, after screen going
black as usual it comes back. The Debian bug above contians a trace.

Dominique confirmed that this issue persisted after updating to 5.16.7
furthermore he bisected the issue and found 

3c196f0510912645c7c5d9107706003f67c3 is the first bad commit
commit 3c196f0510912645c7c5d9107706003f67c3
Author: Alex Deucher 
Date:   Fri Nov 12 11:25:30 2021 -0500

drm/amdgpu: always reset the asic in suspend (v2)

[ Upstream commit daf8de0874ab5b74b38a38726fdd3d07ef98a7ee ]

If the platform suspend happens to fail and the power rail
is not turned off, the GPU will be in an unknown state on
resume, so reset the asic so that it will be in a known
good state on resume even if the platform suspend failed.

v2: handle s0ix

Acked-by: Luben Tuikov 
Acked-by: Evan Quan 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 

 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

to be the first bad commit, see https://bugs.debian.org/1005005#34 .

Does this ring any bell? Any idea on the problem?

Regards,
Salvatore


Re: [PATCH v2 0/3] drm/nouveau: fix a use-after-free in postclose()

2021-10-11 Thread Salvatore Bonaccorso
Hi Ben,

On Tue, Aug 17, 2021 at 04:32:31PM -0400, Lyude Paul wrote:
> It may have been, we're in the process of trying to change around how we
> currently accept nouveau patches to stop this from happening in the future.
> 
> Ben, whenever you get a moment can you take a look at this?
> 
> On Mon, 2021-08-16 at 09:03 +0200, Salvatore Bonaccorso wrote:
> > Hi,
> > 
> > On Fri, Mar 26, 2021 at 06:00:51PM -0400, Lyude Paul wrote:
> > > This patch series is:
> > > 
> > > Reviewed-by: Lyude Paul 
> > > 
> > > Btw - in the future if you need to send a respin of multiple patches, you
> > > need
> > > to send it as it's own separate series instead of replying to the previous
> > > one
> > > (one-off respins can just be posted as replies though), otherwise
> > > patchwork
> > > won't pick it up
> > 
> > Did this patch series somehow fall through the cracks or got lost?

Looking some older threads, noticed this one. Ben did you got a chance
to look at it, or is it now irrelevant by other means?

Regards,
Salvatore


Re: [PATCH v2 0/3] drm/nouveau: fix a use-after-free in postclose()

2021-08-16 Thread Salvatore Bonaccorso
Hi,

On Fri, Mar 26, 2021 at 06:00:51PM -0400, Lyude Paul wrote:
> This patch series is:
> 
> Reviewed-by: Lyude Paul 
> 
> Btw - in the future if you need to send a respin of multiple patches, you need
> to send it as it's own separate series instead of replying to the previous one
> (one-off respins can just be posted as replies though), otherwise patchwork
> won't pick it up

Did this patch series somehow fall through the cracks or got lost?

Regards,
Salvatore


Re: [PATCH v3] drm: Use USB controller's DMA mask when importing dmabufs

2021-02-25 Thread Salvatore Bonaccorso
Hi,

On Thu, Feb 25, 2021 at 07:01:57PM +, Sudip Mukherjee wrote:
> On Tue, Feb 23, 2021 at 02:09:58PM +0100, Greg KH wrote:
> > On Tue, Feb 23, 2021 at 01:51:09PM +0100, Thomas Zimmermann wrote:
> > > Hi
> > > 
> > > Am 23.02.21 um 13:24 schrieb Greg KH:
> > > > On Tue, Feb 23, 2021 at 01:14:30PM +0100, Daniel Vetter wrote:
> > > > > On Tue, Feb 23, 2021 at 1:02 PM Greg KH  
> > > > > wrote:
> > > > > > 
> > > > > > On Tue, Feb 23, 2021 at 12:46:20PM +0100, Daniel Vetter wrote:
> > > > > > > On Tue, Feb 23, 2021 at 12:19:56PM +0100, Greg KH wrote:
> > > > > > > > On Tue, Feb 23, 2021 at 11:58:42AM +0100, Thomas Zimmermann 
> > > > > > > > wrote:
> > > > > > > > > USB devices cannot perform DMA and hence have no dma_mask set 
> > > > > > > > > in their
> > > > > > > > > device structure. Importing dmabuf into a USB-based driver 
> > > > > > > > > fails, which
> > > > > > > > > break joining and mirroring of display in X11.
> > > > > > > > > 
> > > > > > > > > For USB devices, pick the associated USB controller as 
> > > > > > > > > attachment device,
> > > > > > > > > so that it can perform DMA. If the DMa controller does not 
> > > > > > > > > support DMA
> > > > > > > > > transfers, we're aout of luck and cannot import.
> > > > > > > > > 
> > > > > > > > > Drivers should use DRM_GEM_SHMEM_DROVER_OPS_USB to initialize 
> > > > > > > > > their
> > > > > > > > > instance of struct drm_driver.
> > > > > > > > > 
> > > > > > > > > Tested by joining/mirroring displays of udl and radeon un der 
> > > > > > > > > Gnome/X11.
> > > > > > > > > 
> > > > > > > > > v3:
> > > > > > > > >* drop gem_create_object
> > > > > > > > >* use DMA mask of USB controller, if any (Daniel, 
> > > > > > > > > Christian, Noralf)
> > > > > > > > > v2:
> > > > > > > > >* move fix to importer side (Christian, Daniel)
> > > > > > > > >* update SHMEM and CMA helpers for new PRIME callbacks
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Thomas Zimmermann 
> > > > > > > > > Fixes: 6eb0233ec2d0 ("usb: don't inherity DMA properties for 
> > > > > > > > > USB devices")
> > > > > > > > > Cc: Christoph Hellwig 
> > > > > > > > > Cc: Greg Kroah-Hartman 
> > > > > > > > > Cc: Johan Hovold 
> > > > > > > > > Cc: Alan Stern 
> > > > > > > > > Cc: Andy Shevchenko 
> > > > > > > > > Cc: Sebastian Andrzej Siewior 
> > > > > > > > > Cc: Mathias Nyman 
> > > > > > > > > Cc: Oliver Neukum 
> > > > > > > > > Cc: Thomas Gleixner 
> > > > > > > > > Cc:  # v5.10+
> > > > > > > > > ---
> 
> > > > > > > > 
> > > > > > > > There shouldn't be anything "special" about a DRM driver that 
> > > > > > > > needs this
> > > > > > > > vs. any other driver that might want to know about DMA things 
> > > > > > > > related to
> > > > > > > > a specific USB device.  Why isn't this an issue with the 
> > > > > > > > existing
> > > > > > > > storage or v4l USB devices?
> > > > > > > 
> > > > > > > The trouble is that this is a regression fix for 5.9, because the 
> > > > > > > dma-api
> > > > > > > got more opinionated about what it allows. The proper fix is a 
> > > > > > > lot more
> > > > > > > invasive (we essentially need to rework the drm_prime.c to allow 
> > > > > > > dma-buf
> > > > > > > importing for just cpu access), and that's a ton more invasive 
> > > > > > > than just a
> > > > > > > small patch with can stuff into stable kernels.
> > > > > > > 
> > > > > > > This here is ugly, but it should at least get rid of black 
> > > > > > > screens again.
> > > > > > > 
> > > > > > > I think solid FIXME comment explaining the situation would be 
> > > > > > > good.
> > > > > > 
> > > > > > Why can't I take a USB patch for a regression fix?  Is drm somehow
> > > > > > stand-alone that you make changes here that should belong in other
> > > > > > subsystems?
> > > > > > 
> > > > > > {hint, it shouldn't be}
> > > > > > 
> > > > > > When you start poking in the internals of usb controller structures,
> > > > > > that logic belongs in the USB core for all drivers to use, not in a
> > > > > > random tiny subsystem where no USB developer will ever notice it?  
> > > > > > :)
> > > > > 
> > > > > Because the usb fix isn't the right fix here, it's just the duct-tape.
> > > > > We don't want to dig around in these internals, it's just a convenient
> > > > > way to shut up the dma-api until drm has sorted out its confusion.
> > > > > 
> > > > > We can polish the turd if you want, but the thing is, it's still a 
> > > > > turd ...
> > > > > 
> > > > > The right fix is to change drm_prime.c code to not call dma_map_sg
> > > > > when we don't need it. The problem is that roughly 3 layers of code
> > > > > (drm_prime, dma-buf, gem shmem helpers) are involved. Plus, since
> > > > > drm_prime is shared by all drm drivers, all other drm drivers are
> > > > > impacted too. We're not going to be able to cc: stable that kind of
> > > > > stuff. Thomas actually started with that series, until I pointed out
> > > > > how bad things really are.
> > > > > 
> > > > > A

Re: [Nouveau] [PATCH] drm/nouveau: bail out of nouveau_channel_new if channel init fails

2021-02-07 Thread Salvatore Bonaccorso
Hi Ben,

On Mon, Nov 16, 2020 at 09:04:32AM +1000, Ben Skeggs wrote:
> On Mon, 16 Nov 2020 at 05:19, Karol Herbst  wrote:
> >
> > On Sun, Nov 15, 2020 at 6:43 PM Salvatore Bonaccorso  
> > wrote:
> > >
> > > Hi,
> > >
> > > On Fri, Aug 28, 2020 at 11:28:46AM +0200, Frantisek Hrbata wrote:
> > > > Unprivileged user can crash kernel by using 
> > > > DRM_IOCTL_NOUVEAU_CHANNEL_ALLOC
> > > > ioctl. This was reported by trinity[1] fuzzer.
> > > >
> > > > [   71.073906] nouveau :01:00.0: crashme[1329]: channel failed to 
> > > > initialise, -17
> > > > [   71.081730] BUG: kernel NULL pointer dereference, address: 
> > > > 00a0
> > > > [   71.088928] #PF: supervisor read access in kernel mode
> > > > [   71.094059] #PF: error_code(0x) - not-present page
> > > > [   71.099189] PGD 119590067 P4D 119590067 PUD 1054f5067 PMD 0
> > > > [   71.104842] Oops:  [#1] SMP NOPTI
> > > > [   71.108498] CPU: 2 PID: 1329 Comm: crashme Not tainted 5.8.0-rc6+ #2
> > > > [   71.114993] Hardware name: AMD Pike/Pike, BIOS RPK1506A 09/03/2014
> > > > [   71.121213] RIP: 0010:nouveau_abi16_ioctl_channel_alloc+0x108/0x380 
> > > > [nouveau]
> > > > [   71.128339] Code: 48 89 9d f0 00 00 00 41 8b 4c 24 04 41 8b 14 24 45 
> > > > 31 c0 4c 8d 4b 10 48 89 ee 4c 89 f7 e8 10 11 00 00 85 c0 75 78 48 8b 43 
> > > > 10 <8b> 90 a0 00 00 00 41 89 54 24 08 80 7d 3d 05 0f 86 bb 01 00 00 41
> > > > [   71.147074] RSP: 0018:b4a1809cfd38 EFLAGS: 00010246
> > > > [   71.152526] RAX:  RBX: 98cedbaa1d20 RCX: 
> > > > 03bf
> > > > [   71.159651] RDX: 03be RSI:  RDI: 
> > > > 00030160
> > > > [   71.166774] RBP: 98cee776de00 R08: dc0144198a08 R09: 
> > > > 98ceeefd4000
> > > > [   71.173901] R10: 98cee7e81780 R11: 0001 R12: 
> > > > b4a1809cfe08
> > > > [   71.181214] R13: 98cee776d000 R14: 98cec519e000 R15: 
> > > > 98cee776def0
> > > > [   71.188339] FS:  7fd926250500() GS:98ceeac8() 
> > > > knlGS:
> > > > [   71.196418] CS:  0010 DS:  ES:  CR0: 80050033
> > > > [   71.202155] CR2: 00a0 CR3: 000106622000 CR4: 
> > > > 000406e0
> > > > [   71.209297] Call Trace:
> > > > [   71.211777]  ? nouveau_abi16_ioctl_getparam+0x1f0/0x1f0 [nouveau]
> > > > [   71.218053]  drm_ioctl_kernel+0xac/0xf0 [drm]
> > > > [   71.222421]  drm_ioctl+0x211/0x3c0 [drm]
> > > > [   71.226379]  ? nouveau_abi16_ioctl_getparam+0x1f0/0x1f0 [nouveau]
> > > > [   71.232500]  nouveau_drm_ioctl+0x57/0xb0 [nouveau]
> > > > [   71.237285]  ksys_ioctl+0x86/0xc0
> > > > [   71.240595]  __x64_sys_ioctl+0x16/0x20
> > > > [   71.244340]  do_syscall_64+0x4c/0x90
> > > > [   71.248110]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > [   71.253162] RIP: 0033:0x7fd925d4b88b
> > > > [   71.256731] Code: Bad RIP value.
> > > > [   71.259955] RSP: 002b:7ffc743592d8 EFLAGS: 0206 ORIG_RAX: 
> > > > 0010
> > > > [   71.267514] RAX: ffda RBX:  RCX: 
> > > > 7fd925d4b88b
> > > > [   71.274637] RDX: 00601080 RSI: c0586442 RDI: 
> > > > 0003
> > > > [   71.281986] RBP: 7ffc74359340 R08: 7fd926016ce0 R09: 
> > > > 7fd926016ce0
> > > > [   71.289111] R10: 0003 R11: 0206 R12: 
> > > > 00400620
> > > > [   71.296235] R13: 7ffc74359420 R14:  R15: 
> > > > 
> > > > [   71.303361] Modules linked in: rfkill sunrpc snd_hda_codec_realtek 
> > > > snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg 
> > > > snd_hda_codec snd_hda_core edac_mce_amd snd_hwdep kvm_amd snd_seq ccp 
> > > > snd_seq_device snd_pcm kvm snd_timer snd irqbypass soundcore sp5100_tco 
> > > > pcspkr crct10dif_pclmul crc32_pclmul ghash_clmulni_intel wmi_bmof 
> > > > joydev i2c_piix4 fam15h_power k10temp acpi_cpufreq ip_tables xfs 
> > > > libcrc32c sd_mod t10_pi sg nouveau video mxm_wmi i2c_algo_bit 
> > > > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm 
> > > > broadcom bcm_phy_l