Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers
On Thu, Jun 15, 2023 at 1:19 PM Christian König wrote: > > Am 13.06.23 um 16:18 schrieb Karol Herbst: > > On Tue, Jun 13, 2023 at 3:59 PM Christian König > > wrote: > >> Am 13.06.23 um 15:05 schrieb Karol Herbst: > >>> On Mon, Dec 5, 2022 at 2:40 PM Christian König > >>> wrote: > Am 29.11.22 um 22:14 schrieb Felix Kuehling: > > On 2022-11-25 05:21, Christian König wrote: > >> Instead of a single worker going over the list of delete BOs in regular > >> intervals use a per BO worker which blocks for the resv object and > >> locking of the BO. > >> > >> This not only simplifies the handling massively, but also results in > >> much better response time when cleaning up buffers. > >> > >> Signed-off-by: Christian König > > Just thinking out loud: If I understand it correctly, this can cause a > > lot of sleeping worker threads when > > AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed > > at the same time. This happens e.g. when a KFD process terminates or > > crashes. I guess with a concurrency-managed workqueue this isn't going > > to be excessive. And since it's on a per device workqueue, it doesn't > > stall work items on the system work queue or from other devices. > Yes, exactly that. The last parameter to alloc_workqueue() limits how > many work items can be sleeping. > > > I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue > > is not about freeing ttm_resources but about freeing the BOs. But it > > affects freeing of ghost_objs that are holding the ttm_resources being > > freed. > Well if the BO is idle, but not immediately lockable we delegate freeing > the backing pages in the TT object to those workers as well. It might > even be a good idea to use a separate wq for this case. > > > If those assumptions all make sense, patches 1-3 are > > > > Reviewed-by: Felix Kuehling > Thanks, > Christian. > > >>> This patch causes a heap use-after-free when using nouveau with the > >>> potential of trashing filesystems, is there a way to revert it until > >>> we figure out a proper solution to the problem? > >> Uff I don't think so, we have quite some work based on top of this. But > >> let me double check. > >> > > yeah.. I already talked with Dave about fixing this issue as Dave has > > more knowledge on this part of the driver (I hope), so we might have a > > fix soonish, but the concerning part is, that it's already out to > > users, so might be better to be able to revert it if the fix takes a > > while to emerge. > > We at least can't revert easily. This was fixing issues where we have > seen OOM situations because TTM wasn't releasing resources fast enough. > > >> On the other hand have you tried running this with KASAN to catch use > >> after free errors? > > yes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213#note_1942777 > > > >> Since we now block for work to finish and not check every few > >> milliseconds to garbage collect memory will now be reclaimed much faster > >> after freeing it. > > yeah, that kinda makes sense. This entire issue feels like a race > > happening as I need to run the OpenGL CTS in parallel with 8+ threads > > to trigger it reliably. > > Yeah, from the bug report that's a classic use after free because of a race. > > Easiest would be to make sure everybody has a reference to the fence. > turns out nouveau was using `dma_fence_is_signaled_locked` without taking the lock. Fixing that seems to improve the situation by a lot, so I think we have a fix to resolve this problem. > Regards, > Christian. > > > > >> Regards, > >> Christian. > >> > >>> Bug: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213 > >>> > >>> example trace on affected systems: > >>> > >>> [ 4102.946946] general protection fault, probably for non-canonical > >>> address 0x5f775ce3bd949b45: [#3] PREEMPT SMP NOPTI > >>> [ 4102.957794] CPU: 12 PID: 89561 Comm: glcts Tainted: G D > >>> 6.3.5-200.fc38.x86_64 #1 > >>> [ 4102.966556] Hardware name: ASUS System Product Name/PRIME B660-PLUS > >>> D4, BIOS 0418 10/13/2021 > >>> [ 4102.974972] RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320 > >>> [ 4102.980362] Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4 > >>> 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07 > >>> 48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41 > >>> f6 c0 > >>> [ 4102.999073] RSP: 0018:9764e0057b40 EFLAGS: 00010202 > >>> [ 4103.004291] RAX: 5f775ce3bd949b45 RBX: 0dc0 RCX: > >>> 0046 > >>> [ 4103.011408] RDX: 0002cf87600c RSI: 0dc0 RDI: > >>> 5f775ce3bd949b15 > >>> [ 4103.018528] RBP: 0dc0 R08: 000390c0 R09: > >>> 30302d6d > >>> [ 4103.025649] R10: 756c7473 R11: 20090298 R12: > >>> > >>> [ 4103.032767] R13: R14:
Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers
Am 13.06.23 um 16:18 schrieb Karol Herbst: On Tue, Jun 13, 2023 at 3:59 PM Christian König wrote: Am 13.06.23 um 15:05 schrieb Karol Herbst: On Mon, Dec 5, 2022 at 2:40 PM Christian König wrote: Am 29.11.22 um 22:14 schrieb Felix Kuehling: On 2022-11-25 05:21, Christian König wrote: Instead of a single worker going over the list of delete BOs in regular intervals use a per BO worker which blocks for the resv object and locking of the BO. This not only simplifies the handling massively, but also results in much better response time when cleaning up buffers. Signed-off-by: Christian König Just thinking out loud: If I understand it correctly, this can cause a lot of sleeping worker threads when AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed at the same time. This happens e.g. when a KFD process terminates or crashes. I guess with a concurrency-managed workqueue this isn't going to be excessive. And since it's on a per device workqueue, it doesn't stall work items on the system work queue or from other devices. Yes, exactly that. The last parameter to alloc_workqueue() limits how many work items can be sleeping. I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue is not about freeing ttm_resources but about freeing the BOs. But it affects freeing of ghost_objs that are holding the ttm_resources being freed. Well if the BO is idle, but not immediately lockable we delegate freeing the backing pages in the TT object to those workers as well. It might even be a good idea to use a separate wq for this case. If those assumptions all make sense, patches 1-3 are Reviewed-by: Felix Kuehling Thanks, Christian. This patch causes a heap use-after-free when using nouveau with the potential of trashing filesystems, is there a way to revert it until we figure out a proper solution to the problem? Uff I don't think so, we have quite some work based on top of this. But let me double check. yeah.. I already talked with Dave about fixing this issue as Dave has more knowledge on this part of the driver (I hope), so we might have a fix soonish, but the concerning part is, that it's already out to users, so might be better to be able to revert it if the fix takes a while to emerge. We at least can't revert easily. This was fixing issues where we have seen OOM situations because TTM wasn't releasing resources fast enough. On the other hand have you tried running this with KASAN to catch use after free errors? yes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213#note_1942777 Since we now block for work to finish and not check every few milliseconds to garbage collect memory will now be reclaimed much faster after freeing it. yeah, that kinda makes sense. This entire issue feels like a race happening as I need to run the OpenGL CTS in parallel with 8+ threads to trigger it reliably. Yeah, from the bug report that's a classic use after free because of a race. Easiest would be to make sure everybody has a reference to the fence. Regards, Christian. Regards, Christian. Bug: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213 example trace on affected systems: [ 4102.946946] general protection fault, probably for non-canonical address 0x5f775ce3bd949b45: [#3] PREEMPT SMP NOPTI [ 4102.957794] CPU: 12 PID: 89561 Comm: glcts Tainted: G D 6.3.5-200.fc38.x86_64 #1 [ 4102.966556] Hardware name: ASUS System Product Name/PRIME B660-PLUS D4, BIOS 0418 10/13/2021 [ 4102.974972] RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320 [ 4102.980362] Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07 48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41 f6 c0 [ 4102.999073] RSP: 0018:9764e0057b40 EFLAGS: 00010202 [ 4103.004291] RAX: 5f775ce3bd949b45 RBX: 0dc0 RCX: 0046 [ 4103.011408] RDX: 0002cf87600c RSI: 0dc0 RDI: 5f775ce3bd949b15 [ 4103.018528] RBP: 0dc0 R08: 000390c0 R09: 30302d6d [ 4103.025649] R10: 756c7473 R11: 20090298 R12: [ 4103.032767] R13: R14: 0046 R15: 8bda80042600 [ 4103.039887] FS: 7f386a85ef00() GS:8be1df70() knlGS: [ 4103.047958] CS: 0010 DS: ES: CR0: 80050033 [ 4103.053692] CR2: 0493b868 CR3: 00014c3ba000 CR4: 00f50ee0 [ 4103.060812] PKRU: 5554 [ 4103.063520] Call Trace: [ 4103.065970] [ 4103.068071] ? die_addr+0x36/0x90 [ 4103.071384] ? exc_general_protection+0x1be/0x420 [ 4103.076081] ? asm_exc_general_protection+0x26/0x30 [ 4103.080952] ? __kmem_cache_alloc_node+0x1ba/0x320 [ 4103.085734] ? ext4_htree_store_dirent+0x42/0x180 [ 4103.090431] ? ext4_htree_store_dirent+0x42/0x180 [ 4103.095132] __kmalloc+0x4d/0x150 [ 4103.098444] ext4_htree_store_dirent+0x42/0x180 [ 4103.102970] htree_dirblock_to_tree+0x1ed/0x370 [ 4
Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers
On Tue, Jun 13, 2023 at 3:59 PM Christian König wrote: > > Am 13.06.23 um 15:05 schrieb Karol Herbst: > > On Mon, Dec 5, 2022 at 2:40 PM Christian König > > wrote: > >> Am 29.11.22 um 22:14 schrieb Felix Kuehling: > >>> On 2022-11-25 05:21, Christian König wrote: > Instead of a single worker going over the list of delete BOs in regular > intervals use a per BO worker which blocks for the resv object and > locking of the BO. > > This not only simplifies the handling massively, but also results in > much better response time when cleaning up buffers. > > Signed-off-by: Christian König > >>> Just thinking out loud: If I understand it correctly, this can cause a > >>> lot of sleeping worker threads when > >>> AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed > >>> at the same time. This happens e.g. when a KFD process terminates or > >>> crashes. I guess with a concurrency-managed workqueue this isn't going > >>> to be excessive. And since it's on a per device workqueue, it doesn't > >>> stall work items on the system work queue or from other devices. > >> Yes, exactly that. The last parameter to alloc_workqueue() limits how > >> many work items can be sleeping. > >> > >>> I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue > >>> is not about freeing ttm_resources but about freeing the BOs. But it > >>> affects freeing of ghost_objs that are holding the ttm_resources being > >>> freed. > >> Well if the BO is idle, but not immediately lockable we delegate freeing > >> the backing pages in the TT object to those workers as well. It might > >> even be a good idea to use a separate wq for this case. > >> > >>> If those assumptions all make sense, patches 1-3 are > >>> > >>> Reviewed-by: Felix Kuehling > >> Thanks, > >> Christian. > >> > > This patch causes a heap use-after-free when using nouveau with the > > potential of trashing filesystems, is there a way to revert it until > > we figure out a proper solution to the problem? > > Uff I don't think so, we have quite some work based on top of this. But > let me double check. > yeah.. I already talked with Dave about fixing this issue as Dave has more knowledge on this part of the driver (I hope), so we might have a fix soonish, but the concerning part is, that it's already out to users, so might be better to be able to revert it if the fix takes a while to emerge. > On the other hand have you tried running this with KASAN to catch use > after free errors? yes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213#note_1942777 > > Since we now block for work to finish and not check every few > milliseconds to garbage collect memory will now be reclaimed much faster > after freeing it. yeah, that kinda makes sense. This entire issue feels like a race happening as I need to run the OpenGL CTS in parallel with 8+ threads to trigger it reliably. > > Regards, > Christian. > > > > > Bug: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213 > > > > example trace on affected systems: > > > > [ 4102.946946] general protection fault, probably for non-canonical > > address 0x5f775ce3bd949b45: [#3] PREEMPT SMP NOPTI > > [ 4102.957794] CPU: 12 PID: 89561 Comm: glcts Tainted: G D > > 6.3.5-200.fc38.x86_64 #1 > > [ 4102.966556] Hardware name: ASUS System Product Name/PRIME B660-PLUS > > D4, BIOS 0418 10/13/2021 > > [ 4102.974972] RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320 > > [ 4102.980362] Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4 > > 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07 > > 48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41 > > f6 c0 > > [ 4102.999073] RSP: 0018:9764e0057b40 EFLAGS: 00010202 > > [ 4103.004291] RAX: 5f775ce3bd949b45 RBX: 0dc0 RCX: > > 0046 > > [ 4103.011408] RDX: 0002cf87600c RSI: 0dc0 RDI: > > 5f775ce3bd949b15 > > [ 4103.018528] RBP: 0dc0 R08: 000390c0 R09: > > 30302d6d > > [ 4103.025649] R10: 756c7473 R11: 20090298 R12: > > > > [ 4103.032767] R13: R14: 0046 R15: > > 8bda80042600 > > [ 4103.039887] FS: 7f386a85ef00() GS:8be1df70() > > knlGS: > > [ 4103.047958] CS: 0010 DS: ES: CR0: 80050033 > > [ 4103.053692] CR2: 0493b868 CR3: 00014c3ba000 CR4: > > 00f50ee0 > > [ 4103.060812] PKRU: 5554 > > [ 4103.063520] Call Trace: > > [ 4103.065970] > > [ 4103.068071] ? die_addr+0x36/0x90 > > [ 4103.071384] ? exc_general_protection+0x1be/0x420 > > [ 4103.076081] ? asm_exc_general_protection+0x26/0x30 > > [ 4103.080952] ? __kmem_cache_alloc_node+0x1ba/0x320 > > [ 4103.085734] ? ext4_htree_store_dirent+0x42/0x180 > > [ 4103.090431] ? ext4_htree_store_dirent+0x42/0x180 > > [ 4103.095132] __kmalloc+0x4d/0x150 > > [ 4103.098444] ext4_htree_store_dirent+0x42/0x180 > > [ 4103.102
Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers
Am 13.06.23 um 15:05 schrieb Karol Herbst: On Mon, Dec 5, 2022 at 2:40 PM Christian König wrote: Am 29.11.22 um 22:14 schrieb Felix Kuehling: On 2022-11-25 05:21, Christian König wrote: Instead of a single worker going over the list of delete BOs in regular intervals use a per BO worker which blocks for the resv object and locking of the BO. This not only simplifies the handling massively, but also results in much better response time when cleaning up buffers. Signed-off-by: Christian König Just thinking out loud: If I understand it correctly, this can cause a lot of sleeping worker threads when AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed at the same time. This happens e.g. when a KFD process terminates or crashes. I guess with a concurrency-managed workqueue this isn't going to be excessive. And since it's on a per device workqueue, it doesn't stall work items on the system work queue or from other devices. Yes, exactly that. The last parameter to alloc_workqueue() limits how many work items can be sleeping. I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue is not about freeing ttm_resources but about freeing the BOs. But it affects freeing of ghost_objs that are holding the ttm_resources being freed. Well if the BO is idle, but not immediately lockable we delegate freeing the backing pages in the TT object to those workers as well. It might even be a good idea to use a separate wq for this case. If those assumptions all make sense, patches 1-3 are Reviewed-by: Felix Kuehling Thanks, Christian. This patch causes a heap use-after-free when using nouveau with the potential of trashing filesystems, is there a way to revert it until we figure out a proper solution to the problem? Uff I don't think so, we have quite some work based on top of this. But let me double check. On the other hand have you tried running this with KASAN to catch use after free errors? Since we now block for work to finish and not check every few milliseconds to garbage collect memory will now be reclaimed much faster after freeing it. Regards, Christian. Bug: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213 example trace on affected systems: [ 4102.946946] general protection fault, probably for non-canonical address 0x5f775ce3bd949b45: [#3] PREEMPT SMP NOPTI [ 4102.957794] CPU: 12 PID: 89561 Comm: glcts Tainted: G D 6.3.5-200.fc38.x86_64 #1 [ 4102.966556] Hardware name: ASUS System Product Name/PRIME B660-PLUS D4, BIOS 0418 10/13/2021 [ 4102.974972] RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320 [ 4102.980362] Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07 48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41 f6 c0 [ 4102.999073] RSP: 0018:9764e0057b40 EFLAGS: 00010202 [ 4103.004291] RAX: 5f775ce3bd949b45 RBX: 0dc0 RCX: 0046 [ 4103.011408] RDX: 0002cf87600c RSI: 0dc0 RDI: 5f775ce3bd949b15 [ 4103.018528] RBP: 0dc0 R08: 000390c0 R09: 30302d6d [ 4103.025649] R10: 756c7473 R11: 20090298 R12: [ 4103.032767] R13: R14: 0046 R15: 8bda80042600 [ 4103.039887] FS: 7f386a85ef00() GS:8be1df70() knlGS: [ 4103.047958] CS: 0010 DS: ES: CR0: 80050033 [ 4103.053692] CR2: 0493b868 CR3: 00014c3ba000 CR4: 00f50ee0 [ 4103.060812] PKRU: 5554 [ 4103.063520] Call Trace: [ 4103.065970] [ 4103.068071] ? die_addr+0x36/0x90 [ 4103.071384] ? exc_general_protection+0x1be/0x420 [ 4103.076081] ? asm_exc_general_protection+0x26/0x30 [ 4103.080952] ? __kmem_cache_alloc_node+0x1ba/0x320 [ 4103.085734] ? ext4_htree_store_dirent+0x42/0x180 [ 4103.090431] ? ext4_htree_store_dirent+0x42/0x180 [ 4103.095132] __kmalloc+0x4d/0x150 [ 4103.098444] ext4_htree_store_dirent+0x42/0x180 [ 4103.102970] htree_dirblock_to_tree+0x1ed/0x370 [ 4103.107494] ext4_htree_fill_tree+0x109/0x3d0 [ 4103.111846] ext4_readdir+0x6d4/0xa80 [ 4103.115505] iterate_dir+0x178/0x1c0 [ 4103.119076] __x64_sys_getdents64+0x88/0x130 [ 4103.123341] ? __pfx_filldir64+0x10/0x10 [ 4103.127260] do_syscall_64+0x5d/0x90 [ 4103.130835] ? handle_mm_fault+0x11e/0x310 [ 4103.134927] ? do_user_addr_fault+0x1e0/0x720 [ 4103.139278] ? exc_page_fault+0x7c/0x180 [ 4103.143195] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 4103.148240] RIP: 0033:0x7f386a418047 [ 4103.151828] Code: 24 fb ff 4c 89 e0 5b 41 5c 5d c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 ff ff ff 7f 48 39 c2 48 0f 47 d0 b8 d9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 91 cd 0f 00 f7 d8 64 89 02 48 [ 4103.170543] RSP: 002b:7ffd4793ff38 EFLAGS: 0293 ORIG_RAX: 00d9 [ 4103.178095] RAX: ffda RBX: 04933830 RCX: 7f386a418047 [ 4103.185214] RDX: 8000 RSI: 04933860 RDI: 0
Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers
On Mon, Dec 5, 2022 at 2:40 PM Christian König wrote: > > Am 29.11.22 um 22:14 schrieb Felix Kuehling: > > On 2022-11-25 05:21, Christian König wrote: > >> Instead of a single worker going over the list of delete BOs in regular > >> intervals use a per BO worker which blocks for the resv object and > >> locking of the BO. > >> > >> This not only simplifies the handling massively, but also results in > >> much better response time when cleaning up buffers. > >> > >> Signed-off-by: Christian König > > > > Just thinking out loud: If I understand it correctly, this can cause a > > lot of sleeping worker threads when > > AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed > > at the same time. This happens e.g. when a KFD process terminates or > > crashes. I guess with a concurrency-managed workqueue this isn't going > > to be excessive. And since it's on a per device workqueue, it doesn't > > stall work items on the system work queue or from other devices. > > Yes, exactly that. The last parameter to alloc_workqueue() limits how > many work items can be sleeping. > > > I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue > > is not about freeing ttm_resources but about freeing the BOs. But it > > affects freeing of ghost_objs that are holding the ttm_resources being > > freed. > > Well if the BO is idle, but not immediately lockable we delegate freeing > the backing pages in the TT object to those workers as well. It might > even be a good idea to use a separate wq for this case. > > > > > If those assumptions all make sense, patches 1-3 are > > > > Reviewed-by: Felix Kuehling > > Thanks, > Christian. > This patch causes a heap use-after-free when using nouveau with the potential of trashing filesystems, is there a way to revert it until we figure out a proper solution to the problem? Bug: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213 example trace on affected systems: [ 4102.946946] general protection fault, probably for non-canonical address 0x5f775ce3bd949b45: [#3] PREEMPT SMP NOPTI [ 4102.957794] CPU: 12 PID: 89561 Comm: glcts Tainted: G D 6.3.5-200.fc38.x86_64 #1 [ 4102.966556] Hardware name: ASUS System Product Name/PRIME B660-PLUS D4, BIOS 0418 10/13/2021 [ 4102.974972] RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320 [ 4102.980362] Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07 48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41 f6 c0 [ 4102.999073] RSP: 0018:9764e0057b40 EFLAGS: 00010202 [ 4103.004291] RAX: 5f775ce3bd949b45 RBX: 0dc0 RCX: 0046 [ 4103.011408] RDX: 0002cf87600c RSI: 0dc0 RDI: 5f775ce3bd949b15 [ 4103.018528] RBP: 0dc0 R08: 000390c0 R09: 30302d6d [ 4103.025649] R10: 756c7473 R11: 20090298 R12: [ 4103.032767] R13: R14: 0046 R15: 8bda80042600 [ 4103.039887] FS: 7f386a85ef00() GS:8be1df70() knlGS: [ 4103.047958] CS: 0010 DS: ES: CR0: 80050033 [ 4103.053692] CR2: 0493b868 CR3: 00014c3ba000 CR4: 00f50ee0 [ 4103.060812] PKRU: 5554 [ 4103.063520] Call Trace: [ 4103.065970] [ 4103.068071] ? die_addr+0x36/0x90 [ 4103.071384] ? exc_general_protection+0x1be/0x420 [ 4103.076081] ? asm_exc_general_protection+0x26/0x30 [ 4103.080952] ? __kmem_cache_alloc_node+0x1ba/0x320 [ 4103.085734] ? ext4_htree_store_dirent+0x42/0x180 [ 4103.090431] ? ext4_htree_store_dirent+0x42/0x180 [ 4103.095132] __kmalloc+0x4d/0x150 [ 4103.098444] ext4_htree_store_dirent+0x42/0x180 [ 4103.102970] htree_dirblock_to_tree+0x1ed/0x370 [ 4103.107494] ext4_htree_fill_tree+0x109/0x3d0 [ 4103.111846] ext4_readdir+0x6d4/0xa80 [ 4103.115505] iterate_dir+0x178/0x1c0 [ 4103.119076] __x64_sys_getdents64+0x88/0x130 [ 4103.123341] ? __pfx_filldir64+0x10/0x10 [ 4103.127260] do_syscall_64+0x5d/0x90 [ 4103.130835] ? handle_mm_fault+0x11e/0x310 [ 4103.134927] ? do_user_addr_fault+0x1e0/0x720 [ 4103.139278] ? exc_page_fault+0x7c/0x180 [ 4103.143195] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 4103.148240] RIP: 0033:0x7f386a418047 [ 4103.151828] Code: 24 fb ff 4c 89 e0 5b 41 5c 5d c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 ff ff ff 7f 48 39 c2 48 0f 47 d0 b8 d9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 91 cd 0f 00 f7 d8 64 89 02 48 [ 4103.170543] RSP: 002b:7ffd4793ff38 EFLAGS: 0293 ORIG_RAX: 00d9 [ 4103.178095] RAX: ffda RBX: 04933830 RCX: 7f386a418047 [ 4103.185214] RDX: 8000 RSI: 04933860 RDI: 0006 [ 4103.192335] RBP: 7ffd4793ff70 R08: R09: 0001 [ 4103.199454] R10: 0004 R11: 0293 R12: 04933834 [ 4103.206573] R13: 04933860 R14: ff60 R15: [ 4103.213695] [ 4103.215883] Modul
Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers
Am 29.11.22 um 22:14 schrieb Felix Kuehling: On 2022-11-25 05:21, Christian König wrote: Instead of a single worker going over the list of delete BOs in regular intervals use a per BO worker which blocks for the resv object and locking of the BO. This not only simplifies the handling massively, but also results in much better response time when cleaning up buffers. Signed-off-by: Christian König Just thinking out loud: If I understand it correctly, this can cause a lot of sleeping worker threads when AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed at the same time. This happens e.g. when a KFD process terminates or crashes. I guess with a concurrency-managed workqueue this isn't going to be excessive. And since it's on a per device workqueue, it doesn't stall work items on the system work queue or from other devices. Yes, exactly that. The last parameter to alloc_workqueue() limits how many work items can be sleeping. I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue is not about freeing ttm_resources but about freeing the BOs. But it affects freeing of ghost_objs that are holding the ttm_resources being freed. Well if the BO is idle, but not immediately lockable we delegate freeing the backing pages in the TT object to those workers as well. It might even be a good idea to use a separate wq for this case. If those assumptions all make sense, patches 1-3 are Reviewed-by: Felix Kuehling Thanks, Christian. --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- drivers/gpu/drm/i915/i915_gem.c | 2 +- drivers/gpu/drm/i915/intel_region_ttm.c | 2 +- drivers/gpu/drm/ttm/ttm_bo.c | 112 - drivers/gpu/drm/ttm/ttm_bo_util.c | 1 - drivers/gpu/drm/ttm/ttm_device.c | 24 ++--- include/drm/ttm/ttm_bo_api.h | 18 +--- include/drm/ttm/ttm_device.h | 7 +- 8 files changed, 57 insertions(+), 111 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 2b1db37e25c1..74ccbd566777 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3984,7 +3984,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev) amdgpu_fence_driver_hw_fini(adev); if (adev->mman.initialized) - flush_delayed_work(&adev->mman.bdev.wq); + drain_workqueue(adev->mman.bdev.wq); if (adev->pm_sysfs_en) amdgpu_pm_sysfs_fini(adev); diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 8468ca9885fd..c38306f156d6 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1099,7 +1099,7 @@ void i915_gem_drain_freed_objects(struct drm_i915_private *i915) { while (atomic_read(&i915->mm.free_count)) { flush_work(&i915->mm.free_work); - flush_delayed_work(&i915->bdev.wq); + drain_workqueue(i915->bdev.wq); rcu_barrier(); } } diff --git a/drivers/gpu/drm/i915/intel_region_ttm.c b/drivers/gpu/drm/i915/intel_region_ttm.c index cf89d0c2a2d9..657bbc16a48a 100644 --- a/drivers/gpu/drm/i915/intel_region_ttm.c +++ b/drivers/gpu/drm/i915/intel_region_ttm.c @@ -132,7 +132,7 @@ int intel_region_ttm_fini(struct intel_memory_region *mem) break; msleep(20); - flush_delayed_work(&mem->i915->bdev.wq); + drain_workqueue(mem->i915->bdev.wq); } /* If we leaked objects, Don't free the region causing use after free */ diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index b77262a623e0..4749b65bedc4 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -280,14 +280,13 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, ret = 0; } - if (ret || unlikely(list_empty(&bo->ddestroy))) { + if (ret) { if (unlock_resv) dma_resv_unlock(bo->base.resv); spin_unlock(&bo->bdev->lru_lock); return ret; } - list_del_init(&bo->ddestroy); spin_unlock(&bo->bdev->lru_lock); ttm_bo_cleanup_memtype_use(bo); @@ -300,47 +299,21 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, } /* - * Traverse the delayed list, and call ttm_bo_cleanup_refs on all - * encountered buffers. + * Block for the dma_resv object to become idle, lock the buffer and clean up + * the resource and tt object. */ -bool ttm_bo_delayed_delete(struct ttm_device *bdev, bool remove_all) +static void ttm_bo_delayed_delete(struct work_struct *work) { - struct list_head removed; - bool empty; - - INIT_LIST_HEAD(&removed); - - spin_lock(&bdev->lru_lock); - while (!list_empty(&bdev->ddestroy)) { - struct ttm_buffer_object *bo; - - bo = list_first_entry(&bdev->ddestroy, struct ttm_buffer_object, -
Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers
On 2022-11-25 05:21, Christian König wrote: Instead of a single worker going over the list of delete BOs in regular intervals use a per BO worker which blocks for the resv object and locking of the BO. This not only simplifies the handling massively, but also results in much better response time when cleaning up buffers. Signed-off-by: Christian König Just thinking out loud: If I understand it correctly, this can cause a lot of sleeping worker threads when AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed at the same time. This happens e.g. when a KFD process terminates or crashes. I guess with a concurrency-managed workqueue this isn't going to be excessive. And since it's on a per device workqueue, it doesn't stall work items on the system work queue or from other devices. I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue is not about freeing ttm_resources but about freeing the BOs. But it affects freeing of ghost_objs that are holding the ttm_resources being freed. If those assumptions all make sense, patches 1-3 are Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- drivers/gpu/drm/i915/i915_gem.c| 2 +- drivers/gpu/drm/i915/intel_region_ttm.c| 2 +- drivers/gpu/drm/ttm/ttm_bo.c | 112 - drivers/gpu/drm/ttm/ttm_bo_util.c | 1 - drivers/gpu/drm/ttm/ttm_device.c | 24 ++--- include/drm/ttm/ttm_bo_api.h | 18 +--- include/drm/ttm/ttm_device.h | 7 +- 8 files changed, 57 insertions(+), 111 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 2b1db37e25c1..74ccbd566777 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3984,7 +3984,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev) amdgpu_fence_driver_hw_fini(adev); if (adev->mman.initialized) - flush_delayed_work(&adev->mman.bdev.wq); + drain_workqueue(adev->mman.bdev.wq); if (adev->pm_sysfs_en) amdgpu_pm_sysfs_fini(adev); diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 8468ca9885fd..c38306f156d6 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1099,7 +1099,7 @@ void i915_gem_drain_freed_objects(struct drm_i915_private *i915) { while (atomic_read(&i915->mm.free_count)) { flush_work(&i915->mm.free_work); - flush_delayed_work(&i915->bdev.wq); + drain_workqueue(i915->bdev.wq); rcu_barrier(); } } diff --git a/drivers/gpu/drm/i915/intel_region_ttm.c b/drivers/gpu/drm/i915/intel_region_ttm.c index cf89d0c2a2d9..657bbc16a48a 100644 --- a/drivers/gpu/drm/i915/intel_region_ttm.c +++ b/drivers/gpu/drm/i915/intel_region_ttm.c @@ -132,7 +132,7 @@ int intel_region_ttm_fini(struct intel_memory_region *mem) break; msleep(20); - flush_delayed_work(&mem->i915->bdev.wq); + drain_workqueue(mem->i915->bdev.wq); } /* If we leaked objects, Don't free the region causing use after free */ diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index b77262a623e0..4749b65bedc4 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -280,14 +280,13 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, ret = 0; } - if (ret || unlikely(list_empty(&bo->ddestroy))) { + if (ret) { if (unlock_resv) dma_resv_unlock(bo->base.resv); spin_unlock(&bo->bdev->lru_lock); return ret; } - list_del_init(&bo->ddestroy); spin_unlock(&bo->bdev->lru_lock); ttm_bo_cleanup_memtype_use(bo); @@ -300,47 +299,21 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, } /* - * Traverse the delayed list, and call ttm_bo_cleanup_refs on all - * encountered buffers. + * Block for the dma_resv object to become idle, lock the buffer and clean up + * the resource and tt object. */ -bool ttm_bo_delayed_delete(struct ttm_device *bdev, bool remove_all) +static void ttm_bo_delayed_delete(struct work_struct *work) { - struct list_head removed; - bool empty; - - INIT_LIST_HEAD(&removed); - - spin_lock(&bdev->lru_lock); - while (!list_empty(&bdev->ddestroy)) { - struct ttm_buffer_object *bo; - - bo = list_first_entry(&bdev->ddestroy, struct ttm_buffer_object, - ddestroy); - list_move_tail(&bo->ddestroy, &removed); - if (!ttm_bo_get_unless_zero(bo)) - continue; - - if (remove_all || bo->base.resv != &bo->base._resv) { -
[PATCH 3/9] drm/ttm: use per BO cleanup workers
Instead of a single worker going over the list of delete BOs in regular intervals use a per BO worker which blocks for the resv object and locking of the BO. This not only simplifies the handling massively, but also results in much better response time when cleaning up buffers. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- drivers/gpu/drm/i915/i915_gem.c| 2 +- drivers/gpu/drm/i915/intel_region_ttm.c| 2 +- drivers/gpu/drm/ttm/ttm_bo.c | 112 - drivers/gpu/drm/ttm/ttm_bo_util.c | 1 - drivers/gpu/drm/ttm/ttm_device.c | 24 ++--- include/drm/ttm/ttm_bo_api.h | 18 +--- include/drm/ttm/ttm_device.h | 7 +- 8 files changed, 57 insertions(+), 111 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 2b1db37e25c1..74ccbd566777 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3984,7 +3984,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev) amdgpu_fence_driver_hw_fini(adev); if (adev->mman.initialized) - flush_delayed_work(&adev->mman.bdev.wq); + drain_workqueue(adev->mman.bdev.wq); if (adev->pm_sysfs_en) amdgpu_pm_sysfs_fini(adev); diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 8468ca9885fd..c38306f156d6 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1099,7 +1099,7 @@ void i915_gem_drain_freed_objects(struct drm_i915_private *i915) { while (atomic_read(&i915->mm.free_count)) { flush_work(&i915->mm.free_work); - flush_delayed_work(&i915->bdev.wq); + drain_workqueue(i915->bdev.wq); rcu_barrier(); } } diff --git a/drivers/gpu/drm/i915/intel_region_ttm.c b/drivers/gpu/drm/i915/intel_region_ttm.c index cf89d0c2a2d9..657bbc16a48a 100644 --- a/drivers/gpu/drm/i915/intel_region_ttm.c +++ b/drivers/gpu/drm/i915/intel_region_ttm.c @@ -132,7 +132,7 @@ int intel_region_ttm_fini(struct intel_memory_region *mem) break; msleep(20); - flush_delayed_work(&mem->i915->bdev.wq); + drain_workqueue(mem->i915->bdev.wq); } /* If we leaked objects, Don't free the region causing use after free */ diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index b77262a623e0..4749b65bedc4 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -280,14 +280,13 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, ret = 0; } - if (ret || unlikely(list_empty(&bo->ddestroy))) { + if (ret) { if (unlock_resv) dma_resv_unlock(bo->base.resv); spin_unlock(&bo->bdev->lru_lock); return ret; } - list_del_init(&bo->ddestroy); spin_unlock(&bo->bdev->lru_lock); ttm_bo_cleanup_memtype_use(bo); @@ -300,47 +299,21 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, } /* - * Traverse the delayed list, and call ttm_bo_cleanup_refs on all - * encountered buffers. + * Block for the dma_resv object to become idle, lock the buffer and clean up + * the resource and tt object. */ -bool ttm_bo_delayed_delete(struct ttm_device *bdev, bool remove_all) +static void ttm_bo_delayed_delete(struct work_struct *work) { - struct list_head removed; - bool empty; - - INIT_LIST_HEAD(&removed); - - spin_lock(&bdev->lru_lock); - while (!list_empty(&bdev->ddestroy)) { - struct ttm_buffer_object *bo; - - bo = list_first_entry(&bdev->ddestroy, struct ttm_buffer_object, - ddestroy); - list_move_tail(&bo->ddestroy, &removed); - if (!ttm_bo_get_unless_zero(bo)) - continue; - - if (remove_all || bo->base.resv != &bo->base._resv) { - spin_unlock(&bdev->lru_lock); - dma_resv_lock(bo->base.resv, NULL); - - spin_lock(&bdev->lru_lock); - ttm_bo_cleanup_refs(bo, false, !remove_all, true); - - } else if (dma_resv_trylock(bo->base.resv)) { - ttm_bo_cleanup_refs(bo, false, !remove_all, true); - } else { - spin_unlock(&bdev->lru_lock); - } + struct ttm_buffer_object *bo; - ttm_bo_put(bo); - spin_lock(&bdev->lru_lock); - } - list_splice_tail(&removed, &bdev->ddestroy); - empty = list_empty(&bdev->ddestroy); - spin_unlock(&bdev->lru_lock); + bo = container_of(work, typeof(*bo), delayed_delete); -