Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers

2023-06-15 Thread Karol Herbst
On Thu, Jun 15, 2023 at 1:19 PM Christian König
 wrote:
>
> Am 13.06.23 um 16:18 schrieb Karol Herbst:
> > On Tue, Jun 13, 2023 at 3:59 PM Christian König
> >  wrote:
> >> Am 13.06.23 um 15:05 schrieb Karol Herbst:
> >>> On Mon, Dec 5, 2022 at 2:40 PM Christian König  
> >>> wrote:
>  Am 29.11.22 um 22:14 schrieb Felix Kuehling:
> > On 2022-11-25 05:21, Christian König wrote:
> >> Instead of a single worker going over the list of delete BOs in regular
> >> intervals use a per BO worker which blocks for the resv object and
> >> locking of the BO.
> >>
> >> This not only simplifies the handling massively, but also results in
> >> much better response time when cleaning up buffers.
> >>
> >> Signed-off-by: Christian König 
> > Just thinking out loud: If I understand it correctly, this can cause a
> > lot of sleeping worker threads when
> > AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed
> > at the same time. This happens e.g. when a KFD process terminates or
> > crashes. I guess with a concurrency-managed workqueue this isn't going
> > to be excessive. And since it's on a per device workqueue, it doesn't
> > stall work items on the system work queue or from other devices.
>  Yes, exactly that. The last parameter to alloc_workqueue() limits how
>  many work items can be sleeping.
> 
> > I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue
> > is not about freeing ttm_resources but about freeing the BOs. But it
> > affects freeing of ghost_objs that are holding the ttm_resources being
> > freed.
>  Well if the BO is idle, but not immediately lockable we delegate freeing
>  the backing pages in the TT object to those workers as well. It might
>  even be a good idea to use a separate wq for this case.
> 
> > If those assumptions all make sense, patches 1-3 are
> >
> > Reviewed-by: Felix Kuehling 
>  Thanks,
>  Christian.
> 
> >>> This patch causes a heap use-after-free when using nouveau with the
> >>> potential of trashing filesystems, is there a way to revert it until
> >>> we figure out a proper solution to the problem?
> >> Uff I don't think so, we have quite some work based on top of this. But
> >> let me double check.
> >>
> > yeah.. I already talked with Dave about fixing this issue as Dave has
> > more knowledge on this part of the driver (I hope), so we might have a
> > fix soonish, but the concerning part is, that it's already out to
> > users, so might be better to be able to revert it if the fix takes a
> > while to emerge.
>
> We at least can't revert easily. This was fixing issues where we have
> seen OOM situations because TTM wasn't releasing resources fast enough.
>
> >> On the other hand have you tried running this with KASAN to catch use
> >> after free errors?
> > yes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213#note_1942777
> >
> >> Since we now block for work to finish and not check every few
> >> milliseconds to garbage collect memory will now be reclaimed much faster
> >> after freeing it.
> > yeah, that kinda makes sense. This entire issue feels like a race
> > happening as I need to run the OpenGL CTS in parallel with 8+ threads
> > to trigger it reliably.
>
> Yeah, from the bug report that's a classic use after free because of a race.
>
> Easiest would be to make sure everybody has a reference to the fence.
>

turns out nouveau was using `dma_fence_is_signaled_locked` without
taking the lock. Fixing that seems to improve the situation by a lot,
so I think we have a fix to resolve this problem.

> Regards,
> Christian.
>
> >
> >> Regards,
> >> Christian.
> >>
> >>> Bug: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213
> >>>
> >>> example trace on affected systems:
> >>>
> >>> [ 4102.946946] general protection fault, probably for non-canonical
> >>> address 0x5f775ce3bd949b45:  [#3] PREEMPT SMP NOPTI
> >>> [ 4102.957794] CPU: 12 PID: 89561 Comm: glcts Tainted: G  D
> >>>   6.3.5-200.fc38.x86_64 #1
> >>> [ 4102.966556] Hardware name: ASUS System Product Name/PRIME B660-PLUS
> >>> D4, BIOS 0418 10/13/2021
> >>> [ 4102.974972] RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320
> >>> [ 4102.980362] Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4
> >>> 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07
> >>> 48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41
> >>> f6 c0
> >>> [ 4102.999073] RSP: 0018:9764e0057b40 EFLAGS: 00010202
> >>> [ 4103.004291] RAX: 5f775ce3bd949b45 RBX: 0dc0 RCX: 
> >>> 0046
> >>> [ 4103.011408] RDX: 0002cf87600c RSI: 0dc0 RDI: 
> >>> 5f775ce3bd949b15
> >>> [ 4103.018528] RBP: 0dc0 R08: 000390c0 R09: 
> >>> 30302d6d
> >>> [ 4103.025649] R10: 756c7473 R11: 20090298 R12: 
> >>> 
> >>> [ 4103.032767] R13:  R14: 

Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers

2023-06-15 Thread Christian König

Am 13.06.23 um 16:18 schrieb Karol Herbst:

On Tue, Jun 13, 2023 at 3:59 PM Christian König
 wrote:

Am 13.06.23 um 15:05 schrieb Karol Herbst:

On Mon, Dec 5, 2022 at 2:40 PM Christian König  wrote:

Am 29.11.22 um 22:14 schrieb Felix Kuehling:

On 2022-11-25 05:21, Christian König wrote:

Instead of a single worker going over the list of delete BOs in regular
intervals use a per BO worker which blocks for the resv object and
locking of the BO.

This not only simplifies the handling massively, but also results in
much better response time when cleaning up buffers.

Signed-off-by: Christian König 

Just thinking out loud: If I understand it correctly, this can cause a
lot of sleeping worker threads when
AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed
at the same time. This happens e.g. when a KFD process terminates or
crashes. I guess with a concurrency-managed workqueue this isn't going
to be excessive. And since it's on a per device workqueue, it doesn't
stall work items on the system work queue or from other devices.

Yes, exactly that. The last parameter to alloc_workqueue() limits how
many work items can be sleeping.


I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue
is not about freeing ttm_resources but about freeing the BOs. But it
affects freeing of ghost_objs that are holding the ttm_resources being
freed.

Well if the BO is idle, but not immediately lockable we delegate freeing
the backing pages in the TT object to those workers as well. It might
even be a good idea to use a separate wq for this case.


If those assumptions all make sense, patches 1-3 are

Reviewed-by: Felix Kuehling 

Thanks,
Christian.


This patch causes a heap use-after-free when using nouveau with the
potential of trashing filesystems, is there a way to revert it until
we figure out a proper solution to the problem?

Uff I don't think so, we have quite some work based on top of this. But
let me double check.


yeah.. I already talked with Dave about fixing this issue as Dave has
more knowledge on this part of the driver (I hope), so we might have a
fix soonish, but the concerning part is, that it's already out to
users, so might be better to be able to revert it if the fix takes a
while to emerge.


We at least can't revert easily. This was fixing issues where we have 
seen OOM situations because TTM wasn't releasing resources fast enough.



On the other hand have you tried running this with KASAN to catch use
after free errors?

yes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213#note_1942777


Since we now block for work to finish and not check every few
milliseconds to garbage collect memory will now be reclaimed much faster
after freeing it.

yeah, that kinda makes sense. This entire issue feels like a race
happening as I need to run the OpenGL CTS in parallel with 8+ threads
to trigger it reliably.


Yeah, from the bug report that's a classic use after free because of a race.

Easiest would be to make sure everybody has a reference to the fence.

Regards,
Christian.




Regards,
Christian.


Bug: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213

example trace on affected systems:

[ 4102.946946] general protection fault, probably for non-canonical
address 0x5f775ce3bd949b45:  [#3] PREEMPT SMP NOPTI
[ 4102.957794] CPU: 12 PID: 89561 Comm: glcts Tainted: G  D
  6.3.5-200.fc38.x86_64 #1
[ 4102.966556] Hardware name: ASUS System Product Name/PRIME B660-PLUS
D4, BIOS 0418 10/13/2021
[ 4102.974972] RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320
[ 4102.980362] Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4
18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07
48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41
f6 c0
[ 4102.999073] RSP: 0018:9764e0057b40 EFLAGS: 00010202
[ 4103.004291] RAX: 5f775ce3bd949b45 RBX: 0dc0 RCX: 0046
[ 4103.011408] RDX: 0002cf87600c RSI: 0dc0 RDI: 5f775ce3bd949b15
[ 4103.018528] RBP: 0dc0 R08: 000390c0 R09: 30302d6d
[ 4103.025649] R10: 756c7473 R11: 20090298 R12: 
[ 4103.032767] R13:  R14: 0046 R15: 8bda80042600
[ 4103.039887] FS:  7f386a85ef00() GS:8be1df70()
knlGS:
[ 4103.047958] CS:  0010 DS:  ES:  CR0: 80050033
[ 4103.053692] CR2: 0493b868 CR3: 00014c3ba000 CR4: 00f50ee0
[ 4103.060812] PKRU: 5554
[ 4103.063520] Call Trace:
[ 4103.065970]  
[ 4103.068071]  ? die_addr+0x36/0x90
[ 4103.071384]  ? exc_general_protection+0x1be/0x420
[ 4103.076081]  ? asm_exc_general_protection+0x26/0x30
[ 4103.080952]  ? __kmem_cache_alloc_node+0x1ba/0x320
[ 4103.085734]  ? ext4_htree_store_dirent+0x42/0x180
[ 4103.090431]  ? ext4_htree_store_dirent+0x42/0x180
[ 4103.095132]  __kmalloc+0x4d/0x150
[ 4103.098444]  ext4_htree_store_dirent+0x42/0x180
[ 4103.102970]  htree_dirblock_to_tree+0x1ed/0x370
[ 4

Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers

2023-06-13 Thread Karol Herbst
On Tue, Jun 13, 2023 at 3:59 PM Christian König
 wrote:
>
> Am 13.06.23 um 15:05 schrieb Karol Herbst:
> > On Mon, Dec 5, 2022 at 2:40 PM Christian König  
> > wrote:
> >> Am 29.11.22 um 22:14 schrieb Felix Kuehling:
> >>> On 2022-11-25 05:21, Christian König wrote:
>  Instead of a single worker going over the list of delete BOs in regular
>  intervals use a per BO worker which blocks for the resv object and
>  locking of the BO.
> 
>  This not only simplifies the handling massively, but also results in
>  much better response time when cleaning up buffers.
> 
>  Signed-off-by: Christian König 
> >>> Just thinking out loud: If I understand it correctly, this can cause a
> >>> lot of sleeping worker threads when
> >>> AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed
> >>> at the same time. This happens e.g. when a KFD process terminates or
> >>> crashes. I guess with a concurrency-managed workqueue this isn't going
> >>> to be excessive. And since it's on a per device workqueue, it doesn't
> >>> stall work items on the system work queue or from other devices.
> >> Yes, exactly that. The last parameter to alloc_workqueue() limits how
> >> many work items can be sleeping.
> >>
> >>> I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue
> >>> is not about freeing ttm_resources but about freeing the BOs. But it
> >>> affects freeing of ghost_objs that are holding the ttm_resources being
> >>> freed.
> >> Well if the BO is idle, but not immediately lockable we delegate freeing
> >> the backing pages in the TT object to those workers as well. It might
> >> even be a good idea to use a separate wq for this case.
> >>
> >>> If those assumptions all make sense, patches 1-3 are
> >>>
> >>> Reviewed-by: Felix Kuehling 
> >> Thanks,
> >> Christian.
> >>
> > This patch causes a heap use-after-free when using nouveau with the
> > potential of trashing filesystems, is there a way to revert it until
> > we figure out a proper solution to the problem?
>
> Uff I don't think so, we have quite some work based on top of this. But
> let me double check.
>

yeah.. I already talked with Dave about fixing this issue as Dave has
more knowledge on this part of the driver (I hope), so we might have a
fix soonish, but the concerning part is, that it's already out to
users, so might be better to be able to revert it if the fix takes a
while to emerge.

> On the other hand have you tried running this with KASAN to catch use
> after free errors?

yes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213#note_1942777

>
> Since we now block for work to finish and not check every few
> milliseconds to garbage collect memory will now be reclaimed much faster
> after freeing it.

yeah, that kinda makes sense. This entire issue feels like a race
happening as I need to run the OpenGL CTS in parallel with 8+ threads
to trigger it reliably.

>
> Regards,
> Christian.
>
> >
> > Bug: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213
> >
> > example trace on affected systems:
> >
> > [ 4102.946946] general protection fault, probably for non-canonical
> > address 0x5f775ce3bd949b45:  [#3] PREEMPT SMP NOPTI
> > [ 4102.957794] CPU: 12 PID: 89561 Comm: glcts Tainted: G  D
> >  6.3.5-200.fc38.x86_64 #1
> > [ 4102.966556] Hardware name: ASUS System Product Name/PRIME B660-PLUS
> > D4, BIOS 0418 10/13/2021
> > [ 4102.974972] RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320
> > [ 4102.980362] Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4
> > 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07
> > 48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41
> > f6 c0
> > [ 4102.999073] RSP: 0018:9764e0057b40 EFLAGS: 00010202
> > [ 4103.004291] RAX: 5f775ce3bd949b45 RBX: 0dc0 RCX: 
> > 0046
> > [ 4103.011408] RDX: 0002cf87600c RSI: 0dc0 RDI: 
> > 5f775ce3bd949b15
> > [ 4103.018528] RBP: 0dc0 R08: 000390c0 R09: 
> > 30302d6d
> > [ 4103.025649] R10: 756c7473 R11: 20090298 R12: 
> > 
> > [ 4103.032767] R13:  R14: 0046 R15: 
> > 8bda80042600
> > [ 4103.039887] FS:  7f386a85ef00() GS:8be1df70()
> > knlGS:
> > [ 4103.047958] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 4103.053692] CR2: 0493b868 CR3: 00014c3ba000 CR4: 
> > 00f50ee0
> > [ 4103.060812] PKRU: 5554
> > [ 4103.063520] Call Trace:
> > [ 4103.065970]  
> > [ 4103.068071]  ? die_addr+0x36/0x90
> > [ 4103.071384]  ? exc_general_protection+0x1be/0x420
> > [ 4103.076081]  ? asm_exc_general_protection+0x26/0x30
> > [ 4103.080952]  ? __kmem_cache_alloc_node+0x1ba/0x320
> > [ 4103.085734]  ? ext4_htree_store_dirent+0x42/0x180
> > [ 4103.090431]  ? ext4_htree_store_dirent+0x42/0x180
> > [ 4103.095132]  __kmalloc+0x4d/0x150
> > [ 4103.098444]  ext4_htree_store_dirent+0x42/0x180
> > [ 4103.102

Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers

2023-06-13 Thread Christian König

Am 13.06.23 um 15:05 schrieb Karol Herbst:

On Mon, Dec 5, 2022 at 2:40 PM Christian König  wrote:

Am 29.11.22 um 22:14 schrieb Felix Kuehling:

On 2022-11-25 05:21, Christian König wrote:

Instead of a single worker going over the list of delete BOs in regular
intervals use a per BO worker which blocks for the resv object and
locking of the BO.

This not only simplifies the handling massively, but also results in
much better response time when cleaning up buffers.

Signed-off-by: Christian König 

Just thinking out loud: If I understand it correctly, this can cause a
lot of sleeping worker threads when
AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed
at the same time. This happens e.g. when a KFD process terminates or
crashes. I guess with a concurrency-managed workqueue this isn't going
to be excessive. And since it's on a per device workqueue, it doesn't
stall work items on the system work queue or from other devices.

Yes, exactly that. The last parameter to alloc_workqueue() limits how
many work items can be sleeping.


I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue
is not about freeing ttm_resources but about freeing the BOs. But it
affects freeing of ghost_objs that are holding the ttm_resources being
freed.

Well if the BO is idle, but not immediately lockable we delegate freeing
the backing pages in the TT object to those workers as well. It might
even be a good idea to use a separate wq for this case.


If those assumptions all make sense, patches 1-3 are

Reviewed-by: Felix Kuehling 

Thanks,
Christian.


This patch causes a heap use-after-free when using nouveau with the
potential of trashing filesystems, is there a way to revert it until
we figure out a proper solution to the problem?


Uff I don't think so, we have quite some work based on top of this. But 
let me double check.


On the other hand have you tried running this with KASAN to catch use 
after free errors?


Since we now block for work to finish and not check every few 
milliseconds to garbage collect memory will now be reclaimed much faster 
after freeing it.


Regards,
Christian.



Bug: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213

example trace on affected systems:

[ 4102.946946] general protection fault, probably for non-canonical
address 0x5f775ce3bd949b45:  [#3] PREEMPT SMP NOPTI
[ 4102.957794] CPU: 12 PID: 89561 Comm: glcts Tainted: G  D
 6.3.5-200.fc38.x86_64 #1
[ 4102.966556] Hardware name: ASUS System Product Name/PRIME B660-PLUS
D4, BIOS 0418 10/13/2021
[ 4102.974972] RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320
[ 4102.980362] Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4
18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07
48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41
f6 c0
[ 4102.999073] RSP: 0018:9764e0057b40 EFLAGS: 00010202
[ 4103.004291] RAX: 5f775ce3bd949b45 RBX: 0dc0 RCX: 0046
[ 4103.011408] RDX: 0002cf87600c RSI: 0dc0 RDI: 5f775ce3bd949b15
[ 4103.018528] RBP: 0dc0 R08: 000390c0 R09: 30302d6d
[ 4103.025649] R10: 756c7473 R11: 20090298 R12: 
[ 4103.032767] R13:  R14: 0046 R15: 8bda80042600
[ 4103.039887] FS:  7f386a85ef00() GS:8be1df70()
knlGS:
[ 4103.047958] CS:  0010 DS:  ES:  CR0: 80050033
[ 4103.053692] CR2: 0493b868 CR3: 00014c3ba000 CR4: 00f50ee0
[ 4103.060812] PKRU: 5554
[ 4103.063520] Call Trace:
[ 4103.065970]  
[ 4103.068071]  ? die_addr+0x36/0x90
[ 4103.071384]  ? exc_general_protection+0x1be/0x420
[ 4103.076081]  ? asm_exc_general_protection+0x26/0x30
[ 4103.080952]  ? __kmem_cache_alloc_node+0x1ba/0x320
[ 4103.085734]  ? ext4_htree_store_dirent+0x42/0x180
[ 4103.090431]  ? ext4_htree_store_dirent+0x42/0x180
[ 4103.095132]  __kmalloc+0x4d/0x150
[ 4103.098444]  ext4_htree_store_dirent+0x42/0x180
[ 4103.102970]  htree_dirblock_to_tree+0x1ed/0x370
[ 4103.107494]  ext4_htree_fill_tree+0x109/0x3d0
[ 4103.111846]  ext4_readdir+0x6d4/0xa80
[ 4103.115505]  iterate_dir+0x178/0x1c0
[ 4103.119076]  __x64_sys_getdents64+0x88/0x130
[ 4103.123341]  ? __pfx_filldir64+0x10/0x10
[ 4103.127260]  do_syscall_64+0x5d/0x90
[ 4103.130835]  ? handle_mm_fault+0x11e/0x310
[ 4103.134927]  ? do_user_addr_fault+0x1e0/0x720
[ 4103.139278]  ? exc_page_fault+0x7c/0x180
[ 4103.143195]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 4103.148240] RIP: 0033:0x7f386a418047
[ 4103.151828] Code: 24 fb ff 4c 89 e0 5b 41 5c 5d c3 0f 1f 84 00 00
00 00 00 f3 0f 1e fa b8 ff ff ff 7f 48 39 c2 48 0f 47 d0 b8 d9 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 91 cd 0f 00 f7 d8 64 89
02 48
[ 4103.170543] RSP: 002b:7ffd4793ff38 EFLAGS: 0293 ORIG_RAX:
00d9
[ 4103.178095] RAX: ffda RBX: 04933830 RCX: 7f386a418047
[ 4103.185214] RDX: 8000 RSI: 04933860 RDI: 0

Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers

2023-06-13 Thread Karol Herbst
On Mon, Dec 5, 2022 at 2:40 PM Christian König  wrote:
>
> Am 29.11.22 um 22:14 schrieb Felix Kuehling:
> > On 2022-11-25 05:21, Christian König wrote:
> >> Instead of a single worker going over the list of delete BOs in regular
> >> intervals use a per BO worker which blocks for the resv object and
> >> locking of the BO.
> >>
> >> This not only simplifies the handling massively, but also results in
> >> much better response time when cleaning up buffers.
> >>
> >> Signed-off-by: Christian König 
> >
> > Just thinking out loud: If I understand it correctly, this can cause a
> > lot of sleeping worker threads when
> > AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed
> > at the same time. This happens e.g. when a KFD process terminates or
> > crashes. I guess with a concurrency-managed workqueue this isn't going
> > to be excessive. And since it's on a per device workqueue, it doesn't
> > stall work items on the system work queue or from other devices.
>
> Yes, exactly that. The last parameter to alloc_workqueue() limits how
> many work items can be sleeping.
>
> > I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue
> > is not about freeing ttm_resources but about freeing the BOs. But it
> > affects freeing of ghost_objs that are holding the ttm_resources being
> > freed.
>
> Well if the BO is idle, but not immediately lockable we delegate freeing
> the backing pages in the TT object to those workers as well. It might
> even be a good idea to use a separate wq for this case.
>
> >
> > If those assumptions all make sense, patches 1-3 are
> >
> > Reviewed-by: Felix Kuehling 
>
> Thanks,
> Christian.
>

This patch causes a heap use-after-free when using nouveau with the
potential of trashing filesystems, is there a way to revert it until
we figure out a proper solution to the problem?

Bug: https://gitlab.freedesktop.org/drm/nouveau/-/issues/213

example trace on affected systems:

[ 4102.946946] general protection fault, probably for non-canonical
address 0x5f775ce3bd949b45:  [#3] PREEMPT SMP NOPTI
[ 4102.957794] CPU: 12 PID: 89561 Comm: glcts Tainted: G  D
6.3.5-200.fc38.x86_64 #1
[ 4102.966556] Hardware name: ASUS System Product Name/PRIME B660-PLUS
D4, BIOS 0418 10/13/2021
[ 4102.974972] RIP: 0010:__kmem_cache_alloc_node+0x1ba/0x320
[ 4102.980362] Code: 2b 14 25 28 00 00 00 0f 85 74 01 00 00 48 83 c4
18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 41 8b 47 28 4d 8b 07
48 01 f8 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 48 0f c9 48 31 cb 41
f6 c0
[ 4102.999073] RSP: 0018:9764e0057b40 EFLAGS: 00010202
[ 4103.004291] RAX: 5f775ce3bd949b45 RBX: 0dc0 RCX: 0046
[ 4103.011408] RDX: 0002cf87600c RSI: 0dc0 RDI: 5f775ce3bd949b15
[ 4103.018528] RBP: 0dc0 R08: 000390c0 R09: 30302d6d
[ 4103.025649] R10: 756c7473 R11: 20090298 R12: 
[ 4103.032767] R13:  R14: 0046 R15: 8bda80042600
[ 4103.039887] FS:  7f386a85ef00() GS:8be1df70()
knlGS:
[ 4103.047958] CS:  0010 DS:  ES:  CR0: 80050033
[ 4103.053692] CR2: 0493b868 CR3: 00014c3ba000 CR4: 00f50ee0
[ 4103.060812] PKRU: 5554
[ 4103.063520] Call Trace:
[ 4103.065970]  
[ 4103.068071]  ? die_addr+0x36/0x90
[ 4103.071384]  ? exc_general_protection+0x1be/0x420
[ 4103.076081]  ? asm_exc_general_protection+0x26/0x30
[ 4103.080952]  ? __kmem_cache_alloc_node+0x1ba/0x320
[ 4103.085734]  ? ext4_htree_store_dirent+0x42/0x180
[ 4103.090431]  ? ext4_htree_store_dirent+0x42/0x180
[ 4103.095132]  __kmalloc+0x4d/0x150
[ 4103.098444]  ext4_htree_store_dirent+0x42/0x180
[ 4103.102970]  htree_dirblock_to_tree+0x1ed/0x370
[ 4103.107494]  ext4_htree_fill_tree+0x109/0x3d0
[ 4103.111846]  ext4_readdir+0x6d4/0xa80
[ 4103.115505]  iterate_dir+0x178/0x1c0
[ 4103.119076]  __x64_sys_getdents64+0x88/0x130
[ 4103.123341]  ? __pfx_filldir64+0x10/0x10
[ 4103.127260]  do_syscall_64+0x5d/0x90
[ 4103.130835]  ? handle_mm_fault+0x11e/0x310
[ 4103.134927]  ? do_user_addr_fault+0x1e0/0x720
[ 4103.139278]  ? exc_page_fault+0x7c/0x180
[ 4103.143195]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 4103.148240] RIP: 0033:0x7f386a418047
[ 4103.151828] Code: 24 fb ff 4c 89 e0 5b 41 5c 5d c3 0f 1f 84 00 00
00 00 00 f3 0f 1e fa b8 ff ff ff 7f 48 39 c2 48 0f 47 d0 b8 d9 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 91 cd 0f 00 f7 d8 64 89
02 48
[ 4103.170543] RSP: 002b:7ffd4793ff38 EFLAGS: 0293 ORIG_RAX:
00d9
[ 4103.178095] RAX: ffda RBX: 04933830 RCX: 7f386a418047
[ 4103.185214] RDX: 8000 RSI: 04933860 RDI: 0006
[ 4103.192335] RBP: 7ffd4793ff70 R08:  R09: 0001
[ 4103.199454] R10: 0004 R11: 0293 R12: 04933834
[ 4103.206573] R13: 04933860 R14: ff60 R15: 
[ 4103.213695]  
[ 4103.215883] Modul

Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers

2022-12-05 Thread Christian König

Am 29.11.22 um 22:14 schrieb Felix Kuehling:

On 2022-11-25 05:21, Christian König wrote:

Instead of a single worker going over the list of delete BOs in regular
intervals use a per BO worker which blocks for the resv object and
locking of the BO.

This not only simplifies the handling massively, but also results in
much better response time when cleaning up buffers.

Signed-off-by: Christian König 


Just thinking out loud: If I understand it correctly, this can cause a 
lot of sleeping worker threads when 
AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed 
at the same time. This happens e.g. when a KFD process terminates or 
crashes. I guess with a concurrency-managed workqueue this isn't going 
to be excessive. And since it's on a per device workqueue, it doesn't 
stall work items on the system work queue or from other devices.


Yes, exactly that. The last parameter to alloc_workqueue() limits how 
many work items can be sleeping.


I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue 
is not about freeing ttm_resources but about freeing the BOs. But it 
affects freeing of ghost_objs that are holding the ttm_resources being 
freed.


Well if the BO is idle, but not immediately lockable we delegate freeing 
the backing pages in the TT object to those workers as well. It might 
even be a good idea to use a separate wq for this case.




If those assumptions all make sense, patches 1-3 are

Reviewed-by: Felix Kuehling 


Thanks,
Christian.





---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   2 +-
  drivers/gpu/drm/i915/i915_gem.c    |   2 +-
  drivers/gpu/drm/i915/intel_region_ttm.c    |   2 +-
  drivers/gpu/drm/ttm/ttm_bo.c   | 112 -
  drivers/gpu/drm/ttm/ttm_bo_util.c  |   1 -
  drivers/gpu/drm/ttm/ttm_device.c   |  24 ++---
  include/drm/ttm/ttm_bo_api.h   |  18 +---
  include/drm/ttm/ttm_device.h   |   7 +-
  8 files changed, 57 insertions(+), 111 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 2b1db37e25c1..74ccbd566777 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3984,7 +3984,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device 
*adev)

  amdgpu_fence_driver_hw_fini(adev);
    if (adev->mman.initialized)
-    flush_delayed_work(&adev->mman.bdev.wq);
+    drain_workqueue(adev->mman.bdev.wq);
    if (adev->pm_sysfs_en)
  amdgpu_pm_sysfs_fini(adev);
diff --git a/drivers/gpu/drm/i915/i915_gem.c 
b/drivers/gpu/drm/i915/i915_gem.c

index 8468ca9885fd..c38306f156d6 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1099,7 +1099,7 @@ void i915_gem_drain_freed_objects(struct 
drm_i915_private *i915)

  {
  while (atomic_read(&i915->mm.free_count)) {
  flush_work(&i915->mm.free_work);
-    flush_delayed_work(&i915->bdev.wq);
+    drain_workqueue(i915->bdev.wq);
  rcu_barrier();
  }
  }
diff --git a/drivers/gpu/drm/i915/intel_region_ttm.c 
b/drivers/gpu/drm/i915/intel_region_ttm.c

index cf89d0c2a2d9..657bbc16a48a 100644
--- a/drivers/gpu/drm/i915/intel_region_ttm.c
+++ b/drivers/gpu/drm/i915/intel_region_ttm.c
@@ -132,7 +132,7 @@ int intel_region_ttm_fini(struct 
intel_memory_region *mem)

  break;
    msleep(20);
-    flush_delayed_work(&mem->i915->bdev.wq);
+    drain_workqueue(mem->i915->bdev.wq);
  }
    /* If we leaked objects, Don't free the region causing use 
after free */

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index b77262a623e0..4749b65bedc4 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -280,14 +280,13 @@ static int ttm_bo_cleanup_refs(struct 
ttm_buffer_object *bo,

  ret = 0;
  }
  -    if (ret || unlikely(list_empty(&bo->ddestroy))) {
+    if (ret) {
  if (unlock_resv)
  dma_resv_unlock(bo->base.resv);
  spin_unlock(&bo->bdev->lru_lock);
  return ret;
  }
  -    list_del_init(&bo->ddestroy);
  spin_unlock(&bo->bdev->lru_lock);
  ttm_bo_cleanup_memtype_use(bo);
  @@ -300,47 +299,21 @@ static int ttm_bo_cleanup_refs(struct 
ttm_buffer_object *bo,

  }
    /*
- * Traverse the delayed list, and call ttm_bo_cleanup_refs on all
- * encountered buffers.
+ * Block for the dma_resv object to become idle, lock the buffer and 
clean up

+ * the resource and tt object.
   */
-bool ttm_bo_delayed_delete(struct ttm_device *bdev, bool remove_all)
+static void ttm_bo_delayed_delete(struct work_struct *work)
  {
-    struct list_head removed;
-    bool empty;
-
-    INIT_LIST_HEAD(&removed);
-
-    spin_lock(&bdev->lru_lock);
-    while (!list_empty(&bdev->ddestroy)) {
-    struct ttm_buffer_object *bo;
-
-    bo = list_first_entry(&bdev->ddestroy, struct 
ttm_buffer_object,

-   

Re: [PATCH 3/9] drm/ttm: use per BO cleanup workers

2022-11-29 Thread Felix Kuehling

On 2022-11-25 05:21, Christian König wrote:

Instead of a single worker going over the list of delete BOs in regular
intervals use a per BO worker which blocks for the resv object and
locking of the BO.

This not only simplifies the handling massively, but also results in
much better response time when cleaning up buffers.

Signed-off-by: Christian König 


Just thinking out loud: If I understand it correctly, this can cause a 
lot of sleeping worker threads when 
AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE is used and many BOs are freed at 
the same time. This happens e.g. when a KFD process terminates or 
crashes. I guess with a concurrency-managed workqueue this isn't going 
to be excessive. And since it's on a per device workqueue, it doesn't 
stall work items on the system work queue or from other devices.


I'm trying to understand why you set WQ_MEM_RECLAIM. This work queue is 
not about freeing ttm_resources but about freeing the BOs. But it 
affects freeing of ghost_objs that are holding the ttm_resources being 
freed.


If those assumptions all make sense, patches 1-3 are

Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   2 +-
  drivers/gpu/drm/i915/i915_gem.c|   2 +-
  drivers/gpu/drm/i915/intel_region_ttm.c|   2 +-
  drivers/gpu/drm/ttm/ttm_bo.c   | 112 -
  drivers/gpu/drm/ttm/ttm_bo_util.c  |   1 -
  drivers/gpu/drm/ttm/ttm_device.c   |  24 ++---
  include/drm/ttm/ttm_bo_api.h   |  18 +---
  include/drm/ttm/ttm_device.h   |   7 +-
  8 files changed, 57 insertions(+), 111 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2b1db37e25c1..74ccbd566777 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3984,7 +3984,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
amdgpu_fence_driver_hw_fini(adev);
  
  	if (adev->mman.initialized)

-   flush_delayed_work(&adev->mman.bdev.wq);
+   drain_workqueue(adev->mman.bdev.wq);
  
  	if (adev->pm_sysfs_en)

amdgpu_pm_sysfs_fini(adev);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 8468ca9885fd..c38306f156d6 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1099,7 +1099,7 @@ void i915_gem_drain_freed_objects(struct drm_i915_private 
*i915)
  {
while (atomic_read(&i915->mm.free_count)) {
flush_work(&i915->mm.free_work);
-   flush_delayed_work(&i915->bdev.wq);
+   drain_workqueue(i915->bdev.wq);
rcu_barrier();
}
  }
diff --git a/drivers/gpu/drm/i915/intel_region_ttm.c 
b/drivers/gpu/drm/i915/intel_region_ttm.c
index cf89d0c2a2d9..657bbc16a48a 100644
--- a/drivers/gpu/drm/i915/intel_region_ttm.c
+++ b/drivers/gpu/drm/i915/intel_region_ttm.c
@@ -132,7 +132,7 @@ int intel_region_ttm_fini(struct intel_memory_region *mem)
break;
  
  		msleep(20);

-   flush_delayed_work(&mem->i915->bdev.wq);
+   drain_workqueue(mem->i915->bdev.wq);
}
  
  	/* If we leaked objects, Don't free the region causing use after free */

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index b77262a623e0..4749b65bedc4 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -280,14 +280,13 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object 
*bo,
ret = 0;
}
  
-	if (ret || unlikely(list_empty(&bo->ddestroy))) {

+   if (ret) {
if (unlock_resv)
dma_resv_unlock(bo->base.resv);
spin_unlock(&bo->bdev->lru_lock);
return ret;
}
  
-	list_del_init(&bo->ddestroy);

spin_unlock(&bo->bdev->lru_lock);
ttm_bo_cleanup_memtype_use(bo);
  
@@ -300,47 +299,21 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo,

  }
  
  /*

- * Traverse the delayed list, and call ttm_bo_cleanup_refs on all
- * encountered buffers.
+ * Block for the dma_resv object to become idle, lock the buffer and clean up
+ * the resource and tt object.
   */
-bool ttm_bo_delayed_delete(struct ttm_device *bdev, bool remove_all)
+static void ttm_bo_delayed_delete(struct work_struct *work)
  {
-   struct list_head removed;
-   bool empty;
-
-   INIT_LIST_HEAD(&removed);
-
-   spin_lock(&bdev->lru_lock);
-   while (!list_empty(&bdev->ddestroy)) {
-   struct ttm_buffer_object *bo;
-
-   bo = list_first_entry(&bdev->ddestroy, struct ttm_buffer_object,
- ddestroy);
-   list_move_tail(&bo->ddestroy, &removed);
-   if (!ttm_bo_get_unless_zero(bo))
-   continue;
-
-   if (remove_all || bo->base.resv != &bo->base._resv) {
- 

[PATCH 3/9] drm/ttm: use per BO cleanup workers

2022-11-25 Thread Christian König
Instead of a single worker going over the list of delete BOs in regular
intervals use a per BO worker which blocks for the resv object and
locking of the BO.

This not only simplifies the handling massively, but also results in
much better response time when cleaning up buffers.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   2 +-
 drivers/gpu/drm/i915/i915_gem.c|   2 +-
 drivers/gpu/drm/i915/intel_region_ttm.c|   2 +-
 drivers/gpu/drm/ttm/ttm_bo.c   | 112 -
 drivers/gpu/drm/ttm/ttm_bo_util.c  |   1 -
 drivers/gpu/drm/ttm/ttm_device.c   |  24 ++---
 include/drm/ttm/ttm_bo_api.h   |  18 +---
 include/drm/ttm/ttm_device.h   |   7 +-
 8 files changed, 57 insertions(+), 111 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2b1db37e25c1..74ccbd566777 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3984,7 +3984,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
amdgpu_fence_driver_hw_fini(adev);
 
if (adev->mman.initialized)
-   flush_delayed_work(&adev->mman.bdev.wq);
+   drain_workqueue(adev->mman.bdev.wq);
 
if (adev->pm_sysfs_en)
amdgpu_pm_sysfs_fini(adev);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 8468ca9885fd..c38306f156d6 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1099,7 +1099,7 @@ void i915_gem_drain_freed_objects(struct drm_i915_private 
*i915)
 {
while (atomic_read(&i915->mm.free_count)) {
flush_work(&i915->mm.free_work);
-   flush_delayed_work(&i915->bdev.wq);
+   drain_workqueue(i915->bdev.wq);
rcu_barrier();
}
 }
diff --git a/drivers/gpu/drm/i915/intel_region_ttm.c 
b/drivers/gpu/drm/i915/intel_region_ttm.c
index cf89d0c2a2d9..657bbc16a48a 100644
--- a/drivers/gpu/drm/i915/intel_region_ttm.c
+++ b/drivers/gpu/drm/i915/intel_region_ttm.c
@@ -132,7 +132,7 @@ int intel_region_ttm_fini(struct intel_memory_region *mem)
break;
 
msleep(20);
-   flush_delayed_work(&mem->i915->bdev.wq);
+   drain_workqueue(mem->i915->bdev.wq);
}
 
/* If we leaked objects, Don't free the region causing use after free */
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index b77262a623e0..4749b65bedc4 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -280,14 +280,13 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object 
*bo,
ret = 0;
}
 
-   if (ret || unlikely(list_empty(&bo->ddestroy))) {
+   if (ret) {
if (unlock_resv)
dma_resv_unlock(bo->base.resv);
spin_unlock(&bo->bdev->lru_lock);
return ret;
}
 
-   list_del_init(&bo->ddestroy);
spin_unlock(&bo->bdev->lru_lock);
ttm_bo_cleanup_memtype_use(bo);
 
@@ -300,47 +299,21 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object 
*bo,
 }
 
 /*
- * Traverse the delayed list, and call ttm_bo_cleanup_refs on all
- * encountered buffers.
+ * Block for the dma_resv object to become idle, lock the buffer and clean up
+ * the resource and tt object.
  */
-bool ttm_bo_delayed_delete(struct ttm_device *bdev, bool remove_all)
+static void ttm_bo_delayed_delete(struct work_struct *work)
 {
-   struct list_head removed;
-   bool empty;
-
-   INIT_LIST_HEAD(&removed);
-
-   spin_lock(&bdev->lru_lock);
-   while (!list_empty(&bdev->ddestroy)) {
-   struct ttm_buffer_object *bo;
-
-   bo = list_first_entry(&bdev->ddestroy, struct ttm_buffer_object,
- ddestroy);
-   list_move_tail(&bo->ddestroy, &removed);
-   if (!ttm_bo_get_unless_zero(bo))
-   continue;
-
-   if (remove_all || bo->base.resv != &bo->base._resv) {
-   spin_unlock(&bdev->lru_lock);
-   dma_resv_lock(bo->base.resv, NULL);
-
-   spin_lock(&bdev->lru_lock);
-   ttm_bo_cleanup_refs(bo, false, !remove_all, true);
-
-   } else if (dma_resv_trylock(bo->base.resv)) {
-   ttm_bo_cleanup_refs(bo, false, !remove_all, true);
-   } else {
-   spin_unlock(&bdev->lru_lock);
-   }
+   struct ttm_buffer_object *bo;
 
-   ttm_bo_put(bo);
-   spin_lock(&bdev->lru_lock);
-   }
-   list_splice_tail(&removed, &bdev->ddestroy);
-   empty = list_empty(&bdev->ddestroy);
-   spin_unlock(&bdev->lru_lock);
+   bo = container_of(work, typeof(*bo), delayed_delete);
 
-