On 2018-01-17 03:33 PM, Andrey Grodzovsky wrote:
> I have a private libdrm amdgpu test which allocates very big BOs in
> loop until all VRAM, GTT and swap are full, and I don't release them
> in the test (yet) .
>
> Once the test process terminates everything always gets cleared
> including swap . Could this point to KFD specific issue ?
That's possible.
I added some WARNs:
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 5a046a3..d68141e 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -175,7 +175,7 @@ void ttm_tt_destroy(struct ttm_tt *ttm)
if (ttm->state == tt_unbound)
ttm_tt_unpopulate(ttm);
-
+ WARN_ON(ttm->page_flags & TTM_PAGE_FLAG_PERSISTENT_SWAP);
if (!(ttm->page_flags & TTM_PAGE_FLAG_PERSISTENT_SWAP) &&
ttm->swap_storage)
fput(ttm->swap_storage);
@@ -321,6 +321,7 @@ int ttm_tt_swapin(struct ttm_tt *ttm)
return 0;
out_err:
+ WARN(1, "Returning error, not freeing swap_storage");
return ret;
}
@@ -336,7 +337,8 @@ int ttm_tt_swapout(struct ttm_tt *ttm, struct file
*persistent_swap_storage)
BUG_ON(ttm->state != tt_unbound && ttm->state != tt_unpopulated);
BUG_ON(ttm->caching_state != tt_cached);
- if (!persistent_swap_storage) {
+ if (!persistent_swap_storage) {
+ WARN(ttm->swap_storage, "already has swap storage");
swap_storage = shmem_file_setup("ttm swap",
ttm->num_pages << PAGE_SHIFT,
0);
And noticed that ttm_bo_swapout is getting called on BOs that already
have swap space. I think that means it's trying to swap out a BO that's
already swapped out, and that's where it's leaking the pointer to
already allocated swap space:
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602083] ------------[ cut here
]------------
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602086] already has swap
storage
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602124] WARNING: CPU: 8 PID:
1940 at /home/fkuehlin/compute/kernel/drivers/gpu/drm/ttm/t
tm_tt.c:341 ttm_tt_swapout+0x230/0x250 [ttm]
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602126] Modules linked in:
ip6_tables(E) ip_tables(E) x_tables(E) x86_pkg_temp_thermal(E
) amdkfd(E) amd_iommu_v2(E) amdgpu(E) chash(E) gpu_sched(E) ttm(E)
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602139] CPU: 8 PID: 1940 Comm:
kworker/u24:6 Tainted: G W E 4.15.0-rc2-kfd-fkuehlin #7
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602141] Hardware name: ASUS
All Series/X99-E WS/USB 3.1, BIOS 2006 04/07/2016
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602144] Workqueue: ttm_swap
ttm_shrink_work [ttm]
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602147] task: 00000000894fffc6
task.stack: 000000008f73bd43
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602150] RIP:
0010:ttm_tt_swapout+0x230/0x250 [ttm]
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602151] RSP:
0018:ffffa87a43633ce8 EFLAGS: 00010296
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602153] RAX: 0000000000000018
RBX: ffff90ba3af6b858 RCX: 0000000000000006
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602154] RDX: 0000000000001027
RSI: ffff90bbe3306df8 RDI: 0000000000000202
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602155] RBP: ffff90bbddabde00
R08: 0000000000000000 R09: 0000000000000000
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602156] R10: 000000005b2635d0
R11: 0000000000000000 R12: ffff90ba3af6b88c
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602157] R13: ffffa87a43633e3a
R14: ffff90bbdda59d70 R15: ffff90bbdda59ce0
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602159] FS:
0000000000000000(0000) GS:ffff90bbe7400000(0000) knlGS:0000000000000000
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602160] CS: 0010 DS: 0000 ES:
0000 CR0: 0000000080050033
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602161] CR2: 0000000001afbd58
CR3: 00000003f5010002 CR4: 00000000001606e0
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602162] Call Trace:
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602171]
ttm_bo_swapout+0x23a/0x260 [ttm]
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602175] ?
ttm_shrink+0xa8/0xf0 [ttm]
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602179] ttm_shrink+0xb6/0xf0
[ttm]
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602184]
ttm_shrink_work+0x31/0x40 [ttm]
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602189]
process_one_work+0x19d/0x430
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602191] ?
process_one_work+0x136/0x430
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602196]
worker_thread+0x45/0x430
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602202] kthread+0x134/0x170
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602204] ?
process_one_work+0x430/0x430
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602206] ?
kthread_delayed_work_timer_fn+0x80/0x80
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602212]
ret_from_fork+0x24/0x30
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602218] Code: 89 45 40 31 c0
e9 34 ff ff ff 48 c7 c7 a8 ea 13 c0 e8 0b ff f8 ed 8b 44 24 08 e9 1f ff ff ff
48 c7 c7 de fa 13 c0 e8 d0 85 f2 ed <0f> ff eb 80 48 89 ef e8 14 fc ff ff 48 8b
04 24 48 89 45 40 8b
Jan 17 18:40:06 fkuehlin-hsatest2 kernel: [ 196.602268] ---[ end trace
019b6398cabc8266 ]---
Regards,
Felix
>
> Thanks,
>
> Andrey
>
>
> On 01/16/2018 10:21 PM, Felix Kuehling wrote:
>> I'm running an eviction stress test with KFD and find that sometimes it
>> starts swapping. When that happens, swap usage goes up rapidly, but it
>> never comes down. Even after the processes terminate, and all VRAM and
>> GTT allocations are freed (checked in
>> /sys/kernel/debug/dri/0/amdgpu_{gtt|vram}_mm), swap space is still not
>> released.
>>
>> Running the test repeatedly I was able to trigger the OOM killer quite
>> easily. The system died with a panic, running out of processes to kill.
>>
>> The symptoms look like swap space is only allocated but never released.
>>
>> A quick look at the swapping code in ttm_tt.c doesn't show any obvious
>> problems. I'm assuming that fput should free swap space. That should
>> happen when BOs are swapped back in, or destroyed. As far as I can tell,
>> amdgpu doesn't use persistent swap space, so I'm ignoring
>> TTM_PAGE_FLAG_PERSISTENT_SWAP.
>>
>> Any other ideas or pointers?
>>
>> Thanks,
>> Felix
>>
>
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx