Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
Hi guys, On Wed, 7 Apr 2021 at 11:27, Christian König wrote: > Am 07.04.21 um 09:47 schrieb Daniel Gomez: > > On Tue, 6 Apr 2021 at 22:56, Alex Deucher wrote: > >> On Mon, Mar 22, 2021 at 6:34 AM Christian König > >> wrote: > >>> Hi Daniel, > >>> > >>> Am 22.03.21 um 10:38 schrieb Daniel Gomez: > >>>> On Fri, 19 Mar 2021 at 21:29, Felix Kuehling > wrote: > >>>>> This caused a regression in kfdtest in a large-buffer stress test > after > >>>>> memory allocation for user pages fails: > >>>> I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and > >>>> not this one. > >>>> Just some background for the mem leak patch if helps to understand > this: > >>>> The leak was introduce here: > >>>> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D0b988ca1c7c4c73983b4ea96ef7c2af2263c87ebdata=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=%2FeOQf12NBkC3YGZ7QW66%2FT%2FpyM3DjEb9IMbqUvISMfo%3Dreserved=0 > >>>> where the bound status was introduced for all drm drivers including > >>>> radeon and amdgpu. So this patch just reverts the logic to the > >>>> original code but keeping the bound status. In my case, the binding > >>>> code allocates the user pages memory and returns without bounding (at > >>>> amdgpu_gtt_mgr_has_gart_addr). So, > >>>> when the unbinding happens, the memory needs to be cleared to prevent > the leak. > >>> Ah, now I understand what's happening here. Daniel your patch is not > >>> really correct. > >>> > >>> The problem is rather that we don't set the tt object to bound if it > >>> doesn't have a GTT address. > >>> > >>> Going to provide a patch for this. > >> Did this patch ever land? > > I don't think so but I might send a v2 following Christian's comment > > if you guys agree. > > Somebody else already provided a patch which I reviewed, but I'm not > sure if that landed either. > > > Also, the patch here is for radeon but the pagefault issue reported by > > Felix is affected by the amdgpu one: > > > > radeon patch: drm/radeon/ttm: Fix memory leak userptr pages > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210318083236.43578-1-daniel%40qtec.com%2Fdata=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=HSMK%2FqYz%2Bzz9qbKc%2FITUWluBDeaW9YrgyH8p0L640%2F0%3Dreserved=0 > > > > amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210317160840.36019-1-daniel%40qtec.com%2Fdata=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UsUZ4YjCSjHmzlPB07oTaGrsntTrQSwlGk%2BxUjwDiag%3Dreserved=0 > > > > I assume both need to be fixed with the same approach. > > Yes correct. Let me double check where that fix went. This patch (actually, the memory leak fix for amdgpu not radeon) has landed in mainline and has been back-ported to the stable branches. I just want to verify with you if that’s okay and the NULL pointer issue reported by Felix is fixed by this other patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c?id=3c3dc654333f6389803cdcaf03912e94173ae510 Thanks, Daniel > > > Thanks, > Christian. > > > > > Daniel > >> Alex > >> > >>> Regards, > >>> Christian. > >>> > >>>>> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16 > >>>>> [17359.543746] BUG: kernel NULL pointer dereference, address: > > >>>>> [17359.551494] #PF: supervisor read access in kernel mode > >>>>> [17359.557375] #PF: error_code(0x) - not-present page > >>>>> [17359.563247] PGD 0 P4D 0 &
Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
Am 07.04.21 um 09:47 schrieb Daniel Gomez: On Tue, 6 Apr 2021 at 22:56, Alex Deucher wrote: On Mon, Mar 22, 2021 at 6:34 AM Christian König wrote: Hi Daniel, Am 22.03.21 um 10:38 schrieb Daniel Gomez: On Fri, 19 Mar 2021 at 21:29, Felix Kuehling wrote: This caused a regression in kfdtest in a large-buffer stress test after memory allocation for user pages fails: I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and not this one. Just some background for the mem leak patch if helps to understand this: The leak was introduce here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D0b988ca1c7c4c73983b4ea96ef7c2af2263c87ebdata=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=%2FeOQf12NBkC3YGZ7QW66%2FT%2FpyM3DjEb9IMbqUvISMfo%3Dreserved=0 where the bound status was introduced for all drm drivers including radeon and amdgpu. So this patch just reverts the logic to the original code but keeping the bound status. In my case, the binding code allocates the user pages memory and returns without bounding (at amdgpu_gtt_mgr_has_gart_addr). So, when the unbinding happens, the memory needs to be cleared to prevent the leak. Ah, now I understand what's happening here. Daniel your patch is not really correct. The problem is rather that we don't set the tt object to bound if it doesn't have a GTT address. Going to provide a patch for this. Did this patch ever land? I don't think so but I might send a v2 following Christian's comment if you guys agree. Somebody else already provided a patch which I reviewed, but I'm not sure if that landed either. Also, the patch here is for radeon but the pagefault issue reported by Felix is affected by the amdgpu one: radeon patch: drm/radeon/ttm: Fix memory leak userptr pages https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210318083236.43578-1-daniel%40qtec.com%2Fdata=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=HSMK%2FqYz%2Bzz9qbKc%2FITUWluBDeaW9YrgyH8p0L640%2F0%3Dreserved=0 amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210317160840.36019-1-daniel%40qtec.com%2Fdata=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UsUZ4YjCSjHmzlPB07oTaGrsntTrQSwlGk%2BxUjwDiag%3Dreserved=0 I assume both need to be fixed with the same approach. Yes correct. Let me double check where that fix went. Thanks, Christian. Daniel Alex Regards, Christian. [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16 [17359.543746] BUG: kernel NULL pointer dereference, address: [17359.551494] #PF: supervisor read access in kernel mode [17359.557375] #PF: error_code(0x) - not-present page [17359.563247] PGD 0 P4D 0 [17359.566514] Oops: [#1] SMP PTI [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193 [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016 [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu] [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45 [17359.614340] RSP: 0018:a4764971fc98 EFLAGS: 00010206 [17359.620315] RAX: RBX: 950e8d4edf00 RCX: [17359.628204] RDX: RSI: 950e8d4edf00 RDI: 950eadec5e80 [17359.636084] RBP: 950eadec5e80 R08: R09: [17359.643958] R10: 0246 R11: 0001 R12: 950c03377800 [17359.651833] R13: 950eadec5e80 R14: 950c03377858 R15: [17359.659701] FS: 7febb20cb740() GS:950ebfc0() knlGS: [17359.668528] CS: 0010 DS: ES: CR0: 80050033 [17359.675012] CR2: CR3: 0006d700e005 CR4: 001706e0 [17359.682883] Call Trace: [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu] [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm] [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm] [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu] [17359.
Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
On Tue, 6 Apr 2021 at 22:56, Alex Deucher wrote: > > On Mon, Mar 22, 2021 at 6:34 AM Christian König > wrote: > > > > Hi Daniel, > > > > Am 22.03.21 um 10:38 schrieb Daniel Gomez: > > > On Fri, 19 Mar 2021 at 21:29, Felix Kuehling > > > wrote: > > >> This caused a regression in kfdtest in a large-buffer stress test after > > >> memory allocation for user pages fails: > > > I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and > > > not this one. > > > Just some background for the mem leak patch if helps to understand this: > > > The leak was introduce here: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb > > > where the bound status was introduced for all drm drivers including > > > radeon and amdgpu. So this patch just reverts the logic to the > > > original code but keeping the bound status. In my case, the binding > > > code allocates the user pages memory and returns without bounding (at > > > amdgpu_gtt_mgr_has_gart_addr). So, > > > when the unbinding happens, the memory needs to be cleared to prevent the > > > leak. > > > > Ah, now I understand what's happening here. Daniel your patch is not > > really correct. > > > > The problem is rather that we don't set the tt object to bound if it > > doesn't have a GTT address. > > > > Going to provide a patch for this. > > Did this patch ever land? I don't think so but I might send a v2 following Christian's comment if you guys agree. Also, the patch here is for radeon but the pagefault issue reported by Felix is affected by the amdgpu one: radeon patch: drm/radeon/ttm: Fix memory leak userptr pages https://patchwork.kernel.org/project/dri-devel/patch/20210318083236.43578-1-dan...@qtec.com/ amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages https://patchwork.kernel.org/project/dri-devel/patch/20210317160840.36019-1-dan...@qtec.com/ I assume both need to be fixed with the same approach. Daniel > > Alex > > > > > Regards, > > Christian. > > > > > > > >> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16 > > >> [17359.543746] BUG: kernel NULL pointer dereference, address: > > >> > > >> [17359.551494] #PF: supervisor read access in kernel mode > > >> [17359.557375] #PF: error_code(0x) - not-present page > > >> [17359.563247] PGD 0 P4D 0 > > >> [17359.566514] Oops: [#1] SMP PTI > > >> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted > > >> 5.11.0-kfd-fkuehlin #193 > > >> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS > > >> 3201 06/17/2016 > > >> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu] > > >> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 > > >> 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e > > >> 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45 > > >> [17359.614340] RSP: 0018:a4764971fc98 EFLAGS: 00010206 > > >> [17359.620315] RAX: RBX: 950e8d4edf00 RCX: > > >> > > >> [17359.628204] RDX: RSI: 950e8d4edf00 RDI: > > >> 950eadec5e80 > > >> [17359.636084] RBP: 950eadec5e80 R08: R09: > > >> > > >> [17359.643958] R10: 0246 R11: 0001 R12: > > >> 950c03377800 > > >> [17359.651833] R13: 950eadec5e80 R14: 950c03377858 R15: > > >> > > >> [17359.659701] FS: 7febb20cb740() GS:950ebfc0() > > >> knlGS: > > >> [17359.668528] CS: 0010 DS: ES: CR0: 80050033 > > >> [17359.675012] CR2: CR3: 0006d700e005 CR4: > > >> 001706e0 > > >> [17359.682883] Call Trace: > > >> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu] > > >> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm] > > >> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm] > > >> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu] > > >> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 > > >> [amdgpu] > > >> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu] > > >> [17359.723036]
Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
404 > >> [17359.806352] R10: R11: 0202 R12: > >> 559416e4e2aa > >> [17359.814251] R13: R14: 0021 R15: > >> > >> [17359.822140] Modules linked in: ip6table_filter ip6_tables > >> iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 > >> gpu_sched ip_tables x_tables > >> [17359.837776] CR2: > >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]--- > >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu] > >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 > >> 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 > >> <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45 > >> [17359.874929] RSP: 0018:a4764971fc98 EFLAGS: 00010206 > >> [17359.881014] RAX: RBX: 950e8d4edf00 RCX: > >> > >> [17359.889007] RDX: RSI: 950e8d4edf00 RDI: > >> 950eadec5e80 > >> [17359.897008] RBP: 950eadec5e80 R08: R09: > >> > >> [17359.905020] R10: 0246 R11: 0001 R12: > >> 950c03377800 > >> [17359.913034] R13: 950eadec5e80 R14: 950c03377858 R15: > >> > >> [17359.921050] FS: 7febb20cb740() GS:950ebfc0() > >> knlGS: > >> [17359.930047] CS: 0010 DS: ES: CR0: 80050033 > >> [17359.936674] CR2: CR3: 0006d700e005 CR4: > >> 001706e0 > > From what I understand, the init_user_pages fails (returns EBUSY) and > > the code goes to allocate_init_user_pages_failed where the unbind and > > the userptr clear occurs. > > Can we prevent this if we save the bounding status + userptr alloc? so > > the function amdgpu_ttm_backend_unbind returns without trying to clear > > the userptr memory? > > > > Something like: > > > > amdgpu_ttm_backend_bind: > > if (gtt->userptr) { > > r = amdgpu_ttm_tt_pin_userptr(bdev, ttm); > > if (r) ... > > gtt->sg_table = true; > > } > > > > amdgpu_ttm_backend_unbind: > > if (gtt->sg_table) { > > if (gtt->user_ptr) ... > > } > > > > If you agree, I'll send a v2 patch. Otherwise, maybe we could return > > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated. > > > > Any other ideas? > > > > Regards, > > Daniel > > > >> Reverting this patch fixes the problem for me. > >> > >> Regards, > >> Felix > >> > >> On 2021-03-18 10:57 p.m., Alex Deucher wrote: > >>> Applied. Thanks! > >>> > >>> Alex > >>> > >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian > >>> wrote: > >>>> Reviewed-by: Christian König > >>>> > >>>> Von: Daniel Gomez > >>>> Gesendet: Donnerstag, 18. März 2021 09:32 > >>>> Cc: dag...@gmail.com ; Daniel Gomez ; > >>>> Deucher, Alexander ; Koenig, Christian > >>>> ; David Airlie ; Daniel > >>>> Vetter ; amd-gfx@lists.freedesktop.org > >>>> ; dri-de...@lists.freedesktop.org > >>>> ; linux-ker...@vger.kernel.org > >>>> > >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages > >>>> > >>>> If userptr pages have been pinned but not bounded, > >>>> they remain uncleared. > >>>> > >>>> Signed-off-by: Daniel Gomez > >>>> --- > >>>>drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++-- > >>>>1 file changed, 3 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c > >>>> b/drivers/gpu/drm/radeon/radeon_ttm.c > >>>> index e8c66d10478f..bbcc6264d48f 100644 > >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c > >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c > >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct > >>>> ttm_bo_device *bdev, struct ttm_tt > >>>>struct radeon_ttm_tt *gtt = (void *)ttm; > >>>>struct radeon_device *rdev = radeon_get_rdev(bdev); > >>>> > >>>> + if (gtt->userptr) > >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm); > >>>> + > >>>>if (!gtt->bound) > >>>>return; > >>>> > >>>>radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages); > >>>> > >>>> - if (gtt->userptr) > >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm); > >>>>gtt->bound = false; > >>>>} > >>>> > >>>> -- > >>>> 2.30.2 > >>>> > >>>> ___ > >>>> dri-devel mailing list > >>>> dri-de...@lists.freedesktop.org > >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel > >>> ___ > >>> dri-devel mailing list > >>> dri-de...@lists.freedesktop.org > >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel > > ___ > > amd-gfx mailing list > > amd-gfx@lists.freedesktop.org > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
dd0 R09: > >> c404 > >> [17359.806352] R10: R11: 0202 R12: > >> 559416e4e2aa > >> [17359.814251] R13: R14: 0021 R15: > >> > >> [17359.822140] Modules linked in: ip6table_filter ip6_tables > >> iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 > >> gpu_sched ip_tables x_tables > >> [17359.837776] CR2: > >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]--- > >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu] > >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 > >> 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 > >> <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45 > >> [17359.874929] RSP: 0018:a4764971fc98 EFLAGS: 00010206 > >> [17359.881014] RAX: RBX: 950e8d4edf00 RCX: > >> > >> [17359.889007] RDX: RSI: 950e8d4edf00 RDI: > >> 950eadec5e80 > >> [17359.897008] RBP: 950eadec5e80 R08: R09: > >> > >> [17359.905020] R10: 0246 R11: 0001 R12: > >> 950c03377800 > >> [17359.913034] R13: 950eadec5e80 R14: 950c03377858 R15: > >> > >> [17359.921050] FS: 7febb20cb740() GS:950ebfc0() > >> knlGS: > >> [17359.930047] CS: 0010 DS: ES: CR0: 80050033 > >> [17359.936674] CR2: CR3: 0006d700e005 CR4: > >> 001706e0 > > From what I understand, the init_user_pages fails (returns EBUSY) and > > the code goes to allocate_init_user_pages_failed where the unbind and > > the userptr clear occurs. > > Can we prevent this if we save the bounding status + userptr alloc? so > > the function amdgpu_ttm_backend_unbind returns without trying to clear > > the userptr memory? > > > > Something like: > > > > amdgpu_ttm_backend_bind: > > if (gtt->userptr) { > > r = amdgpu_ttm_tt_pin_userptr(bdev, ttm); > > if (r) ... > > gtt->sg_table = true; > > } > > > > amdgpu_ttm_backend_unbind: > > if (gtt->sg_table) { > > if (gtt->user_ptr) ... > > } > > > > If you agree, I'll send a v2 patch. Otherwise, maybe we could return > > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated. > > > > Any other ideas? > > > > Regards, > > Daniel > > > >> Reverting this patch fixes the problem for me. > >> > >> Regards, > >> Felix > >> > >> On 2021-03-18 10:57 p.m., Alex Deucher wrote: > >>> Applied. Thanks! > >>> > >>> Alex > >>> > >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian > >>> wrote: > >>>> Reviewed-by: Christian König > >>>> > >>>> Von: Daniel Gomez > >>>> Gesendet: Donnerstag, 18. März 2021 09:32 > >>>> Cc: dag...@gmail.com ; Daniel Gomez ; > >>>> Deucher, Alexander ; Koenig, Christian > >>>> ; David Airlie ; Daniel > >>>> Vetter ; amd-gfx@lists.freedesktop.org > >>>> ; dri-de...@lists.freedesktop.org > >>>> ; linux-ker...@vger.kernel.org > >>>> > >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages > >>>> > >>>> If userptr pages have been pinned but not bounded, > >>>> they remain uncleared. > >>>> > >>>> Signed-off-by: Daniel Gomez > >>>> --- > >>>>drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++-- > >>>>1 file changed, 3 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c > >>>> b/drivers/gpu/drm/radeon/radeon_ttm.c > >>>> index e8c66d10478f..bbcc6264d48f 100644 > >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c > >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c > >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct > >>>> ttm_bo_device *bdev, struct ttm_tt > >>>>struct radeon_ttm_tt *gtt = (void *)ttm; > >>>>struct radeon_device *rdev = radeon_get_rdev(bdev); > >>>> > >>>> + if (gtt->userptr) > >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm); > >>>> + > >>>>if (!gtt->bound) > >>>>return; > >>>> > >>>>radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages); > >>>> > >>>> - if (gtt->userptr) > >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm); > >>>>gtt->bound = false; > >>>>} > >>>> > >>>> -- > >>>> 2.30.2 > >>>> > >>>> ___ > >>>> dri-devel mailing list > >>>> dri-de...@lists.freedesktop.org > >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel > >>> ___ > >>> dri-devel mailing list > >>> dri-de...@lists.freedesktop.org > >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel > > ___ > > amd-gfx mailing list > > amd-gfx@lists.freedesktop.org > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
: 950eadec5e80 R14: 950c03377858 R15: [17359.921050] FS: 7febb20cb740() GS:950ebfc0() knlGS: [17359.930047] CS: 0010 DS: ES: CR0: 80050033 [17359.936674] CR2: CR3: 0006d700e005 CR4: 001706e0 From what I understand, the init_user_pages fails (returns EBUSY) and the code goes to allocate_init_user_pages_failed where the unbind and the userptr clear occurs. Can we prevent this if we save the bounding status + userptr alloc? so the function amdgpu_ttm_backend_unbind returns without trying to clear the userptr memory? Something like: amdgpu_ttm_backend_bind: if (gtt->userptr) { r = amdgpu_ttm_tt_pin_userptr(bdev, ttm); if (r) ... gtt->sg_table = true; } amdgpu_ttm_backend_unbind: if (gtt->sg_table) { if (gtt->user_ptr) ... } If you agree, I'll send a v2 patch. Otherwise, maybe we could return within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated. Any other ideas? Regards, Daniel Reverting this patch fixes the problem for me. Regards, Felix On 2021-03-18 10:57 p.m., Alex Deucher wrote: Applied. Thanks! Alex On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian wrote: Reviewed-by: Christian König Von: Daniel Gomez Gesendet: Donnerstag, 18. März 2021 09:32 Cc: dag...@gmail.com ; Daniel Gomez ; Deucher, Alexander ; Koenig, Christian ; David Airlie ; Daniel Vetter ; amd-gfx@lists.freedesktop.org ; dri-de...@lists.freedesktop.org ; linux-ker...@vger.kernel.org Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages If userptr pages have been pinned but not bounded, they remain uncleared. Signed-off-by: Daniel Gomez --- drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index e8c66d10478f..bbcc6264d48f 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt struct radeon_ttm_tt *gtt = (void *)ttm; struct radeon_device *rdev = radeon_get_rdev(bdev); + if (gtt->userptr) + radeon_ttm_tt_unpin_userptr(bdev, ttm); + if (!gtt->bound) return; radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages); - if (gtt->userptr) - radeon_ttm_tt_unpin_userptr(bdev, ttm); gtt->bound = false; } -- 2.30.2 ___ dri-devel mailing list dri-de...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ___ dri-devel mailing list dri-de...@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> [17359.905020] R10: 0246 R11: 0001 R12: > 950c03377800 > [17359.913034] R13: 950eadec5e80 R14: 950c03377858 R15: > > [17359.921050] FS: 7febb20cb740() GS:950ebfc0() > knlGS: > [17359.930047] CS: 0010 DS: ES: CR0: 80050033 > [17359.936674] CR2: CR3: 0006d700e005 CR4: > 001706e0 From what I understand, the init_user_pages fails (returns EBUSY) and the code goes to allocate_init_user_pages_failed where the unbind and the userptr clear occurs. Can we prevent this if we save the bounding status + userptr alloc? so the function amdgpu_ttm_backend_unbind returns without trying to clear the userptr memory? Something like: amdgpu_ttm_backend_bind: if (gtt->userptr) { r = amdgpu_ttm_tt_pin_userptr(bdev, ttm); if (r) ... gtt->sg_table = true; } amdgpu_ttm_backend_unbind: if (gtt->sg_table) { if (gtt->user_ptr) ... } If you agree, I'll send a v2 patch. Otherwise, maybe we could return within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated. Any other ideas? Regards, Daniel > > Reverting this patch fixes the problem for me. > > Regards, >Felix > > On 2021-03-18 10:57 p.m., Alex Deucher wrote: > > Applied. Thanks! > > > > Alex > > > > On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian > > wrote: > >> Reviewed-by: Christian König > >> ____________________ > >> Von: Daniel Gomez > >> Gesendet: Donnerstag, 18. März 2021 09:32 > >> Cc: dag...@gmail.com ; Daniel Gomez ; > >> Deucher, Alexander ; Koenig, Christian > >> ; David Airlie ; Daniel Vetter > >> ; amd-gfx@lists.freedesktop.org > >> ; dri-de...@lists.freedesktop.org > >> ; linux-ker...@vger.kernel.org > >> > >> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages > >> > >> If userptr pages have been pinned but not bounded, > >> they remain uncleared. > >> > >> Signed-off-by: Daniel Gomez > >> --- > >> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++-- > >> 1 file changed, 3 insertions(+), 2 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c > >> b/drivers/gpu/drm/radeon/radeon_ttm.c > >> index e8c66d10478f..bbcc6264d48f 100644 > >> --- a/drivers/gpu/drm/radeon/radeon_ttm.c > >> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c > >> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct > >> ttm_bo_device *bdev, struct ttm_tt > >> struct radeon_ttm_tt *gtt = (void *)ttm; > >> struct radeon_device *rdev = radeon_get_rdev(bdev); > >> > >> + if (gtt->userptr) > >> + radeon_ttm_tt_unpin_userptr(bdev, ttm); > >> + > >> if (!gtt->bound) > >> return; > >> > >> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages); > >> > >> - if (gtt->userptr) > >> - radeon_ttm_tt_unpin_userptr(bdev, ttm); > >> gtt->bound = false; > >> } > >> > >> -- > >> 2.30.2 > >> > >> ___ > >> dri-devel mailing list > >> dri-de...@lists.freedesktop.org > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel > > ___ > > dri-devel mailing list > > dri-de...@lists.freedesktop.org > > https://lists.freedesktop.org/mailman/listinfo/dri-devel ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
This caused a regression in kfdtest in a large-buffer stress test after memory allocation for user pages fails: [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16 [17359.543746] BUG: kernel NULL pointer dereference, address: [17359.551494] #PF: supervisor read access in kernel mode [17359.557375] #PF: error_code(0x) - not-present page [17359.563247] PGD 0 P4D 0 [17359.566514] Oops: [#1] SMP PTI [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193 [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016 [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu] [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45 [17359.614340] RSP: 0018:a4764971fc98 EFLAGS: 00010206 [17359.620315] RAX: RBX: 950e8d4edf00 RCX: [17359.628204] RDX: RSI: 950e8d4edf00 RDI: 950eadec5e80 [17359.636084] RBP: 950eadec5e80 R08: R09: [17359.643958] R10: 0246 R11: 0001 R12: 950c03377800 [17359.651833] R13: 950eadec5e80 R14: 950c03377858 R15: [17359.659701] FS: 7febb20cb740() GS:950ebfc0() knlGS: [17359.668528] CS: 0010 DS: ES: CR0: 80050033 [17359.675012] CR2: CR3: 0006d700e005 CR4: 001706e0 [17359.682883] Call Trace: [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu] [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm] [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm] [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu] [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu] [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu] [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu] [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu] [17359.734152] __x64_sys_ioctl+0x8b/0xd0 [17359.738796] do_syscall_64+0x2d/0x40 [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [17359.749205] RIP: 0033:0x7febb083b6d7 [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48 [17359.774340] RSP: 002b:7ffdb5522cd8 EFLAGS: 0202 ORIG_RAX: 0010 [17359.782668] RAX: ffda RBX: 0001 RCX: 7febb083b6d7 [17359.790566] RDX: 7ffdb5522d60 RSI: c0284b16 RDI: 0003 [17359.798459] RBP: 7ffdb5522d10 R08: 7ffdb5522dd0 R09: c404 [17359.806352] R10: R11: 0202 R12: 559416e4e2aa [17359.814251] R13: R14: 0021 R15: [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables [17359.837776] CR2: [17359.841888] ---[ end trace a6f27d64475b28c8 ]--- [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu] [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45 [17359.874929] RSP: 0018:a4764971fc98 EFLAGS: 00010206 [17359.881014] RAX: RBX: 950e8d4edf00 RCX: [17359.889007] RDX: RSI: 950e8d4edf00 RDI: 950eadec5e80 [17359.897008] RBP: 950eadec5e80 R08: R09: [17359.905020] R10: 0246 R11: 0001 R12: 950c03377800 [17359.913034] R13: 950eadec5e80 R14: 950c03377858 R15: [17359.921050] FS: 7febb20cb740() GS:950ebfc0() knlGS: [17359.930047] CS: 0010 DS: ES: CR0: 80050033 [17359.936674] CR2: CR3: 0006d700e005 CR4: 001706e0 Reverting this patch fixes the problem for me. Regards, Felix On 2021-03-18 10:57 p.m., Alex Deucher wrote: Applied. Thanks! Alex On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian wrote: Reviewed-by: Christian König Von: Daniel Gomez Gesendet: Donnerstag, 18. März 2021 09:32 Cc: dag...@gmail.com ; Daniel Gomez ; Deucher, Alexander ; Koenig, Christian ; David Airlie ; Daniel Vetter ; amd-gfx@lists.freedesktop.org ; dri-de...@lists.freedesktop.org ; linux-ker...@vger.kernel.org Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages If userptr pages have been pinned but not bounded, they remain uncleared. Signed-off-by: Daniel Gomez --- drivers/gpu/drm/radeon/radeon_ttm.c
Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
Applied. Thanks! Alex On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian wrote: > > Reviewed-by: Christian König > > Von: Daniel Gomez > Gesendet: Donnerstag, 18. März 2021 09:32 > Cc: dag...@gmail.com ; Daniel Gomez ; > Deucher, Alexander ; Koenig, Christian > ; David Airlie ; Daniel Vetter > ; amd-gfx@lists.freedesktop.org > ; dri-de...@lists.freedesktop.org > ; linux-ker...@vger.kernel.org > > Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages > > If userptr pages have been pinned but not bounded, > they remain uncleared. > > Signed-off-by: Daniel Gomez > --- > drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c > b/drivers/gpu/drm/radeon/radeon_ttm.c > index e8c66d10478f..bbcc6264d48f 100644 > --- a/drivers/gpu/drm/radeon/radeon_ttm.c > +++ b/drivers/gpu/drm/radeon/radeon_ttm.c > @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct > ttm_bo_device *bdev, struct ttm_tt > struct radeon_ttm_tt *gtt = (void *)ttm; > struct radeon_device *rdev = radeon_get_rdev(bdev); > > + if (gtt->userptr) > + radeon_ttm_tt_unpin_userptr(bdev, ttm); > + > if (!gtt->bound) > return; > > radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages); > > - if (gtt->userptr) > - radeon_ttm_tt_unpin_userptr(bdev, ttm); > gtt->bound = false; > } > > -- > 2.30.2 > > ___ > dri-devel mailing list > dri-de...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
AW: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
Reviewed-by: Christian König Von: Daniel Gomez Gesendet: Donnerstag, 18. März 2021 09:32 Cc: dag...@gmail.com ; Daniel Gomez ; Deucher, Alexander ; Koenig, Christian ; David Airlie ; Daniel Vetter ; amd-gfx@lists.freedesktop.org ; dri-de...@lists.freedesktop.org ; linux-ker...@vger.kernel.org Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages If userptr pages have been pinned but not bounded, they remain uncleared. Signed-off-by: Daniel Gomez --- drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index e8c66d10478f..bbcc6264d48f 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt struct radeon_ttm_tt *gtt = (void *)ttm; struct radeon_device *rdev = radeon_get_rdev(bdev); + if (gtt->userptr) + radeon_ttm_tt_unpin_userptr(bdev, ttm); + if (!gtt->bound) return; radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages); - if (gtt->userptr) - radeon_ttm_tt_unpin_userptr(bdev, ttm); gtt->bound = false; } -- 2.30.2 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/radeon/ttm: Fix memory leak userptr pages
If userptr pages have been pinned but not bounded, they remain uncleared. Signed-off-by: Daniel Gomez --- drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index e8c66d10478f..bbcc6264d48f 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt struct radeon_ttm_tt *gtt = (void *)ttm; struct radeon_device *rdev = radeon_get_rdev(bdev); + if (gtt->userptr) + radeon_ttm_tt_unpin_userptr(bdev, ttm); + if (!gtt->bound) return; radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages); - if (gtt->userptr) - radeon_ttm_tt_unpin_userptr(bdev, ttm); gtt->bound = false; } -- 2.30.2 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx