zone_device_page_init uses set_page_count to set vram page refcount to 1, there is race if step 2 happens between step 1 and 3.
1. CPU page fault handler get vram page, migrate the vram page to system page 2. GPU page fault migrate to the vram page, set page refcount to 1 3. CPU page fault handler put vram page, the vram page refcount is 0 and reduce the vram_bo refcount 4. vram_bo refcount is 1 off because the vram page is still used. Afterwards, this causes use-after-free bug and page refcount warning. zone_device_page_init should not use in page migration, change to get_page fix the race bug. Add WARN_ONCE to report this issue early because the refcount bug is hard to investigate. Signed-off-by: Philip Yang <[email protected]> --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index d10c6673f4de..15ab2db4af1d 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -217,7 +217,8 @@ svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn) page = pfn_to_page(pfn); svm_range_bo_ref(prange->svm_bo); page->zone_device_data = prange->svm_bo; - zone_device_page_init(page); + get_page(page); + lock_page(page); } static void @@ -552,6 +553,17 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, if (mpages) { prange->actual_loc = best_loc; prange->vram_pages += mpages; + /* + * To guarent we hold correct page refcount for all prange vram + * pages and svm_bo refcount. + * After prange migrated to VRAM, each vram page refcount hold + * one svm_bo refcount, and vram node hold one refcount. + * After page migrated to system memory, vram page refcount + * reduced to 0, svm_migrate_page_free reduce svm_bo refcount. + * svm_range_vram_node_free will free the svm_bo. + */ + WARN_ONCE(prange->vram_pages == kref_read(&prange->svm_bo->kref), + "svm_bo refcount leaking\n"); } else if (!prange->actual_loc) { /* if no page migrated and all pages from prange are at * sys ram drop svm_bo got from svm_range_vram_node_new -- 2.49.0
