On 2025-09-17 19:12, Chen, Xiaogang wrote:
On 9/16/2025 5:41 PM, Philip Yang wrote:
If mmap write lock is taken while draining retry fault, mmap write lock
is not released because svm_range_restore_pages calls mmap_read_unlock
then returns. This causes deadlock and systen hang later because mmap
read or write lock cannot be taken.
Downgrade mmap write lock to read lock if draining retry fault fix this
bug.
Signed-off-by: Philip Yang<philip.y...@amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 6604a37b304f..fb02ff9ae62a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -3043,6 +3043,8 @@ svm_range_restore_pages(struct amdgpu_device *adev,
unsigned int pasid,
if (svms->checkpoint_ts[gpuidx] != 0) {
if (amdgpu_ih_ts_after_or_equal(ts,
svms->checkpoint_ts[gpuidx])) {
pr_debug("draining retry fault, drop fault 0x%llx\n",
addr);
+ if (write_locked)
+ mmap_write_downgrade(mm);
Is there unlock order issue? Now it holds svms->lock first, then mmap
read lock after mmap_write_downgrade. The unlock should be
mmap_read_unlock(mm), then mutex_unlock(&svms->lock). "goto
out_unlock_svms" does it in reverse order.
downgrading write lock doesn't change lock order (not in two steps like
up_write, down_read), we downgrade mmap write lock in normal path as
well, but missing it in this error handling path.
Regards,
Philip
Regards
Xiaogang
r = -EAGAIN;
goto out_unlock_svms;
} else {