amdgpu: add graceful VM fault handling v2

Kuehling, Felix Wed, 04 Sep 2019 15:48:26 -0700

On 2019-09-04 11:02 a.m., Christian König wrote:
> Next step towards HMM support. For now just silence the retry fault and
> optionally redirect the request to the dummy page.
>
> v2: make sure the VM is not destroyed while we handle the fault.
>
> Signed-off-by: Christian König <[email protected]>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 74 ++++++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  2 +
>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  4 ++
>   3 files changed, 80 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 951608fc1925..410d89966a66 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -3142,3 +3142,77 @@ void amdgpu_vm_set_task_info(struct amdgpu_vm *vm)
>               }
>       }
>   }
> +
> +/**
> + * amdgpu_vm_handle_fault - graceful handling of VM faults.
> + * @adev: amdgpu device pointer
> + * @pasid: PASID of the VM
> + * @addr: Address of the fault
> + *
> + * Try to gracefully handle a VM fault. Return true if the fault was handled 
> and
> + * shouldn't be reported any more.
> + */
> +bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
> +                         uint64_t addr)
> +{
> +     struct amdgpu_ring *ring = &adev->sdma.instance[0].page;
> +     struct amdgpu_bo *root;
> +     uint64_t value, flags;
> +     struct amdgpu_vm *vm;
> +     long r;
> +
> +     if (!ring->sched.ready)
> +             return false;
> +
> +     spin_lock(&adev->vm_manager.pasid_lock);
> +     vm = idr_find(&adev->vm_manager.pasid_idr, pasid);
> +     if (vm)
> +             root = amdgpu_bo_ref(vm->root.base.bo);
> +     else
> +             root = NULL;
> +     spin_unlock(&adev->vm_manager.pasid_lock);
> +
> +     if (!root)
> +             return false;
> +
> +     r = amdgpu_bo_reserve(root, true);
> +     if (r)
> +             goto error_unref;
> +
> +     spin_lock(&adev->vm_manager.pasid_lock);
> +     vm = idr_find(&adev->vm_manager.pasid_idr, pasid);
> +     spin_unlock(&adev->vm_manager.pasid_lock);


I think this deserves a comment. If I understand it correctly, you're 
looking up the vm twice so that you have the VM root reservation to 
protect against user-after-free. Otherwise the vm pointer is only valid 
as long as you're holding the spin-lock.


> +
> +     if (!vm || vm->root.base.bo != root)

The check of vm->root.base.bo should probably still be under the 
spin_lock. Because you're not sure yet it's the right VM, you can't rely 
on the reservation here to prevent use-after-free.


> +             goto error_unlock;
> +
> +     addr /= AMDGPU_GPU_PAGE_SIZE;
> +     flags = AMDGPU_PTE_VALID | AMDGPU_PTE_SNOOPED |
> +             AMDGPU_PTE_SYSTEM;
> +
> +     if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_NEVER) {
> +             /* Redirect the access to the dummy page */
> +             value = adev->dummy_page_addr;
> +             flags |= AMDGPU_PTE_EXECUTABLE | AMDGPU_PTE_READABLE |
> +                     AMDGPU_PTE_WRITEABLE;
> +     } else {
> +             value = 0;
> +     }
> +
> +     r = amdgpu_vm_bo_update_mapping(adev, vm, true, NULL, addr, addr + 1,
> +                                     flags, value, NULL, NULL);
> +     if (r)
> +             goto error_unlock;
> +
> +     r = amdgpu_vm_update_pdes(adev, vm, true);
> +
> +error_unlock:
> +     amdgpu_bo_unreserve(root);
> +     if (r < 0)
> +             DRM_ERROR("Can't handle page fault (%ld)\n", r);
> +
> +error_unref:
> +     amdgpu_bo_unref(&root);
> +
> +     return false;
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index 0a97dc839f3b..4dbbe1b6b413 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -413,6 +413,8 @@ void amdgpu_vm_check_compute_bug(struct amdgpu_device 
> *adev);
>   
>   void amdgpu_vm_get_task_info(struct amdgpu_device *adev, unsigned int pasid,
>                            struct amdgpu_task_info *task_info);
> +bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid,
> +                         uint64_t addr);
>   
>   void amdgpu_vm_set_task_info(struct amdgpu_vm *vm);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 9d15679df6e0..15a1ce51befa 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -353,6 +353,10 @@ static int gmc_v9_0_process_interrupt(struct 
> amdgpu_device *adev,
>       }
>   
>       /* If it's the first fault for this address, process it normally */
> +     if (retry_fault && !in_interrupt() &&
> +         amdgpu_vm_handle_fault(adev, entry->pasid, addr))
> +             return 1; /* This also prevents sending it to KFD */

The !in_interrupt() is meant to only do this on the rerouted interrupt 
ring that's handled by a worker function?

Looks like amdgpu_vm_handle_fault never returns true for now. So we'll 
never get to the "return 1" here.

Regards,
   Felix


> +
>       if (!amdgpu_sriov_vf(adev)) {
>               /*
>                * Issue a dummy read to wait for the status register to
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 9/9] drm/amdgpu: add graceful VM fault handling v2

Reply via email to