When an eGPU is unplugged the KFD topology should also be destroyed
for that GPU. This never happens because the fini_sw callbacks never
get to run. Run them manually before calling amdgpu_device_ip_fini_early()
when a device has already been disconnected.

This location is intentionally chosen to make sure that the kfd locking
refcount doesn't get incremented unintentionally.

Cc: [email protected]
Closes: https://community.frame.work/t/amd-egpu-on-linux/8691/33
Signed-off-by: Mario Limonciello (AMD) <[email protected]>
---
v2:
 * Move the call earlier in amdgpu_device_fini_hw() to fix locking
   refcount issues
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 021ecc988ff79..f167ba1b6ffcb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5251,6 +5251,14 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
 
        amdgpu_ttm_set_buffer_funcs_status(adev, false);
 
+       /*
+        * device went through surprise hotplug; we need to destroy topology
+        * before ip_fini_early to prevent kfd locking refcount issues by 
calling
+        * amdgpu_amdkfd_suspend()
+        */
+       if (drm_dev_is_unplugged(adev_to_drm(adev)))
+               amdgpu_amdkfd_device_fini_sw(adev);
+
        amdgpu_device_ip_fini_early(adev);
 
        amdgpu_irq_fini_hw(adev);
-- 
2.43.0

Reply via email to