Re: [PATCH v2] drm/amdgpu: Check fence emitted count to identify bad jobs

Christian König Tue, 19 Nov 2024 01:47:58 -0800

Hi Shikang,

please completely drop the AMDGPU_PENDING_JOB_TIMEOUT workaround.

This is unnecessary when you use amdgpu_fence_count_emitted() instead oflooking at the jobs.

That's one of the reasons why looking at the jobs is such a reallyreally bad idea in the first place.


Regards,
Christian.

Am 19.11.24 um 09:47 schrieb Fan, Shikang:


[AMD Official Use Only - AMD Internal Distribution Only]


+@Koenig, Christian <mailto:[email protected]>

Hi Christian,

Could you please help take a look at this patch? Compared to theprevious patch, we now use amdgpu_fence_emitted_count to checkunfinished jobs. And this function is currently only used formailbox_flr_work In SRIOV case, soI believe the modification on thisfunction will not have any impact on the rest part of the driver.Thanks for your advice on v1 patch.


Regards,
Shikang

------------------------------------------------------------------------
*From:* Shikang Fan <[email protected]>
*Sent:* Monday, November 18, 2024 6:10 PM
*To:* [email protected] <[email protected]>
*Cc:* Fan, Shikang <[email protected]>; Deng, Emily <[email protected]>

*Subject:* [PATCH v2] drm/amdgpu: Check fence emitted count toidentify bad jobs

In SRIOV, when host driver performs MODE 1 reset and notifies FLR to
guest driver, there is a small chance that there is no job running on hw
but the driver has not updated the pending list yet, causing the driver
not respond the FLR request. Modify the has_job_running function to
make sure if there is still running job.

v2: Use amdgpu_fence_count_emitted to determine job running status.

Signed-off-by: Emily Deng <[email protected]>
Signed-off-by: Shikang Fan <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.cb/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index b3ca911e55d6..ea756eacebdc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -100,6 +100,7 @@ MODULE_FIRMWARE("amdgpu/navi12_gpu_info.bin");
 #define AMDGPU_PCIE_INDEX_FALLBACK (0x38 >> 2)
 #define AMDGPU_PCIE_INDEX_HI_FALLBACK (0x44 >> 2)
 #define AMDGPU_PCIE_DATA_FALLBACK (0x3C >> 2)
+#define AMDGPU_PENDING_JOB_TIMEOUT     (1000000)

 static const struct drm_driver amdgpu_kms_driver;

@@ -5222,15 +5223,19 @@ static int amdgpu_device_reset_sriov(structamdgpu_device *adev,

 }

 /**

- * amdgpu_device_has_job_running - check if there is any job inmirror list

+ * amdgpu_device_has_job_running - check if there is any unfinished job
  *
  * @adev: amdgpu_device pointer
  *
- * check if there is any job in mirror list

+ * check if there is any job running on the device when guest driverreceives+ * FLR notification from host driver. If there are still jobs runningand not+ * signaled after 1s, the hardware is most likely hung already, thenthe guest+ * driver will not respond the FLR reset. Instead, let the job hitthe timeout

+ * and guest driver then issue the reset request.
  */
 bool amdgpu_device_has_job_running(struct amdgpu_device *adev)
 {
-       int i;
+       int i, j;
         struct drm_sched_job *job;

         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {

@@ -5239,11 +5244,12 @@ bool amdgpu_device_has_job_running(structamdgpu_device *adev)

                 if (!amdgpu_ring_sched_ready(ring))
                         continue;

-               spin_lock(&ring->sched.job_list_lock);
-               job = list_first_entry_or_null(&ring->sched.pending_list,

- struct drm_sched_job,list);

- spin_unlock(&ring->sched.job_list_lock);
-               if (job)
+               for (j = 0; j < AMDGPU_PENDING_JOB_TIMEOUT; j++) {
+                       if (!amdgpu_fence_count_emitted(ring))
+                               break;
+                       udelay(1);
+               }
+               if (j == AMDGPU_PENDING_JOB_TIMEOUT)
                         return true;
         }
         return false;
--
2.34.1

Re: [PATCH v2] drm/amdgpu: Check fence emitted count to identify bad jobs

Reply via email to