On 12/11/19 11:05 PM, Ma, Le wrote:
[AMD Official Use Only - Internal Distribution Only]
-----Original Message-----
From: Andrey Grodzovsky <andrey.grodzov...@amd.com>
Sent: Thursday, December 12, 2019 4:39 AM
To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Ma, Le
<le...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com>; Quan, Evan
<evan.q...@amd.com>; Grodzovsky, Andrey <andrey.grodzov...@amd.com>
Subject: [RESEND PATCH 4/5] Subject: drm/amdgpu: Redo XGMI reset
synchronization.
Use task barrier in XGMI hive to synchronize ASIC resets across
devices in XGMI hive.
Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com
<mailto:andrey.grodzov...@amd.com>>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42
+++++++++++++++++++++++++-----
1 file changed, 36 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1d19edfa..e4089a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -67,6 +67,7 @@
#include "amdgpu_tmz.h"
#include <linux/suspend.h>
+#include <drm/task_barrier.h>
MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
@@ -2663,14 +2664,43 @@ static void
amdgpu_device_xgmi_reset_func(struct work_struct *__work) {
struct amdgpu_device *adev =
container_of(__work, struct amdgpu_device, xgmi_reset_work);
+ struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, 0);
- if (amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO)
- adev->asic_reset_res = (adev->in_baco == false) ?
- amdgpu_device_baco_enter(adev->ddev) :
- qamdgpu_device_baco_exit(adev->ddev);
- else
- adev->asic_reset_res = amdgpu_asic_reset(adev);
+ /*
+ * Use task barrier to synchronize all xgmi reset works
across the
+ * hive.
+ * task_barrier_enter and task_barrier_exit will block
untill all the
+ * threads running the xgmi reset works reach those points.
I assume
+ * guarantee of progress here for all the threads as the
workqueue code
+ * creates new worker threads as needed by amount of work
items in queue
+ * (see worker_thread) and also each thread sleeps in the
barrir and by
+ * this yielding the CPU for other work threads to make
progress.
+ */
[Le]: This comments can be adjusted since we switch to
system_unbound_wq in patch #5.
+ if (amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) {
+
+ if (hive)
+ task_barrier_enter(&hive->tb);
[Le]: The multiple hive condition can be checked only once and moved
to the location right after the assignment.
Not sure what you meant here but in fact let's note that while in
amdgpu_device_xgmi_reset_func it's a bug for amdgpu_get_xgmi_hive to
return NULL so I think better instead to add WARN_ON(!hive,"...") and
return right at the beginning of the function if indeed hive == NULL
Andrey
+
+ adev->asic_reset_res = amdgpu_device_baco_enter(adev->ddev);
+
+ if (adev->asic_reset_res)
+ goto fail;
+
+ if (hive)
+ task_barrier_exit(&hive->tb);
[Le]: Same as above.
+
+ adev->asic_reset_res = amdgpu_device_baco_exit(adev->ddev);
+
+ if (adev->asic_reset_res)
+ goto fail;
+ } else {
+ if (hive)
+ task_barrier_full(&hive->tb);
[Le]: Same as above.
With above addressed, Reviewed-by: Le Ma <le...@amd.com
<mailto:le...@amd.com>>
Regards,
Ma Le
+
+ adev->asic_reset_res = amdgpu_asic_reset(adev);
+ }
+fail:
if (adev->asic_reset_res)
DRM_WARN("ASIC reset failed with error, %d for
drm dev, %s",
adev->asic_reset_res, adev->ddev->unique);
--
2.7.4
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx