[AMD Official Use Only - General]

The in_gpu_reset is set after reset error count and reset error status function 
call, so we can't use  amdgpu_in_reset(), please check ras->in_recovery flag.

Regards,
Stanley
From: Zhou1, Tao <[email protected]>
Sent: Friday, October 13, 2023 5:06 PM
To: Zhang, Hawking <[email protected]>; [email protected]; 
Yang, Stanley <[email protected]>; Li, Candice <[email protected]>; Chai, 
Thomas <[email protected]>; Wang, Yang(Kevin) <[email protected]>
Subject: Re: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions


[AMD Official Use Only - General]

How about this condition:

if ((amdgpu_in_reset(adev) || amdgpu_ras_intr_triggered()) &&
           mca_funcs && mca_funcs->mca_set_debug_mode)

I use amdgpu_in_reset to skip touching it in all gpu resets, not only for the 
resets triggered by ras fatal error.

Regards,
Tao

________________________________
From: Zhang, Hawking <[email protected]<mailto:[email protected]>>
Sent: Thursday, October 12, 2023 9:14 PM
To: Zhou1, Tao <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>; Yang, 
Stanley <[email protected]<mailto:[email protected]>>; Li, Candice 
<[email protected]<mailto:[email protected]>>; Chai, Thomas 
<[email protected]<mailto:[email protected]>>; Wang, Yang(Kevin) 
<[email protected]<mailto:[email protected]>>
Subject: RE: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions

[AMD Official Use Only - General]

-       if (!amdgpu_ras_is_supported(adev, block))
+       /* skip ras error reset in gpu reset */
+       if (amdgpu_in_reset(adev) &&
+           mca_funcs && mca_funcs->mca_set_debug_mode)
+               return 0;

We should check RAS in_recovery flag in such case. Reset domain is locked in 
relative late phase, at least *after* error counter harvest. Please double 
check.

Regards,
Hawking
-----Original Message-----
From: Zhou1, Tao <[email protected]<mailto:[email protected]>>
Sent: Thursday, October 12, 2023 17:01
To: [email protected]<mailto:[email protected]>; Yang, 
Stanley <[email protected]<mailto:[email protected]>>; Zhang, Hawking 
<[email protected]<mailto:[email protected]>>; Li, Candice 
<[email protected]<mailto:[email protected]>>; Chai, Thomas 
<[email protected]<mailto:[email protected]>>; Wang, Yang(Kevin) 
<[email protected]<mailto:[email protected]>>
Cc: Zhou1, Tao <[email protected]<mailto:[email protected]>>
Subject: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions

PMFW is responsible for RAS error reset in some conditions, driver can skip the 
operation.

Signed-off-by: Tao Zhou <[email protected]<mailto:[email protected]>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 91ed4fd96ee1..6dddb0423411 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1105,11 +1105,18 @@ int amdgpu_ras_reset_error_count(struct amdgpu_device 
*adev,
                enum amdgpu_ras_block block)
 {
        struct amdgpu_ras_block_object *block_obj = 
amdgpu_ras_get_ras_block(adev, block, 0);
+       const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs;

        if (!block_obj || !block_obj->hw_ops)
                return 0;

-       if (!amdgpu_ras_is_supported(adev, block))
+       /* skip ras error reset in gpu reset */
+       if (amdgpu_in_reset(adev) &&
+           mca_funcs && mca_funcs->mca_set_debug_mode)
+               return 0;
+
+       if (!amdgpu_ras_is_supported(adev, block) ||
+           !amdgpu_ras_get_mca_debug_mode(adev))
                return 0;

        if (block_obj->hw_ops->reset_ras_error_count)
@@ -1122,6 +1129,7 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device 
*adev,
                enum amdgpu_ras_block block)
 {
        struct amdgpu_ras_block_object *block_obj = 
amdgpu_ras_get_ras_block(adev, block, 0);
+       const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs;

        if (!block_obj || !block_obj->hw_ops) {
                dev_dbg_once(adev->dev, "%s doesn't config RAS function\n", @@ 
-1129,7 +1137,13 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device 
*adev,
                return 0;
        }

-       if (!amdgpu_ras_is_supported(adev, block))
+       /* skip ras error reset in gpu reset */
+       if (amdgpu_in_reset(adev) &&
+           mca_funcs && mca_funcs->mca_set_debug_mode)
+               return 0;
+
+       if (!amdgpu_ras_is_supported(adev, block) ||
+           !amdgpu_ras_get_mca_debug_mode(adev))
                return 0;

        if (block_obj->hw_ops->reset_ras_error_count)
--
2.35.1

Reply via email to