[AMD Official Use Only - General]

Additional:
This idx/count can help us to distinguish whether those logs are from the same 
query results.

If this information is removed, we cannot confirm from the log whether any log 
is missing.
e.g: [4/8], then we can know that there are 4 more that are not printed due to 
some reason.

Best Regards,
Kevin

_____________________________________________
From: Wang, Yang(Kevin)
Sent: Wednesday, January 3, 2024 5:51 PM
To: Zhang, Hawking <[email protected]>; [email protected]
Cc: Zhou1, Tao <[email protected]>; Chai, Thomas <[email protected]>
Subject: RE: [PATCH 02/14] drm/amdgpu: add ACA kernel hardware error log support

_____________________________________________
From: Zhang, Hawking <[email protected]<mailto:[email protected]>>
Sent: Wednesday, January 3, 2024 5:36 PM
To: Wang, Yang(Kevin) <[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]>
Cc: Zhou1, Tao <[email protected]<mailto:[email protected]>>; Chai, Thomas 
<[email protected]<mailto:[email protected]>>
Subject: RE: [PATCH 02/14] drm/amdgpu: add ACA kernel hardware error log support


[AMD Official Use Only - General]



+       dev_info(adev->dev, "[Hardware error] Accelerator Check Architecture 
events logged\n");
+       /* plus 1 for output format, e.g: ACA[08/08]: xxxx */
+       for (i = 0; i < ARRAY_SIZE(aca_regs); i++)
+               dev_info(adev->dev, "[Hardware error] 
ACA[%02d/%02d].%s=0x%016llx\n",
+                        idx + 1, total, aca_regs[i].name, 
bank->regs[aca_regs[i].reg_idx]);

We should keep the ACA log format simple since there are tools like crash 
dumper that grab these logs.

How about formatting log as below
dev_info(adev->dev, "[Hardware error] Accelerator Check Architecture (ACA) 
events logged\n");
dev_info(adev→dev, "[Hardware error] ACA.%s=0x%016llx\n");

In general, if the idx doesn't convey useful information, then just replace it 
with ACA.Reg.

[Kevin]:

Agree, will update it in next version.

Best Regards,
Kevin

Thoughts?

Regards,
Hawking

-----Original Message-----
From: Wang, Yang(Kevin) <[email protected]<mailto:[email protected]>>
Sent: Wednesday, January 3, 2024 16:02
To: [email protected]<mailto:[email protected]>
Cc: Zhang, Hawking <[email protected]<mailto:[email protected]>>; 
Zhou1, Tao <[email protected]<mailto:[email protected]>>; Chai, Thomas 
<[email protected]<mailto:[email protected]>>; Wang, Yang(Kevin) 
<[email protected]<mailto:[email protected]>>
Subject: [PATCH 02/14] drm/amdgpu: add ACA kernel hardware error log support

add ACA kernel hardware error log support.

Signed-off-by: Yang Wang <[email protected]<mailto:[email protected]>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 29 +++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
index 6a6f167b5380..cadeda64eded 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c
@@ -100,6 +100,33 @@ static int aca_smu_get_valid_aca_count(struct 
amdgpu_device *adev, enum aca_erro
        return smu_funcs->get_valid_aca_count(adev, type, count);  }

+static struct aca_regs_dump {
+       const char *name;
+       int reg_idx;
+} aca_regs[] = {
+       {"CONTROL",             ACA_REG_IDX_CTL},
+       {"STATUS",              ACA_REG_IDX_STATUS},
+       {"ADDR",                ACA_REG_IDX_ADDR},
+       {"MISC",                ACA_REG_IDX_MISC0},
+       {"CONFIG",              ACA_REG_IDX_CONFG},
+       {"IPID",                ACA_REG_IDX_IPID},
+       {"SYND",                ACA_REG_IDX_SYND},
+       {"DESTAT",              ACA_REG_IDX_DESTAT},
+       {"DEADDR",              ACA_REG_IDX_DEADDR},
+       {"CONTROL_MASK",        ACA_REG_IDX_CTL_MASK},
+};
+
+static void aca_smu_bank_dump(struct amdgpu_device *adev, int idx, int
+total, struct aca_bank *bank) {
+       int i;
+
+       dev_info(adev->dev, "[Hardware error] Accelerator Check Architecture 
events logged\n");
+       /* plus 1 for output format, e.g: ACA[08/08]: xxxx */
+       for (i = 0; i < ARRAY_SIZE(aca_regs); i++)
+               dev_info(adev->dev, "[Hardware error] 
ACA[%02d/%02d].%s=0x%016llx\n",
+                        idx + 1, total, aca_regs[i].name, 
bank->regs[aca_regs[i].reg_idx]);
+}
+
 static int aca_smu_get_valid_aca_banks(struct amdgpu_device *adev, enum 
aca_error_type type,
                                       int start, int count,
                                       struct aca_banks *banks)
@@ -137,6 +164,8 @@ static int aca_smu_get_valid_aca_banks(struct amdgpu_device 
*adev, enum aca_erro
                if (ret)
                        return ret;

+               aca_smu_bank_dump(adev, i, count, &bank);
+
                ret = aca_banks_add_bank(banks, &bank);
                if (ret)
                        return ret;
--
2.34.1

Reply via email to