AMD General

Reviewed-by: Hawking Zhang <[email protected]>

Regards,
Hawking
-----Original Message-----
From: amd-gfx <[email protected]> On Behalf Of Yunxiang Li
Sent: Tuesday, June 2, 2026 5:03 AM
To: Deucher, Alexander <[email protected]>; Koenig, Christian 
<[email protected]>
Cc: [email protected]; Li, Yunxiang (Teddy) <[email protected]>
Subject: [PATCH] drm/amdgpu: set sub_block_index for mca ras sub-blocks

The mca ras sub-blocks (mp0, mp1, mpio) all share the AMDGPU_RAS_BLOCK__MCA 
block id and are distinguished only by sub_block_index. The ras manager object 
for an mca block is selected
with:

        con->objs[AMDGPU_RAS_BLOCK__LAST + head->sub_block_index]

Since the rework in commit 7f544c5488cf ("drm/amdgpu: Rework mca ras
sw_init") moved the ras_comm setup into amdgpu_mca_mp*_ras_sw_init() but left 
sub_block_index unset, mp0/mp1/mpio all default to index 0 and collide on the 
same object slot. mp0 grabs the slot and creates its sysfs node first; mp1 (and 
mpio) then find the slot already in use, so
amdgpu_ras_block_late_init() -> amdgpu_ras_sysfs_create() returns
-EINVAL:

  amdgpu: mca.mp1 failed to execute ras_block_late_init_default! ret:-22
  amdgpu: amdgpu_ras_late_init failed -22
  amdgpu: amdgpu_device_ip_late_init failed
  amdgpu: Fatal error during GPU init

The error is currently masked because amdgpu_ras_late_init() does not check the 
return value of amdgpu_ras_block_late_init_default(), but it already leaves 
mp1/mpio without their sysfs nodes and becomes a fatal init failure as soon as 
that return value is honored.

Restore the per-sub-block sub_block_index assignment so each mca sub-block maps 
to its own object slot.

Fixes: 7f544c5488cf ("drm/amdgpu: Rework mca ras sw_init")
Signed-off-by: Yunxiang Li <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
index 3ca03b5e0f913..e1e4a61b1301c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c
@@ -99,6 +99,7 @@ int amdgpu_mca_mp0_ras_sw_init(struct amdgpu_device *adev)

        strcpy(ras->ras_block.ras_comm.name, "mca.mp0");
        ras->ras_block.ras_comm.block = AMDGPU_RAS_BLOCK__MCA;
+       ras->ras_block.ras_comm.sub_block_index = AMDGPU_RAS_MCA_BLOCK__MP0;
        ras->ras_block.ras_comm.type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
        adev->mca.mp0.ras_if = &ras->ras_block.ras_comm;

@@ -123,6 +124,7 @@ int amdgpu_mca_mp1_ras_sw_init(struct amdgpu_device *adev)

        strcpy(ras->ras_block.ras_comm.name, "mca.mp1");
        ras->ras_block.ras_comm.block = AMDGPU_RAS_BLOCK__MCA;
+       ras->ras_block.ras_comm.sub_block_index = AMDGPU_RAS_MCA_BLOCK__MP1;
        ras->ras_block.ras_comm.type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
        adev->mca.mp1.ras_if = &ras->ras_block.ras_comm;

@@ -147,6 +149,7 @@ int amdgpu_mca_mpio_ras_sw_init(struct amdgpu_device *adev)

        strcpy(ras->ras_block.ras_comm.name, "mca.mpio");
        ras->ras_block.ras_comm.block = AMDGPU_RAS_BLOCK__MCA;
+       ras->ras_block.ras_comm.sub_block_index = AMDGPU_RAS_MCA_BLOCK__MPIO;
        ras->ras_block.ras_comm.type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE;
        adev->mca.mpio.ras_if = &ras->ras_block.ras_comm;

--
2.51.2

Reply via email to