[PATCH AUTOSEL 6.19-6.18] drm/amdgpu: return when ras table checksum is error

Sasha Levin Mon, 23 Feb 2026 04:38:02 -0800

From: Gangliang Xie <[email protected]>

[ Upstream commit 044f8d3b1fac6ac89c560f61415000e6bdab3a03 ]


end the function flow when ras table checksum is error

Signed-off-by: Gangliang Xie <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Reviewed-by: Kent Russell <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---

LLM Generated explanations, may be completely bogus:

Now I have a clear picture. Let me analyze this commit:

## Analysis

### What the commit does

The commit fixes a missing early return in `amdgpu_ras_eeprom_check()`
when the RAS EEPROM table checksum verification fails for the
`RAS_TABLE_HDR_VAL` case (valid header).

**The bug:** When `hdr->header == RAS_TABLE_HDR_VAL` and
`__verify_ras_table_checksum()` fails, the original code logs an error
but continues execution. This means the function proceeds to check if
bad pages exceed 90% of the threshold and eventually returns 0 (success
via `return res < 0 ? res : 0;`, since `__verify_ras_table_checksum`
returns positive for checksum mismatch). The caller then treats the
corrupt table as valid.

**The inconsistency:** In the other branch (`RAS_TABLE_HDR_BAD`), the
same checksum failure already results in `return -EINVAL` (line 1728).
The fix makes both code paths behave consistently - returning an error
on checksum failure.

### Why this matters

If the RAS table has a corrupt checksum and the function returns
success:
1. The caller `amdgpu_ras_load_bad_pages()` proceeds to use potentially
   corrupt bad page data
2. Corrupt bad page tracking could lead to incorrect GPU memory
   management decisions
3. Pages that should be retired (due to hardware errors) might not be,
   or vice versa, potentially leading to GPU errors, data corruption, or
   instability

### Stable criteria assessment

- **Fixes a real bug:** Yes - using corrupt EEPROM data when checksum
  fails is a genuine bug
- **Obviously correct:** Yes - the `RAS_TABLE_HDR_BAD` path already
  returns `-EINVAL` on checksum failure; this is making the
  `RAS_TABLE_HDR_VAL` path consistent
- **Small and contained:** Yes - adds `return -EINVAL` and wraps the
  existing `if` in braces, 4 lines changed
- **No new features:** Correct - purely a bug fix
- **Risk:** Very low - only affects the error path when checksum is
  already corrupt

### Verification

- Read the full `amdgpu_ras_eeprom_check()` function (lines 1670-1762)
  confirming the two parallel branches and the inconsistent handling
- Confirmed `__verify_ras_table_checksum()` returns positive for
  checksum mismatch, negative for I/O error (lines 1429-1437)
- Confirmed the return statement `return res < 0 ? res : 0` at line 1761
  means a positive `res` (checksum mismatch) would be returned as 0
  (success) - this is the core bug
- Confirmed the `RAS_TABLE_HDR_BAD` branch already returns `-EINVAL` on
  checksum failure (line 1728), confirming this is an oversight
- Confirmed the caller in `amdgpu_ras.c` uses the return value to decide
  whether to proceed with loading bad pages (`if (ret) goto out;`)
- The patch has been Reviewed-by two AMD engineers (Tao Zhou and Kent
  Russell)

The fix is small, surgical, obviously correct (matching the existing
pattern in the parallel code path), and prevents using corrupt EEPROM
data. It meets all stable kernel criteria.

**YES**

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 64dd7a81bff5f..710a8fe79fccd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1701,10 +1701,12 @@ int amdgpu_ras_eeprom_check(struct 
amdgpu_ras_eeprom_control *control)
                }
 
                res = __verify_ras_table_checksum(control);
-               if (res)
+               if (res) {
                        dev_err(adev->dev,
                                "RAS table incorrect checksum or error:%d\n",
                                res);
+                       return -EINVAL;
+               }
 
                /* Warn if we are at 90% of the threshold or above
                 */
-- 
2.51.0

[PATCH AUTOSEL 6.19-6.18] drm/amdgpu: return when ras table checksum is error

Reply via email to