From: Philip Yang <[email protected]>

[ Upstream commit 5b57c3c3f22336e8fd5edb7f0fef3c7823f8eac1 ]

Only check and drain IH1 ring if CAM is not enabled.

If GPU is under reset, don't access IH to drain retry fault.

Signed-off-by: Philip Yang <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a thorough analysis. Let me
compile my findings.

---

## Detailed Analysis

### 1. Commit Message Analysis

**Subject:** "drm/amdkfd: Handle GPU reset and drain retry fault race"

The commit addresses a **race condition** between GPU reset and the
drain retry fault mechanism. The message describes two specific fixes:

1. "Only check and drain IH1 ring if CAM is not enabled" - a correctness
   fix for interrupt handler ring selection logic
2. "If GPU is under reset, don't access IH to drain retry fault" -
   protecting against hardware access during GPU reset

Both are bug fix descriptions, not feature additions.

### 2. Code Change Analysis

The patch makes two distinct changes in `kfd_svm.c`:

#### Change A: `svm_range_drain_retry_fault()` - GPU reset protection

**Before (current code):**

```2337:2364:drivers/gpu/drm/amd/amdkfd/kfd_svm.c
static void svm_range_drain_retry_fault(struct svm_range_list *svms)
{
        // ... iterates over GPUs ...
                amdgpu_ih_wait_on_checkpoint_process_ts(pdd->dev->adev,
                                pdd->dev->adev->irq.retry_cam_enabled ?
                                &pdd->dev->adev->irq.ih :
                                &pdd->dev->adev->irq.ih1);
                // ... no reset protection ...
}
```

**After (patched code):** Adds
`down_read_trylock(&pdd->dev->adev->reset_domain->sem)` before accessing
the IH hardware and `up_read(...)` after. If the trylock fails (GPU is
resetting), it `continue`s to the next GPU.

**Why this matters:** `svm_range_drain_retry_fault()` calls
`amdgpu_ih_wait_on_checkpoint_process_ts()` which calls
`amdgpu_ih_get_wptr()`. Looking at `vega20_ih_get_wptr()`, it does
`RREG32_NO_KIQ(ih_regs->ih_rb_wptr)` - a **direct hardware register
read**. During GPU reset (`amdgpu_device_lock_reset_domain` takes a
write lock on `reset_domain->sem`), the hardware is being torn down and
reinitialized. Accessing registers during this window can cause:
- **System hangs** (MMIO reads to offline hardware can hang the CPU bus)
- **Garbage data reads** leading to incorrect behavior
- **Kernel crashes** if the driver acts on invalid data

This is the **exact same pattern** used throughout the amdgpu driver -
there are **20+ existing call sites** that use
`down_read_trylock(&adev->reset_domain->sem)` to protect hardware
access. The fix follows an established, well-tested pattern.

#### Change B: `svm_range_unmap_from_cpu()` - CAM-aware IH1 check

**Before:**
```c
if (adev->irq.ih1.ring_size) {
```

**After:**
```c
if (!adev->irq.retry_cam_enabled && adev->irq.ih1.ring_size) {
```

The comment already says "Check and drain ih1 ring if cam not available"
but the code was missing the `!retry_cam_enabled` check. When CAM
(Content Addressable Memory) is enabled for retry filtering, retry
faults go through the primary IH ring (`ih`), not `ih1`. Checking `ih1`
when CAM is enabled is incorrect because:
- The timestamp from `ih1` would be stale/irrelevant
- It could cause the code to `continue` early (bypassing the `ih_soft`
  check below) with an incorrect `checkpoint_ts`
- This leads to retry faults being incorrectly dropped or not dropped
  when they should be

This same logic (`if (adev->irq.retry_cam_enabled) return;` before `ih1`
access) already exists in `amdgpu_gmc_filter_faults_remove()`,
confirming that the fix aligns with the intended design.

### 3. Bug Classification

- **Change A:** Race condition fix - accessing hardware registers
  without reset domain protection. This is a **system hang / crash**
  bug.
- **Change B:** Logic bug - checking the wrong interrupt handler ring
  when CAM is enabled, leading to incorrect retry fault handling
  (potential data corruption or stale page mappings).

### 4. Scope and Risk Assessment

- **Lines changed:** ~10 lines of actual logic (plus 1 include)
- **Files changed:** 1 file (`kfd_svm.c`)
- **Complexity:** Low - follows a well-established pattern (20+ examples
  in the codebase)
- **Risk of regression:** Very low
  - The `down_read_trylock` pattern is used everywhere in amdgpu and is
    proven safe
  - If trylock fails, we simply skip draining for that GPU (graceful
    degradation)
  - The CAM check aligns with existing logic in
    `amdgpu_gmc_filter_faults_remove()`

### 5. User Impact

- **Affected users:** Anyone using AMD GPUs with KFD
  (compute/ROCm/OpenCL workloads) who experiences GPU resets
- **Severity:** HIGH - GPU reset + drain retry fault = potential system
  hang when register access hangs the CPU
- **Without fix:** If a GPU reset happens concurrently with
  `svm_range_list_fini()` (process exit), the system could hang trying
  to read hardware registers from an offline GPU

### 6. Dependencies

- Requires `amdgpu_reset.h` (available since v6.1)
- Requires `reset_domain->sem` infrastructure (available since v6.1)
- Requires `retry_cam_enabled` field (available since the CAM commit in
  6.6)
- The `svm_range_drain_retry_fault()` function in its current form (with
  the `retry_cam_enabled` ternary) was introduced in 6.12 (commit
  `6ef29715ac06`)
- This patch applies cleanly only to trees with `6ef29715ac06` and
  `96316211eb5c4` (both in 6.12+)

### 7. Stability Indicators

- **Reviewed-by:** Harish Kasiviswanathan (AMD kernel engineer)
- **Author:** Philip Yang (AMD KFD maintainer, regularly contributes SVM
  retry fault fixes)
- **Signed-off-by:** Alex Deucher (AMD GPU subsystem maintainer)
- Pattern is well-established in the driver (20+ existing similar uses)

### Conclusion

This commit fixes two real bugs:

1. A **race condition** between GPU reset and IH ring access that can
   cause **system hangs** - this is the critical fix
2. A **logic error** in IH ring selection when CAM is enabled that
   causes incorrect retry fault handling

The fix is small (10 lines of logic), follows an extremely well-
established pattern in the amdgpu driver (20+ existing call sites use
the same `down_read_trylock` on `reset_domain->sem`), has been reviewed
by an AMD engineer, was authored by the KFD SVM subsystem expert, and
addresses a real crash/hang scenario. The risk of regression is minimal.

The fix applies to kernels 6.12+ where `6ef29715ac06` ("drm/amdkfd:
Change kfd/svm page fault drain handling") exists.

**YES**

 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 79ea138897fcf..a10cf8650c92b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -33,6 +33,7 @@
 #include "amdgpu_hmm.h"
 #include "amdgpu.h"
 #include "amdgpu_xgmi.h"
+#include "amdgpu_reset.h"
 #include "kfd_priv.h"
 #include "kfd_svm.h"
 #include "kfd_migrate.h"
@@ -2349,6 +2350,9 @@ static void svm_range_drain_retry_fault(struct 
svm_range_list *svms)
 
                pr_debug("drain retry fault gpu %d svms %p\n", i, svms);
 
+               if (!down_read_trylock(&pdd->dev->adev->reset_domain->sem))
+                       continue;
+
                amdgpu_ih_wait_on_checkpoint_process_ts(pdd->dev->adev,
                                pdd->dev->adev->irq.retry_cam_enabled ?
                                &pdd->dev->adev->irq.ih :
@@ -2358,6 +2362,7 @@ static void svm_range_drain_retry_fault(struct 
svm_range_list *svms)
                        amdgpu_ih_wait_on_checkpoint_process_ts(pdd->dev->adev,
                                &pdd->dev->adev->irq.ih_soft);
 
+               up_read(&pdd->dev->adev->reset_domain->sem);
 
                pr_debug("drain retry fault gpu %d svms 0x%p done\n", i, svms);
        }
@@ -2541,7 +2546,7 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct 
svm_range *prange,
                adev = pdd->dev->adev;
 
                /* Check and drain ih1 ring if cam not available */
-               if (adev->irq.ih1.ring_size) {
+               if (!adev->irq.retry_cam_enabled && adev->irq.ih1.ring_size) {
                        ih = &adev->irq.ih1;
                        checkpoint_wptr = amdgpu_ih_get_wptr(adev, ih);
                        if (ih->rptr != checkpoint_wptr) {
-- 
2.51.0

Reply via email to