Am 10.12.21 um 10:35 schrieb Lang Yu:
It is useful to maintain error context when debugging
SW/FW issues. Introduce amdgpu_device_halt() for this
purpose. It will bring hardware to a kind of halt state,
so that no one can touch it any more.

Compare to a simple hang, the system will keep stable
at least for SSH access. Then it should be trivial to
inspect the hardware state and see what's going on.

Suggested-by: Christian Koenig <[email protected]>
Suggested-by: Andrey Grodzovsky <[email protected]>
Signed-off-by: Lang Yu <[email protected]>

I'm not 100% sure if the amdgpu_device_unmap_mmio() in necessary here, but for now that patch is Reviewed-by: Christian König <[email protected]>

Thanks,
Christian.


v2:
  - Set adev->no_hw_access earlier to avoid crashes.(Christian)
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++++++++++++++++++++++
  2 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c5cfe2926ca1..3f5f8f62aa5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct amdgpu_device *adev,
  void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
                struct amdgpu_ring *ring);
+void amdgpu_device_halt(struct amdgpu_device *adev);
+
  /* atpx handler */
  #if defined(CONFIG_VGA_SWITCHEROO)
  void amdgpu_register_atpx_handler(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a1c14466f23d..8fe7ea6cee18 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct amdgpu_device 
*adev,
amdgpu_asic_invalidate_hdp(adev, ring);
  }
+
+/**
+ * amdgpu_device_halt() - bring hardware to some kind of halt state
+ *
+ * @adev: amdgpu_device pointer
+ *
+ * Bring hardware to some kind of halt state so that no one can touch it
+ * any more. It will help to maintain error context when error occurred.
+ * Compare to a simple hang, the system will keep stable at least for SSH
+ * access. Then it should be trivial to inspect the hardware state and
+ * see what's going on. Implemented as following:
+ *
+ * 1. drm_dev_unplug() makes device inaccessible to user space(IOCTLs, etc),
+ *    clears all CPU mappings to device, disallows remappings through page 
faults
+ * 2. amdgpu_irq_disable_all() disables all interrupts
+ * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
+ * 4. set adev->no_hw_access to avoid potential crashes after setp 5
+ * 5. amdgpu_device_unmap_mmio() clears all MMIO mappings
+ * 6. pci_disable_device() and pci_wait_for_pending_transaction()
+ *    flush any in flight DMA operations
+ */
+void amdgpu_device_halt(struct amdgpu_device *adev)
+{
+       struct pci_dev *pdev = adev->pdev;
+       struct drm_device *ddev = &adev->ddev;
+
+       drm_dev_unplug(ddev);
+
+       amdgpu_irq_disable_all(adev);
+
+       amdgpu_fence_driver_hw_fini(adev);
+
+       adev->no_hw_access = true;
+
+       amdgpu_device_unmap_mmio(adev);
+
+       pci_disable_device(pdev);
+       pci_wait_for_pending_transaction(pdev);
+}

Reply via email to