Am 01.11.2017 um 04:29 schrieb Liu, Monk:
The thing is triggering gpu_recover() in irq routine give you NULL for the 
@bad/job parameter, so gpu_recover() actually did nothing meaningful, it just 
repeat scheduling un-signaled jobs again and again, and finally your GPU is 
stuck with infinite recovering ....

Yeah, completely agree that this isn't a good implementation. But we need to somehow do better than that.

here is my initial thought:
In " gfx_v9_0_priv_reg_irq(struct amdgpu_device *adev, struct amdgpu_iv_entry 
*entry)" routine:
For lockup_timeout==0 case: we put a new parameter "@sched" to 
amd_gpu_recover(), that way although we don't have @bad/job, but with @sched we can still 
find out the pending job on his scheduler,
But that make code ugly, cuz we need to change amdgpu_gpu_recover(adev, job) to 
amdgpu_gpu_recover(adev, job, sched). Anyway need to find a way to tell 
gpu_recover() at least which ring has problem.

For lockup_timeout!=0 case: did nothing and just return since TDR will correct 
this error and do recover() gracefully.

I think waiting for the TDR will take to long. How about this instead:

1. In gfx_v9_0_priv_reg_irq() (and all the other IRQ handlers as well) we add functionality to figure out the ring/scheduler which caused the illegal operation.

2. Then we add a new function amd_sched_job_fault() which gets the scheduler which had a fault as parameter.

3. amd_sched_job_fault() then grabs job_list_lock and takes the first job from the ring_mirror_list.

4. We call cancel_delayed_work() to cancel the timeout of the job.

5. If cancel_delayed_work() returns true (which means the TDR isn't already running anyway) we call schedule_delayed_work() with a zero timeout.

This way the actual timeout value is truncated to zero and we run the recovery immediately just in the same way as it would when we have waited.

Should be trivial to implement actually. What do you think?

Regards,
Christian.




-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: 2017年10月31日 23:01
To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu:cleanup deprecated gpu reset logic

Am 31.10.2017 um 12:55 schrieb Monk Liu:
trigger gpu reset/recovery from illegle instruction IRQ is deprecated
long time ago, we already switch to recover gpu by TDR triggering.

now please set lockup_timeout to non-zero value in driver loading to
enable TDR.
The patch is ok, but NAK to the general approach. We should have illegal 
instruction/access alongside the timeout.

But instead of trying to trigger the reset directly inform the scheduler that 
the we detected a problem on the engine. The scheduler can then cancel the 
timeout and do the appropriate things.

This patch would be ok if you install this new functionality directly after 
removing the old (and broken) one.

Regards,
Christian.

Change-Id: I45a576a97fd9859e1098e785ce857c2cf5adfba5
Signed-off-by: Monk Liu <monk....@amd.com>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 -
   drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 21 ---------------------
   drivers/gpu/drm/amd/amdgpu/cik_sdma.c   |  1 -
   drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c   |  2 --
   drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c   |  3 ---
   drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c   |  2 --
   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   |  2 --
   drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c  |  1 -
   drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c  |  1 -
   drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c  |  1 -
   drivers/gpu/drm/amd/amdgpu/si_dma.c     |  1 -
   11 files changed, 36 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 6e89be5..b3453e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1459,7 +1459,6 @@ struct amdgpu_device {
        bool                            shutdown;
        bool                            need_dma32;
        bool                            accel_working;
-       struct work_struct              reset_work;
        struct notifier_block           acpi_nb;
        struct amdgpu_i2c_chan          *i2c_bus[AMDGPU_MAX_I2C_BUS];
        struct amdgpu_debugfs           debugfs[AMDGPU_DEBUGFS_MAX_COMPONENTS];
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index c340774..989b530 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -73,23 +73,6 @@ static void amdgpu_hotplug_work_func(struct work_struct 
*work)
        drm_helper_hpd_irq_event(dev);
   }
-/**
- * amdgpu_irq_reset_work_func - execute gpu reset
- *
- * @work: work struct
- *
- * Execute scheduled gpu reset (cayman+).
- * This function is called when the irq handler
- * thinks we need a gpu reset.
- */
-static void amdgpu_irq_reset_work_func(struct work_struct *work) -{
-       struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
-                                                 reset_work);
-
-       if (!amdgpu_sriov_vf(adev))
-               amdgpu_gpu_recover(adev, NULL);
-}
/* Disable *all* interrupts */
   static void amdgpu_irq_disable_all(struct amdgpu_device *adev) @@
-251,14 +234,11 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
                                amdgpu_hotplug_work_func);
        }
- INIT_WORK(&adev->reset_work, amdgpu_irq_reset_work_func);
-
        adev->irq.installed = true;
        r = drm_irq_install(adev->ddev, adev->ddev->pdev->irq);
        if (r) {
                adev->irq.installed = false;
                flush_work(&adev->hotplug_work);
-               cancel_work_sync(&adev->reset_work);
                return r;
        }
@@ -283,7 +263,6 @@ void amdgpu_irq_fini(struct amdgpu_device *adev)
                if (adev->irq.msi_enabled)
                        pci_disable_msi(adev->pdev);
                flush_work(&adev->hotplug_work);
-               cancel_work_sync(&adev->reset_work);
        }
for (i = 0; i < AMDGPU_IH_CLIENTID_MAX; ++i) { diff --git
a/drivers/gpu/drm/amd/amdgpu/cik_sdma.c
b/drivers/gpu/drm/amd/amdgpu/cik_sdma.c
index ed26dcb..c8d9bc1 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik_sdma.c
@@ -1227,7 +1227,6 @@ static int cik_sdma_process_illegal_inst_irq(struct 
amdgpu_device *adev,
                                             struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal instruction in SDMA command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c
index 9430d48..25b32b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c
@@ -3443,7 +3443,6 @@ static int gfx_v6_0_priv_reg_irq(struct amdgpu_device 
*adev,
                                 struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal register access in command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }
@@ -3452,7 +3451,6 @@ static int gfx_v6_0_priv_inst_irq(struct amdgpu_device *adev,
                                  struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal instruction in command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
index 3c2b15a..b0e127d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
@@ -5020,7 +5020,6 @@ static int gfx_v7_0_priv_reg_irq(struct amdgpu_device 
*adev,
                                 struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal register access in command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }
@@ -5029,8 +5028,6 @@ static int gfx_v7_0_priv_inst_irq(struct amdgpu_device *adev,
                                  struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal instruction in command stream\n");
-       // XXX soft reset the gfx block only
-       schedule_work(&adev->reset_work);
        return 0;
   }
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
index a74515a..5fd4996 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -6795,7 +6795,6 @@ static int gfx_v8_0_priv_reg_irq(struct amdgpu_device 
*adev,
                                 struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal register access in command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }
@@ -6804,7 +6803,6 @@ static int gfx_v8_0_priv_inst_irq(struct amdgpu_device *adev,
                                  struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal instruction in command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 9855dc0..dce1960 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -4136,7 +4136,6 @@ static int gfx_v9_0_priv_reg_irq(struct amdgpu_device 
*adev,
                                 struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal register access in command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }
@@ -4145,7 +4144,6 @@ static int gfx_v9_0_priv_inst_irq(struct amdgpu_device *adev,
                                  struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal instruction in command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
index 92f8c44..7e2580c 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v2_4.c
@@ -1161,7 +1161,6 @@ static int sdma_v2_4_process_illegal_inst_irq(struct 
amdgpu_device *adev,
                                              struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal instruction in SDMA command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
index 52e6bf2..c12f994 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v3_0.c
@@ -1480,7 +1480,6 @@ static int sdma_v3_0_process_illegal_inst_irq(struct 
amdgpu_device *adev,
                                              struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal instruction in SDMA command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index fe78c00..09b7586 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -1416,7 +1416,6 @@ static int sdma_v4_0_process_illegal_inst_irq(struct 
amdgpu_device *adev,
                                              struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal instruction in SDMA command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }
diff --git a/drivers/gpu/drm/amd/amdgpu/si_dma.c
b/drivers/gpu/drm/amd/amdgpu/si_dma.c
index ee469a9..408e145 100644
--- a/drivers/gpu/drm/amd/amdgpu/si_dma.c
+++ b/drivers/gpu/drm/amd/amdgpu/si_dma.c
@@ -683,7 +683,6 @@ static int si_dma_process_illegal_inst_irq(struct 
amdgpu_device *adev,
                                              struct amdgpu_iv_entry *entry)
   {
        DRM_ERROR("Illegal instruction in SDMA command stream\n");
-       schedule_work(&adev->reset_work);
        return 0;
   }


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to