On 2021-07-30 5:26 p.m., Felix Kuehling wrote:
Am 2021-07-28 um 1:31 p.m. schrieb Eric Huang:
It is to fix a bug of gpu_recovery on multiple GPUs,
When one gpu is reset, the application running on other
gpu hangs, because kfd post reset doesn't restore the
running process.
This will resume all processes, even those that were affected by the GPU
reset.

The assumption we made here is, that KFD processes can use all GPUs. So
when one GPU is reset, all processes are affected. If we want to refine
that, we'll need some more complex logic to identify the processes
affected by a GPU reset and keep only those processes suspended. This
could be based on the GPUs that a process has created queues on, or
allocated memory on.

What we don't want, is processes continuing with bad data or
inconsistent state after a GPU reset.

Current code doesn't take care of this assumption. When a GPU hangs, evicting queues will fail on it and roll back to restore all processes queues on other GPUs, and continue to running with unclear state and data after a GPU reset.

The original thought about this patch is to call kfd_suspend_all_processes and kfd_restore_all_processes in pair on pre_reset and post_reset. And It keeps the consistent behavior for both amdgpu_gpu_recover and hang_hws.
  And it also fixes a bug in the function
kfd_process_evict_queues, when one gpu hangs, process
running on other gpus can't be evicted.
This should be a separate patch. The code you're removing was there as
an attempt to make kfd_process_evict_queues transactional. That means,
it either succeeds completely or it fails completely. I wanted to avoid
putting the system into an unknown state where some queues are suspended
and others are not and the caller has no idea how to proceed. So I roll
back a partial queue eviction if something failed.
Can we let the caller to decide if roll-back is needed? because no all the callers need to roll back, e.g. kfd_suspend_all_processes and kgd2kfd_quiesce_mm.

Your changing this to "try to evict as much as possible". Then a failure
does not mean "eviction failed" but "eviction completed but something
hung". Then the GPU reset can take care of the hanging part. I think
that's a reasonable strategy. But we need to be sure that there are no
other error conditions (other than hangs) that could cause a queue
eviction to fail.

There were some recent changes in pqm_destroy_queue that check the
return value of dqm->ops.destroy_queue, which indirectly comes from
execute_queues (sam as in the eviction case). -ETIME means a hang. Other
errors are possible. Those other errors may still need the roll-back.
Otherwise we'd leave stuff running on a non-hanging GPU after we
promised that we evicted everything.
I think it depends on case scenario. GPU reset doesn't need to know the return state. Memory eviction may need. Does Memory notifier invalidate range need?

See one more comment inline.


Signed-off-by: Eric Huang <jinhuieric.hu...@amd.com>
---
  drivers/gpu/drm/amd/amdkfd/kfd_device.c  |  2 +-
  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 24 +-----------------------
  2 files changed, 2 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 24b5e0aa1eac..daf1c19bd799 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -984,7 +984,7 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
        if (!kfd->init_complete)
                return 0;
- ret = kfd_resume(kfd);
+       ret = kgd2kfd_resume(kfd, false, true);
Which branch is this for? On amd-staging-drm-next kgd2kfd_resume only
has two parameters.
Sorry, it is based on dkms staging 5.11. I didn't notice there is difference between two branches.

Regards,
Eric

Regards,
   Felix


        if (ret)
                return ret;
        atomic_dec(&kfd_locked);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 38a9dee40785..9272a12c1db8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1879,36 +1879,14 @@ int kfd_process_evict_queues(struct kfd_process *p)
  {
        int r = 0;
        int i;
-       unsigned int n_evicted = 0;
for (i = 0; i < p->n_pdds; i++) {
                struct kfd_process_device *pdd = p->pdds[i];
r = pdd->dev->dqm->ops.evict_process_queues(pdd->dev->dqm,
                                                            &pdd->qpd);
-               if (r) {
+               if (r)
                        pr_err("Failed to evict process queues\n");
-                       goto fail;
-               }
-               n_evicted++;
-       }
-
-       return r;
-
-fail:
-       /* To keep state consistent, roll back partial eviction by
-        * restoring queues
-        */
-       for (i = 0; i < p->n_pdds; i++) {
-               struct kfd_process_device *pdd = p->pdds[i];
-
-               if (n_evicted == 0)
-                       break;
-               if (pdd->dev->dqm->ops.restore_process_queues(pdd->dev->dqm,
-                                                             &pdd->qpd))
-                       pr_err("Failed to restore queues\n");
-
-               n_evicted--;
        }
return r;

Reply via email to