sriov:return -ENODEV if gpu reseted

Christian König Mon, 09 Oct 2017 01:55:03 -0700

I think the approach which is currently used by Vulkan actually doesn'tsounds correct to me and should be fixed from the very beginning.

The kernel driver should expose the hardware capabilities as cleanly aspossible to userspace and *NOT* just try to fulfill any random userspaceor customer requirements.

Otherwise we run into having only specialized code which fits just onerequirement instead of a complete solution which works for everyone.

What we need at least is signaling that a problem occurred, blocking allcommand submission until userspace reacted, canceling all submission inflight and be able to reset this state from the userspace side.

You implemented this by blocking everything an assuming that userspacewould somehow magically react gracefully.

What needs to be done instead is to implement proper reset handling inMesa and/or the DDX and while doing so we can talk about requirementsfor the kernel driver.

Adding Nicolai and Marek for this to come up with a plan how to fix thisin Mesa.

Until that is properly done I will block any attempt to push a halvebacked solution upstream which only works for the closed source. I willalso revert the existing changed cause they seem to have caused a bunchof regression to the GPU reset code (not that the code was good in thefirst place, but now it doesn't work any more at all).


Regards,
Christian.

Am 09.10.2017 um 10:35 schrieb Liu, Monk:

Please be aware that this policy is what the strict mode defined and what 
customer want,
And also please check VK spec, it defines that after GPU reset all vk INSTANCE 
should close/release its resource/device/ctx and all buffers, and call 
re-initvkinstance after gpu reset

So this whole approach is what just aligned with the spec, and to not influence 
with current MESA/OGL client that's why I put the whole approach into the 
strict mode
And by default strict mode is not selected


BR Monk

-----Original Message-----
From: Christian König [mailto:[email protected]]
Sent: 2017年10月9日 16:26
To: Liu, Monk <[email protected]>; [email protected]
Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

Am 30.09.2017 um 08:03 schrieb Monk Liu:

for SRIOV strict mode gpu reset:

In kms open we mark the latest adev->gpu_reset_counter in fpriv we
return -ENODEV in cs_ioctl or info_ioctl if they found
fpriv->gpu_reset_counter != adev->gpu_reset_counter.

this way we prevent a potential bad process/FD from submitting cmds
and notify userspace with -ENODEV.

userspace should close all BO/ctx and re-open dri FD to re-create
virtual memory system for this process

The whole aproach is a NAK from my side.

We need to enable userspace to continue, not force it into process termination 
to recover. Otherwise we could send a SIGTERM in the first place.

Regards,
Christian.

Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
Signed-off-by: Monk Liu <[email protected]>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu.h     | 1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 5 +++++
   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
   3 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index de9c164..b40d4ba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -772,6 +772,7 @@ struct amdgpu_fpriv {
        struct idr              bo_list_handles;
        struct amdgpu_ctx_mgr   ctx_mgr;
        u32                     vram_lost_counter;
+       int gpu_reset_counter;
   };

/*

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 9467cf6..6a1515e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, 
struct drm_file *filp)
        if (amdgpu_kms_vram_lost(adev, fpriv))
                return -ENODEV;

+ if (amdgpu_sriov_vf(adev) &&

+               amdgpu_sriov_reset_level == 1 &&
+               fpriv->gpu_reset_counter < 
atomic_read(&adev->gpu_reset_counter))
+               return -ENODEV;
+
        parser.adev = adev;
        parser.filp = filp;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c

b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 282f45b..bd389cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void 
*data, struct drm_file
        if (amdgpu_kms_vram_lost(adev, fpriv))
                return -ENODEV;

+ if (amdgpu_sriov_vf(adev) &&

+               amdgpu_sriov_reset_level == 1 &&
+               fpriv->gpu_reset_counter < 
atomic_read(&adev->gpu_reset_counter))
+               return -ENODEV;
+
        switch (info->query) {
        case AMDGPU_INFO_ACCEL_WORKING:
                ui32 = adev->accel_working;
@@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct 
drm_file *file_priv)
                goto out_suspend;
        }

+ fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);

+
        r = amdgpu_vm_init(adev, &fpriv->vm,
                           AMDGPU_VM_CONTEXT_GFX, 0);
        if (r) {


_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted

Reply via email to