When a poison is consumed on the guest before the guest receives the host's poison creation msg, a corner case may occur to have poison_handler complete processing earlier than it should to cause the guest to hang waiting for the req_bad_pages reply during a VF FLR, resulting in the VM becoming inaccessible in stress tests.
To work around this issue, this patch introduce a delay of 3s before poison_handler msg gets sent out. This way we make sure the correct processing order for both poison_handler and req_bad_pages event. Signed-off-by: Ellen Pan <yu...@amd.com> --- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c index f6d8597452ed..64e631c996e2 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c @@ -499,6 +499,9 @@ void xgpu_nv_mailbox_put_irq(struct amdgpu_device *adev) static void xgpu_nv_ras_poison_handler(struct amdgpu_device *adev, enum amdgpu_ras_block block) { + // delay 3s to make sure any other intr is properly handled first + msleep(3000); + if (amdgpu_ip_version(adev, UMC_HWIP, 0) < IP_VERSION(12, 0, 0)) { xgpu_nv_send_access_requests(adev, IDH_RAS_POISON); } else { -- 2.25.1