On Fri, Apr 25, 2025 at 11:58 AM Nikola Petrovic <[email protected]> wrote: > > I've identified a critical issue with the existing GPU reset mechanism that > causes a BSOD on the Windows Hyper-V platform. The current function: > > static void xgpu_nv_mailbox_flr_work(struct work_struct *work) > > incorrectly sets the AMDGPU_HOST_FLR flag if any engine is hanging. This > approach wrongly assumes that the Host PF is always responsible for > triggering FLR. However, a VF (VM-guest) can also cause a GPU hang, which > results in an unsuccessful VM reset. This ultimately causes a > FULL_ACCESS_TIMEOUT on the host side, leading to six attempts to retrigger a > Whole Guest Reset (WGR), which results in a BSOD after five to six failed > restarts. > > Additionally, the current sequence sends a READY_TO_RESTART event and then > requests FULL_ACCESS, which seems incorrect to me. > > My fix addresses this problem by using REQ_GPU_RESET to initiate the > necessary restart while appropriately handling the FULL ACCESS request. My > implementation has successfully passed 100 loop tests, confirming its > effectiveness. > > Signed-off-by: Nikola Petrovic <[email protected]>
Acked-by: Alex Deucher <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 7f354cd532dc..a2a436707200 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -5314,11 +5314,12 @@ static int amdgpu_device_reset_sriov(struct > amdgpu_device *adev, > struct amdgpu_hive_info *hive = NULL; > > if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) { > + r = amdgpu_virt_wait_reset(adev); > + if (r) > + return r; > if (!amdgpu_ras_get_fed_status(adev)) > - amdgpu_virt_ready_to_reset(adev); > - amdgpu_virt_wait_reset(adev); > + amdgpu_virt_reset_gpu(adev); > clear_bit(AMDGPU_HOST_FLR, &reset_context->flags); > - r = amdgpu_virt_request_full_gpu(adev, true); > } else { > r = amdgpu_virt_reset_gpu(adev); > } > -- > 2.43.0 >
