Hi Christian,

I give a quick try according to your suggestion. It also works and cleaner. I 
will send a new patch to revise the retry_init. Please help reviewing later.
— 
Sincerely Yours,
Pixel








On 08/11/2017, 10:40 AM, "Ding, Pixel" <[email protected]> wrote:

>Hi Christian,
>
>The retry_init only handles the failure caused by exclusive mode timeout. It 
>checks the MMIO to see if there’s exclusive mode timeout, and retry init if 
>there’s, otherwise just return error.
>
>For exclusive timeout case, the host layer issues a FLR on this VF so driver 
>needn't cleanup hardware status, amdgpu_device_fini here just is used to 
>cleanup the software.
>
>It’s tested and proved working correctly. Although the debugfs files are only 
>the tip of the iceberg, it’s the only issue we found in this version of 
>retry_init.
>
>— 
>Sincerely Yours,
>Pixel
>
>
>
>
>
>
>
>On 07/11/2017, 5:56 PM, "Koenig, Christian" <[email protected]> wrote:
>
>>Hi Gary,
>>
>>well that patch is nonsense to begin with.
>>
>>amdgpu_device_init() does quite a bunch of other initialization which is 
>>not cleaned up by amdgpu_device_fini(), so the debugfs files are only 
>>the tip of the iceberg here.
>>
>>Please revert 2316518efc459928ad1d3d2d3511ea5fbda19475 and then we can 
>>try again from scratch.
>>
>>What we need to do is return -EAGAIN from amdgpu_driver_load_kms. Then 
>>in amdgpu_pci_probe() we can catch that error and call 
>>drm_dev_register() multiple times if necessary.
>>
>>This way we can also optionally pci_disable_device() / 
>>pci_enable_device() between tries if appropriate.
>>
>>Regards,
>>Christian.
>>
>>Am 07.11.2017 um 09:02 schrieb Sun, Gary:
>>> Hi Christian,
>>>
>>> The feature is for GPU virtualization and has been checked in, you can 
>>> refer to the following patch or commit 
>>> 75b126427778218b36cfb68637e4f8d0e584b8ef.
>>>
>>>  From 2316518efc459928ad1d3d2d3511ea5fbda19475 Mon Sep 17 00:00:00 2001
>>> From: pding <[email protected]>
>>> Date: Mon, 23 Oct 2017 17:22:09 +0800
>>> Subject: [PATCH 001/121] drm/amdgpu: retry init if it fails due to 
>>> exclusive mode timeout (v3)
>>>
>>> The exclusive mode has real-time limitation in reality, such like being
>>> done in 300ms. It's easy observed if running many VF/VMs in single host
>>> with heavy CPU workload.
>>>
>>> If we find the init fails due to exclusive mode timeout, try it again.
>>>
>>> v2:
>>>   - rewrite the condition for readable value.
>>>
>>> v3:
>>>   - fix typo, add comments for sleep
>>>
>>> Acked-by: Alex Deucher <[email protected]>
>>> Signed-off-by: pding <[email protected]>
>>> Signed-off-by: Alex Deucher <[email protected]>
>>> Signed-off-by: Gary Sun <[email protected]>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   10 ++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |   15 +++++++++++++--
>>>   2 files changed, 23 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 125f77d..385b10e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2303,6 +2303,15 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>>   
>>>     r = amdgpu_init(adev);
>>>     if (r) {
>>> +           /* failed in exclusive mode due to timeout */
>>> +           if (amdgpu_sriov_vf(adev) &&
>>> +               !amdgpu_sriov_runtime(adev) &&
>>> +               amdgpu_virt_mmio_blocked(adev) &&
>>> +               !amdgpu_virt_wait_reset(adev)) {
>>> +                   dev_err(adev->dev, "VF exclusive mode timeout\n");
>>> +                   r = -EAGAIN;
>>> +                   goto failed;
>>> +           }
>>>             dev_err(adev->dev, "amdgpu_init failed\n");
>>>             amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_AMDGPU_INIT_FAIL, 0, 
>>> 0);
>>>             amdgpu_fini(adev);
>>> @@ -2390,6 +2399,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>>     amdgpu_vf_error_trans_all(adev);
>>>     if (runtime)
>>>             vga_switcheroo_fini_domain_pm_ops(adev->dev);
>>> +
>>>     return r;
>>>   }
>>>   
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index 720139e..f313eee 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -86,7 +86,7 @@ void amdgpu_driver_unload_kms(struct drm_device *dev)
>>>   int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
>>>   {
>>>     struct amdgpu_device *adev;
>>> -   int r, acpi_status;
>>> +   int r, acpi_status, retry = 0;
>>>   
>>>   #ifdef CONFIG_DRM_AMDGPU_SI
>>>     if (!amdgpu_si_support) {
>>> @@ -122,6 +122,7 @@ int amdgpu_driver_load_kms(struct drm_device *dev, 
>>> unsigned long flags)
>>>             }
>>>     }
>>>   #endif
>>> +retry_init:
>>>   
>>>     adev = kzalloc(sizeof(struct amdgpu_device), GFP_KERNEL);
>>>     if (adev == NULL) {
>>> @@ -144,7 +145,17 @@ int amdgpu_driver_load_kms(struct drm_device *dev, 
>>> unsigned long flags)
>>>      * VRAM allocation
>>>      */
>>>     r = amdgpu_device_init(adev, dev, dev->pdev, flags);
>>> -   if (r) {
>>> +   if (r == -EAGAIN && ++retry <= 3) {
>>> +           adev->virt.caps &= ~AMDGPU_SRIOV_CAPS_RUNTIME;
>>> +           adev->virt.ops = NULL;
>>> +           amdgpu_device_fini(adev);
>>> +           kfree(adev);
>>> +           dev->dev_private = NULL;
>>> +           /* Don't request EX mode too frequently which is attacking */
>>> +           msleep(5000);
>>> +           dev_err(&dev->pdev->dev, "retry init %d\n", retry);
>>> +           goto retry_init;
>>> +   } else if (r) {
>>>             dev_err(&dev->pdev->dev, "Fatal error during GPU init\n");
>>>             goto out;
>>>     }
>>
>>
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to