Re: [PATCH] drm/amdgpu: Init zone device and drm client after mode-1 reset on reload

2024-03-05 Thread Felix Kuehling


On 2024-03-04 19:20, Rehman, Ahmad wrote:


[AMD Official Use Only - General]


Hey,


Due to mode-1 reset (pending_reset), the amdgpu_amdkfd_device_init 
will not be called and hence adev->kfd.init_complete will not be set. 
The function amdgpu_amdkfd_drm_client_create has condition:

if (!adev->kfd.init_complete)
              return 0;
So, in probe function, when we return from device_init the KFD is not 
initialized and amdgpu_amdkfd_drm_client_create returns without doing 
anything.


I think your change could result in calling 
amdgpu_amdkfd_drm_client_create multiple times. IIRC, one purpose of 
moving the call to amdgpu_pci_probe was to ensure that it is only called 
once, because it only gets unregistered once when the driver is 
unloaded. Maybe it would be better to remove the if 
(!adev->kfd.init_complete) condition from 
amdgpu_amdkfd_drm_client_create. That way we would always create the 
client at probe and it would be ready when it's needed after the GPU reset.



There is a chance that the client would get created unnecessarily if KFD 
init never succeeds. But that should be rare, and it's not a big 
resource waste.



There were some comments on a previous code review, that creating the 
DRM client too early could cause problems. But I don't understand what 
that problem could be. As I understand it, the adev->kfd.client is just 
a place to put GEM handles for KFD BOs that we don't want to expose to 
user mode. I see no harm in creating this client too early or when it's 
not needed.



Regards,
  Felix




Thanks,
Ahmad


*From:* Kuehling, Felix 
*Sent:* Monday, March 4, 2024 6:39 PM
*To:* Rehman, Ahmad ; 
amd-gfx@lists.freedesktop.org 

*Cc:* Wan, Gavin 
*Subject:* Re: [PATCH] drm/amdgpu: Init zone device and drm client 
after mode-1 reset on reload


On 2024-03-04 17:05, Ahmad Rehman wrote:
> In passthrough environment, when amdgpu is reloaded after unload, mode-1
> is triggered after initializing the necessary IPs, That init does not
> include KFD, and KFD init waits until the reset is completed. KFD init
> is called in the reset handler, but in this case, the zone device and
> drm client is not initialized, causing app to create kernel panic.
>
> Signed-off-by: Ahmad Rehman 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 -
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

> index 15b188aaf681..80b9642f2bc4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2479,8 +2479,11 @@ static void 
amdgpu_drv_delayed_reset_work_handler(struct work_struct *work)

>    }
>    for (i = 0; i < mgpu_info.num_dgpu; i++) {
>    adev = mgpu_info.gpu_ins[i].adev;
> - if (!adev->kfd.init_complete)
> + if (!adev->kfd.init_complete) {
> + kgd2kfd_init_zone_device(adev);
>    amdgpu_amdkfd_device_init(adev);
> + amdgpu_amdkfd_drm_client_create(adev);
I don't see what's preventing the DRM client initialization in the
reset-on-driver-load case. It only needs to be created once and that
happens in amdgpu_pci_probe. Am I missing anything?

Regards,
   Felix


> + }
>    amdgpu_ttm_set_buffer_funcs_status(adev, true);
>    }
>   }

Re: [PATCH] drm/amdgpu: Init zone device and drm client after mode-1 reset on reload

2024-03-04 Thread Rehman, Ahmad
[AMD Official Use Only - General]

Hey,


Due to mode-1 reset (pending_reset), the amdgpu_amdkfd_device_init will not be 
called and hence adev->kfd.init_complete will not be set. The function 
amdgpu_amdkfd_drm_client_create has condition:
if (!adev->kfd.init_complete)
return 0;
So, in probe function, when we return from device_init the KFD is not 
initialized and amdgpu_amdkfd_drm_client_create returns without doing anything.

Thanks,
Ahmad


From: Kuehling, Felix 
Sent: Monday, March 4, 2024 6:39 PM
To: Rehman, Ahmad ; amd-gfx@lists.freedesktop.org 

Cc: Wan, Gavin 
Subject: Re: [PATCH] drm/amdgpu: Init zone device and drm client after mode-1 
reset on reload


On 2024-03-04 17:05, Ahmad Rehman wrote:
> In passthrough environment, when amdgpu is reloaded after unload, mode-1
> is triggered after initializing the necessary IPs, That init does not
> include KFD, and KFD init waits until the reset is completed. KFD init
> is called in the reset handler, but in this case, the zone device and
> drm client is not initialized, causing app to create kernel panic.
>
> Signed-off-by: Ahmad Rehman 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 -
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 15b188aaf681..80b9642f2bc4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2479,8 +2479,11 @@ static void 
> amdgpu_drv_delayed_reset_work_handler(struct work_struct *work)
>}
>for (i = 0; i < mgpu_info.num_dgpu; i++) {
>adev = mgpu_info.gpu_ins[i].adev;
> - if (!adev->kfd.init_complete)
> + if (!adev->kfd.init_complete) {
> + kgd2kfd_init_zone_device(adev);
>amdgpu_amdkfd_device_init(adev);
> + amdgpu_amdkfd_drm_client_create(adev);
I don't see what's preventing the DRM client initialization in the
reset-on-driver-load case. It only needs to be created once and that
happens in amdgpu_pci_probe. Am I missing anything?

Regards,
   Felix


> + }
>amdgpu_ttm_set_buffer_funcs_status(adev, true);
>}
>   }


Re: [PATCH] drm/amdgpu: Init zone device and drm client after mode-1 reset on reload

2024-03-04 Thread Felix Kuehling



On 2024-03-04 17:05, Ahmad Rehman wrote:

In passthrough environment, when amdgpu is reloaded after unload, mode-1
is triggered after initializing the necessary IPs, That init does not
include KFD, and KFD init waits until the reset is completed. KFD init
is called in the reset handler, but in this case, the zone device and
drm client is not initialized, causing app to create kernel panic.

Signed-off-by: Ahmad Rehman 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 15b188aaf681..80b9642f2bc4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2479,8 +2479,11 @@ static void amdgpu_drv_delayed_reset_work_handler(struct 
work_struct *work)
}
for (i = 0; i < mgpu_info.num_dgpu; i++) {
adev = mgpu_info.gpu_ins[i].adev;
-   if (!adev->kfd.init_complete)
+   if (!adev->kfd.init_complete) {
+   kgd2kfd_init_zone_device(adev);
amdgpu_amdkfd_device_init(adev);
+   amdgpu_amdkfd_drm_client_create(adev);
I don't see what's preventing the DRM client initialization in the 
reset-on-driver-load case. It only needs to be created once and that 
happens in amdgpu_pci_probe. Am I missing anything?


Regards,
  Felix



+   }
amdgpu_ttm_set_buffer_funcs_status(adev, true);
}
  }