On 5/7/25 13:03, Sam wrote: > > On 2025/5/7 18:03, Lazar, Lijo wrote: >> On 5/7/2025 11:52 AM, Zhang, GuoQing (Sam) wrote: >>> [AMD Official Use Only - AMD Internal Distribution Only] >>> >>> >>> >>>> Please keep in mind that this is not the only scenario addressed by the >>>> driver - for ex: a resume sequence is executed after a device reset. >>>> This patch itself introduces unwanted sequences for other commonly used >>>> usecases. Please rework on the series without breaking existing usecases. >>>> Thanks, >>>> Lijo >>> >>> Hi @Lazar, Lijo<mailto:[email protected]>, Thank you for the feedback. >>> >>> >>> I also think the new code should be inside a check so that new code is >>> executed only on resume with different VF and do not break existing >>> usecases. Following is the implementation of this approach I can think of. >>> >>> - introduce new field `prev_physical_node_id ` in `struct amdgpu_xgmi `. >>> update the fields on resume. >>> >>> - put new code inside code block `if (prev_physical_node_id != >>> physical_node_id )` >>> >>> >> Can this happen only with XGMI under this condition? Any other method >> possible like preparing a 'unique signature' and matching it to identify >> if it resumed on an identically configured system? > > Yes, this hibernate-resume with different VF feature is only for devices with > XGMI. Detecting XGMI node id change is the only way I can think of to > identify the case. It's also a very simple way. > > @Koenig, Christian <mailto:[email protected]> Are you OK with this > approach, adding a check for the new code sequence?
Well you still need to avoid calling gmc_v9_0_mc_init() since that is most likely incorrect. Regards, Christian. > >> Regardless, instead of having a direct check, better to wrap it inside >> something like >> if (amdgpu_virt_need_migration()) or something more appropriate. > > Yes, I will do that. Thank you! > > Regards > Sam > >> Thanks, >> Lijo >> >>> Is this approach acceptable? If not, can you suggest a better approach? >>> @Lazar, Lijo<mailto:[email protected]> @Koenig, Christian >>> <mailto:[email protected]> Thank you! >>> >>> >>> Regards >>> >>> Sam >>> >>> >>> *From: *Lazar, Lijo<[email protected]> >>> *Date: *Tuesday, May 6, 2025 at 19:55 >>> *To: *Zhang, GuoQing (Sam)<[email protected]>, amd- >>> [email protected] <[email protected]> >>> *Cc: *Zhao, Victor<[email protected]>, Chang, HaiJun >>> <[email protected]>, Koenig, Christian<[email protected]>, >>> Deucher, Alexander<[email protected]>, Zhang, Owen(SRDC) >>> <[email protected]>, Ma, Qing (Mark)<[email protected]>, Jiang Liu >>> <[email protected]> >>> *Subject: *Re: [PATCH v3 1/7] drm/amdgpu: update XGMI physical node id >>> and GMC configs on resume >>> >>> >>> >>> On 5/6/2025 3:06 PM, Samuel Zhang wrote: >>>> For virtual machine with vGPUs in SRIOV single device mode and XGMI >>>> is enabled, XGMI physical node ids may change when waking up from >>>> hiberation with different vGPU devices. So update XGMI physical node >>>> ids on resume. >>>> >>> Please keep in mind that this is not the only scenario addressed by the >>> driver - for ex: a resume sequence is executed after a device reset. >>> This patch itself introduces unwanted sequences for other commonly used >>> usecases. Please rework on the series without breaking existing usecases. >>> >>> Thanks, >>> Lijo >>> >>>> Update GPU memory controller configuration on resume if XGMI physical >>>> node ids are changed. >>>> >>>> Signed-off-by: Jiang Liu<[email protected]> >>>> Signed-off-by: Samuel Zhang<[email protected]> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++++++++++++++++++++++ >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 +-- >>>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 4 ++++ >>>> 3 files changed, 29 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/ >>> drm/amd/amdgpu/amdgpu_device.c >>>> index d477a901af84..e795af5067e5 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -5040,6 +5040,27 @@ int amdgpu_device_suspend(struct drm_device >>> *dev, bool notify_clients) >>>> return 0; >>>> } >>>> +static int amdgpu_device_update_xgmi_info(struct amdgpu_device *adev) >>>> +{ >>>> + int r; >>>> + unsigned int prev_physical_node_id; >>>> + >>>> + /* Get xgmi info again for sriov to detect device changes */ >>>> + if (amdgpu_sriov_vf(adev) && >>>> + !(adev->flags & AMD_IS_APU) && >>>> + adev->gmc.xgmi.supported && >>>> + !adev->gmc.xgmi.connected_to_cpu) { >>>> + prev_physical_node_id = adev->gmc.xgmi.physical_node_id; >>>> + r = adev->gfxhub.funcs->get_xgmi_info(adev); >>>> + if (r) >>>> + return r; >>>> + >>>> + dev_info(adev->dev, "xgmi node, old id %d, new id %d\n", >>>> + prev_physical_node_id, adev- >>>> gmc.xgmi.physical_node_id); >>>> + } >>>> + return 0; >>>> +} >>>> + >>>> /** >>>> * amdgpu_device_resume - initiate device resume >>>> * >>>> @@ -5059,6 +5080,9 @@ int amdgpu_device_resume(struct drm_device *dev, >>> bool notify_clients) >>>> r = amdgpu_virt_request_full_gpu(adev, true); >>>> if (r) >>>> return r; >>>> + r = amdgpu_device_update_xgmi_info(adev); >>>> + if (r) >>>> + return r; >>>> } >>>> if (dev->switch_power_state == DRM_SWITCH_POWER_OFF) >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/ >>> drm/amd/amdgpu/amdgpu_gmc.c >>>> index d1fa5e8e3937..a2abddf3c110 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c >>>> @@ -1298,8 +1298,7 @@ int amdgpu_gmc_get_nps_memranges(struct >>> amdgpu_device *adev, >>>> if (!mem_ranges || !exp_ranges) >>>> return -EINVAL; >>>> - refresh = (adev->init_lvl->level != >>> AMDGPU_INIT_LEVEL_MINIMAL_XGMI) && >>>> - (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS); >>>> + refresh = true; >>>> ret = amdgpu_discovery_get_nps_info(adev, &nps_type, &ranges, >>>> &range_cnt, refresh); >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/ >>> amd/amdgpu/gmc_v9_0.c >>>> index 59385da80185..1eb451a3743b 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c >>>> @@ -2533,6 +2533,10 @@ static int gmc_v9_0_resume(struct >>> amdgpu_ip_block *ip_block) >>>> struct amdgpu_device *adev = ip_block->adev; >>>> int r; >>>> + r = gmc_v9_0_mc_init(adev); >>>> + if (r) >>>> + return r; >>>> + >>>> /* If a reset is done for NPS mode switch, read the memory range >>>> * information again. >>>> */
