On 02.07.25 09:28, Samuel Zhang wrote: > > On 2025/7/1 16:22, Christian König wrote: >> On 01.07.25 10:18, Zhang, GuoQing (Sam) wrote: >>> [AMD Official Use Only - AMD Internal Distribution Only] >>> >>> >>> Hi Christian, >>> >>> >>> Thank you for the feedback. >>> >>> >>> For “return ret < 0 ? ret : 0;”, it is equivalent to “return ret;” since >>> ret is always <= 0 after the loop. >> No it isn't. >> >> ttm_global_swapout() returns the number of pages swapped out and only a >> negative error code if something went wrong. > > > /** > * move GTT BOs to shmem for hibernation. > * > * returns 0 on success, negative on failure. > */ > int ttm_device_prepare_hibernation(void) > { > struct ttm_operation_ctx ctx = { > .interruptible = false, > .no_wait_gpu = false, > .force_alloc = true > }; > int ret; > > do { > ret = ttm_global_swapout(&ctx, GFP_KERNEL); > } while (ret > 0); > return ret; > } > > This is the new code version. > If ttm_global_swapout() return positive number, the while loop will continue > to the next iteration. > The while loop stops only when ttm_global_swapout() returns 0 or negative > number. In both case, the new function can just return the ret.
Ok, now I at least got what you wanted to do. But that is not really what I had in mind and isn't really good coding style. Please use ttm_device_swapout() instead of ttm_global_swapout(), apart from that we can probably keep it that way. Regards, Christian. > > The ret values printed in the do while loop: > [ 53.745892] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 512 > [ 53.950975] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 35840 > [ 53.951713] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 9 > [ 67.712196] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 2187264 > [ 67.713726] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 512 > [ 67.759212] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 32768 > [ 67.761946] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1024 > [ 67.762685] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 85 > [ 67.763518] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 175 > [ 67.767318] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 2367 > [ 67.767942] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1 > [ 67.768499] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1 > [ 67.769054] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1 > ... > [ 67.783554] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1 > [ 67.785755] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1 > [ 67.788607] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 1 > [ 67.789906] [TTM DEVICE] ttm_device_prepare_hibernation:164 ret 0 > > > Regards > Sam > > > >> >> And it's probably not a good idea to return that from the new function. >> >> Regards, >> Christian. >> >>> >>> For all other comments, I will revise the patch accordingly in v2. >>> >>> >>> Regards >>> >>> Sam >>> >>> >>> >>> *From: *Koenig, Christian <christian.koe...@amd.com> >>> *Date: *Monday, June 30, 2025 at 19:54 >>> *To: *Zhang, GuoQing (Sam) <guoqing.zh...@amd.com>, raf...@kernel.org >>> <raf...@kernel.org>, len.br...@intel.com <len.br...@intel.com>, >>> pa...@kernel.org <pa...@kernel.org>, Deucher, Alexander >>> <alexander.deuc...@amd.com>, Limonciello, Mario >>> <mario.limoncie...@amd.com>, Lazar, Lijo <lijo.la...@amd.com> >>> *Cc: *Zhao, Victor <victor.z...@amd.com>, Chang, HaiJun >>> <haijun.ch...@amd.com>, Ma, Qing (Mark) <qing...@amd.com>, >>> amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>, >>> dri-de...@lists.freedesktop.org <dri-de...@lists.freedesktop.org>, >>> linux...@vger.kernel.org <linux...@vger.kernel.org>, >>> linux-ker...@vger.kernel.org <linux-ker...@vger.kernel.org> >>> *Subject: *Re: [PATCH 1/3] drm/amdgpu: move GTT to SHM after eviction for >>> hibernation >>> >>> On 30.06.25 12:41, Samuel Zhang wrote: >>>> When hibernate with data center dGPUs, huge number of VRAM BOs evicted >>>> to GTT and takes too much system memory. This will cause hibernation >>>> fail due to insufficient memory for creating the hibernation image. >>>> >>>> Move GTT BOs to shmem in KMD, then shmem to swap disk in kernel >>>> hibernation code to make room for hibernation image. >>> This should probably be two patches, one for TTM and then an amdgpu patch >>> to forward the event. >>> >>>> Signed-off-by: Samuel Zhang <guoqing.zh...@amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 13 ++++++++++++- >>>> drivers/gpu/drm/ttm/ttm_resource.c | 18 ++++++++++++++++++ >>>> include/drm/ttm/ttm_resource.h | 1 + >>>> 3 files changed, 31 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>> index 4d57269c9ca8..5aede907a591 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >>>> @@ -2889,6 +2889,7 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo, >>>> int amdgpu_ttm_evict_resources(struct amdgpu_device *adev, int mem_type) >>>> { >>>> struct ttm_resource_manager *man; >>>> + int r; >>>> switch (mem_type) { >>>> case TTM_PL_VRAM: >>>> @@ -2903,7 +2904,17 @@ int amdgpu_ttm_evict_resources(struct amdgpu_device >>>> *adev, int mem_type) >>>> return -EINVAL; >>>> } >>>> - return ttm_resource_manager_evict_all(&adev->mman.bdev, man); >>>> + r = ttm_resource_manager_evict_all(&adev->mman.bdev, man); >>>> + if (r) { >>>> + DRM_ERROR("Failed to evict memory type %d\n", mem_type); >>>> + return r; >>>> + } >>>> + if (adev->in_s4 && mem_type == TTM_PL_VRAM) { >>>> + r = ttm_resource_manager_swapout(); >>>> + if (r) >>>> + DRM_ERROR("Failed to swap out, %d\n", r); >>>> + } >>>> + return r; >>>> } >>>> #if defined(CONFIG_DEBUG_FS) >>>> diff --git a/drivers/gpu/drm/ttm/ttm_resource.c >>>> b/drivers/gpu/drm/ttm/ttm_resource.c >>>> index fd41b56e2c66..07b1f5a5afc2 100644 >>>> --- a/drivers/gpu/drm/ttm/ttm_resource.c >>>> +++ b/drivers/gpu/drm/ttm/ttm_resource.c >>>> @@ -534,6 +534,24 @@ void ttm_resource_manager_init(struct >>>> ttm_resource_manager *man, >>>> } >>>> EXPORT_SYMBOL(ttm_resource_manager_init); >>>> +int ttm_resource_manager_swapout(void) >>> This needs documentation, better placement and a better name. >>> >>> First of all put it into ttm_device.c instead of the resource manager. >>> >>> Then call it something like ttm_device_prepare_hibernation or similar. >>> >>> >>>> +{ >>>> + struct ttm_operation_ctx ctx = { >>>> + .interruptible = false, >>>> + .no_wait_gpu = false, >>>> + .force_alloc = true >>>> + }; >>>> + int ret; >>>> + >>>> + while (true) { >>> Make that: >>> >>> do { >>> ret = ... >>> } while (ret > 0); >>> >>>> + ret = ttm_global_swapout(&ctx, GFP_KERNEL); >>>> + if (ret <= 0) >>>> + break; >>>> + } >>>> + return ret; >>> It's rather pointless to return the number of swapped out pages. >>> >>> Make that "return ret < 0 ? ret : 0; >>> >>> Regards, >>> Christian. >>> >>>> +} >>>> +EXPORT_SYMBOL(ttm_resource_manager_swapout); >>>> + >>>> /* >>>> * ttm_resource_manager_evict_all >>>> * >>>> diff --git a/include/drm/ttm/ttm_resource.h >>>> b/include/drm/ttm/ttm_resource.h >>>> index b873be9597e2..46181758068e 100644 >>>> --- a/include/drm/ttm/ttm_resource.h >>>> +++ b/include/drm/ttm/ttm_resource.h >>>> @@ -463,6 +463,7 @@ void ttm_resource_manager_init(struct >>>> ttm_resource_manager *man, >>>> int ttm_resource_manager_evict_all(struct ttm_device *bdev, >>>> struct ttm_resource_manager *man); >>>> +int ttm_resource_manager_swapout(void); >>>> uint64_t ttm_resource_manager_usage(struct ttm_resource_manager >>>> *man); >>>> void ttm_resource_manager_debug(struct ttm_resource_manager *man,