KFD VRAM allocations only set AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE
(clear on free) but not AMDGPU_GEM_CREATE_VRAM_CLEARED (clear on
create). This means freshly allocated VRAM BOs contain stale data
from prior use, which is observable by GPU compute kernels.

The GEM ioctl path unconditionally sets VRAM_CLEARED, but the KFD
path was missing this flag.

This causes data corruption in applications that depend on
VMM-allocated memory being zero-initialized, such as RCCL P2P
transport where stale data in ptrExchange/head/tail fields leads
to HSA_STATUS_ERROR_MEMORY_FAULT crashes.

Signed-off-by: Amir Shetaia <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 8a869fe41acd..7c01492e69dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1735,7 +1735,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
                        alloc_domain = AMDGPU_GEM_DOMAIN_GTT;
                        alloc_flags = 0;
                } else {
-                       alloc_flags = AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE;
+                       alloc_flags = AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE |
+                               AMDGPU_GEM_CREATE_VRAM_CLEARED;
                        alloc_flags |= (flags & KFD_IOC_ALLOC_MEM_FLAGS_PUBLIC) 
?
                        AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED : 0;
 
-- 
2.43.0

Reply via email to