On 2025-11-17 13:13, Martin, Andrew wrote:
[AMD Official Use Only - AMD Internal Distribution Only]

Greetings @Kuehling, Felix,

Thanks for the review comments, please see some discovery comments in lined!  
Patch 2 with the suggested fixes is on its way.

Note: I have added 1 change in the new patch (re-running the scan, found it)

file: kfd_crat.c:2360
...
peer_dev = kfd_topology_device_by_proximity_domain_no_lock(nid);
         if (!peer_dev)
                 return -ENODEV;

Hmm, not sure I agree with this one. This is in a loop that looks for XGMI-connected peer devices. Currently we create devices with increasing proximity domains and no gaps. So kfd_topology_device_by_proximity_domain_no_lock should never fail. Maybe in some future scenario where a device was hot-unplugged this assumption may be broken.

To be more robust here, I'd just continue if peer_dev is NULL. That way a topology with gaps doesn't cause the creation of new devices to fail.


...

-----Original Message-----
From: Kuehling, Felix <[email protected]>
Sent: Friday, November 14, 2025 4:48 PM
To: Russell, Kent <[email protected]>; Martin, Andrew
<[email protected]>; [email protected]
Subject: Re: [PATCH] drm/amdkfd: FORWARD NULL

On 2025-11-14 12:02, Russell, Kent wrote:
[AMD Official Use Only - AMD Internal Distribution Only]

-----Original Message-----
From: amd-gfx <[email protected]> On Behalf Of
Andrew Martin
Sent: Friday, November 14, 2025 9:41 AM
To: [email protected]
Cc: Martin, Andrew <[email protected]>
Subject: [PATCH] drm/amdkfd: FORWARD NULL

This patch fixes issues when the code moves forward with a potential
NULL pointer, without checking.

Signed-off-by: Andrew Martin <[email protected]>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 2 ++
   drivers/gpu/drm/amd/amdkfd/kfd_debug.c           | 6 +++++-
   drivers/gpu/drm/amd/amdkfd/kfd_process.c         | 3 +++
   3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
index 1ef758ac5076..71b7db5de69f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
@@ -104,6 +104,8 @@ static const char
*amdkfd_fence_get_driver_name(struct
dma_fence *f)
   static const char *amdkfd_fence_get_timeline_name(struct dma_fence *f)
   {
        struct amdgpu_amdkfd_fence *fence = to_amdgpu_amdkfd_fence(f);
+     if (!fence)
+             return NULL;
Felix can hopefully confirm, but we may need to have something here,
since the references here expect something. Like in #define
AMDGPU_JOB_GET_TIMELINE_NAME(job) \
job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence-
finished)
For amdgpu Job fences the above makes sense. But KFD fences are our evictions
fences. There is no job associated with them.

I don't think the NULL check is needed here. to_amdgpu_amdkfd_fence returns
NULL if the f is NULL or the fence is not an amdgpu_amdkfd_fence, based on
the fence_ops. But we're in a fence_ops callback here that should only be called
for amdgpu_amdkfd_fences.

That said, if you need a check to satisfy a static checker, I suggest this:

      return fence ? fence->timeline_name : NULL;
This worked perfectly.

        return fence->timeline_name;
   }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index ba99e0f258ae..42fa137bdb2f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -517,7 +517,9 @@ int kfd_dbg_trap_set_flags(struct kfd_process
*target, uint32_t *flags)

        for (i = 0; i < target->n_pdds; i++) {
                struct kfd_topology_device *topo_dev =
-                             
kfd_topology_device_by_id(target->pdds[i]->dev->id);
+                     kfd_topology_device_by_id(target->pdds[i]->dev->id);
+             if (!topo_dev)
+                     continue;
This loop checks the capabilities of all the devices. If a device is not found, 
we
should assume the worst and return an error, instead of just assuming that it'll
be fine.

Done

                uint32_t caps = topo_dev->node_props.capability;

                if (!(caps &
HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED)
&&
@@ -1071,6 +1073,8 @@ int kfd_dbg_trap_device_snapshot(struct
kfd_process *target,
        for (i = 0; i < tmp_num_devices; i++) {
                struct kfd_process_device *pdd = target->pdds[i];
                struct kfd_topology_device *topo_dev =
kfd_topology_device_by_id(pdd->dev->id);
+             if (!topo_dev)
+                     continue;
I'd return an error here as well, because we obviously don't have accurate
device info.

Done

                device_info.gpu_id = pdd->dev->id;
                device_info.exception_status = pdd->exception_status;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index f5d173f1ca3b..f40d93f82ede 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1685,6 +1685,9 @@ int kfd_process_device_init_vm(struct
kfd_process_device *pdd,
        struct kfd_node *dev;
        int ret;

+     if (!pdd)
+             return -EINVAL;
+
We generally assume that functions are called with valid parameters.
With that argument, we should probably remove the check for !drm_file
below as well.
Done.  This is a good thing to note, but I want to have another lesson here.  
Is this true in general or only for internal/private APIs? I asked b/c this 
comment forced me to really look carefully the kfd_process_device_init_vm(), 
which is prototyped in kfd_priv.h.  This mean this and any other function from 
here should be called with valid params.  But public APIs should check that the 
params are valid, b/c this how we prevent crashes in our library?

Right, I should have been more precise. Public APIs where you don't know and don't trust the callers need to check their parameters. Internal APIs where you know and trust the callers don't need to do redundant checks.

Regards,
  Felix


Regards,
    Felix


        if (!drm_file)
                return -EINVAL;
Probably easier to just combine the !pdd and !drm_file into the same check.

   Kent
--
2.43.0

Reply via email to