[AMD Official Use Only - AMD Internal Distribution Only]

Greetings @Kuehling, Felix,

Thanks for the review comments, please see some discovery comments in lined!  
Patch 2 with the suggested fixes is on its way.

Note: I have added 1 change in the new patch (re-running the scan, found it)

file: kfd_crat.c:2360
...
peer_dev = kfd_topology_device_by_proximity_domain_no_lock(nid);
        if (!peer_dev)
                return -ENODEV;
...

> -----Original Message-----
> From: Kuehling, Felix <[email protected]>
> Sent: Friday, November 14, 2025 4:48 PM
> To: Russell, Kent <[email protected]>; Martin, Andrew
> <[email protected]>; [email protected]
> Subject: Re: [PATCH] drm/amdkfd: FORWARD NULL
>
> On 2025-11-14 12:02, Russell, Kent wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> >> -----Original Message-----
> >> From: amd-gfx <[email protected]> On Behalf Of
> >> Andrew Martin
> >> Sent: Friday, November 14, 2025 9:41 AM
> >> To: [email protected]
> >> Cc: Martin, Andrew <[email protected]>
> >> Subject: [PATCH] drm/amdkfd: FORWARD NULL
> >>
> >> This patch fixes issues when the code moves forward with a potential
> >> NULL pointer, without checking.
> >>
> >> Signed-off-by: Andrew Martin <[email protected]>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 2 ++
> >>   drivers/gpu/drm/amd/amdkfd/kfd_debug.c           | 6 +++++-
> >>   drivers/gpu/drm/amd/amdkfd/kfd_process.c         | 3 +++
> >>   3 files changed, 10 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> >> index 1ef758ac5076..71b7db5de69f 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
> >> @@ -104,6 +104,8 @@ static const char
> >> *amdkfd_fence_get_driver_name(struct
> >> dma_fence *f)
> >>   static const char *amdkfd_fence_get_timeline_name(struct dma_fence *f)
> >>   {
> >>        struct amdgpu_amdkfd_fence *fence = to_amdgpu_amdkfd_fence(f);
> >> +     if (!fence)
> >> +             return NULL;
> > Felix can hopefully confirm, but we may need to have something here,
> > since the references here expect something. Like in #define
> AMDGPU_JOB_GET_TIMELINE_NAME(job) \
> >
> > job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence-
> > >finished)
>
> For amdgpu Job fences the above makes sense. But KFD fences are our evictions
> fences. There is no job associated with them.
>
> I don't think the NULL check is needed here. to_amdgpu_amdkfd_fence returns
> NULL if the f is NULL or the fence is not an amdgpu_amdkfd_fence, based on
> the fence_ops. But we're in a fence_ops callback here that should only be 
> called
> for amdgpu_amdkfd_fences.
>
> That said, if you need a check to satisfy a static checker, I suggest this:
>
>      return fence ? fence->timeline_name : NULL;

This worked perfectly.
>
>
> >
> >>        return fence->timeline_name;
> >>   }
> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> >> b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> >> index ba99e0f258ae..42fa137bdb2f 100644
> >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
> >> @@ -517,7 +517,9 @@ int kfd_dbg_trap_set_flags(struct kfd_process
> >> *target, uint32_t *flags)
> >>
> >>        for (i = 0; i < target->n_pdds; i++) {
> >>                struct kfd_topology_device *topo_dev =
> >> -                             
> >> kfd_topology_device_by_id(target->pdds[i]->dev->id);
> >> +                     kfd_topology_device_by_id(target->pdds[i]->dev->id);
> >> +             if (!topo_dev)
> >> +                     continue;
>
> This loop checks the capabilities of all the devices. If a device is not 
> found, we
> should assume the worst and return an error, instead of just assuming that 
> it'll
> be fine.
>

Done

>
> >>                uint32_t caps = topo_dev->node_props.capability;
> >>
> >>                if (!(caps &
> >> HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED)
> >> &&
> >> @@ -1071,6 +1073,8 @@ int kfd_dbg_trap_device_snapshot(struct
> >> kfd_process *target,
> >>        for (i = 0; i < tmp_num_devices; i++) {
> >>                struct kfd_process_device *pdd = target->pdds[i];
> >>                struct kfd_topology_device *topo_dev =
> >> kfd_topology_device_by_id(pdd->dev->id);
> >> +             if (!topo_dev)
> >> +                     continue;
>
> I'd return an error here as well, because we obviously don't have accurate
> device info.
>

Done

>
> >>
> >>                device_info.gpu_id = pdd->dev->id;
> >>                device_info.exception_status = pdd->exception_status;
> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> index f5d173f1ca3b..f40d93f82ede 100644
> >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> >> @@ -1685,6 +1685,9 @@ int kfd_process_device_init_vm(struct
> >> kfd_process_device *pdd,
> >>        struct kfd_node *dev;
> >>        int ret;
> >>
> >> +     if (!pdd)
> >> +             return -EINVAL;
> >> +
>
> We generally assume that functions are called with valid parameters.
> With that argument, we should probably remove the check for !drm_file
> below as well.

Done.  This is a good thing to note, but I want to have another lesson here.  
Is this true in general or only for internal/private APIs? I asked b/c this 
comment forced me to really look carefully the kfd_process_device_init_vm(), 
which is prototyped in kfd_priv.h.  This mean this and any other function from 
here should be called with valid params.  But public APIs should check that the 
params are valid, b/c this how we prevent crashes in our library?
>
> Regards,
>    Felix
>
>
> >>        if (!drm_file)
> >>                return -EINVAL;
> > Probably easier to just combine the !pdd and !drm_file into the same check.
> >
> >   Kent
> >> --
> >> 2.43.0

Reply via email to