[Public]
> -----Original Message-----
> From: Kim, Jonathan <[email protected]>
> Sent: Friday, October 17, 2025 2:38 PM
> To: Clement, Sunday <[email protected]>; amd-
> [email protected]
> Cc: Kasiviswanathan, Harish <[email protected]>; Kuehling,
> Felix <[email protected]>; Clement, Sunday
> <[email protected]>
> Subject: RE: [PATCH] drm/amdkfd: Fix nullpointer dereference
>
> [Public]
>
> > -----Original Message-----
> > From: amd-gfx <[email protected]> On Behalf Of
> > Sunday Clement
> > Sent: Friday, October 17, 2025 10:33 AM
> > To: [email protected]
> > Cc: Kasiviswanathan, Harish <[email protected]>;
> > Kuehling, Felix <[email protected]>; Clement, Sunday
> > <[email protected]>
> > Subject: [PATCH] drm/amdkfd: Fix nullpointer dereference
> >
> > In the event no device is found with the given proximity domain and
> > kfd_topology_device_by_proximity_domain_no_lock() returns a null
> > device immediately checking !peer_Dev->gpu will result in a null
> > pointer dereference.
> >
> > Signed-off-by: Sunday Clement <[email protected]>
> > ---
> > drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > index 4a7180b46b71..6093d96c5892 100644
> > --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > @@ -2357,7 +2357,7 @@ static int kfd_create_vcrat_image_gpu(void
> > *pcrat_image,
> > if (kdev->kfd->hive_id) {
> > for (nid = 0; nid < proximity_domain; ++nid) {
> > peer_dev =
> > kfd_topology_device_by_proximity_domain_no_lock(nid);
> > - if (!peer_dev->gpu)
> > + if (!peer_dev || !peer_dev->gpu)
>
> Is this a real failure?
> If so, we should figure out why our assumption that proximity domain ids as a
> counter for valid devices should work but actually don't.
> Either way, probably better to throw an error (something like -ENODEV) rather
> than continue since IO link data has now been assigned garbage and we
> probably don't want to keep building the hive at this point.
>
> Jon
I think this failure is not real or at least could only really happen in a
contrived situation, like a race condition of a GPU being Hot unplugged(I don't
think this is even supported) so the topology dynamically changing while the
vcrat for another GPU is being created and that messing up finding a device by
proximity domain if the device was removed, probably not exactly a realistic
scenario. But like you said instead of silently continuing I could do something
like the following and return -ENODEV,
if(!peer_dev) {
return -ENODEV;
}
Thanks,
Sunday
>
> > continue;
> > if (peer_dev->gpu->kfd->hive_id != kdev->kfd->hive_id)
> > continue;
> > --
> > 2.43.0
>