[Public]

> -----Original Message-----
> From: Clement, Sunday <[email protected]>
> Sent: Friday, October 24, 2025 11:54 AM
> To: Kim, Jonathan <[email protected]>; [email protected]
> Cc: Kasiviswanathan, Harish <[email protected]>; Kuehling,
> Felix <[email protected]>
> Subject: RE: [PATCH] drm/amdkfd: Fix nullpointer dereference
>
> [Public]
>
> > -----Original Message-----
> > From: Kim, Jonathan <[email protected]>
> > Sent: Friday, October 17, 2025 2:38 PM
> > To: Clement, Sunday <[email protected]>; amd-
> > [email protected]
> > Cc: Kasiviswanathan, Harish <[email protected]>; Kuehling,
> > Felix <[email protected]>; Clement, Sunday
> > <[email protected]>
> > Subject: RE: [PATCH] drm/amdkfd: Fix nullpointer dereference
> >
> > [Public]
> >
> > > -----Original Message-----
> > > From: amd-gfx <[email protected]> On Behalf Of
> > > Sunday Clement
> > > Sent: Friday, October 17, 2025 10:33 AM
> > > To: [email protected]
> > > Cc: Kasiviswanathan, Harish <[email protected]>;
> > > Kuehling, Felix <[email protected]>; Clement, Sunday
> > > <[email protected]>
> > > Subject: [PATCH] drm/amdkfd: Fix nullpointer dereference
> > >
> > > In the event no device is found with the given proximity domain and
> > > kfd_topology_device_by_proximity_domain_no_lock() returns a null
> > > device immediately checking !peer_Dev->gpu will result in a null
> > > pointer dereference.
> > >
> > > Signed-off-by: Sunday Clement <[email protected]>
> > > ---
> > >  drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > > b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > > index 4a7180b46b71..6093d96c5892 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
> > > @@ -2357,7 +2357,7 @@ static int kfd_create_vcrat_image_gpu(void
> > > *pcrat_image,
> > >       if (kdev->kfd->hive_id) {
> > >               for (nid = 0; nid < proximity_domain; ++nid) {
> > >                       peer_dev =
> > > kfd_topology_device_by_proximity_domain_no_lock(nid);
> > > -                     if (!peer_dev->gpu)
> > > +                     if (!peer_dev || !peer_dev->gpu)
> >
> > Is this a real failure?
> > If so, we should figure out why our assumption that proximity domain ids as 
> > a
> > counter for valid devices should work but actually don't.
> > Either way, probably better to throw an error (something like -ENODEV) 
> > rather
> > than continue since IO link data has now been assigned garbage and we
> > probably don't want to keep building the hive at this point.
> >
> > Jon
>
> I think this failure is not real or at least could only really happen in a 
> contrived
> situation, like a race condition of a GPU being Hot unplugged(I don't think 
> this is
> even supported) so the topology dynamically changing while the vcrat for 
> another
> GPU is being created and that messing up finding a device by proximity domain 
> if
> the device was removed, probably not exactly a realistic scenario. But like 
> you
> said instead of silently continuing I could do something like the following 
> and
> return -ENODEV,
>                 if(!peer_dev) {
>                         return -ENODEV;
>                 }

I guess the break on check wouldn't hurt but yeah ... hot plugging an xGMI hive 
node during load seems physically impossible and surely not advised.

Jon

> Thanks,
> Sunday
>
> >
> > >                               continue;
> > >                       if (peer_dev->gpu->kfd->hive_id != 
> > > kdev->kfd->hive_id)
> > >                               continue;
> > > --
> > > 2.43.0
> >
>

Reply via email to