[Public] > -----Original Message----- > From: Clement, Sunday <[email protected]> > Sent: Friday, October 24, 2025 11:54 AM > To: Kim, Jonathan <[email protected]>; [email protected] > Cc: Kasiviswanathan, Harish <[email protected]>; Kuehling, > Felix <[email protected]> > Subject: RE: [PATCH] drm/amdkfd: Fix nullpointer dereference > > [Public] > > > -----Original Message----- > > From: Kim, Jonathan <[email protected]> > > Sent: Friday, October 17, 2025 2:38 PM > > To: Clement, Sunday <[email protected]>; amd- > > [email protected] > > Cc: Kasiviswanathan, Harish <[email protected]>; Kuehling, > > Felix <[email protected]>; Clement, Sunday > > <[email protected]> > > Subject: RE: [PATCH] drm/amdkfd: Fix nullpointer dereference > > > > [Public] > > > > > -----Original Message----- > > > From: amd-gfx <[email protected]> On Behalf Of > > > Sunday Clement > > > Sent: Friday, October 17, 2025 10:33 AM > > > To: [email protected] > > > Cc: Kasiviswanathan, Harish <[email protected]>; > > > Kuehling, Felix <[email protected]>; Clement, Sunday > > > <[email protected]> > > > Subject: [PATCH] drm/amdkfd: Fix nullpointer dereference > > > > > > In the event no device is found with the given proximity domain and > > > kfd_topology_device_by_proximity_domain_no_lock() returns a null > > > device immediately checking !peer_Dev->gpu will result in a null > > > pointer dereference. > > > > > > Signed-off-by: Sunday Clement <[email protected]> > > > --- > > > drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c > > > b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c > > > index 4a7180b46b71..6093d96c5892 100644 > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c > > > @@ -2357,7 +2357,7 @@ static int kfd_create_vcrat_image_gpu(void > > > *pcrat_image, > > > if (kdev->kfd->hive_id) { > > > for (nid = 0; nid < proximity_domain; ++nid) { > > > peer_dev = > > > kfd_topology_device_by_proximity_domain_no_lock(nid); > > > - if (!peer_dev->gpu) > > > + if (!peer_dev || !peer_dev->gpu) > > > > Is this a real failure? > > If so, we should figure out why our assumption that proximity domain ids as > > a > > counter for valid devices should work but actually don't. > > Either way, probably better to throw an error (something like -ENODEV) > > rather > > than continue since IO link data has now been assigned garbage and we > > probably don't want to keep building the hive at this point. > > > > Jon > > I think this failure is not real or at least could only really happen in a > contrived > situation, like a race condition of a GPU being Hot unplugged(I don't think > this is > even supported) so the topology dynamically changing while the vcrat for > another > GPU is being created and that messing up finding a device by proximity domain > if > the device was removed, probably not exactly a realistic scenario. But like > you > said instead of silently continuing I could do something like the following > and > return -ENODEV, > if(!peer_dev) { > return -ENODEV; > }
I guess the break on check wouldn't hurt but yeah ... hot plugging an xGMI hive node during load seems physically impossible and surely not advised. Jon > Thanks, > Sunday > > > > > > continue; > > > if (peer_dev->gpu->kfd->hive_id != > > > kdev->kfd->hive_id) > > > continue; > > > -- > > > 2.43.0 > > >
