On 08-Mar-26 1:48 AM, Mario Limonciello wrote:


On 3/6/2026 12:41 AM, Lazar, Lijo wrote:


On 06-Mar-26 11:40 AM, Mario Limonciello wrote:


On 3/5/2026 11:07 PM, Lazar, Lijo wrote:


On 06-Mar-26 3:35 AM, Mario Limonciello wrote:
I found more case that a NULL version causes problems.
Add NULL checks as applicable.

Fixes: 39fc2bc4da00 ("drm/amdgpu: Protect GPU register accesses in powergated state in some paths")
Signed-off-by: Mario Limonciello <[email protected]>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
  1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/ gpu/ drm/amd/amdgpu/amdgpu_device.c
index bc6f714e8763a..74cbe58484fe2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3463,6 +3463,9 @@ static void amdgpu_ip_block_hw_fini(struct amdgpu_ip_block *ip_block)
      struct amdgpu_device *adev = ip_block->adev;
      int r;
+    if (!ip_block->version)
+        return;
+

ip block versions are set during discovery phase itself. This is a very early init failure

Yes; this case is NPI system that not all blocks are in discovery yet. System panics at bootup with NULL ptr deref in multiple places instead of a clean recovery and keep fbdev.  This patch series sorts it out.


Blocks not in discovery shouldn't be added to ip list or should be added differently.

and ideally the fix should be not to call any fini for such an early failure.

As an alternative to this series?


Yes, if it's a failure as early as in discovery stage, probably we should skip amdgpu_device_fini_hw altogether.

I experimented some more and feel that the solution I came up with is correct. There are valid versions of everything at this time (the failed IP block isn't there at that time).


My understanding of the situation is this is any early exit since driver doesn't recognize one IP block and hence cannot assign corresponding version functions. Without discovery mechanism, the equivalent case is driver not detecting the device id. In both cases, there shouldn't be any need to run through sw/hw fini sequences of ip block.

So how would you know to skip fini?  I guess check asic_funcs not to be NULL?


One way is to undo the effect of set_ip_block within discovery itself. For ex: if there is discovery error, call amdgpu_ip_block_clear() or similar and remove any added ip blocks. num_ip_blocks will then be 0 and in such cases don't run through any unwind sequence (that shouldn't really be needed then). That is the same case if driver is not able detect a valid discovery binary blob also.

Thanks,
Lijo

But then it's the same as the second commit is doing already.


Thanks,
Lijo


Thanks,
Lijo

      if (!ip_block->version->funcs->hw_fini) {
          dev_err(adev->dev, "hw_fini of IP block <%s> not defined\n",
              ip_block->version->funcs->name);
@@ -3496,6 +3499,8 @@ static void amdgpu_device_smu_fini_early(struct amdgpu_device *adev)
      for (i = 0; i < adev->num_ip_blocks; i++) {
          if (!adev->ip_blocks[i].status.hw)
              continue;
+        if (!adev->ip_blocks[i].version)
+            continue;
          if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_SMC) {
              amdgpu_ip_block_hw_fini(&adev->ip_blocks[i]);
              break;





Reply via email to