On 08-Mar-26 1:48 AM, Mario Limonciello wrote:
On 3/6/2026 12:41 AM, Lazar, Lijo wrote:
On 06-Mar-26 11:40 AM, Mario Limonciello wrote:
On 3/5/2026 11:07 PM, Lazar, Lijo wrote:
On 06-Mar-26 3:35 AM, Mario Limonciello wrote:
I found more case that a NULL version causes problems.
Add NULL checks as applicable.
Fixes: 39fc2bc4da00 ("drm/amdgpu: Protect GPU register accesses in
powergated state in some paths")
Signed-off-by: Mario Limonciello <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/
gpu/ drm/amd/amdgpu/amdgpu_device.c
index bc6f714e8763a..74cbe58484fe2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3463,6 +3463,9 @@ static void amdgpu_ip_block_hw_fini(struct
amdgpu_ip_block *ip_block)
struct amdgpu_device *adev = ip_block->adev;
int r;
+ if (!ip_block->version)
+ return;
+
ip block versions are set during discovery phase itself. This is a
very early init failure
Yes; this case is NPI system that not all blocks are in discovery
yet. System panics at bootup with NULL ptr deref in multiple places
instead of a clean recovery and keep fbdev. This patch series sorts
it out.
Blocks not in discovery shouldn't be added to ip list or should be
added differently.
and ideally the fix should be not to call any fini for such an early
failure.
As an alternative to this series?
Yes, if it's a failure as early as in discovery stage, probably we
should skip amdgpu_device_fini_hw altogether.
I experimented some more and feel that the solution I came up with is
correct. There are valid versions of everything at this time (the failed
IP block isn't there at that time).
My understanding of the situation is this is any early exit since driver
doesn't recognize one IP block and hence cannot assign corresponding
version functions. Without discovery mechanism, the equivalent case is
driver not detecting the device id. In both cases, there shouldn't be
any need to run through sw/hw fini sequences of ip block.
So how would you know to skip fini? I guess check asic_funcs not to be
NULL?
One way is to undo the effect of set_ip_block within discovery itself.
For ex: if there is discovery error, call amdgpu_ip_block_clear() or
similar and remove any added ip blocks. num_ip_blocks will then be 0 and
in such cases don't run through any unwind sequence (that shouldn't
really be needed then). That is the same case if driver is not able
detect a valid discovery binary blob also.
Thanks,
Lijo
But then it's the same as the second commit is doing already.
Thanks,
Lijo
Thanks,
Lijo
if (!ip_block->version->funcs->hw_fini) {
dev_err(adev->dev, "hw_fini of IP block <%s> not defined\n",
ip_block->version->funcs->name);
@@ -3496,6 +3499,8 @@ static void
amdgpu_device_smu_fini_early(struct amdgpu_device *adev)
for (i = 0; i < adev->num_ip_blocks; i++) {
if (!adev->ip_blocks[i].status.hw)
continue;
+ if (!adev->ip_blocks[i].version)
+ continue;
if (adev->ip_blocks[i].version->type ==
AMD_IP_BLOCK_TYPE_SMC) {
amdgpu_ip_block_hw_fini(&adev->ip_blocks[i]);
break;