On Thursday 03/12 at 01:02 -0700, Nathan Chancellor wrote:
> Hi Calvin,
>
> On Mon, Mar 09, 2026 at 09:24:57PM -0700, Calvin Owens wrote:
> > Commit e1b385726f7f ("drm/amd/display: Add additional checks for PSP
> > footer size") introduced a use of an uninitialized stack variable
> > in dm_dmub_sw_init() (region_params.bss_data_size).
> >
> > Interestingly, this seems to cause no issue on normal kernels. But when
> > full LTO is enabled, it causes the compiler to "optimize" out huge
> > swaths of amdgpu initialization code, and the driver is unusable:
>
> Yeah, this appears to be a very unfortunate case of "clang encountered known
> undefined behavior and stopped code generation", which we would like to
> avoid but figuring out a proper upstreamable solution is hard. The most
> recent attempt:
>
> https://github.com/llvm/llvm-project/pull/146791
>
> My guess is that LTO allows inlining of
> dmub_srv_get_fw_meta_info_from_raw_fw() into dm_dmub_sw_init(), at which
> point it can see that the result of accessing an uninitialized
> region_params.bss_data_size will be used through
> fw_meta_info_params.fw_bss_data and gives up generating the rest of the
> function.
Thanks for looking Nathan. I'll keep an eye on that and see if it's able
to catch this example. I've tried to come up with a minimal reproducer,
but I haven't had any luck yet (so far I always get the warning), would
that be helpful at all?
I put the full W=2 output for the one file here in case anyone else
wants to look:
https://github.com/jcalvinowens/lkml-debug/blob/main/amdgpu-lto/gcc-warns.txt
https://github.com/jcalvinowens/lkml-debug/blob/main/amdgpu-lto/llvm-warns.txt
Somehow 'make drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.o' doesn't
work, I want to look at that later because it was mildly annoying while
digging into this.
> > amdgpu 0000:03:00.0: [drm] Loading DMUB firmware via PSP:
> > version=0x07002F00
> > amdgpu 0000:03:00.0: sw_init of IP block <dm> failed 5
> > amdgpu 0000:03:00.0: amdgpu_device_ip_init failed
> > amdgpu 0000:03:00.0: Fatal error during GPU init
> >
> > It surprises me that neither gcc nor clang emit a warning about this: I
> > only found it by bisecting the LTO breakage.
>
> gcc's -Wmaybe-uninitialized is disabled by default for the kernel but
> even enabling it with KCFLAGS does not show an instance here, which I
> find quite surprising... for clang, it is harder because the warning
> happens early in the frontend where it might not be able to track a
> value that well.
GCC does flag what seems to me to be a real but benign warning about an
ERR_PTR check that doesn't handle NULL in the same file:
https://lore.kernel.org/lkml/6aaf2cf4bd19363a85f35e649685d7bdae400253.1773157137.git.cal...@wbinvd.org/
I'm also trying to find a minimal reproducer for GCC, no luck yet.
> > Fix by using the old value for region_params.bss_data_size in place of
> > the uninitialized reference, which makes amdgpu work with LTO again.
> >
> > Fixes: e1b385726f7f ("drm/amd/display: Add additional checks for PSP footer
> > size")
> > Signed-off-by: Calvin Owens <[email protected]>
> > ---
> > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index b3d6f2cd8ab6..e69e61163ae9 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -2554,7 +2554,7 @@ static int dm_dmub_sw_init(struct amdgpu_device *adev)
> > fw_meta_info_params.fw_inst_const = adev->dm.dmub_fw->data +
> >
> > le32_to_cpu(hdr->header.ucode_array_offset_bytes) +
> > PSP_HEADER_BYTES_256;
> > - fw_meta_info_params.fw_bss_data = region_params.bss_data_size ?
> > adev->dm.dmub_fw->data +
> > + fw_meta_info_params.fw_bss_data = le32_to_cpu(hdr->bss_data_bytes) ?
> > adev->dm.dmub_fw->data +
>
> Maybe it would be better to use fw_meta_info_params.bss_data_size
> instead of le32_to_cpu(hdr->bss_data_bytes)? Obviously it is the same
> value but it would result in a smaller change. It seems likely that this
> was just a copy and paste failure.
Agreed. That ends up being almost self evidently correct if I force git
to add an extra context line with the assignment, I always forget I can
do that:
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index b3d6f2cd8ab6..0d1c772ef713 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2553,9 +2553,9 @@ static int dm_dmub_sw_init(struct amdgpu_device *adev)
fw_meta_info_params.bss_data_size = le32_to_cpu(hdr->bss_data_bytes);
fw_meta_info_params.fw_inst_const = adev->dm.dmub_fw->data +
le32_to_cpu(hdr->header.ucode_array_offset_bytes) +
PSP_HEADER_BYTES_256;
- fw_meta_info_params.fw_bss_data = region_params.bss_data_size ?
adev->dm.dmub_fw->data +
+ fw_meta_info_params.fw_bss_data = fw_meta_info_params.bss_data_size ?
adev->dm.dmub_fw->data +
le32_to_cpu(hdr->header.ucode_array_offset_bytes) +
le32_to_cpu(hdr->inst_const_bytes) :
NULL;
fw_meta_info_params.custom_psp_footer_size = 0;
I'll send a v2 in a little bit.
Thanks,
Calvin
> >
> > le32_to_cpu(hdr->header.ucode_array_offset_bytes) +
> > le32_to_cpu(hdr->inst_const_bytes) :
> > NULL;
> > fw_meta_info_params.custom_psp_footer_size = 0;
> > --
> > 2.47.3
> >