[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Lazar, Lijo <[email protected]>
> Sent: Monday, March 23, 2026 11:46 AM
> To: Zhang, Jesse(Jie) <[email protected]>; [email protected]
> Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> <[email protected]>
> Subject: Re: [PATCH] drm/amdgpu: harden discovery TMR buffer allocation
>
>
>
> On 23-Mar-26 6:56 AM, Zhang, Jesse(Jie) wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> >> -----Original Message-----
> >> From: Lazar, Lijo <[email protected]>
> >> Sent: Friday, March 20, 2026 6:32 PM
> >> To: Zhang, Jesse(Jie) <[email protected]>;
> >> [email protected]
> >> Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> >> <[email protected]>
> >> Subject: Re: [PATCH] drm/amdgpu: harden discovery TMR buffer
> >> allocation
> >>
> >>
> >>
> >> On 20-Mar-26 3:25 PM, Jesse.Zhang wrote:
> >>> Some platforms report an invalidly large IP discovery TMR size,
> >>> which leads
> >>> amdgpu_discovery_init() to attempt a large kmalloc allocation and
> >>> trigger page allocator warnings/failures during probe.
> >>>
> >>> Observed log excerpt:
> >>>     WARNING: mm/page_alloc.c:5216 at
> >> __alloc_frozen_pages_noprof+0x29e/0x340
> >>>     ...
> >>>     ___kmalloc_large_node+0xf2/0x130
> >>>     __kmalloc_noprof+0x442/0x6b0
> >>>     amdgpu_discovery_init+0x161/0xa00 [amdgpu]
> >>>    Fatal error during GPU init
> >>>    probe with driver amdgpu failed with error -12
> >>
> >> This looks like a different issue. Do you have a trace of which path
> >> it takes and the value seen?
> > The function amdgpu_discovery_get_tmr_info() reads the discovery table size
> from the TMR info via ACPI. In the attached log, the discovered size is 
> 0x11800000
> (approx. 281 MB).
> > This size is then passed to kzalloc() later in amdgpu_discovery_init(), 
> > which leads
> to an allocation failure (‑12) and the page‑allocator warning.
>
> Thanks, it's a regression introduced with a recent change.
>
> The fix should be here.
>
> https://gitlab.freedesktop.org/agd5f/linux/-/blob/drm-
> next/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c#L327
>
> acpi function will return the full tmr size (not that of discover alone). 
> Discovery table
> size remains as DISCOVERY_TMR_SIZE. Please change this to
> DISCOVERY_TMR_SIZE and also add a Fixes tag.
Thanks Lijo, I will update the patch.

Thanks
Jesse
>
> Thanks,
> Lijo
>
> >
> > [ 1337.084630] ------------[ cut here ]------------ [ 1337.084634]
> > WARNING: mm/page_alloc.c:5216 at
> > __alloc_frozen_pages_noprof+0x29e/0x340, CPU#0: kworker/0:0/9 [
> > 1337.084652] Modules linked in: amdgpu(E+) amdxcp
> > drm_panel_backlight_quirks gpu_sched drm_buddy drm_ttm_helper ttm
> > drm_exec drm_suballoc_helper drm_client_lib drm_display_helper cec
> > rc_core drm_kms_helper video xt_comment xt_conntrack xt_MASQUERADE
> > bridge stp llc xt_set ip_set nft_chain_nat nf_nat nf_conntrack
> > nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat x_tables
> > nf_tables nfnetlink xfrm_user xfrm_algo overlay binfmt_misc
> > nls_iso8859_1 intel_rapl_msr amd_atl intel_rapl_common amd64_edac
> > edac_mce_amd kvm_amd ccp kvm rapl wmi_bmof mac_hid sch_fq_codel
> > dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efi_pstore drm
> > autofs4 btrfs blake2b libblake2b raid10 raid456 async_raid6_recov
> > async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0
> > linear dax_hmem cxl_acpi cxl_port nvme igb cxl_core
> > ghash_clmulni_intel einj dca nvme_core i2c_piix4 i2c_algo_bit
> > i2c_smbus wmi aesni_intel [ 1337.084915] CPU: 0 UID: 0 PID: 9 Comm:
> > kworker/0:0 Tainted: G E 6.19.0+ #79 PREEMPT(voluntary) [ 1337.084925]
> > Tainted: [E]=UNSIGNED_MODULE [ 1337.084929] Hardware name: AMD
> > Corporation Sh54p/Sh54p, BIOS RMP100CAS 11/07/2025 [ 1337.084934]
> > Workqueue: events work_for_cpu_fn [ 1337.084948] RIP:
> > 0010:__alloc_frozen_pages_noprof+0x29e/0x340
> > [ 1337.084954] Code: e9 b6 fe ff ff 83 fe 0a 0f 86 ec fd ff ff 0f b6
> > 1d a4 b0 13 02 80 fb 01 0f 87 75 66 b6 ff 83 e3 01 75 09 c6 05 8f b0
> > 13 02 01 <0f> 0b 45 31 ff e9 12 ff ff ff a9 00 00 08 00 75 62 44 89 e1
> > 80 e1 [ 1337.084960] RSP: 0018:ffffc90000123970 EFLAGS: 00010246 [
> > 1337.084967] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> > 0000000000000000 [ 1337.084971] RDX: 0000000000000000 RSI:
> > 0000000000000011 RDI: 0000000000000000 [ 1337.084975] RBP:
> > ffffc900001239c8 R08: 0000000000000000 R09: 0000000000000001 [
> > 1337.084979] R10: ffffc90000123b18 R11: 0000000000000004 R12:
> > 0000000000040dc0 [ 1337.084984] R13: 0000000000000011 R14:
> > ffffffffffffffff R15: 0000000000000000 [ 1337.084988] FS:
> > 0000000000000000(0000) GS:ffff88a64bec9000(0000)
> knlGS:0000000000000000 [ 1337.084994] CS: 0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033 [ 1337.084998] CR2: 000000c002881000 CR3:
> 000000046b090009 CR4: 0000000000770ef0 [ 1337.085003] PKRU: 55555554
> [ 1337.085006] Call Trace:
> > [ 1337.085010] <TASK>
> > [ 1337.085030] alloc_pages_mpol+0x7e/0x190 [ 1337.085041] ?
> > srso_alias_return_thunk+0x5/0xfbef5
> > [ 1337.085057] ? amdgpu_discovery_init+0x161/0xa00 [amdgpu] [
> > 1337.085836] alloc_frozen_pages_noprof+0x58/0x80
> > [ 1337.085842] ___kmalloc_large_node+0xf2/0x130 [ 1337.085848] ?
> > vprintk_emit+0x2aa/0x590 [ 1337.085857]
> > __kmalloc_large_node_noprof+0x25/0xc0
> > [ 1337.085862] __kmalloc_noprof+0x442/0x6b0 [ 1337.085867] ?
> > vprintk+0x1c/0x50 [ 1337.085870] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 1337.085873] ? _printk+0x5b/0x80
> > [ 1337.085880] amdgpu_discovery_init+0x161/0xa00 [amdgpu] [
> > 1337.086193] ? amdgpu_discovery_init+0x161/0xa00 [amdgpu] [
> > 1337.086475] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 1337.086481] amdgpu_discovery_reg_base_init+0x1e/0x6e0 [amdgpu] [
> > 1337.086755] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 1337.086760] amdgpu_discovery_set_ip_blocks+0x1cf7/0x2b30 [amdgpu] [
> > 1337.087045] ? raw_pci_read+0x2d/0x50 [ 1337.087053] ?
> > srso_alias_return_thunk+0x5/0xfbef5
> > [ 1337.087056] ? pci_read+0x30/0x40
> > [ 1337.087059] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 1337.087062] ? pci_bus_read_config_dword+0x4d/0x80
> > [ 1337.087069] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 1337.087073] ? pcie_capability_read_dword+0xb7/0xe0
> > [ 1337.087079] amdgpu_device_init+0x1072/0x3470 [amdgpu] [
> > 1337.087402] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 1337.087409] ? pci_read+0x30/0x40
> > [ 1337.087414] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 1337.087420] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 1337.087426] ? pci_read_config_word+0x2d/0x50 [ 1337.087434] ?
> > srso_alias_return_thunk+0x5/0xfbef5
> > [ 1337.087441] ? do_pci_enable_device+0x11b/0x150 [ 1337.087450] ?
> > pci_update_current_state+0x6f/0xa0
> > [ 1337.087463] amdgpu_driver_load_kms+0x1e/0xc0 [amdgpu] [
> > 1337.087750] amdgpu_pci_probe+0x2c5/0x730 [amdgpu] [ 1337.088036]
> > local_pci_probe+0x4f/0xb0 [ 1337.088046] work_for_cpu_fn+0x1e/0x30 [
> > 1337.088051] process_scheduled_works+0xa6/0x420
> > [ 1337.088060] worker_thread+0x12a/0x270 [ 1337.088066]
> > kthread+0x10d/0x230 [ 1337.088074] ? __pfx_worker_thread+0x10/0x10 [
> > 1337.088078] ? __pfx_kthread+0x10/0x10 [ 1337.088086]
> > ret_from_fork+0x17c/0x1f0 [ 1337.088094] ? __pfx_kthread+0x10/0x10 [
> > 1337.088099] ret_from_fork_asm+0x1a/0x30 [ 1337.088114] </TASK> [
> > 1337.088118] ---[ end trace 0000000000000000 ]--- [ 1337.096795]
> > amdgpu 0000:01:00.0: Fatal error during GPU init [ 1337.103593] amdgpu
> > 0000:01:00.0: probe with driver amdgpu failed with error -12
> >
> > Thanks
> > Jesse
> >>
> >> Thanks,
> >> Lijo
> >>
> >>>
> >>> Fix by:
> >>> - validating discovery size and falling back to DISCOVERY_TMR_SIZE when
> >>>     size is zero or out of expected range;
> >>> - using kvzalloc() for discovery buffer allocation to avoid high-order
> >>>     contiguous-page allocation failures;
> >>> - using kvfree() on all release paths.
> >>>
> >>> Signed-off-by: Jesse Zhang <[email protected]>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 19
> ++++++++++++++++---
> >>>    1 file changed, 16 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> >>> index 5a4e63e1ad93..a6b49378c495 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
> >>> @@ -329,7 +329,20 @@ static int amdgpu_discovery_get_tmr_info(struct
> >> amdgpu_device *adev,
> >>>              }
> >>>      }
> >>>    out:
> >>> -   adev->discovery.bin = kzalloc(adev->discovery.size, GFP_KERNEL);
> >>> +   if (!adev->discovery.size || adev->discovery.size >
> >> DISCOVERY_TMR_SIZE) {
> >>> +           dev_warn(adev->dev,
> >>> +                    "invalid discovery size 0x%x, fallback to default 
> >>> 0x%x\n",
> >>> +                    adev->discovery.size, DISCOVERY_TMR_SIZE);
> >>> +           /*
> >>> +            * Some platforms may expose garbage TMR size through
> >> scratch/ACPI.
> >>> +            * Fall back to legacy layout in VRAM when available.
> >>> +            */
> >>> +           if (!*is_tmr_in_sysmem && vram_size)
> >>> +                   adev->discovery.offset = (vram_size << 20) -
> >> DISCOVERY_TMR_OFFSET;
> >>> +           adev->discovery.size = DISCOVERY_TMR_SIZE;
> >>> +   }
> >>> +
> >>> +   adev->discovery.bin = kvzalloc(adev->discovery.size,
> >>> + GFP_KERNEL);
> >>>      if (!adev->discovery.bin)
> >>>              return -ENOMEM;
> >>>      adev->discovery.debugfs_blob.data = adev->discovery.bin; @@
> >>> -694,7
> >>> +707,7 @@ static int amdgpu_discovery_init(struct amdgpu_device
> >>> +*adev)
> >>>      return 0;
> >>>
> >>>    out:
> >>> -   kfree(adev->discovery.bin);
> >>> +   kvfree(adev->discovery.bin);
> >>>      adev->discovery.bin = NULL;
> >>>      if ((amdgpu_discovery != 2) &&
> >>>          (RREG32(mmIP_DISCOVERY_VERSION) == 4)) @@ -707,7 +720,7
> >> @@
> >>> static void amdgpu_discovery_sysfs_fini(struct amdgpu_device *adev);
> >>>    void amdgpu_discovery_fini(struct amdgpu_device *adev)
> >>>    {
> >>>      amdgpu_discovery_sysfs_fini(adev);
> >>> -   kfree(adev->discovery.bin);
> >>> +   kvfree(adev->discovery.bin);
> >>>      adev->discovery.bin = NULL;
> >>>    }
> >>>
> >

Reply via email to