[AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Lazar, Lijo <[email protected]> > Sent: Monday, March 23, 2026 11:46 AM > To: Zhang, Jesse(Jie) <[email protected]>; [email protected] > Cc: Deucher, Alexander <[email protected]>; Koenig, Christian > <[email protected]> > Subject: Re: [PATCH] drm/amdgpu: harden discovery TMR buffer allocation > > > > On 23-Mar-26 6:56 AM, Zhang, Jesse(Jie) wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > >> -----Original Message----- > >> From: Lazar, Lijo <[email protected]> > >> Sent: Friday, March 20, 2026 6:32 PM > >> To: Zhang, Jesse(Jie) <[email protected]>; > >> [email protected] > >> Cc: Deucher, Alexander <[email protected]>; Koenig, Christian > >> <[email protected]> > >> Subject: Re: [PATCH] drm/amdgpu: harden discovery TMR buffer > >> allocation > >> > >> > >> > >> On 20-Mar-26 3:25 PM, Jesse.Zhang wrote: > >>> Some platforms report an invalidly large IP discovery TMR size, > >>> which leads > >>> amdgpu_discovery_init() to attempt a large kmalloc allocation and > >>> trigger page allocator warnings/failures during probe. > >>> > >>> Observed log excerpt: > >>> WARNING: mm/page_alloc.c:5216 at > >> __alloc_frozen_pages_noprof+0x29e/0x340 > >>> ... > >>> ___kmalloc_large_node+0xf2/0x130 > >>> __kmalloc_noprof+0x442/0x6b0 > >>> amdgpu_discovery_init+0x161/0xa00 [amdgpu] > >>> Fatal error during GPU init > >>> probe with driver amdgpu failed with error -12 > >> > >> This looks like a different issue. Do you have a trace of which path > >> it takes and the value seen? > > The function amdgpu_discovery_get_tmr_info() reads the discovery table size > from the TMR info via ACPI. In the attached log, the discovered size is > 0x11800000 > (approx. 281 MB). > > This size is then passed to kzalloc() later in amdgpu_discovery_init(), > > which leads > to an allocation failure (‑12) and the page‑allocator warning. > > Thanks, it's a regression introduced with a recent change. > > The fix should be here. > > https://gitlab.freedesktop.org/agd5f/linux/-/blob/drm- > next/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c#L327 > > acpi function will return the full tmr size (not that of discover alone). > Discovery table > size remains as DISCOVERY_TMR_SIZE. Please change this to > DISCOVERY_TMR_SIZE and also add a Fixes tag. Thanks Lijo, I will update the patch.
Thanks Jesse > > Thanks, > Lijo > > > > > [ 1337.084630] ------------[ cut here ]------------ [ 1337.084634] > > WARNING: mm/page_alloc.c:5216 at > > __alloc_frozen_pages_noprof+0x29e/0x340, CPU#0: kworker/0:0/9 [ > > 1337.084652] Modules linked in: amdgpu(E+) amdxcp > > drm_panel_backlight_quirks gpu_sched drm_buddy drm_ttm_helper ttm > > drm_exec drm_suballoc_helper drm_client_lib drm_display_helper cec > > rc_core drm_kms_helper video xt_comment xt_conntrack xt_MASQUERADE > > bridge stp llc xt_set ip_set nft_chain_nat nf_nat nf_conntrack > > nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat x_tables > > nf_tables nfnetlink xfrm_user xfrm_algo overlay binfmt_misc > > nls_iso8859_1 intel_rapl_msr amd_atl intel_rapl_common amd64_edac > > edac_mce_amd kvm_amd ccp kvm rapl wmi_bmof mac_hid sch_fq_codel > > dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efi_pstore drm > > autofs4 btrfs blake2b libblake2b raid10 raid456 async_raid6_recov > > async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 > > linear dax_hmem cxl_acpi cxl_port nvme igb cxl_core > > ghash_clmulni_intel einj dca nvme_core i2c_piix4 i2c_algo_bit > > i2c_smbus wmi aesni_intel [ 1337.084915] CPU: 0 UID: 0 PID: 9 Comm: > > kworker/0:0 Tainted: G E 6.19.0+ #79 PREEMPT(voluntary) [ 1337.084925] > > Tainted: [E]=UNSIGNED_MODULE [ 1337.084929] Hardware name: AMD > > Corporation Sh54p/Sh54p, BIOS RMP100CAS 11/07/2025 [ 1337.084934] > > Workqueue: events work_for_cpu_fn [ 1337.084948] RIP: > > 0010:__alloc_frozen_pages_noprof+0x29e/0x340 > > [ 1337.084954] Code: e9 b6 fe ff ff 83 fe 0a 0f 86 ec fd ff ff 0f b6 > > 1d a4 b0 13 02 80 fb 01 0f 87 75 66 b6 ff 83 e3 01 75 09 c6 05 8f b0 > > 13 02 01 <0f> 0b 45 31 ff e9 12 ff ff ff a9 00 00 08 00 75 62 44 89 e1 > > 80 e1 [ 1337.084960] RSP: 0018:ffffc90000123970 EFLAGS: 00010246 [ > > 1337.084967] RAX: 0000000000000000 RBX: 0000000000000000 RCX: > > 0000000000000000 [ 1337.084971] RDX: 0000000000000000 RSI: > > 0000000000000011 RDI: 0000000000000000 [ 1337.084975] RBP: > > ffffc900001239c8 R08: 0000000000000000 R09: 0000000000000001 [ > > 1337.084979] R10: ffffc90000123b18 R11: 0000000000000004 R12: > > 0000000000040dc0 [ 1337.084984] R13: 0000000000000011 R14: > > ffffffffffffffff R15: 0000000000000000 [ 1337.084988] FS: > > 0000000000000000(0000) GS:ffff88a64bec9000(0000) > knlGS:0000000000000000 [ 1337.084994] CS: 0010 DS: 0000 ES: 0000 CR0: > 0000000080050033 [ 1337.084998] CR2: 000000c002881000 CR3: > 000000046b090009 CR4: 0000000000770ef0 [ 1337.085003] PKRU: 55555554 > [ 1337.085006] Call Trace: > > [ 1337.085010] <TASK> > > [ 1337.085030] alloc_pages_mpol+0x7e/0x190 [ 1337.085041] ? > > srso_alias_return_thunk+0x5/0xfbef5 > > [ 1337.085057] ? amdgpu_discovery_init+0x161/0xa00 [amdgpu] [ > > 1337.085836] alloc_frozen_pages_noprof+0x58/0x80 > > [ 1337.085842] ___kmalloc_large_node+0xf2/0x130 [ 1337.085848] ? > > vprintk_emit+0x2aa/0x590 [ 1337.085857] > > __kmalloc_large_node_noprof+0x25/0xc0 > > [ 1337.085862] __kmalloc_noprof+0x442/0x6b0 [ 1337.085867] ? > > vprintk+0x1c/0x50 [ 1337.085870] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 1337.085873] ? _printk+0x5b/0x80 > > [ 1337.085880] amdgpu_discovery_init+0x161/0xa00 [amdgpu] [ > > 1337.086193] ? amdgpu_discovery_init+0x161/0xa00 [amdgpu] [ > > 1337.086475] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 1337.086481] amdgpu_discovery_reg_base_init+0x1e/0x6e0 [amdgpu] [ > > 1337.086755] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 1337.086760] amdgpu_discovery_set_ip_blocks+0x1cf7/0x2b30 [amdgpu] [ > > 1337.087045] ? raw_pci_read+0x2d/0x50 [ 1337.087053] ? > > srso_alias_return_thunk+0x5/0xfbef5 > > [ 1337.087056] ? pci_read+0x30/0x40 > > [ 1337.087059] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 1337.087062] ? pci_bus_read_config_dword+0x4d/0x80 > > [ 1337.087069] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 1337.087073] ? pcie_capability_read_dword+0xb7/0xe0 > > [ 1337.087079] amdgpu_device_init+0x1072/0x3470 [amdgpu] [ > > 1337.087402] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 1337.087409] ? pci_read+0x30/0x40 > > [ 1337.087414] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 1337.087420] ? srso_alias_return_thunk+0x5/0xfbef5 > > [ 1337.087426] ? pci_read_config_word+0x2d/0x50 [ 1337.087434] ? > > srso_alias_return_thunk+0x5/0xfbef5 > > [ 1337.087441] ? do_pci_enable_device+0x11b/0x150 [ 1337.087450] ? > > pci_update_current_state+0x6f/0xa0 > > [ 1337.087463] amdgpu_driver_load_kms+0x1e/0xc0 [amdgpu] [ > > 1337.087750] amdgpu_pci_probe+0x2c5/0x730 [amdgpu] [ 1337.088036] > > local_pci_probe+0x4f/0xb0 [ 1337.088046] work_for_cpu_fn+0x1e/0x30 [ > > 1337.088051] process_scheduled_works+0xa6/0x420 > > [ 1337.088060] worker_thread+0x12a/0x270 [ 1337.088066] > > kthread+0x10d/0x230 [ 1337.088074] ? __pfx_worker_thread+0x10/0x10 [ > > 1337.088078] ? __pfx_kthread+0x10/0x10 [ 1337.088086] > > ret_from_fork+0x17c/0x1f0 [ 1337.088094] ? __pfx_kthread+0x10/0x10 [ > > 1337.088099] ret_from_fork_asm+0x1a/0x30 [ 1337.088114] </TASK> [ > > 1337.088118] ---[ end trace 0000000000000000 ]--- [ 1337.096795] > > amdgpu 0000:01:00.0: Fatal error during GPU init [ 1337.103593] amdgpu > > 0000:01:00.0: probe with driver amdgpu failed with error -12 > > > > Thanks > > Jesse > >> > >> Thanks, > >> Lijo > >> > >>> > >>> Fix by: > >>> - validating discovery size and falling back to DISCOVERY_TMR_SIZE when > >>> size is zero or out of expected range; > >>> - using kvzalloc() for discovery buffer allocation to avoid high-order > >>> contiguous-page allocation failures; > >>> - using kvfree() on all release paths. > >>> > >>> Signed-off-by: Jesse Zhang <[email protected]> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 19 > ++++++++++++++++--- > >>> 1 file changed, 16 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> index 5a4e63e1ad93..a6b49378c495 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > >>> @@ -329,7 +329,20 @@ static int amdgpu_discovery_get_tmr_info(struct > >> amdgpu_device *adev, > >>> } > >>> } > >>> out: > >>> - adev->discovery.bin = kzalloc(adev->discovery.size, GFP_KERNEL); > >>> + if (!adev->discovery.size || adev->discovery.size > > >> DISCOVERY_TMR_SIZE) { > >>> + dev_warn(adev->dev, > >>> + "invalid discovery size 0x%x, fallback to default > >>> 0x%x\n", > >>> + adev->discovery.size, DISCOVERY_TMR_SIZE); > >>> + /* > >>> + * Some platforms may expose garbage TMR size through > >> scratch/ACPI. > >>> + * Fall back to legacy layout in VRAM when available. > >>> + */ > >>> + if (!*is_tmr_in_sysmem && vram_size) > >>> + adev->discovery.offset = (vram_size << 20) - > >> DISCOVERY_TMR_OFFSET; > >>> + adev->discovery.size = DISCOVERY_TMR_SIZE; > >>> + } > >>> + > >>> + adev->discovery.bin = kvzalloc(adev->discovery.size, > >>> + GFP_KERNEL); > >>> if (!adev->discovery.bin) > >>> return -ENOMEM; > >>> adev->discovery.debugfs_blob.data = adev->discovery.bin; @@ > >>> -694,7 > >>> +707,7 @@ static int amdgpu_discovery_init(struct amdgpu_device > >>> +*adev) > >>> return 0; > >>> > >>> out: > >>> - kfree(adev->discovery.bin); > >>> + kvfree(adev->discovery.bin); > >>> adev->discovery.bin = NULL; > >>> if ((amdgpu_discovery != 2) && > >>> (RREG32(mmIP_DISCOVERY_VERSION) == 4)) @@ -707,7 +720,7 > >> @@ > >>> static void amdgpu_discovery_sysfs_fini(struct amdgpu_device *adev); > >>> void amdgpu_discovery_fini(struct amdgpu_device *adev) > >>> { > >>> amdgpu_discovery_sysfs_fini(adev); > >>> - kfree(adev->discovery.bin); > >>> + kvfree(adev->discovery.bin); > >>> adev->discovery.bin = NULL; > >>> } > >>> > >
