[AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Lazar, Lijo <[email protected]> > Sent: Friday, March 20, 2026 6:32 PM > To: Zhang, Jesse(Jie) <[email protected]>; [email protected] > Cc: Deucher, Alexander <[email protected]>; Koenig, Christian > <[email protected]> > Subject: Re: [PATCH] drm/amdgpu: harden discovery TMR buffer allocation > > > > On 20-Mar-26 3:25 PM, Jesse.Zhang wrote: > > Some platforms report an invalidly large IP discovery TMR size, which > > leads > > amdgpu_discovery_init() to attempt a large kmalloc allocation and > > trigger page allocator warnings/failures during probe. > > > > Observed log excerpt: > > WARNING: mm/page_alloc.c:5216 at > __alloc_frozen_pages_noprof+0x29e/0x340 > > ... > > ___kmalloc_large_node+0xf2/0x130 > > __kmalloc_noprof+0x442/0x6b0 > > amdgpu_discovery_init+0x161/0xa00 [amdgpu] > > Fatal error during GPU init > > probe with driver amdgpu failed with error -12 > > This looks like a different issue. Do you have a trace of which path it takes > and the > value seen? The function amdgpu_discovery_get_tmr_info() reads the discovery table size from the TMR info via ACPI. In the attached log, the discovered size is 0x11800000 (approx. 281 MB). This size is then passed to kzalloc() later in amdgpu_discovery_init(), which leads to an allocation failure (‑12) and the page‑allocator warning.
[ 1337.084630] ------------[ cut here ]------------ [ 1337.084634] WARNING: mm/page_alloc.c:5216 at __alloc_frozen_pages_noprof+0x29e/0x340, CPU#0: kworker/0:0/9 [ 1337.084652] Modules linked in: amdgpu(E+) amdxcp drm_panel_backlight_quirks gpu_sched drm_buddy drm_ttm_helper ttm drm_exec drm_suballoc_helper drm_client_lib drm_display_helper cec rc_core drm_kms_helper video xt_comment xt_conntrack xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat x_tables nf_tables nfnetlink xfrm_user xfrm_algo overlay binfmt_misc nls_iso8859_1 intel_rapl_msr amd_atl intel_rapl_common amd64_edac edac_mce_amd kvm_amd ccp kvm rapl wmi_bmof mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efi_pstore drm autofs4 btrfs blake2b libblake2b raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 linear dax_hmem cxl_acpi cxl_port nvme igb cxl_core ghash_clmulni_intel einj dca nvme_core i2c_piix4 i2c_algo_bit i2c_smbus wmi aesni_intel [ 1337.084915] CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Tainted: G E 6.19.0+ #79 PREEMPT(voluntary) [ 1337.084925] Tainted: [E]=UNSIGNED_MODULE [ 1337.084929] Hardware name: AMD Corporation Sh54p/Sh54p, BIOS RMP100CAS 11/07/2025 [ 1337.084934] Workqueue: events work_for_cpu_fn [ 1337.084948] RIP: 0010:__alloc_frozen_pages_noprof+0x29e/0x340 [ 1337.084954] Code: e9 b6 fe ff ff 83 fe 0a 0f 86 ec fd ff ff 0f b6 1d a4 b0 13 02 80 fb 01 0f 87 75 66 b6 ff 83 e3 01 75 09 c6 05 8f b0 13 02 01 <0f> 0b 45 31 ff e9 12 ff ff ff a9 00 00 08 00 75 62 44 89 e1 80 e1 [ 1337.084960] RSP: 0018:ffffc90000123970 EFLAGS: 00010246 [ 1337.084967] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 1337.084971] RDX: 0000000000000000 RSI: 0000000000000011 RDI: 0000000000000000 [ 1337.084975] RBP: ffffc900001239c8 R08: 0000000000000000 R09: 0000000000000001 [ 1337.084979] R10: ffffc90000123b18 R11: 0000000000000004 R12: 0000000000040dc0 [ 1337.084984] R13: 0000000000000011 R14: ffffffffffffffff R15: 0000000000000000 [ 1337.084988] FS: 0000000000000000(0000) GS:ffff88a64bec9000(0000) knlGS:0000000000000000 [ 1337.084994] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1337.084998] CR2: 000000c002881000 CR3: 000000046b090009 CR4: 0000000000770ef0 [ 1337.085003] PKRU: 55555554 [ 1337.085006] Call Trace: [ 1337.085010] <TASK> [ 1337.085030] alloc_pages_mpol+0x7e/0x190 [ 1337.085041] ? srso_alias_return_thunk+0x5/0xfbef5 [ 1337.085057] ? amdgpu_discovery_init+0x161/0xa00 [amdgpu] [ 1337.085836] alloc_frozen_pages_noprof+0x58/0x80 [ 1337.085842] ___kmalloc_large_node+0xf2/0x130 [ 1337.085848] ? vprintk_emit+0x2aa/0x590 [ 1337.085857] __kmalloc_large_node_noprof+0x25/0xc0 [ 1337.085862] __kmalloc_noprof+0x442/0x6b0 [ 1337.085867] ? vprintk+0x1c/0x50 [ 1337.085870] ? srso_alias_return_thunk+0x5/0xfbef5 [ 1337.085873] ? _printk+0x5b/0x80 [ 1337.085880] amdgpu_discovery_init+0x161/0xa00 [amdgpu] [ 1337.086193] ? amdgpu_discovery_init+0x161/0xa00 [amdgpu] [ 1337.086475] ? srso_alias_return_thunk+0x5/0xfbef5 [ 1337.086481] amdgpu_discovery_reg_base_init+0x1e/0x6e0 [amdgpu] [ 1337.086755] ? srso_alias_return_thunk+0x5/0xfbef5 [ 1337.086760] amdgpu_discovery_set_ip_blocks+0x1cf7/0x2b30 [amdgpu] [ 1337.087045] ? raw_pci_read+0x2d/0x50 [ 1337.087053] ? srso_alias_return_thunk+0x5/0xfbef5 [ 1337.087056] ? pci_read+0x30/0x40 [ 1337.087059] ? srso_alias_return_thunk+0x5/0xfbef5 [ 1337.087062] ? pci_bus_read_config_dword+0x4d/0x80 [ 1337.087069] ? srso_alias_return_thunk+0x5/0xfbef5 [ 1337.087073] ? pcie_capability_read_dword+0xb7/0xe0 [ 1337.087079] amdgpu_device_init+0x1072/0x3470 [amdgpu] [ 1337.087402] ? srso_alias_return_thunk+0x5/0xfbef5 [ 1337.087409] ? pci_read+0x30/0x40 [ 1337.087414] ? srso_alias_return_thunk+0x5/0xfbef5 [ 1337.087420] ? srso_alias_return_thunk+0x5/0xfbef5 [ 1337.087426] ? pci_read_config_word+0x2d/0x50 [ 1337.087434] ? srso_alias_return_thunk+0x5/0xfbef5 [ 1337.087441] ? do_pci_enable_device+0x11b/0x150 [ 1337.087450] ? pci_update_current_state+0x6f/0xa0 [ 1337.087463] amdgpu_driver_load_kms+0x1e/0xc0 [amdgpu] [ 1337.087750] amdgpu_pci_probe+0x2c5/0x730 [amdgpu] [ 1337.088036] local_pci_probe+0x4f/0xb0 [ 1337.088046] work_for_cpu_fn+0x1e/0x30 [ 1337.088051] process_scheduled_works+0xa6/0x420 [ 1337.088060] worker_thread+0x12a/0x270 [ 1337.088066] kthread+0x10d/0x230 [ 1337.088074] ? __pfx_worker_thread+0x10/0x10 [ 1337.088078] ? __pfx_kthread+0x10/0x10 [ 1337.088086] ret_from_fork+0x17c/0x1f0 [ 1337.088094] ? __pfx_kthread+0x10/0x10 [ 1337.088099] ret_from_fork_asm+0x1a/0x30 [ 1337.088114] </TASK> [ 1337.088118] ---[ end trace 0000000000000000 ]--- [ 1337.096795] amdgpu 0000:01:00.0: Fatal error during GPU init [ 1337.103593] amdgpu 0000:01:00.0: probe with driver amdgpu failed with error -12 Thanks Jesse > > Thanks, > Lijo > > > > > Fix by: > > - validating discovery size and falling back to DISCOVERY_TMR_SIZE when > > size is zero or out of expected range; > > - using kvzalloc() for discovery buffer allocation to avoid high-order > > contiguous-page allocation failures; > > - using kvfree() on all release paths. > > > > Signed-off-by: Jesse Zhang <[email protected]> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 19 ++++++++++++++++--- > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > > index 5a4e63e1ad93..a6b49378c495 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c > > @@ -329,7 +329,20 @@ static int amdgpu_discovery_get_tmr_info(struct > amdgpu_device *adev, > > } > > } > > out: > > - adev->discovery.bin = kzalloc(adev->discovery.size, GFP_KERNEL); > > + if (!adev->discovery.size || adev->discovery.size > > DISCOVERY_TMR_SIZE) { > > + dev_warn(adev->dev, > > + "invalid discovery size 0x%x, fallback to default > > 0x%x\n", > > + adev->discovery.size, DISCOVERY_TMR_SIZE); > > + /* > > + * Some platforms may expose garbage TMR size through > scratch/ACPI. > > + * Fall back to legacy layout in VRAM when available. > > + */ > > + if (!*is_tmr_in_sysmem && vram_size) > > + adev->discovery.offset = (vram_size << 20) - > DISCOVERY_TMR_OFFSET; > > + adev->discovery.size = DISCOVERY_TMR_SIZE; > > + } > > + > > + adev->discovery.bin = kvzalloc(adev->discovery.size, GFP_KERNEL); > > if (!adev->discovery.bin) > > return -ENOMEM; > > adev->discovery.debugfs_blob.data = adev->discovery.bin; @@ -694,7 > > +707,7 @@ static int amdgpu_discovery_init(struct amdgpu_device *adev) > > return 0; > > > > out: > > - kfree(adev->discovery.bin); > > + kvfree(adev->discovery.bin); > > adev->discovery.bin = NULL; > > if ((amdgpu_discovery != 2) && > > (RREG32(mmIP_DISCOVERY_VERSION) == 4)) @@ -707,7 +720,7 > @@ > > static void amdgpu_discovery_sysfs_fini(struct amdgpu_device *adev); > > void amdgpu_discovery_fini(struct amdgpu_device *adev) > > { > > amdgpu_discovery_sysfs_fini(adev); > > - kfree(adev->discovery.bin); > > + kvfree(adev->discovery.bin); > > adev->discovery.bin = NULL; > > } > >
