On 5/14/26 21:49, Ahmed Elmetwally wrote: > When the user-supplied amdgpu.gartsize= module parameter requests a GART > aperture so large that the resulting page-table BO cannot be allocated > from VRAM, gmc_v11_0_sw_init() aborts with -ENOMEM during the kernel BO > pin in amdgpu_gart_table_vram_alloc(), which fails amdgpu probe entirely > and leaves the device without a /dev/dri node.
Yeah that's the expected behavior. > > The GART page table is 8 bytes per AMDGPU_GPU_PAGE_SIZE-sized page, > i.e. table_size = gart_size / 512. The table BO is allocated from real > VRAM. On small-VRAM SoCs (APUs with stolen VRAM -- Strix Halo at ~1 GiB > is the case at hand) a user-supplied gartsize that produces a table BO > larger than what can be pinned in VRAM is silently accepted today and > only fails far downstream. > > Reproducer on Strix Halo (gfx1151, 1 GiB stolen VRAM): > > /etc/modprobe.d/amdgpu-tuning.conf: > options amdgpu gartsize=262144 # 256 GiB -> 512 MiB page table > > dmesg: > amdgpu: [gmc_v11_0]*ERROR* GART aperture is needed by the driver but > no memory has been pinned for it ... > amdgpu 0000:c6:00.0: amdgpu: SW IP initialize failed > amdgpu 0000:c6:00.0: amdgpu: sw_init of IP block <gmc_v11_0> failed -12 > > Recovery requires booting with amdgpu.gartsize=N on the kernel command > line -- i.e. the user must already know the cause, from rescue media, > with the GPU offline. > > The user-supplied gartsize is a tuning hint, Nope it is a parameter we use for testing. End users have absolutely no business changing that. > not a correctness > requirement. Compute the maximum sensible value such that the page-table > BO fits within real_vram_size / 8 (leaves 7/8 of VRAM for everything > else), and if the user value exceeds it, log a warning and fall back to > the per-IP auto default. With this patch the same modprobe.d entry > above produces: > > amdgpu 0000:c6:00.0: amdgpu: amdgpu.gartsize=262144 MiB exceeds device > capacity (real_vram=1024 MiB, max sensible=65536 MiB); clamping to > default 512 MiB > > and the device probes normally. > > The helper lives in amdgpu_gmc.c so the other gmc_v*_0 backends can > adopt it in follow-up patches; this patch only wires gmc_v11_0 because > that is where the failure was observed. Well if you shoot into your own foot it is supposed to hurt. When the end user gives incorrect or nonsense module parameters the driver should clearly not override them, but gracefully fail to load. Otherwise we loose part of the testing functionality those module parameters provide. So clear NAK to that patch here. Regards, Christian. > > Signed-off-by: Ahmed Elmetwally <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 47 +++++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 2 + > drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 6 +--- > 3 files changed, 50 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > @@ -330,6 +330,53 @@ void amdgpu_gmc_gart_location(struct amdgpu_device *adev, > dev_info(adev->dev, "GART: %lluM 0x%016llX - 0x%016llX\n", > mc->gart_size >> 20, mc->gart_start, mc->gart_end); > } > + > +/** > + * amdgpu_gmc_validate_gart_size - clamp amdgpu.gartsize against VRAM > capacity > + * > + * @adev: amdgpu device (real_vram_size must already be populated) > + * @user_mb: value of the amdgpu_gart_size module parameter (MiB), > + * or -1 for auto > + * @default_mb: per-IP auto default in MiB > + * > + * The GART page table is allocated from VRAM and sized as > + * gart_size / AMDGPU_GPU_PAGE_SIZE * sizeof(u64). A typoed or mis-pasted > + * amdgpu.gartsize value (e.g. 262144 MiB on a 1 GiB-VRAM APU) produces a > + * page-table BO that cannot be pinned, aborting GPU probe. Cap the page > + * table at 1/8 of real VRAM, warn, and fall back to the auto default. > + * Returns gart_size in bytes. > + */ > +u64 amdgpu_gmc_validate_gart_size(struct amdgpu_device *adev, > + int user_mb, u32 default_mb) > +{ > + u64 want_bytes, max_bytes; > + > + if (user_mb == -1) > + return (u64)default_mb << 20; > + > + want_bytes = (u64)user_mb << 20; > + > + /* > + * page-table BO bytes = gart_bytes / AMDGPU_GPU_PAGE_SIZE * 8 > + * Constraint: page-table BO <= real_vram_size / 8 > + * gart_bytes <= (real_vram_size / 8) * (AMDGPU_GPU_PAGE_SIZE / 8) > + */ > + max_bytes = (adev->gmc.real_vram_size / 8) * > + (AMDGPU_GPU_PAGE_SIZE / 8); > + > + if (want_bytes > max_bytes) { > + dev_warn(adev->dev, > + "amdgpu.gartsize=%u MiB exceeds device capacity " > + "(real_vram=%llu MiB, max sensible=%llu MiB); " > + "clamping to default %u MiB\n", > + user_mb, > + adev->gmc.real_vram_size >> 20, > + max_bytes >> 20, > + default_mb); > + return (u64)default_mb << 20; > + } > + > + return want_bytes; > +} > > /** > * amdgpu_gmc_agp_location - try to find AGP location > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h > @@ -416,6 +416,8 @@ void amdgpu_gmc_vram_location(struct amdgpu_device *adev, > struct amdgpu_gmc *mc, > void amdgpu_gmc_gart_location(struct amdgpu_device *adev, > struct amdgpu_gmc *mc, > enum amdgpu_gart_placement gart_placement); > +u64 amdgpu_gmc_validate_gart_size(struct amdgpu_device *adev, > + int user_mb, u32 default_mb); > void amdgpu_gmc_agp_location(struct amdgpu_device *adev, > struct amdgpu_gmc *mc); > void amdgpu_gmc_set_agp_default(struct amdgpu_device *adev, > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > @@ -708,11 +708,7 @@ static int gmc_v11_0_mc_init(struct amdgpu_device *adev) > if (adev->gmc.visible_vram_size > adev->gmc.real_vram_size) > adev->gmc.visible_vram_size = adev->gmc.real_vram_size; > > - /* set the gart size */ > - if (amdgpu_gart_size == -1) > - adev->gmc.gart_size = 512ULL << 20; > - else > - adev->gmc.gart_size = (u64)amdgpu_gart_size << 20; > + adev->gmc.gart_size = amdgpu_gmc_validate_gart_size(adev, > + amdgpu_gart_size, 512); > > gmc_v11_0_vram_gtt_location(adev, &adev->gmc); > > -- > 2.45.0
