On 12/16/25 11:08, Donet Tom wrote:
>> The GPU page tables are 4k in size no matter what the CPU page size is and
>> there is some special handling so that we can allocate them even under
>> memory pressure. Background is that you sometimes need to split up higher
>> order pages (1G, 2M) into lower order pages (2M, 4k) to be able to swap
>> things to system memory for example and for that you need some an extra
>> layer of page tables.
>>
>> The problem is now that those 4k pages are rounded up to your CPU page size,
>> resulting in both wasting quite some memory as well as messing up the
>> special handling to not run into OOM situations when swapping things to
>> system memory....
>>
>> What we could potentially do is to switch to 64k pages on the GPU as well
>> (the HW is flexible enough to be re-configurable), but that is tons of
>> changes and probably not easily testable.
>
>
> If possible, could you share the steps to change the hardware page size? I
> can try testing it on our system.
Just typing that down from the front of my head, so don't nail me for 100%
correctness.
Modern HW, e.g. gfx9/Vega and newer including all MI* products, has a maximum
of 48bits of address space.
Those 48bits are divided on multiple page directories (PD) and a leave page
table (PT).
IIRC vm_block_size module parameter controls the size of the PDs. If you set
that to 13 instead of the default 9 you should already get 64k PDs instead of
4k PDs. But take that with a grain of salt I think we haven't tested that
parameter in the last 10 years or so.
Then each page directory entry on level 0 (PDE0) has a field called block
fragment size (see AMDGPU_PDE_BFS for MI products). This controls to how much
memory each page table entry (PTE) finally points to.
So putting it all together you should be able to have a configuration with two
levels PDs, each covering 13 bits of address space and 64k in size, plus a PT
covering 18bits of address space and 2M in size where each PTE points to a 64k
block.
Here are the relevant bits from function amdgpu_vm_adjust_size():
...
tmp = roundup_pow_of_two(adev->vm_manager.max_pfn);
if (amdgpu_vm_block_size != -1)
tmp >>= amdgpu_vm_block_size - 9;
tmp = DIV_ROUND_UP(fls64(tmp) - 1, 9) - 1;
adev->vm_manager.num_level = min_t(unsigned int, max_level, tmp);
switch (adev->vm_manager.num_level) {
case 3:
adev->vm_manager.root_level = AMDGPU_VM_PDB2;
break;
case 2:
adev->vm_manager.root_level = AMDGPU_VM_PDB1;
break;
case 1:
adev->vm_manager.root_level = AMDGPU_VM_PDB0;
break;
default:
dev_err(adev->dev, "VMPT only supports 2~4+1 levels\n");
}
/* block size depends on vm size and hw setup*/
if (amdgpu_vm_block_size != -1)
adev->vm_manager.block_size =
min((unsigned)amdgpu_vm_block_size, max_bits
- AMDGPU_GPU_PAGE_SHIFT
- 9 * adev->vm_manager.num_level);
else if (adev->vm_manager.num_level > 1)
adev->vm_manager.block_size = 9;
else
adev->vm_manager.block_size = amdgpu_vm_get_block_size(tmp);
if (amdgpu_vm_fragment_size == -1)
adev->vm_manager.fragment_size = fragment_size_default;
else
adev->vm_manager.fragment_size = amdgpu_vm_fragment_size;
...
But again, that is probably tons of work since the AMDGPU_PAGE_SIZE macro needs
to change as well and I'm not sure if the FW doesn't internally assume that we
have 4k pages somewhere.
Regards,
Christian.