On 12/16/25 11:08, Donet Tom wrote:
>> The GPU page tables are 4k in size no matter what the CPU page size is and 
>> there is some special handling so that we can allocate them even under 
>> memory pressure. Background is that you sometimes need to split up higher 
>> order pages (1G, 2M) into lower order pages (2M, 4k) to be able to swap 
>> things to system memory for example and for that you need some an extra 
>> layer of page tables.
>>
>> The problem is now that those 4k pages are rounded up to your CPU page size, 
>> resulting in both wasting quite some memory as well as messing up the 
>> special handling to not run into OOM situations when swapping things to 
>> system memory....
>>
>> What we could potentially do is to switch to 64k pages on the GPU as well 
>> (the HW is flexible enough to be re-configurable), but that is tons of 
>> changes and probably not easily testable.
> 
> 
> If possible, could you share the steps to change the hardware page size? I 
> can try testing it on our system.

Just typing that down from the front of my head, so don't nail me for 100% 
correctness.

Modern HW, e.g. gfx9/Vega and newer including all MI* products, has a maximum 
of 48bits of address space.

Those 48bits are divided on multiple page directories (PD) and a leave page 
table (PT).

IIRC vm_block_size module parameter controls the size of the PDs. If you set 
that to 13 instead of the default 9 you should already get 64k PDs instead of 
4k PDs. But take that with a grain of salt I think we haven't tested that 
parameter in the last 10 years or so.

Then each page directory entry on level 0 (PDE0) has a field called block 
fragment size (see AMDGPU_PDE_BFS for MI products). This controls to how much 
memory each page table entry (PTE) finally points to.

So putting it all together you should be able to have a configuration with two 
levels PDs, each covering 13 bits of address space and 64k in size, plus a PT 
covering 18bits of address space and 2M in size where each PTE points to a 64k 
block.

Here are the relevant bits from function amdgpu_vm_adjust_size():
...
        tmp = roundup_pow_of_two(adev->vm_manager.max_pfn);
        if (amdgpu_vm_block_size != -1)
                tmp >>= amdgpu_vm_block_size - 9;
        tmp = DIV_ROUND_UP(fls64(tmp) - 1, 9) - 1;
        adev->vm_manager.num_level = min_t(unsigned int, max_level, tmp);
        switch (adev->vm_manager.num_level) {
        case 3:
                adev->vm_manager.root_level = AMDGPU_VM_PDB2;
                break;
        case 2:
                adev->vm_manager.root_level = AMDGPU_VM_PDB1;
                break;
        case 1:
                adev->vm_manager.root_level = AMDGPU_VM_PDB0;
                break;
        default:
                dev_err(adev->dev, "VMPT only supports 2~4+1 levels\n");
        }
        /* block size depends on vm size and hw setup*/
        if (amdgpu_vm_block_size != -1)
                adev->vm_manager.block_size =
                        min((unsigned)amdgpu_vm_block_size, max_bits
                            - AMDGPU_GPU_PAGE_SHIFT
                            - 9 * adev->vm_manager.num_level);
        else if (adev->vm_manager.num_level > 1)
                adev->vm_manager.block_size = 9;
        else
                adev->vm_manager.block_size = amdgpu_vm_get_block_size(tmp);

        if (amdgpu_vm_fragment_size == -1)
                adev->vm_manager.fragment_size = fragment_size_default;
        else
                adev->vm_manager.fragment_size = amdgpu_vm_fragment_size;
...

But again, that is probably tons of work since the AMDGPU_PAGE_SIZE macro needs 
to change as well and I'm not sure if the FW doesn't internally assume that we 
have 4k pages somewhere.

Regards,
Christian.

Reply via email to