Hi Jason,

On Wed, Sep 03, 2025 at 02:46:40PM -0300, Jason Gunthorpe wrote:
> From: Alejandro Jimenez <alejandro.j.jime...@oracle.com>
> 
> Replace the io_pgtable versions with pt_iommu versions. The v2 page table
> uses the x86 implementation that will be eventually shared with VT-D.
> 
> This supports the same special features as the original code:
>  - increase_top for the v1 format to allow scaling from 3 to 6 levels
>  - non-present flushing
>  - Dirty tracking for v1 only
>  - __sme_set() to adjust the PTEs for CC
>  - Optimization for flushing with virtualization to minimize the range
>  - amd_iommu_pgsize_bitmap override of the native page sizes
>  - page tables allocate from the device's NUMA node
> 
> Rework the domain ops so that v1/v2 get their own ops. Make dedicated
> allocation functions for v1 and v2. Hook up invalidation for a top change
> to struct pt_iommu_flush_ops. Delete some of the iopgtable related code
> that becomes unused in this patch. The next patch will delete the rest of
> it.
> 
> This fixes a race bug in AMD's increase_address_space() implementation. It
> stores the top level and top pointer in different memory, which prevents
> other threads from reading a coherent version:
> 
>    increase_address_space()   alloc_pte()
>                                 level = pgtable->mode - 1;
>       pgtable->root  = pte;
>       pgtable->mode += 1;
>                                 pte = &pgtable->root[PM_LEVEL_INDEX(level, 
> address)];
> 
> The iommupt version is careful to put mode and root under a single
> READ_ONCE and then is careful to only READ_ONCE a single time per
> walk.
> 
> Signed-off-by: Alejandro Jimenez <alejandro.j.jime...@oracle.com>
> Tested-by: Alejandro Jimenez <alejandro.j.jime...@oracle.com>
> Signed-off-by: Jason Gunthorpe <j...@nvidia.com>
> ---
>  drivers/iommu/amd/Kconfig           |   5 +-
>  drivers/iommu/amd/amd_iommu.h       |   1 -
>  drivers/iommu/amd/amd_iommu_types.h |  12 +-
>  drivers/iommu/amd/io_pgtable.c      |   1 -
>  drivers/iommu/amd/iommu.c           | 538 ++++++++++++++--------------
>  5 files changed, 282 insertions(+), 275 deletions(-)
> 

../..

> +
> +static struct iommu_domain *amd_iommu_domain_alloc_paging_v1(struct device 
> *dev,
> +                                                          u32 flags)
> +{
> +     struct pt_iommu_amdv1_cfg cfg = {};
>       struct protection_domain *domain;
>       int ret;
>  
> +     if (amd_iommu_hatdis)
> +             return ERR_PTR(-EOPNOTSUPP);
> +
>       domain = protection_domain_alloc();
>       if (!domain)
>               return ERR_PTR(-ENOMEM);
>  
> -     domain->pd_mode = pgtable;
> -     ret = pdom_setup_pgtable(domain, dev);
> +     domain->pd_mode = PD_MODE_V1;
> +     domain->iommu.hw_flush_ops = &amd_hw_flush_ops_v1;
> +     domain->iommu.nid = dev_to_node(dev);
> +     if (flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING)
> +             domain->domain.dirty_ops = &amdv1_dirty_ops;
> +
> +     /*
> +      * Someday FORCE_COHERENCE should be set by
> +      * amd_iommu_enforce_cache_coherency() like VT-D does.
> +      */
> +     cfg.common.features = BIT(PT_FEAT_DYNAMIC_TOP) |
> +                           BIT(PT_FEAT_AMDV1_ENCRYPT_TABLES) |
> +                           BIT(PT_FEAT_AMDV1_FORCE_COHERENCE);
> +
> +     /*
> +      * AMD's IOMMU can flush as many pages as necessary in a single flush.
> +      * Unless we run in a virtual machine, which can be inferred according
> +      * to whether "non-present cache" is on, it is probably best to prefer
> +      * (potentially) too extensive TLB flushing (i.e., more misses) over
> +      * multiple TLB flushes (i.e., more flushes). For virtual machines the
> +      * hypervisor needs to synchronize the host IOMMU PTEs with those of
> +      * the guest, and the trade-off is different: unnecessary TLB flushes
> +      * should be avoided.
> +      */
> +     if (amd_iommu_np_cache)
> +             cfg.common.features |= BIT(PT_FEAT_FLUSH_RANGE_NO_GAPS);
> +     else
> +             cfg.common.features |= BIT(PT_FEAT_FLUSH_RANGE);
> +
> +     cfg.common.hw_max_vasz_lg2 =
> +             max(64, (amd_iommu_hpt_level - 1) * 9 + 21);

s/max/min
with 6 level page table max will be 66 and i am seeing boot failure with
this. Please refer below fail log.

[    5.458439] BUG: kernel NULL pointer dereference, address: 0000000000000030
[    5.459338] #PF: supervisor read access in kernel mode
[    5.459338] #PF: error_code(0x0000) - not-present page
[    5.459338] PGD 0
[    5.459338] Oops: Oops: 0000 [#1] SMP NOPTI
[    5.459338] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-rc3+ #83 
PREEMPT(voluntary)
[    5.459338] Hardware name: AMD Corporation RUBY/RUBY, BIOS RRR100DB 
08/09/2024
[    5.459338] RIP: 0010:sysfs_add_link_to_group+0x16/0x60
[    5.459338] Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 
90 0f 1f 44 00 00 55 48 89 e5 41 55 49 89 cd 41 54 49 89 d4 31 d2 53 <48> 8b 7f 
30 e8 b1 8d ff ff 48 85 c0 74 32 48 89 c3 4c 89 e6 48 89
[    5.459338] RSP: 0018:ff7421b640067c00 EFLAGS: 00010246
[    5.459338] RAX: ff1b1885cb669a20 RBX: ff1b1885c028e0c8 RCX: ff1b1885cb6289b0
[    5.459338] RDX: 0000000000000000 RSI: ffffffffacb14932 RDI: 0000000000000000
[    5.459338] RBP: ff7421b640067c18 R08: 0000000000000000 R09: 0000000000000000
[    5.459338] R10: 0000000000000000 R11: 0000000000000000 R12: ff1b1885c028e0c8
[    5.459338] R13: ff1b1885cb6289b0 R14: ff7421b640067d20 R15: ff1b1885c0a20800
[    5.459338] FS:  0000000000000000(0000) GS:ff1b18951a09a000(0000) 
knlGS:0000000000000000
[    5.459338] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.459338] CR2: 0000000000000030 CR3: 0000000f65640001 CR4: 0000000000f71ef0
[    5.459338] PKRU: 55555554
[    5.459338] Call Trace:
[    5.459338]  <TASK>
[    5.459338]  iommu_device_link+0x36/0xb0
[    5.459338]  __iommu_probe_device+0x123/0x550
[    5.459338]  ? __pfx_probe_iommu_group+0x10/0x10
[    5.459338]  probe_iommu_group+0x29/0x50
[    5.459338]  bus_for_each_dev+0x89/0xf0
[    5.459338]  iommu_device_register+0xca/0x240
[    5.459338]  iommu_go_to_state+0x7d2/0x2340
[    5.459338]  ? __pfx_pci_iommu_init+0x10/0x10
[    5.459338]  amd_iommu_init+0x14/0x40
[    5.459338]  ? __pfx_pci_iommu_init+0x10/0x10
[    5.459338]  pci_iommu_init+0x13/0x70
[    5.459338]  ? __pfx_pci_iommu_init+0x10/0x10
[    5.459338]  do_one_initcall+0x5a/0x340
[    5.459338]  kernel_init_freeable+0x34e/0x530
[    5.459338]  ? __pfx_kernel_init+0x10/0x10
[    5.459338]  kernel_init+0x1b/0x200
[    5.459338]  ? __pfx_kernel_init+0x10/0x10
[    5.459338]  ret_from_fork+0x1a1/0x1d0
[    5.459338]  ? __pfx_kernel_init+0x10/0x10
[    5.459338]  ret_from_fork_asm+0x1a/0x30
[    5.459338] RIP: 1f0f:0x0
[    5.459338] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[    5.459338] RSP: 0000:0000000000000000 EFLAGS: 841f0f2e66 ORIG_RAX: 
1f0f2e6600000000
[    5.459338] RAX: 0000000000000000 RBX: 1f0f2e6600000000 RCX: 2e66000000000084
[    5.459338] RDX: 0000000000841f0f RSI: 000000841f0f2e66 RDI: 00841f0f2e660000
[    5.459338] RBP: 00841f0f2e660000 R08: 00841f0f2e660000 R09: 000000841f0f2e66
[    5.459338] R10: 0000000000841f0f R11: 2e66000000000084 R12: 000000841f0f2e66
[    5.459338] R13: 0000000000841f0f R14: 2e66000000000084 R15: 1f0f2e6600000000
[    5.459338]  </TASK>
[    5.459338] Modules linked in:
[    5.459338] CR2: 0000000000000030
[    5.459338] ---[ end trace 0000000000000000 ]---
[    5.459338] RIP: 0010:sysfs_add_link_to_group+0x16/0x60
[    5.459338] Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 
90 0f 1f 44 00 00 55 48 89 e5 41 55 49 89 cd 41 54 49 89 d4 31 d2 53 <48> 8b 7f 
30 e8 b1 8d ff ff 48 85 c0 74 32 48 89 c3 4c 89 e6 48 89
[    5.459338] RSP: 0018:ff7421b640067c00 EFLAGS: 00010246
[    5.459338] RAX: ff1b1885cb669a20 RBX: ff1b1885c028e0c8 RCX: ff1b1885cb6289b0
[    5.459338] RDX: 0000000000000000 RSI: ffffffffacb14932 RDI: 0000000000000000
[    5.459338] RBP: ff7421b640067c18 R08: 0000000000000000 R09: 0000000000000000
[    5.459338] R10: 0000000000000000 R11: 0000000000000000 R12: ff1b1885c028e0c8
[    5.459338] R13: ff1b1885cb6289b0 R14: ff7421b640067d20 R15: ff1b1885c0a20800
[    5.459338] FS:  0000000000000000(0000) GS:ff1b18951a09a000(0000) 
knlGS:0000000000000000
[    5.459338] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.459338] CR2: 0000000000000030 CR3: 0000000f65640001 CR4: 0000000000f71ef0
[    5.459338] PKRU: 55555554
[    5.459338] note: swapper/0[1] exited with irqs disabled
[    5.874958] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x00000009
[    5.875940] ---[ end Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x00000009 ]---

> +     cfg.common.hw_max_oasz_lg2 = 52;
> +     cfg.starting_level = 2;
> +     domain->domain.ops = &amdv1_ops;
> +
> +     ret = pt_iommu_amdv1_init(&domain->amdv1, &cfg, GFP_KERNEL);
>       if (ret) {
> -             pdom_id_free(domain->id);
> -             kfree(domain);
> +             amd_iommu_domain_free(&domain->domain);
>               return ERR_PTR(ret);
>       }
>  
> -     domain->domain.geometry.aperture_start = 0;
> -     domain->domain.geometry.aperture_end   = dma_max_address(pgtable);
> -     domain->domain.geometry.force_aperture = true;
> -     domain->domain.pgsize_bitmap = domain->iop.pgtbl.cfg.pgsize_bitmap;
> +     /*
> +      * Narrow the supported page sizes to those selected by the kernel
> +      * command line.
> +      */
> +     domain->domain.pgsize_bitmap &= amd_iommu_pgsize_bitmap;
> +     return &domain->domain;
> +}
> 2.43.0
> 

Thanks,
Ankit

Reply via email to