Jason,

On 9/3/2025 11:16 PM, Jason Gunthorpe wrote:
> map is slightly complicated because it has to handle a number of special
> edge cases:
>  - Overmapping a previously shared table with an OA - requries validating
>    and freeing the possibly empty tables
>  - Doing the above across an entire to-be-created contiguous entry
>  - Installing a new shared table level concurrently with another thread
>  - Expanding the table by adding more top levels
> 
> Table expansion is a unique feature of AMDv1, this version is quite
> similar except we handle racing concurrent lockless map. The table top
> pointer and starting level are encoded in a single uintptr_t which ensures
> we can READ_ONCE() without tearing. Any op will do the READ_ONCE() and use
> that fixed point as its starting point. Concurrent expansion is handled
> with a table global spinlock.
> 
> When inserting a new table entry map checks that the entire portion of the
> table is empty. This includes freeing any empty lower tables that will be
> overwritten by an OA. A separate free list is used while checking and
> collecting all the empty lower tables so that writing the new entry is
> uninterrupted, either the new entry fully writes or nothing changes.
> 
> A special fast path for PAGE_SIZE is implemented that does a direct walk
> to the leaf level and installs a single entry. This gives ~15% improvement
> for iommu_map() when mapping lists of single pages.
> 
> This version sits under the iommu_domain_ops as map_pages() but does not
> require the external page size calculation. The implementation is actually
> map_range() and can do arbitrary ranges, internally handling all the
> validation and supporting any arrangment of page sizes. A future series
> can optimize iommu_map() to take advantage of this.
> 
> Tested-by: Alejandro Jimenez <[email protected]>
> Signed-off-by: Jason Gunthorpe <[email protected]>
> ---
>  drivers/iommu/generic_pt/iommu_pt.h | 481 ++++++++++++++++++++++++++++
>  include/linux/generic_pt/iommu.h    |  58 ++++
>  2 files changed, 539 insertions(+)
> 

.../...

> +static int __map_range_leaf(struct pt_range *range, void *arg,
> +                         unsigned int level, struct pt_table_p *table)
> +{
> +     struct pt_state pts = pt_init(range, level, table);
> +     struct pt_iommu_map_args *map = arg;
> +     unsigned int leaf_pgsize_lg2 = map->leaf_pgsize_lg2;
> +     unsigned int start_index;
> +     pt_oaddr_t oa = map->oa;
> +     unsigned int step;
> +     bool need_contig;
> +     int ret = 0;
> +
> +     PT_WARN_ON(map->leaf_level != level);
> +     PT_WARN_ON(!pt_can_have_leaf(&pts));
> +
> +     step = log2_to_int_t(unsigned int,
> +                          leaf_pgsize_lg2 - pt_table_item_lg2sz(&pts));
> +     need_contig = leaf_pgsize_lg2 != pt_table_item_lg2sz(&pts);
> +
> +     _pt_iter_first(&pts);
> +     start_index = pts.index;
> +     do {
> +             pts.type = pt_load_entry_raw(&pts);
> +             if (pts.type != PT_ENTRY_EMPTY || need_contig) {
> +                     if (pts.index != start_index)
> +                             pt_index_to_va(&pts);
> +                     ret = clear_contig(&pts, map->iotlb_gather, step,
> +                                        leaf_pgsize_lg2);
> +                     if (ret)
> +                             break;
> +             }
> +
> +             PT_WARN_ON(compute_best_pgsize(&pts, oa) != leaf_pgsize_lg2);


If I select CONFIG_DEBUG_GENERIC_PT=y and boot AMD system with V1 (Host page
table), in some cases we hit this warning. Code path looks ok. may be silence
these warning?


[   31.985383] pt_iommu_amdv1_map_pages : oa 0x208b95d000 va 0xfef80000 last_va
0xfef9ffff pgsz_lg 0xc pgsize 0x1000 pgcount 0x20
[   31.985384] __map_range_leaf oa 0x208b95e000 va 0xfef80000 last_va 0xfef9ffff
pgsize 0xd leaf_pgsize 0xc possible_sz 0x1ff000
[   31.985391] ------------[ cut here ]------------
[   31.985392] WARNING: CPU: 359 PID: 2540 at
drivers/iommu/generic_pt/fmt/../iommu_pt.h:493 __map_range_leaf+0x636/0x860
[   31.985399] Modules linked in:
[   31.985402] CPU: 359 UID: 0 PID: 2540 Comm: systemd-udevd Not tainted
6.17.0-rc3-genricpt+ #444 VOLUNTARY
[   31.985405] Hardware name: AMD Corporation Titanite_4G/Titanite_4G, BIOS
RTI100EB 12/05/2024
[   31.985406] RIP: 0010:__map_range_leaf+0x636/0x860
[   31.985409] Code: 49 89 6e 18 48 8b 54 24 58 65 48 2b 15 6b 4d b8 01 0f 85 2a
02 00 00 48 83 c4 60 5b 5d 41 5c 41 5d 41 5e 41 5f e9 55 2e 67 ff <0f> 0b e9 07
fe ff ff 0f b6 48 21 e9 e5 fb ff ff 48 8b 7c 24 18 44
[   31.985411] RSP: 0018:ff78b42ad7063558 EFLAGS: 00010297
[   31.985413] RAX: 0000000000000000 RBX: ff453e2c423cdc08 RCX: 000000000000000d
[   31.985414] RDX: 0000000000000000 RSI: 0000000000002000 RDI: ffffff7fffffffff
[   31.985415] RBP: 000000208b95e000 R08: 00000000fef9ffff R09: 00000000fffeffff
[   31.985416] R10: 000000000000000c R11: ff453e6b4c696000 R12: 0000000000003000
[   31.985417] R13: ff78b42ad7063770 R14: ff78b42ad7063748 R15: 000000000000000c
[   31.985418] FS:  00007f46c7e888c0(0000) GS:ff453e6aabbc2000(0000)
knlGS:0000000000000000
[   31.985420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   31.985421] CR2: 00007f46c7e03000 CR3: 0000000141f6b002 CR4: 0000000000771ef0
[   31.985422] PKRU: 55555554
[   31.985423] Call Trace:
[   31.985424]  <TASK>
[   31.985426]  __map_range+0x399/0x5a0
[   31.985429]  ? down_trylock+0x20/0x30
[   31.985434]  __map_range+0x1af/0x5a0
[   31.985436]  ? _printk+0x52/0x70
[   31.985441]  pt_iommu_amdv1_map_pages+0x6e6/0xca0
[   31.985444]  ? srso_alias_return_thunk+0x5/0xfbef5
[   31.985448]  ? iommu_map_nosync+0x129/0x230
[   31.985451]  iommu_map_nosync+0x129/0x230
[   31.985454]  blk_rq_dma_map_iter_start+0x186/0x1c0
[   31.985458]  nvme_prep_rq+0x4ff/0x8b0
[   31.985461]  ? srso_alias_return_thunk+0x5/0xfbef5
[   31.985463]  nvme_queue_rqs+0xc0/0x1d0
[   31.985466]  blk_mq_dispatch_queue_requests+0xf2/0x140
[   31.985469]  blk_mq_flush_plug_list+0x71/0x170
[   31.985472]  __blk_flush_plug+0xcc/0x120
[   31.985476]  blk_finish_plug+0x1f/0x30
[   31.985478]  read_pages+0x1a8/0x260
[   31.985483]  ? filemap_add_folio+0xae/0xd0
[   31.985485]  page_cache_ra_unbounded+0x174/0x230
[   31.985488]  force_page_cache_ra+0x89/0xb0
[   31.985491]  filemap_get_pages+0x12a/0x720
[   31.985494]  filemap_read+0xda/0x3e0
[   31.985497]  ? srso_alias_return_thunk+0x5/0xfbef5
[   31.985499]  ? alloc_pages_mpol+0x76/0x140
[   31.985502]  ? srso_alias_return_thunk+0x5/0xfbef5
[   31.985504]  ? mod_memcg_lruvec_state+0x96/0x1a0
[   31.985507]  ? srso_alias_return_thunk+0x5/0xfbef5
[   31.985509]  ? __lruvec_stat_mod_folio+0x6d/0xa0
[   31.985511]  ? srso_alias_return_thunk+0x5/0xfbef5
[   31.985512]  ? srso_alias_return_thunk+0x5/0xfbef5
[   31.985514]  ? set_ptes.constprop.0+0x36/0x80
[   31.985517]  ? srso_alias_return_thunk+0x5/0xfbef5
[   31.985519]  ? __handle_mm_fault+0xa2c/0x14d0
[   31.985522]  blkdev_read_iter+0x6f/0x140
[   31.985525]  vfs_read+0x207/0x330
[   31.985528]  ksys_read+0x5c/0xd0
[   31.985530]  do_syscall_64+0x50/0x1e0
[   31.985533]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   31.985535] RIP: 0033:0x7f46c8576852
[   31.985537] Code: c0 e9 b2 fe ff ff 50 48 8d 3d 1a b4 0c 00 e8 a5 1d 02 00 0f
1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0
ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[   31.985538] RSP: 002b:00007ffc06f9c638 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[   31.985540] RAX: ffffffffffffffda RBX: 00007f46c7e02028 RCX: 00007f46c8576852
[   31.985541] RDX: 0000000000040000 RSI: 00007f46c7e02038 RDI: 000000000000000c
[   31.985542] RBP: 0000555f80925280 R08: 00007f46c7e02010 R09: 00007f46c7e02010
[   31.985543] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000
[   31.985544] R13: 0000000000040000 R14: 00007f46c7e02010 R15: 0000555f809252d0
[   31.985546]  </TASK>
[   31.985547] ---[ end trace 0000000000000000 ]---


-Vasant



Reply via email to