This bug is awaiting verification that the linux/6.14.0-32.32 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-plucky-linux' to 'verification-done-plucky-linux'. If the problem still exists, change the tag 'verification-needed-plucky- linux' to 'verification-failed-plucky-linux'.
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: kernel-spammed-plucky-linux-v2 verification-needed-plucky-linux -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2115738 Title: I/O performance regression on NVMes under same bridge (dual port nvme) Status in linux package in Ubuntu: In Progress Status in linux source package in Oracular: Won't Fix Status in linux source package in Plucky: Fix Committed Status in linux source package in Questing: In Progress Bug description: [ Impact ] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes The iotlb_sync_map iommu ops allows drivers to perform necessary cache flushes when new mappings are established. For the Intel iommu driver, this callback specifically serves two purposes: - To flush caches when a second-stage page table is attached to a device whose iommu is operating in caching mode (CAP_REG.CM==1). - To explicitly flush internal write buffers to ensure updates to memory- resident remapping structures are visible to hardware (CAP_REG.RWBF==1). However, in scenarios where neither caching mode nor the RWBF flag is active, the cache_tag_flush_range_np() helper, which is called in the iotlb_sync_map path, effectively becomes a no-op. Despite being a no-op, cache_tag_flush_range_np() involves iterating through all cache tags of the iommu's attached to the domain, protected by a spinlock. This unnecessary execution path introduces overhead, leading to a measurable I/O performance regression. On systems with NVMes under the same bridge, performance was observed to drop from approximately ~6150 MiB/s down to ~4985 MiB/s. Introduce a flag in the dmar_domain structure. This flag will only be set when iotlb_sync_map is required (i.e., when CM or RWBF is set). The cache_tag_flush_range_np() is called only for domains where this flag is set. This flag, once set, is immutable, given that there won't be mixed configurations in real-world scenarios where some IOMMUs in a system operate in caching mode while others do not. Theoretically, the immutability of this flag does not impact functionality. [ Fix ] Backport the following commit: - 12724ce3fe1a iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes - b9434ba97c44 iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags() - b33125296b50 iommu/vt-d: Create unique domain ops for each stage - 0fa6f0893466 iommu/vt-d: Split intel_iommu_enforce_cache_coherency() - 85cfaacc9937 iommu/vt-d: Split paging_domain_compatible() - cee686775f9c iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain to Plucky. [ Test Plan ] Run fio against two NVMEs under the same pci bridge (dual port NVMe): $ sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 --time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting --new_group --name=job1 --filename=/dev/nvmeXnY --new_group --name=job2 --filename=/dev/nvmeWnZ verify that the speed reached with the two NVMEs under the same bridge is the same that would have been reached if the two NVMEs were not under the same bridge. [ Regression Potential ] This fix affects the Intel IOMMU (VT-d) driver. An issue with this fix may introduce problems such as incorrect omission of required IOTLB cache or write buffer flushes when attaching devices to a domain. This could result in memory remapping structures not being visible to hardware in configurations that actually require synchronization. As a consequence, devices performing DMA may exhibit data corruption, access violations, or inconsistent behavior due to stale or incomplete translations being used by the hardware. --- [Description] A performance regression has been reported when running fio against two NVMe devices under the same pci bridge (dual port NVMe). The issue was initially reported for 6.11-hwe kernel for Noble. The performance regression was introduced in the 6.10 upstream kernel and is still present in 6.16 (build at commit e540341508ce2f6e27810106253d5). Bisection pointed to commit 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map"). In our tests we observe ~6150 MiB/s when the NVMe devices are on different bridges and ~4985 MiB/s when under the same brigde. Before the offending commit we observe ~6150 MiB/s, regardless of NVMe device placement. [Test Case] We can reproduce the issue on gcp on Z3 metal instance type (z3-highmem-192-highlssd-metal) [1]. You need to have 2 NVMe devices under the same bridge, e.g: # nvme list -v ... Device SN MN FR TxPort Address Slot Subsystem Namespaces -------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ---------------- nvme0 nvme_card-pd nvme_card-pd (null) pcie 0000:05:00.1 nvme-subsys0 nvme0n1 nvme1 3DE4D285C21A7C001.0 nvme_card 00000000 pcie 0000:3d:00.0 nvme-subsys1 nvme1n1 nvme10 3DE4D285C21A7C001.1 nvme_card 00000000 pcie 0000:3d:00.1 nvme-subsys10 nvme10n1 nvme11 3DE4D285C2027C000.0 nvme_card 00000000 pcie 0000:3e:00.0 nvme-subsys11 nvme11n1 nvme12 3DE4D285C2027C000.1 nvme_card 00000000 pcie 0000:3e:00.1 nvme-subsys12 nvme12n1 nvme2 3DE4D285C2368C001.0 nvme_card 00000000 pcie 0000:b7:00.0 nvme-subsys2 nvme2n1 nvme3 3DE4D285C22A74001.0 nvme_card 00000000 pcie 0000:86:00.0 nvme-subsys3 nvme3n1 nvme4 3DE4D285C22A74001.1 nvme_card 00000000 pcie 0000:86:00.1 nvme-subsys4 nvme4n1 nvme5 3DE4D285C2368C001.1 nvme_card 00000000 pcie 0000:b7:00.1 nvme-subsys5 nvme5n1 nvme6 3DE4D285C21274000.0 nvme_card 00000000 pcie 0000:87:00.0 nvme-subsys6 nvme6n1 nvme7 3DE4D285C21094000.0 nvme_card 00000000 pcie 0000:b8:00.0 nvme-subsys7 nvme7n1 nvme8 3DE4D285C21274000.1 nvme_card 00000000 pcie 0000:87:00.1 nvme-subsys8 nvme8n1 nvme9 3DE4D285C21094000.1 nvme_card 00000000 pcie 0000:b8:00.1 nvme-subsys9 nvme9n1 ... For the output above, drives nvme1n1 and nvme10n1 are under the same bridge, and looking the SN it seems it is a dual port NVMe. - Under the same bridge Run fio against nvme1n1 and nvme10n1, observe 4897MiB/s after a short spike in the beginning at ~6150MiB/s. # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 --time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting --new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 --filename=/dev/nvme10n1 ... Jobs: 16 (f=16): [r(16)][100.0%][r=4897MiB/s][r=1254k IOPS][eta 00m:00s] ... - Under different bridge Run fio against nvme1n1 and nvme11n1, observe # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 --time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting --new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 --filename=/dev/nvme11n1 ... Jobs: 16 (f=16): [r(16)][100.0%][r=6153MiB/s][r=1575k IOPS][eta 00m:00s] ... ** So far, we haven't been able to reproduce it on another machine, but we suspect will be reproducible with any machine with a dual port NVMe. [Other] In spreadsheet [2], the are some profiling data for different kernel versions, showing consistent performance difference between kernel versions. Offending commit : https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=129dab6e1286525fe5baed860d3dfcd9c6b4b327 Report issue upstream [3]. [1] https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types [2] https://docs.google.com/spreadsheets/d/19F0Vvgz0ztFpDX4E37E_o8JYrJ04iYJz-1cqU-j4Umk/edit?gid=1544333169#gid=1544333169 [3] https://lore.kernel.org/regressions/[email protected]/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : [email protected] Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp

