On Fri, Jan 23, 2026 at 08:26:10PM -0400, Jason Gunthorpe wrote: > On Fri, Jan 23, 2026 at 02:53:59PM -0800, Matthew Brost wrote: > > > Thats a 2x improvement in overall full operation? Wow! > > > > > > Did you look at how non-iommu cases perform too? > > > > > > > Like intel_iommu=off kerenl command line? I haven't checked that but can. > > iommu.passthrough=1 > > This is generally what we recommend everyone who cares about > performance more than iommu protection should use by default. It
Yes, worked in HPC for a long time and we always set the IOMMU to passthrough. > leaves the iommu HW turned on, which x86 requires for other reasons, > but eliminates the performance cost to DMA. > iommu.passthrough=1 brings the 2M case to roughly 130us for 2M - this stat includes migrate_vma_* functions btw, also for reference this time drops to ~10us in any scenario with 2M device pages. > > > I think we can do better still for the non-cached platforms as I have > > > a way in mind to batch up lines and flush the line instead of flushing > > > for every 8 byte IOPTE written. Some ARM folks have been talking about > > > this problem too.. > > > > Yes, prior to the IOMMU changes I believe the basline was ~330us so > > dma-map/unmap are still way slower than before and if this affect > > platforms other than Intel x86 there will be complaints everyone until > > the entire kernel moves to the IOVA alloc model. > > I have managed to get a test showing that when cache flushing is > turned on the new code is 50% slower. I'm investigating this.. > > map_pages > pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) > 2^12, 331,249 , 289,214 , -35.35 > 2^21, 335,243 , 306,222 , -37.37 > 2^30, 226,238 , 205,215 , 4.04 > # test_map_unmap_benchmark: > unmap_pages > pgsz ,avg new,old ns, min new,old ns , min % (+ve is better) > 2^12, 389,272 , 347,237 , -46.46 > 2^21, 321,261 , 297,239 , -24.24 > 2^30, 237,251 , 214,228 , 6.06 > > So it looks to me like this is isolated to Intel GPU for the moment > because it is the only device that would use the cache flushing flow > until we convert ARM. > > FWIW, on my system enabling cache flushing goes from 60ns to 250ns, it > has a huge, huge cost to these flows. I see that you have fixed this one, we verfieid it, thanks!. > > > Also another question does IOVA alloc support modes similar to > > dma_map_resource between per device? We also do that and I haven't > > modified that code or check that for perf regressions. > > Yes, and no.. The API does, but Christoph doesn't want to let arbitary > drivers use it. So you need to figure out some way to get there. > Yes, I see that API allows this and it seems to work too. > For reference Leon added dma_buf_phys_vec_to_sgt() which shows this > flow to create a sg_table. > That will likely work for dma-buf, let me see if I can convert our dma-buf flows to use this helper. But it won't work for things like SVM, so it would be desirable to figure out to have an API drivers can use to iova alloc/link/sync/unlink/free for multi-device or just agree we trust drivers enough to use the existing API. Matt > There are also hmm helpers for the mapping too if this is in a hmm > context. > > A PCI device calling map_resource is incorrect usage of the DMA API, > but it was the only option till now. > > Jason
