On Tue, Apr 19, 2016 at 07:48:16PM +0300, Adam Morrison wrote: > This patchset improves the scalability of the Intel IOMMU code by > resolving two spinlock bottlenecks, yielding up to ~5x performance > improvement and approaching iommu=off performance. > > For example, here's the throughput obtained by 16 memcached instances > running on a 16-core Sandy Bridge system, accessed using memslap on > another machine that has iommu=off, using the default memslap config > (64-byte keys, 1024-byte values, and 10%/90% SET/GET ops): > > stock iommu=off: > 990,803 memcached transactions/sec (=100%, median of 10 runs). > stock iommu=on: > 221,416 memcached transactions/sec (=22%). > [61.70% 0.63% memcached [kernel.kallsyms] [k] > _raw_spin_lock_irqsave] > patched iommu=on: > 963,159 memcached transactions/sec (=97%). > [1.29% 1.10% memcached [kernel.kallsyms] [k] > _raw_spin_lock_irqsave] > > The two resolved spinlocks: > > - Deferred IOTLB invalidations are batched in a global data structure > and serialized under a spinlock (add_unmap() & flush_unmaps()); this > patchset batches IOTLB invalidations in a per-CPU data structure. > > - IOVA management (alloc_iova() & __free_iova()) is serialized under > the rbtree spinlock; this patchset adds per-CPU caches of allocated > IOVAs so that the rbtree doesn't get accessed frequently. (Adding a > cache above the existing IOVA allocator is less intrusive than dynamic > identity mapping and helps keep IOMMU page table usage low; see > Patch 7.) > > The paper "Utilizing the IOMMU Scalably" (presented at the 2015 USENIX > Annual Technical Conference) contains many more details and experiments: > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.usenix.org_system_files_conference_atc15_atc15-2Dpaper-2Dpeleg.pdf&d=CwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=X13hAPkxmvBro1Ug8vcKHw&m=O-p7wCR-G4eXQJOhyiio_pLUyJGkaFCUv4CNBrTdMPs&s=T7ynyUVZWcBkslPyKJqEUUggCmFDrsglpKRu0I3EXhQ&e= > > > v3: > * Patch 7/7: Respect the caller-passed limit IOVA when satisfying an IOVA > allocation from the cache.
Thanks, looks good. I'm still thinking to have 2 caches, one for DMA32 and the other for DMA64. Mixing them in one cache might make allocation from cache have more failure. But we can do this later if it is a real problem. So for the the whole series Reviewed-by: Shaohua Li <[email protected]> _______________________________________________ iommu mailing list [email protected] https://lists.linuxfoundation.org/mailman/listinfo/iommu
