On Fri, Jan 22, 2016 at 01:14:56AM +0200, Adam Morrison wrote: > On Wed, Jan 20, 2016 at 01:14:35PM -0800, Shaohua Li wrote: > > > > My understanding from the above is that the only issue with our > > > patchset was not dealing with pfn_limit. I can just fix that and > > > repost, sounds good? > > > > Sure, please do it. For the patches, I'm not comformatable about the > > per-cpu deferred invalidation. One important benefit of IOMMU is > > isolation. Deferred invalidation already loose the isolation, per-cpu > > invalidation loose further. It would be better we can flush all per-cpu > > invalidation entries if one cpu hits its per-cpu limit. Also you'd > > better look at CPU hotplug. We don't want to lose the invalidation > > entries if one cpu is hot removed. > > I'll look into these. > > > The per-cpu iova implementation looks unnecessary complicated. I know > > you are referring the paper, but the whole point is batch > > allocation/free. > > Batched allocation/free isn't enough. It still creates spinlock > contention, even if there is per-cpu invalidation (that gets rid of > async_umap_flush_lock). Here are sample results from our memcached > test (throughput of querying 16 memcached instances on a 16-core box > with an Intel XL710 NIC): > > batched alloc/free, iommu=on: > 313,161 memcached transactions/sec (= 29% of iommu=off) > > batched alloc/free + per-cpu invalidations, iommu=on: > 434,590 memcached transactions/sec (= 40% of iommu=off) > > perf report: > 61.15% 0.33% swapper [kernel.kallsyms] [k] > _raw_spin_lock_irqsave > | > ---_raw_spin_lock_irqsave > | > |--87.81%-- free_iova_array > |--11.71%-- alloc_iova > > In contrast, the per-cpu magazine cache in our patchset enables iova > allocation/free to complete without accessing the iova allocator at > all. So we don't touch the rbtree spinlock, and also complete iova > allocation in constant time, which avoids the linear-time allocations > that the iova allocator suffers from. (These were described in the > paper "Efficient intra-operating system protection against harmful > DMAs", presented at the USENIX FAST 2015 conference.) The end result: > > magazines cache + per-cpu invalidations, iommu=on: > 1,067,586 memcached transactions/sec (= 98% of iommu=off)
Yes, I know a typical per-cpu cache algorithm will free entry to the per-cpu cache first. I didn't do it because it's not worthy in my test. We definitely should do it if it's worthy though. My point is we don't need implement the per-cpu that complicated. It could be simply: alloc: - refill per-cpu cache if it's empty - alloc from per-cpu cache free - free to per-cpu cache - if per-cpu cache hits threshold, free to global And please don't cache too much. The DMA address below 4G is still precious. Thanks, Shaohua _______________________________________________ iommu mailing list [email protected] https://lists.linuxfoundation.org/mailman/listinfo/iommu
