On Wed, Jan 20, 2016 at 01:14:35PM -0800, Shaohua Li wrote:

> > My understanding from the above is that the only issue with our
> > patchset was not dealing with pfn_limit.  I can just fix that and
> > repost, sounds good?
> 
> Sure, please do it. For the patches, I'm not comformatable about the
> per-cpu deferred invalidation. One important benefit of IOMMU is
> isolation. Deferred invalidation already loose the isolation, per-cpu
> invalidation loose further. It would be better we can flush all per-cpu
> invalidation entries if one cpu hits its per-cpu limit. Also you'd
> better look at CPU hotplug. We don't want to lose the invalidation
> entries if one cpu is hot removed.

I'll look into these.

> The per-cpu iova implementation looks unnecessary complicated. I know
> you are referring the paper, but the whole point is batch
> allocation/free.

Batched allocation/free isn't enough.  It still creates spinlock
contention, even if there is per-cpu invalidation (that gets rid of
async_umap_flush_lock).  Here are sample results from our memcached
test (throughput of querying 16 memcached instances on a 16-core box
with an Intel XL710 NIC):

      batched alloc/free, iommu=on:
      313,161 memcached transactions/sec (= 29% of iommu=off)

      batched alloc/free + per-cpu invalidations, iommu=on:
      434,590 memcached transactions/sec (= 40% of iommu=off)

      perf report:
      61.15%     0.33%  swapper         [kernel.kallsyms]      [k] 
_raw_spin_lock_irqsave
                     |
                     ---_raw_spin_lock_irqsave
                        |
                        |--87.81%-- free_iova_array
                        |--11.71%-- alloc_iova

In contrast, the per-cpu magazine cache in our patchset enables iova
allocation/free to complete without accessing the iova allocator at
all.  So we don't touch the rbtree spinlock, and also complete iova
allocation in constant time, which avoids the linear-time allocations
that the iova allocator suffers from.  (These were described in the
paper "Efficient intra-operating system protection against harmful
DMAs", presented at the USENIX FAST 2015 conference.)  The end result:

      magazines cache + per-cpu invalidations, iommu=on:
      1,067,586 memcached transactions/sec (= 98% of iommu=off)

_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to