On Fri, Jan 22, 2016 at 01:14:56AM +0200, Adam Morrison wrote:
> On Wed, Jan 20, 2016 at 01:14:35PM -0800, Shaohua Li wrote:
> 
> > > My understanding from the above is that the only issue with our
> > > patchset was not dealing with pfn_limit.  I can just fix that and
> > > repost, sounds good?
> > 
> > Sure, please do it. For the patches, I'm not comformatable about the
> > per-cpu deferred invalidation. One important benefit of IOMMU is
> > isolation. Deferred invalidation already loose the isolation, per-cpu
> > invalidation loose further. It would be better we can flush all per-cpu
> > invalidation entries if one cpu hits its per-cpu limit. Also you'd
> > better look at CPU hotplug. We don't want to lose the invalidation
> > entries if one cpu is hot removed.
> 
> I'll look into these.
> 
> > The per-cpu iova implementation looks unnecessary complicated. I know
> > you are referring the paper, but the whole point is batch
> > allocation/free.
> 
> Batched allocation/free isn't enough.  It still creates spinlock
> contention, even if there is per-cpu invalidation (that gets rid of
> async_umap_flush_lock).  Here are sample results from our memcached
> test (throughput of querying 16 memcached instances on a 16-core box
> with an Intel XL710 NIC):
> 
>       batched alloc/free, iommu=on:
>       313,161 memcached transactions/sec (= 29% of iommu=off)
> 
>       batched alloc/free + per-cpu invalidations, iommu=on:
>       434,590 memcached transactions/sec (= 40% of iommu=off)
> 
>       perf report:
>       61.15%     0.33%  swapper         [kernel.kallsyms]      [k] 
> _raw_spin_lock_irqsave
>                      |
>                      ---_raw_spin_lock_irqsave
>                         |
>                         |--87.81%-- free_iova_array
>                         |--11.71%-- alloc_iova
> 
> In contrast, the per-cpu magazine cache in our patchset enables iova
> allocation/free to complete without accessing the iova allocator at
> all.  So we don't touch the rbtree spinlock, and also complete iova
> allocation in constant time, which avoids the linear-time allocations
> that the iova allocator suffers from.  (These were described in the
> paper "Efficient intra-operating system protection against harmful
> DMAs", presented at the USENIX FAST 2015 conference.)  The end result:
> 
>       magazines cache + per-cpu invalidations, iommu=on:
>       1,067,586 memcached transactions/sec (= 98% of iommu=off)

Yes, I know a typical per-cpu cache algorithm will free entry to the
per-cpu cache first. I didn't do it because it's not worthy in my test.
We definitely should do it if it's worthy though. My point is we don't
need implement the per-cpu that complicated. It could be simply:

alloc:
 - refill per-cpu cache if it's empty
 - alloc from per-cpu cache

free
 - free to per-cpu cache
 - if per-cpu cache hits threshold, free to global

And please don't cache too much. The DMA address below 4G is still
precious.

Thanks,
Shaohua
_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to