Hi Robin,

On 8/22/2017 11:17 AM, Robin Murphy wrote:
Hi all,

Just a quick repost of v2[1] with a small fix for the bug reported by Nate.
I tested the series and can confirm that the crash I reported on v2
no longer occurs with this version.

To recap, whilst this mostly only improves worst-case performance, those
worst-cases have a tendency to be pathologically bad:

Ard reports general desktop performance with Chromium on AMD Seattle going
from ~1-2FPS to perfectly usable.

Leizhen reports gigabit ethernet throughput going from ~6.5Mbit/s to line

I also inadvertantly found that the HiSilicon hns_dsaf driver was taking ~35s
to probe simply becuase of the number of DMA buffers it maps on startup (perf
shows around 76% of that was spent under the lock in alloc_iova()). With this
series applied it takes a mere ~1s, mostly of unrelated mdelay()s, with
alloc_iova() entirely lost in the noise.

Are any of these cases PCI devices attached to domains that have run
out of 32-bit IOVAs and have to retry the allocation using dma_limit?

iommu_dma_alloc_iova() {
        if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))  [<- TRUE]
                iova = alloc_iova_fast(DMA_BIT_MASK(32));     [<- NULL]
        if (!iova)
                iova = alloc_iova_fast(dma_limit);            [<- OK  ]

I am asking because, when using 64k pages, the Mellanox CX4 adapter
exhausts the supply 32-bit IOVAs simply allocating per-cpu IOVA space
during 'ifconfig up' so the code path outlined above is taken for
nearly all subsequent allocations. Although I do see a notable (~2x)
performance improvement with this series, I would still characterize it
as "pathologically bad" at < 10% of iommu passthrough performance.

This was a bit surprising to me as I thought the iova_rcache would
have eliminated the need to walk the rbtree for runtime allocations.
Unfortunately, it looks like the failed attempt to allocate a 32-bit
IOVA actually drops the cached IOVAs that we could have used when
subsequently performing the allocation at dma_limit.

alloc_iova_fast() {
        iova_pfn = iova_rcache_get(...);     [<- Fail, no 32-bit IOVAs]
        if (iova_pfn)
                return iova_pfn;

        new_iova = alloc_iova(...);          [<- Fail, no 32-bit IOVAs]
        if (!new_iova) {
                unsigned int cpu;

                if (flushed_rcache)
                        return 0;

                /* Try replenishing IOVAs by flushing rcache. */
                flushed_rcache = true;
                        free_cpu_cached_iovas(cpu, iovad);     [<- :( ]
                goto retry;

As an experiment, I added code to skip the rcache flushing/retry for
the 32-bit allocations. In this configuration, 100% of passthrough mode
performance was achieved. I made the same change in the baseline and
measured performance at ~95% of passthrough mode.

I also got similar results by altogether removing the 32-bit allocation
from iommu_dma_alloc_iova() which makes me wonder why we even bother.
What (PCIe) workloads have been shown to actually benefit from it?

Tested-by: Nate Watterson <nwatt...@codeaurora.org>


[1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg19139.html

Robin Murphy (1):
   iommu/iova: Extend rbtree node caching

Zhen Lei (3):
   iommu/iova: Optimise rbtree searching
   iommu/iova: Optimise the padding calculation
   iommu/iova: Make dma_32bit_pfn implicit

  drivers/gpu/drm/tegra/drm.c      |   3 +-
  drivers/gpu/host1x/dev.c         |   3 +-
  drivers/iommu/amd_iommu.c        |   7 +--
  drivers/iommu/dma-iommu.c        |  18 +------
  drivers/iommu/intel-iommu.c      |  11 ++--
  drivers/iommu/iova.c             | 114 +++++++++++++++++----------------------
  drivers/misc/mic/scif/scif_rma.c |   3 +-
  include/linux/iova.h             |   8 +--
  8 files changed, 62 insertions(+), 105 deletions(-)

Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
iommu mailing list

Reply via email to