As mentioned in [0], the CPU may consume many cycles processing
arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
get space on the queue takes approx 25% of the cycles for this function.

The cmpxchg() is removed as follows:
- We assume that the cmdq can never fill with changes to limit the
  batch size (where necessary) and always issue a CMD_SYNC for a batch
  We need to do this since we no longer maintain the cons value in
  software, and we cannot deal with no available space properly.
- Replace cmpxchg() with atomic inc operation, to maintain the prod
  and owner values.

Early experiments have shown that we may see a 25% boost in throughput
IOPS for my NVMe test with these changes. And some CPUs, which were
loaded at ~55%, now see a ~45% load.

So, even though the changes are incomplete and other parts of the driver
will need fixing up (and it looks maybe broken for !MSI support), the
performance boost seen would seem to be worth the effort of exploring
this.

Comments requested please.

Thanks

[0] 
https://lore.kernel.org/linux-iommu/b926444035e5e2439431908e3842afd24b8...@dggemi525-mbs.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3

John Garry (2):
  iommu/arm-smmu-v3: Calculate bits for prod and owner
  iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist()

 drivers/iommu/arm-smmu-v3.c | 92 +++++++++++++++++++++++----------------------
 1 file changed, 47 insertions(+), 45 deletions(-)

-- 
2.16.4

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to