On 2021-01-25 19:31, Chuck Lever wrote:
On Jan 25, 2021, at 12:39 PM, Chuck Lever <[email protected]> wrote:
Hello Lu -
Many thanks for your prototype.
On Jan 24, 2021, at 9:38 PM, Lu Baolu <[email protected]> wrote:
This patch series is only for Request-For-Testing purpose. It aims to
fix the performance regression reported here.
https://lore.kernel.org/linux-iommu/[email protected]/
The first two patches are borrowed from here.
https://lore.kernel.org/linux-iommu/[email protected]/
Please kindly help to verification.
Best regards,
baolu
Lu Baolu (1):
iommu/vt-d: Add iotlb_sync_map callback
Yong Wu (2):
iommu: Move iotlb_sync_map out from __iommu_map
iommu: Add iova and size as parameters in iotlb_sync_map
drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------
drivers/iommu/iommu.c | 23 +++++++---
drivers/iommu/tegra-gart.c | 7 ++-
include/linux/iommu.h | 3 +-
4 files changed, 83 insertions(+), 36 deletions(-)
Here are results with the NFS client at stock v5.11-rc5 and the
NFS server at v5.10, showing the regression I reported earlier.
Children see throughput for 12 initial writers = 4534582.00 kB/sec
Parent sees throughput for 12 initial writers = 4458145.56 kB/sec
Min throughput per process = 373101.59 kB/sec
Max throughput per process = 382669.50 kB/sec
Avg throughput per process = 377881.83 kB/sec
Min xfer = 1022720.00 kB
CPU Utilization: Wall time 2.787 CPU time 1.922 CPU
utilization 68.95 %
Children see throughput for 12 rewriters = 4542003.12 kB/sec
Parent sees throughput for 12 rewriters = 4538024.19 kB/sec
Min throughput per process = 374672.00 kB/sec
Max throughput per process = 383983.78 kB/sec
Avg throughput per process = 378500.26 kB/sec
Min xfer = 1022976.00 kB
CPU utilization: Wall time 2.733 CPU time 1.947 CPU
utilization 71.25 %
Children see throughput for 12 readers = 4568632.03 kB/sec
Parent sees throughput for 12 readers = 4563672.02 kB/sec
Min throughput per process = 376727.56 kB/sec
Max throughput per process = 383783.91 kB/sec
Avg throughput per process = 380719.34 kB/sec
Min xfer = 1029376.00 kB
CPU utilization: Wall time 2.733 CPU time 1.898 CPU
utilization 69.46 %
Children see throughput for 12 re-readers = 4610702.78 kB/sec
Parent sees throughput for 12 re-readers = 4606135.66 kB/sec
Min throughput per process = 381532.78 kB/sec
Max throughput per process = 387072.53 kB/sec
Avg throughput per process = 384225.23 kB/sec
Min xfer = 1034496.00 kB
CPU utilization: Wall time 2.711 CPU time 1.910 CPU
utilization 70.45 %
Here's the NFS client at v5.11-rc5 with your series applied.
The NFS server remains at v5.10:
Children see throughput for 12 initial writers = 4434778.81 kB/sec
Parent sees throughput for 12 initial writers = 4408190.69 kB/sec
Min throughput per process = 367865.28 kB/sec
Max throughput per process = 371134.38 kB/sec
Avg throughput per process = 369564.90 kB/sec
Min xfer = 1039360.00 kB
CPU Utilization: Wall time 2.842 CPU time 1.904 CPU
utilization 66.99 %
Children see throughput for 12 rewriters = 4476870.69 kB/sec
Parent sees throughput for 12 rewriters = 4471701.48 kB/sec
Min throughput per process = 370985.34 kB/sec
Max throughput per process = 374752.28 kB/sec
Avg throughput per process = 373072.56 kB/sec
Min xfer = 1038592.00 kB
CPU utilization: Wall time 2.801 CPU time 1.902 CPU
utilization 67.91 %
Children see throughput for 12 readers = 5865268.88 kB/sec
Parent sees throughput for 12 readers = 5854519.73 kB/sec
Min throughput per process = 487766.81 kB/sec
Max throughput per process = 489623.88 kB/sec
Avg throughput per process = 488772.41 kB/sec
Min xfer = 1044736.00 kB
CPU utilization: Wall time 2.144 CPU time 1.895 CPU
utilization 88.41 %
Children see throughput for 12 re-readers = 5847438.62 kB/sec
Parent sees throughput for 12 re-readers = 5839292.18 kB/sec
Min throughput per process = 485835.03 kB/sec
Max throughput per process = 488702.12 kB/sec
Avg throughput per process = 487286.55 kB/sec
Min xfer = 1042688.00 kB
CPU utilization: Wall time 2.148 CPU time 1.909 CPU
utilization 88.84 %
NFS READ throughput is almost fully restored. A normal-looking throughput
result, copied from the previous thread, is:
Children see throughput for 12 readers = 5921370.94 kB/sec
Parent sees throughput for 12 readers = 5914106.69 kB/sec
The NFS WRITE throughput result appears to be unchanged, or slightly
worse than before. I don't have an explanation for this result. I applied
your patches on the NFS server also without noting improvement.
Function-boundary tracing shows some interesting results.
# trace-cmd record -e rpcrdma -e iommu -p function_graph --max-graph-depth=5 -g
dma_map_sg_attrs
Some 120KB SGLs are DMA-mapped in a single call to __iommu_map(). Other SGLs of
the same size need as many as one __iommu_map() call per SGL element (which
would be 30 for a 120KB SGL).
In v5.10, intel_map_sg() was structured such that an SGL is always handled with
a single call to domain_mapping() and thus always just a single TLB flush.
Implementing .iotlb_sync_map means that a single top-level
iommu_map()/iommu_map_sg() call should still only invoke a single "TLB
flush" (really, any maintenance required for the IOMMU to start using
the new mapping) at the end, regardless of how many internal
__iommu_map() calls are made to satisfy the overall request. If you're
seeing something other than that behaviour (with this series), that
implies we've not got things quite right yet.
My amateur theorizing suggests that the SGL element coalescing done in
__iommu_map_sg() is not working as well as intel_map_sg() used to, which results
in more calls to domain_mapping(). Not only does that take longer, but it
creates
many more DMA maps. Could that also have some impact on device TLB resources?
FWIW the old __domain_mapping() code just did a dumb iteration over the
scatterlist segments internally, so __iommu_map_sg() should be no worse
in that regard, and could in principle even be better if it's able to
coalesce things far enough to start fitting large page mappings. The
only appreciable difference *should* be any additional self-time in
__iommu_map() due to the iteration now being performed one layer up.
Is there any significant difference between how the NFS read and write
paths make their DMA API calls and/or get their scatterlists in the
first place, that might help shed some light on the curious
half-recovery you got?
Thanks,
Robin.
_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu