> On Jan 25, 2021, at 12:39 PM, Chuck Lever <[email protected]> wrote:
>
> Hello Lu -
>
> Many thanks for your prototype.
>
>
>> On Jan 24, 2021, at 9:38 PM, Lu Baolu <[email protected]> wrote:
>>
>> This patch series is only for Request-For-Testing purpose. It aims to
>> fix the performance regression reported here.
>>
>> https://lore.kernel.org/linux-iommu/[email protected]/
>>
>> The first two patches are borrowed from here.
>>
>> https://lore.kernel.org/linux-iommu/[email protected]/
>>
>> Please kindly help to verification.
>>
>> Best regards,
>> baolu
>>
>> Lu Baolu (1):
>> iommu/vt-d: Add iotlb_sync_map callback
>>
>> Yong Wu (2):
>> iommu: Move iotlb_sync_map out from __iommu_map
>> iommu: Add iova and size as parameters in iotlb_sync_map
>>
>> drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------
>> drivers/iommu/iommu.c | 23 +++++++---
>> drivers/iommu/tegra-gart.c | 7 ++-
>> include/linux/iommu.h | 3 +-
>> 4 files changed, 83 insertions(+), 36 deletions(-)
>
> Here are results with the NFS client at stock v5.11-rc5 and the
> NFS server at v5.10, showing the regression I reported earlier.
>
> Children see throughput for 12 initial writers = 4534582.00 kB/sec
> Parent sees throughput for 12 initial writers = 4458145.56 kB/sec
> Min throughput per process = 373101.59 kB/sec
> Max throughput per process = 382669.50 kB/sec
> Avg throughput per process = 377881.83 kB/sec
> Min xfer = 1022720.00 kB
> CPU Utilization: Wall time 2.787 CPU time 1.922 CPU
> utilization 68.95 %
>
>
> Children see throughput for 12 rewriters = 4542003.12 kB/sec
> Parent sees throughput for 12 rewriters = 4538024.19 kB/sec
> Min throughput per process = 374672.00 kB/sec
> Max throughput per process = 383983.78 kB/sec
> Avg throughput per process = 378500.26 kB/sec
> Min xfer = 1022976.00 kB
> CPU utilization: Wall time 2.733 CPU time 1.947 CPU
> utilization 71.25 %
>
>
> Children see throughput for 12 readers = 4568632.03 kB/sec
> Parent sees throughput for 12 readers = 4563672.02 kB/sec
> Min throughput per process = 376727.56 kB/sec
> Max throughput per process = 383783.91 kB/sec
> Avg throughput per process = 380719.34 kB/sec
> Min xfer = 1029376.00 kB
> CPU utilization: Wall time 2.733 CPU time 1.898 CPU
> utilization 69.46 %
>
>
> Children see throughput for 12 re-readers = 4610702.78 kB/sec
> Parent sees throughput for 12 re-readers = 4606135.66 kB/sec
> Min throughput per process = 381532.78 kB/sec
> Max throughput per process = 387072.53 kB/sec
> Avg throughput per process = 384225.23 kB/sec
> Min xfer = 1034496.00 kB
> CPU utilization: Wall time 2.711 CPU time 1.910 CPU
> utilization 70.45 %
>
> Here's the NFS client at v5.11-rc5 with your series applied.
> The NFS server remains at v5.10:
>
> Children see throughput for 12 initial writers = 4434778.81 kB/sec
> Parent sees throughput for 12 initial writers = 4408190.69 kB/sec
> Min throughput per process = 367865.28 kB/sec
> Max throughput per process = 371134.38 kB/sec
> Avg throughput per process = 369564.90 kB/sec
> Min xfer = 1039360.00 kB
> CPU Utilization: Wall time 2.842 CPU time 1.904 CPU
> utilization 66.99 %
>
>
> Children see throughput for 12 rewriters = 4476870.69 kB/sec
> Parent sees throughput for 12 rewriters = 4471701.48 kB/sec
> Min throughput per process = 370985.34 kB/sec
> Max throughput per process = 374752.28 kB/sec
> Avg throughput per process = 373072.56 kB/sec
> Min xfer = 1038592.00 kB
> CPU utilization: Wall time 2.801 CPU time 1.902 CPU
> utilization 67.91 %
>
>
> Children see throughput for 12 readers = 5865268.88 kB/sec
> Parent sees throughput for 12 readers = 5854519.73 kB/sec
> Min throughput per process = 487766.81 kB/sec
> Max throughput per process = 489623.88 kB/sec
> Avg throughput per process = 488772.41 kB/sec
> Min xfer = 1044736.00 kB
> CPU utilization: Wall time 2.144 CPU time 1.895 CPU
> utilization 88.41 %
>
>
> Children see throughput for 12 re-readers = 5847438.62 kB/sec
> Parent sees throughput for 12 re-readers = 5839292.18 kB/sec
> Min throughput per process = 485835.03 kB/sec
> Max throughput per process = 488702.12 kB/sec
> Avg throughput per process = 487286.55 kB/sec
> Min xfer = 1042688.00 kB
> CPU utilization: Wall time 2.148 CPU time 1.909 CPU
> utilization 88.84 %
>
> NFS READ throughput is almost fully restored. A normal-looking throughput
> result, copied from the previous thread, is:
>
> Children see throughput for 12 readers = 5921370.94 kB/sec
> Parent sees throughput for 12 readers = 5914106.69 kB/sec
>
> The NFS WRITE throughput result appears to be unchanged, or slightly
> worse than before. I don't have an explanation for this result. I applied
> your patches on the NFS server also without noting improvement.
Function-boundary tracing shows some interesting results.
# trace-cmd record -e rpcrdma -e iommu -p function_graph --max-graph-depth=5 -g
dma_map_sg_attrs
Some 120KB SGLs are DMA-mapped in a single call to __iommu_map(). Other SGLs of
the same size need as many as one __iommu_map() call per SGL element (which
would be 30 for a 120KB SGL).
In v5.10, intel_map_sg() was structured such that an SGL is always handled with
a single call to domain_mapping() and thus always just a single TLB flush.
My amateur theorizing suggests that the SGL element coalescing done in
__iommu_map_sg() is not working as well as intel_map_sg() used to, which results
in more calls to domain_mapping(). Not only does that take longer, but it
creates
many more DMA maps. Could that also have some impact on device TLB resources?
--
Chuck Lever
_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu