> On Jan 12, 2021, at 9:25 PM, Lu Baolu <[email protected]> wrote:
> 
> Hi,
> 
> On 1/12/21 10:38 PM, Will Deacon wrote:
>> [Expanding cc list to include DMA-IOMMU and intel IOMMU folks]
>> On Fri, Jan 08, 2021 at 04:18:36PM -0500, Chuck Lever wrote:
>>> Hi-
>>> 
>>> [ Please cc: me on replies, I'm not currently subscribed to
>>> iommu@lists ].
>>> 
>>> I'm running NFS performance tests on InfiniBand using CX-3 Pro cards
>>> at 56Gb/s. The test is iozone on an NFSv3/RDMA mount:
>>> 
>>> /home/cel/bin/iozone -M -+u -i0 -i1 -s1g -r256k -t12 -I
>>> 
>>> For those not familiar with the way storage protocols use RDMA, The
>>> initiator/client sets up memory regions and the target/server uses
>>> RDMA Read and Write to move data out of and into those regions. The
>>> initiator/client uses only RDMA memory registration and invalidation
>>> operations, and the target/server uses RDMA Read and Write.
>>> 
>>> My NFS client is a two-socket 12-core x86_64 system with its I/O MMU
>>> enabled using the kernel command line options "intel_iommu=on
>>> iommu=strict".
>>> 
>>> Recently I've noticed a significant (25-30%) loss in NFS throughput.
>>> I was able to bisect on my client to the following commits.
>>> 
>>> Here's 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
>>> map_sg"). This is about normal for this test.
>>> 
>>>     Children see throughput for 12 initial writers  = 4732581.09 kB/sec
>>>     Parent sees throughput for 12 initial writers   = 4646810.21 kB/sec
>>>     Min throughput per process                      =  387764.34 kB/sec
>>>     Max throughput per process                      =  399655.47 kB/sec
>>>     Avg throughput per process                      =  394381.76 kB/sec
>>>     Min xfer                                        = 1017344.00 kB
>>>     CPU Utilization: Wall time    2.671    CPU time    1.974    CPU 
>>> utilization  73.89 %
>>>     Children see throughput for 12 rewriters        = 4837741.94 kB/sec
>>>     Parent sees throughput for 12 rewriters         = 4833509.35 kB/sec
>>>     Min throughput per process                      =  398983.72 kB/sec
>>>     Max throughput per process                      =  406199.66 kB/sec
>>>     Avg throughput per process                      =  403145.16 kB/sec
>>>     Min xfer                                        = 1030656.00 kB
>>>     CPU utilization: Wall time    2.584    CPU time    1.959    CPU 
>>> utilization  75.82 %
>>>     Children see throughput for 12 readers          = 5921370.94 kB/sec
>>>     Parent sees throughput for 12 readers           = 5914106.69 kB/sec
>>>     Min throughput per process                      =  491812.38 kB/sec
>>>     Max throughput per process                      =  494777.28 kB/sec
>>>     Avg throughput per process                      =  493447.58 kB/sec
>>>     Min xfer                                        = 1042688.00 kB
>>>     CPU utilization: Wall time    2.122    CPU time    1.968    CPU 
>>> utilization  92.75 %
>>>     Children see throughput for 12 re-readers       = 5947985.69 kB/sec
>>>     Parent sees throughput for 12 re-readers        = 5941348.51 kB/sec
>>>     Min throughput per process                      =  492805.81 kB/sec
>>>     Max throughput per process                      =  497280.19 kB/sec
>>>     Avg throughput per process                      =  495665.47 kB/sec
>>>     Min xfer                                        = 1039360.00 kB
>>>     CPU utilization: Wall time    2.111    CPU time    1.968    CPU 
>>> utilization  93.22 %
>>> 
>>> Here's c062db039f40 ("iommu/vt-d: Update domain geometry in
>>> iommu_ops.at(de)tach_dev"). It's losing some steam here.
>>> 
>>>     Children see throughput for 12 initial writers  = 4342419.12 kB/sec
>>>     Parent sees throughput for 12 initial writers   = 4310612.79 kB/sec
>>>     Min throughput per process                      =  359299.06 kB/sec
>>>     Max throughput per process                      =  363866.16 kB/sec
>>>     Avg throughput per process                      =  361868.26 kB/sec
>>>     Min xfer                                        = 1035520.00 kB
>>>     CPU Utilization: Wall time    2.902    CPU time    1.951    CPU 
>>> utilization  67.22 %
>>>     Children see throughput for 12 rewriters        = 4408576.66 kB/sec
>>>     Parent sees throughput for 12 rewriters         = 4404280.87 kB/sec
>>>     Min throughput per process                      =  364553.88 kB/sec
>>>     Max throughput per process                      =  370029.28 kB/sec
>>>     Avg throughput per process                      =  367381.39 kB/sec
>>>     Min xfer                                        = 1033216.00 kB
>>>     CPU utilization: Wall time    2.836    CPU time    1.956    CPU 
>>> utilization  68.97 %
>>>     Children see throughput for 12 readers          = 5406879.47 kB/sec
>>>     Parent sees throughput for 12 readers           = 5401862.78 kB/sec
>>>     Min throughput per process                      =  449583.03 kB/sec
>>>     Max throughput per process                      =  451761.69 kB/sec
>>>     Avg throughput per process                      =  450573.29 kB/sec
>>>     Min xfer                                        = 1044224.00 kB
>>>     CPU utilization: Wall time    2.323    CPU time    1.977    CPU 
>>> utilization  85.12 %
>>>     Children see throughput for 12 re-readers       = 5410601.12 kB/sec
>>>     Parent sees throughput for 12 re-readers        = 5403504.40 kB/sec
>>>     Min throughput per process                      =  449918.12 kB/sec
>>>     Max throughput per process                      =  452489.28 kB/sec
>>>     Avg throughput per process                      =  450883.43 kB/sec
>>>     Min xfer                                        = 1043456.00 kB
>>>     CPU utilization: Wall time    2.321    CPU time    1.978    CPU 
>>> utilization  85.21 %
>>> 
>>> And here's c588072bba6b ("iommu/vt-d: Convert intel iommu driver to
>>> the iommu ops"). Significant throughput loss.
>>> 
>>>     Children see throughput for 12 initial writers  = 3812036.91 kB/sec
>>>     Parent sees throughput for 12 initial writers   = 3753683.40 kB/sec
>>>     Min throughput per process                      =  313672.25 kB/sec
>>>     Max throughput per process                      =  321719.44 kB/sec
>>>     Avg throughput per process                      =  317669.74 kB/sec
>>>     Min xfer                                        = 1022464.00 kB
>>>     CPU Utilization: Wall time    3.309    CPU time    1.986    CPU 
>>> utilization  60.02 %
>>>     Children see throughput for 12 rewriters        = 3786831.94 kB/sec
>>>     Parent sees throughput for 12 rewriters         = 3783205.58 kB/sec
>>>     Min throughput per process                      =  313654.44 kB/sec
>>>     Max throughput per process                      =  317844.50 kB/sec
>>>     Avg throughput per process                      =  315569.33 kB/sec
>>>     Min xfer                                        = 1035520.00 kB
>>>     CPU utilization: Wall time    3.302    CPU time    1.945    CPU 
>>> utilization  58.90 %
>>>     Children see throughput for 12 readers          = 4265828.28 kB/sec
>>>     Parent sees throughput for 12 readers           = 4261844.88 kB/sec
>>>     Min throughput per process                      =  352305.00 kB/sec
>>>     Max throughput per process                      =  357726.22 kB/sec
>>>     Avg throughput per process                      =  355485.69 kB/sec
>>>     Min xfer                                        = 1032960.00 kB
>>>     CPU utilization: Wall time    2.934    CPU time    1.942    CPU 
>>> utilization  66.20 %
>>>     Children see throughput for 12 re-readers       = 4220651.19 kB/sec
>>>     Parent sees throughput for 12 re-readers        = 4216096.04 kB/sec
>>>     Min throughput per process                      =  348677.16 kB/sec
>>>     Max throughput per process                      =  353467.44 kB/sec
>>>     Avg throughput per process                      =  351720.93 kB/sec
>>>     Min xfer                                        = 1035264.00 kB
>>>     CPU utilization: Wall time    2.969    CPU time    1.952    CPU 
>>> utilization  65.74 %
>>> 
>>> The regression appears to be 100% reproducible.
> 
> The commit 65f746e8285f ("iommu: Add quirk for Intel graphic devices in
> map_sg") is a temporary workaround. We have reverted it recently (5.11-
> rc3). Can you please try the a kernel version after -rc3?

I don't see a change in write results with v5.11-rc3, but read throughput
appears to improve a little.


        Children see throughput for 12 initial writers  = 3854295.72 kB/sec
        Parent sees throughput for 12 initial writers   = 3744064.85 kB/sec
        Min throughput per process                      =  313499.41 kB/sec 
        Max throughput per process                      =  328151.44 kB/sec
        Avg throughput per process                      =  321191.31 kB/sec
        Min xfer                                        = 1001728.00 kB
        CPU Utilization: Wall time    3.289    CPU time    2.075    CPU 
utilization  63.10 %


        Children see throughput for 12 rewriters        = 3692675.22 kB/sec
        Parent sees throughput for 12 rewriters         = 3688975.23 kB/sec
        Min throughput per process                      =  304863.84 kB/sec 
        Max throughput per process                      =  311000.16 kB/sec
        Avg throughput per process                      =  307722.93 kB/sec
        Min xfer                                        = 1028096.00 kB
        CPU utilization: Wall time    3.375    CPU time    2.051    CPU 
utilization  60.76 %


        Children see throughput for 12 readers          = 4521975.69 kB/sec
        Parent sees throughput for 12 readers           = 4516965.08 kB/sec
        Min throughput per process                      =  372762.16 kB/sec 
        Max throughput per process                      =  382233.84 kB/sec
        Avg throughput per process                      =  376831.31 kB/sec
        Min xfer                                        = 1022720.00 kB
        CPU utilization: Wall time    2.747    CPU time    1.961    CPU 
utilization  71.39 %


        Children see throughput for 12 re-readers       = 4684127.06 kB/sec
        Parent sees throughput for 12 re-readers        = 4678990.23 kB/sec
        Min throughput per process                      =  385586.34 kB/sec 
        Max throughput per process                      =  395542.47 kB/sec
        Avg throughput per process                      =  390343.92 kB/sec
        Min xfer                                        = 1022208.00 kB
        CPU utilization: Wall time    2.653    CPU time    1.941    CPU 
utilization  73.16 %



--
Chuck Lever



_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to