Hi John, On 2020-03-20 16:20, John Garry wrote:
I've run a bunch of netperf instances on multiple cores and collectingSMMU usage (on TaiShan 2280). I'm getting the following ratio pretty consistently. - 6.07% arm_smmu_iotlb_sync - 5.74% arm_smmu_tlb_inv_range 5.09% arm_smmu_cmdq_issue_cmdlist 0.28% __pi_memset 0.08% __pi_memcpy 0.08% arm_smmu_atc_inv_domain.constprop.37 0.07% arm_smmu_cmdq_build_cmd 0.01% arm_smmu_cmdq_batch_add 0.31% __pi_memsetSo arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(), when ATS is not used. According to the annotations, the load from the atomic_read(), that checks whether the domain uses ATS, is 77% of the samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not surethere is much room for optimization there.Well I did originally suggest using RCU protection to scan the list of devices, instead of reading an atomic and checking for non-zero value. But that would be an optimsation for ATS also, and there was no ATS devices atthe time (to verify performance).Heh, I have yet to get my hands on one. Currently I can't evaluate ATSperformance, but I agree that using RCU to scan the list should get betterresults when using ATS.When ATS isn't in use however, I suspect reading nr_ats_masters should be more efficient than taking the RCU lock + reading an "ats_devices" list(since the smmu_domain->devices list also serves context descriptorinvalidation, even when ATS isn't in use). I'll run some tests however, tosee if I can micro-optimize this case, but I don't expect noticeable improvements.ok, cheers. I, too, would not expect a significant improvement there. JFYI, I've been playing for "perf annotate" today and it's giving strange results for my NVMe testing. So "report" looks somewhat sane, if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():55.39% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_cmdq_issue_cmdlist 9.74% irq/342-nvme0q1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq 1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many 1.73% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_atc_inv_domain.constprop.42 1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap 1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw But "annotate" consistently tells me that a specific instruction consumes ~99% of the load for the enqueue function: : /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */ : if (sync) { 0.00 : ffff80001071c948: ldr w0, [x29, #108] : int ret = 0; 0.00 : ffff80001071c94c: mov w24, #0x0 // #0 : if (sync) { 0.00 : ffff80001071c950: cbnz w0, ffff80001071c990 <arm_smmu_cmdq_issue_cmdlist+0x420> : arch_local_irq_restore(): 0.00 : ffff80001071c954: msr daif, x21 : arm_smmu_cmdq_issue_cmdlist(): : } : } : : local_irq_restore(flags); : return ret; : } 99.51 : ffff80001071c958: adrp x0, ffff800011909000 <page_wait_table+0x14c0>
This is likely the side effect of the re-enabling of interrupts (msr daif, x21) on the previous instruction which causes the perf interrupt to fire right after.
Time to enable pseudo-NMIs in the PMUv3 driver...
M.
--
Jazz is not dead. It just smells funny...
_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu
