Hi Robin, > -----Original Message----- > From: Robin Murphy [mailto:[email protected]] > Sent: Thursday, November 03, 2016 16:39 > To: Nipun Gupta <[email protected]>; [email protected]; linux-arm- > [email protected]; [email protected] > Cc: Stuart Yoder <[email protected]> > Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable > caching of bypass entries > > On 02/11/16 19:26, Nipun Gupta wrote: > > > > Hi Robin, > > > >> -----Original Message----- > >> From: Robin Murphy [mailto:[email protected]] > >> Sent: Wednesday, November 02, 2016 16:51 > >> To: Nipun Gupta <[email protected]>; [email protected]; linux-arm- > >> [email protected]; [email protected] > >> Cc: Stuart Yoder <[email protected]> > >> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to > enable > >> caching of bypass entries > >> > >> Hi Nipun, > >> > >> On 02/11/16 13:35, Nipun Gupta wrote: > >>> The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR) > >>> provides an option to enable the updation of TLB in case of bypass > >>> transactions due to no stream match in the stream match table. This > >>> reduces the latencies of the subsequent transactions with the same stream- > id > >> which bypasses the SMMU. > >>> This provides a significant performance benefit for certain networking > >>> workloads. > >> > >> ...at the cost of increased TLB contention against other workloads. > >> However, in the general case we'd expect the system to be fully described, > >> so > if > >> there aren't any unmatched Stream IDs there hopefully shouldn't be an > impact > >> to leaving this switched on. I'd be interested to see some actual > >> performance > >> numbers, though - you already know my opinion about unsubstantiated > quotes > >> from the MMU-500 TRM. > > > > With this change we have seen substantial performance improvement of ~9- > 10% > > with DPDK l3fwd application > (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html) > > on NXP's LS2088a platform (single core as well as multi-core). Also, with > > ODP > reflector application > > (loopback mode - NXP in-house) we have seen 5% improvement in > performance on > > LS1088 platform. > > > > W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from > avg. ~140 > > platform cycles per memory read transactions which follow this bypass path > (on LS2088 > > with DPDK l3fwd application). > > > > (Apologies, I cannot share the DPDK/ODP exact performance numbers on the > mailing list). > > That's understandable, and I'm not sure I'd know how to interpret them > anyway ;) I reckon the percentages make a sufficiently compelling > qualification of the improvement, so it would be good to have that > summarised in the commit log.
Sure, I'll add a part of it in the commit log. > > >> > >>> Signed-off-by: Nipun Gupta <[email protected]> > >>> --- > >>> drivers/iommu/arm-smmu.c | 21 +++++++++++++++------ > >>> 1 file changed, 15 insertions(+), 6 deletions(-) > >>> > >>> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index > >>> ce2a9d4..7010a5c 100644 > >>> --- a/drivers/iommu/arm-smmu.c > >>> +++ b/drivers/iommu/arm-smmu.c > >>> @@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg { > >>> > >>> #define ARM_MMU500_ACTLR_CPRE (1 << 1) > >>> > >>> +#define ACR_SMTNMB_TLBEN (1 << 8) > >> > >> ACR is entirely implementation-defined, so there are no generic field > >> names. > >> Please follow the naming convention handily demonstrated in the subsequent > >> context line. > >> > >>> #define ARM_MMU500_ACR_CACHE_LOCK (1 << 26) > >> > >> Actually, can we also please keep these in descending order of bit position > like > >> everything else? > >> > >>> #define CB_PAR_F (1 << 0) > >>> @@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct > >> arm_smmu_device *smmu) > >>> for (i = 0; i < smmu->num_mapping_groups; ++i) > >>> arm_smmu_write_sme(smmu, i); > >>> > >>> + /* Get the major rev required for configuring ACR */ > >> > >> That comment is nonsense. > >> > >>> + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7); > >>> + major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK; > >>> + > >>> /* > >>> * Before clearing ARM_MMU500_ACTLR_CPRE, need to > >>> * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK > >>> * bit is only present in MMU-500r2 onwards. > >>> */ > >>> - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7); > >>> - major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK; > >>> - if ((smmu->model == ARM_MMU500) && (major >= 2)) { > >>> - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR); > >>> + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR); > >>> + if ((smmu->model == ARM_MMU500) && (major >= 2)) > >>> reg &= ~ARM_MMU500_ACR_CACHE_LOCK; > >>> - writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR); > >>> - } > >>> + > >>> + /* > >>> + * Set the SMTNMB_TLBEN in ACR so that the transactions which > >>> + * bypass with SMMU due to no stream match found in the SMR table > >>> + * are updated in the TLB's. > >> > >> Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries > for > >> reduced latency". It's already clear from the code what bit's being set > >> where, > we > >> only need to remember *why*. > >> > >>> + */ > >>> + reg |= ACR_SMTNMB_TLBEN; > >>> + writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR); > >> > >> Are you sure it's perfectly fine to set that implementation-defined bit on > >> any > >> SMMU implementation other than the two-and-a-half ARM Ltd. ones which > >> happen to share the same meaning? I'm certainly not. > >> > >> The correct flow would be something like this: > >> > >> if (smmu->model == ARM_MMU500) { > >> reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7); > >> major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK; > >> reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR); > >> if (major >= 2) > >> reg &= ~ARM_MMU500_ACR_CACHE_LOCK; > >> reg |= ACR_SMTNMB_TLBEN; > >> writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR); > >> } > >> > >> The shape of the current code avoids an extra level of indentation, but > >> once > you > >> have to have the nested conditional anyway, it might as well all be > >> predicated > >> appropriately. > >> > > > > Thank you for providing the useful comments. I would incorporate them all in > next version :). > > Cool. Just for clarity (I realise I was thinking it, but never said it > outright), whilst MMU-40x do share the same feature with the same ACR > bit definition as MMU-500, I'd be inclined not to bother with them. > Since the monolithic microarchitecture means there's normally a separate > MMU-40x per device, if you don't want translation for that device you > can simply not probe the thing to turn it on in the first place. > This seems a pretty decent reason to have this bit set only for MMU-500. I'll send a patch v2 soon. Thanks, Nipun > Robin. > > > > > Regards, > > Nipun > > > >> Robin. > >> > >>> /* Make sure all context banks are disabled and clear CB_FSR */ > >>> for (i = 0; i < smmu->num_context_banks; ++i) { > >>> > > _______________________________________________ iommu mailing list [email protected] https://lists.linuxfoundation.org/mailman/listinfo/iommu
