Hi Robin,

> -----Original Message-----
> From: Robin Murphy [mailto:[email protected]]
> Sent: Thursday, November 03, 2016 16:39
> To: Nipun Gupta <[email protected]>; [email protected]; linux-arm-
> [email protected]; [email protected]
> Cc: Stuart Yoder <[email protected]>
> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
> caching of bypass entries
> 
> On 02/11/16 19:26, Nipun Gupta wrote:
> >
> > Hi Robin,
> >
> >> -----Original Message-----
> >> From: Robin Murphy [mailto:[email protected]]
> >> Sent: Wednesday, November 02, 2016 16:51
> >> To: Nipun Gupta <[email protected]>; [email protected]; linux-arm-
> >> [email protected]; [email protected]
> >> Cc: Stuart Yoder <[email protected]>
> >> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to
> enable
> >> caching of bypass entries
> >>
> >> Hi Nipun,
> >>
> >> On 02/11/16 13:35, Nipun Gupta wrote:
> >>> The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR)
> >>> provides an option to enable the updation of TLB in case of bypass
> >>> transactions due to no stream match in the stream match table. This
> >>> reduces the latencies of the subsequent transactions with the same stream-
> id
> >> which bypasses the SMMU.
> >>> This provides a significant performance benefit for certain networking
> >>> workloads.
> >>
> >> ...at the cost of increased TLB contention against other workloads.
> >> However, in the general case we'd expect the system to be fully described, 
> >> so
> if
> >> there aren't any unmatched Stream IDs there hopefully shouldn't be an
> impact
> >> to leaving this switched on. I'd be interested to see some actual 
> >> performance
> >> numbers, though - you already know my opinion about unsubstantiated
> quotes
> >> from the MMU-500 TRM.
> >
> > With this change we have seen substantial performance improvement of ~9-
> 10%
> > with DPDK l3fwd application
> (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
> > on NXP's LS2088a platform (single core as well as multi-core). Also, with 
> > ODP
> reflector application
> > (loopback mode - NXP in-house) we have seen 5% improvement in
> performance on
> > LS1088 platform.
> >
> > W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from
> avg. ~140
> > platform cycles per memory read transactions which follow this bypass path
> (on LS2088
> > with DPDK l3fwd application).
> >
> > (Apologies, I cannot share the DPDK/ODP exact performance numbers on the
> mailing list).
> 
> That's understandable, and I'm not sure I'd know how to interpret them
> anyway ;) I reckon the percentages make a sufficiently compelling
> qualification of the improvement, so it would be good to have that
> summarised in the commit log.

Sure, I'll add a part of it in the commit log.

> 
> >>
> >>> Signed-off-by: Nipun Gupta <[email protected]>
> >>> ---
> >>>  drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
> >>>  1 file changed, 15 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
> >>> ce2a9d4..7010a5c 100644
> >>> --- a/drivers/iommu/arm-smmu.c
> >>> +++ b/drivers/iommu/arm-smmu.c
> >>> @@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
> >>>
> >>>  #define ARM_MMU500_ACTLR_CPRE            (1 << 1)
> >>>
> >>> +#define ACR_SMTNMB_TLBEN         (1 << 8)
> >>
> >> ACR is entirely implementation-defined, so there are no generic field 
> >> names.
> >> Please follow the naming convention handily demonstrated in the subsequent
> >> context line.
> >>
> >>>  #define ARM_MMU500_ACR_CACHE_LOCK        (1 << 26)
> >>
> >> Actually, can we also please keep these in descending order of bit position
> like
> >> everything else?
> >>
> >>>  #define CB_PAR_F                 (1 << 0)
> >>> @@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct
> >> arm_smmu_device *smmu)
> >>>   for (i = 0; i < smmu->num_mapping_groups; ++i)
> >>>           arm_smmu_write_sme(smmu, i);
> >>>
> >>> + /* Get the major rev required for configuring ACR */
> >>
> >> That comment is nonsense.
> >>
> >>> + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> >>> + major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> >>> +
> >>>   /*
> >>>    * Before clearing ARM_MMU500_ACTLR_CPRE, need to
> >>>    * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
> >>>    * bit is only present in MMU-500r2 onwards.
> >>>    */
> >>> - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> >>> - major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> >>> - if ((smmu->model == ARM_MMU500) && (major >= 2)) {
> >>> -         reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> >>> + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> >>> + if ((smmu->model == ARM_MMU500) && (major >= 2))
> >>>           reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> >>> -         writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> >>> - }
> >>> +
> >>> + /*
> >>> +  * Set the SMTNMB_TLBEN in ACR so that the transactions which
> >>> +  * bypass with SMMU due to no stream match found in the SMR table
> >>> +  * are updated in the TLB's.
> >>
> >> Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries
> for
> >> reduced latency". It's already clear from the code what bit's being set 
> >> where,
> we
> >> only need to remember *why*.
> >>
> >>> +  */
> >>> + reg |= ACR_SMTNMB_TLBEN;
> >>> + writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> >>
> >> Are you sure it's perfectly fine to set that implementation-defined bit on 
> >> any
> >> SMMU implementation other than the two-and-a-half ARM Ltd. ones which
> >> happen to share the same meaning? I'm certainly not.
> >>
> >> The correct flow would be something like this:
> >>
> >>    if (smmu->model == ARM_MMU500) {
> >>            reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> >>            major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> >>            reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> >>            if (major >= 2)
> >>                    reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> >>            reg |= ACR_SMTNMB_TLBEN;
> >>            writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> >>    }
> >>
> >> The shape of the current code avoids an extra level of indentation, but 
> >> once
> you
> >> have to have the nested conditional anyway, it might as well all be 
> >> predicated
> >> appropriately.
> >>
> >
> > Thank you for providing the useful comments. I would incorporate them all in
> next version :).
> 
> Cool. Just for clarity (I realise I was thinking it, but never said it
> outright), whilst MMU-40x do share the same feature with the same ACR
> bit definition as MMU-500, I'd be inclined not to bother with them.
> Since the monolithic microarchitecture means there's normally a separate
> MMU-40x per device, if you don't want translation for that device you
> can simply not probe the thing to turn it on in the first place.
> 

This seems a pretty decent reason to have this bit set only for MMU-500.
I'll send a patch v2 soon.

Thanks,
Nipun

> Robin.
> 
> >
> > Regards,
> > Nipun
> >
> >> Robin.
> >>
> >>>   /* Make sure all context banks are disabled and clear CB_FSR  */
> >>>   for (i = 0; i < smmu->num_context_banks; ++i) {
> >>>
> >

_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to