On Fri, 29 May 2026 22:44:42 +0100 David Laight <[email protected]> wrote:
> On Fri, 29 May 2026 16:29:34 -0300 > Jason Gunthorpe <[email protected]> wrote: > > > On Fri, May 29, 2026 at 05:55:16PM +0100, David Laight wrote: ... > > I can't say, this is copied from the kernel and Will made it: > > > > arm64: io: Ensure calls to delay routines are ordered against prior > > readX() > > > > A relatively standard idiom for ensuring that a pair of MMIO writes to a > > device arrive at that device with a specified minimum delay between them > > is as follows: > > > > writel_relaxed(42, dev_base + CTL1); > > readl(dev_base + CTL1); > > udelay(10); > > writel_relaxed(42, dev_base + CTL2); > > > > the intention being that the read-back from the device will push the > > prior write to CTL1, and the udelay will hold up the write to CTL1 until > > at least 10us have elapsed. > > > > Unfortunately, on arm64 where the underlying delay loop is implemented > > as a read of the architected counter, the CPU does not guarantee > > ordering from the readl() to the delay loop and therefore the delay loop > > could in theory be speculated and not provide the desired interval > > between the two writes. > > > > Fix this in a similar manner to PowerPC by introducing a dummy control > > dependency on the output of readX() which, combined with the ISB in the > > read of the architected counter, guarantees that a subsequent delay loop > > can not be executed until the readX() has returned its result. > > Hmmm... > > Ok so there is some subtlety with the read of the counter that might > make it all work. > > It is better to make the delay loop have a data dependency on the result > of the readl(). > Something like: > u32 z = 0; > OPTIMIZER_HIDE_VAR(z); > writel_relaxed(42, dev_base + CTL1); > udelay(10 + (z & readl(dev_base + CTL1))); > writel_relaxed(42, dev_base + CTL2); > That avoids the potentially mispredicted branch and only adds instructions > when a delay follows. > That sequence is safe for all cpu and doesn't cost much for cpu (like x86) > where it (probably) isn't needed (maybe unless you patch the scale for udelay > into the code so there are no memory reads, just code). > > Probably best refactored as udelay_depends(10, readl(dev_base + CTL1)). > Or maybe udelay_after(). Sleeping on it, all the code can be put in udelay(). You just need a read memory barrier, followed by a memory read (of anywhere 'hot') and then use a data dependency (as above) from the second read into the delay loop. -- David
