On Fri, 29 May 2026 22:44:42 +0100
David Laight <[email protected]> wrote:

> On Fri, 29 May 2026 16:29:34 -0300
> Jason Gunthorpe <[email protected]> wrote:
> 
> > On Fri, May 29, 2026 at 05:55:16PM +0100, David Laight wrote:  
...
> > I can't say, this is copied from the kernel and Will made it:
> > 
> >     arm64: io: Ensure calls to delay routines are ordered against prior 
> > readX()
> >     
> >     A relatively standard idiom for ensuring that a pair of MMIO writes to a
> >     device arrive at that device with a specified minimum delay between them
> >     is as follows:
> >     
> >             writel_relaxed(42, dev_base + CTL1);
> >             readl(dev_base + CTL1);
> >             udelay(10);
> >             writel_relaxed(42, dev_base + CTL2);
> >     
> >     the intention being that the read-back from the device will push the
> >     prior write to CTL1, and the udelay will hold up the write to CTL1 until
> >     at least 10us have elapsed.
> >     
> >     Unfortunately, on arm64 where the underlying delay loop is implemented
> >     as a read of the architected counter, the CPU does not guarantee
> >     ordering from the readl() to the delay loop and therefore the delay loop
> >     could in theory be speculated and not provide the desired interval
> >     between the two writes.
> >     
> >     Fix this in a similar manner to PowerPC by introducing a dummy control
> >     dependency on the output of readX() which, combined with the ISB in the
> >     read of the architected counter, guarantees that a subsequent delay loop
> >     can not be executed until the readX() has returned its result.  
> 
> Hmmm...
> 
> Ok so there is some subtlety with the read of the counter that might
> make it all work.
> 
> It is better to make the delay loop have a data dependency on the result
> of the readl().
> Something like:
>       u32 z = 0;
>       OPTIMIZER_HIDE_VAR(z);
>       writel_relaxed(42, dev_base + CTL1);
>       udelay(10 + (z & readl(dev_base + CTL1)));
>       writel_relaxed(42, dev_base + CTL2);
> That avoids the potentially mispredicted branch and only adds instructions
> when a delay follows.
> That sequence is safe for all cpu and doesn't cost much for cpu (like x86)
> where it (probably) isn't needed (maybe unless you patch the scale for udelay
> into the code so there are no memory reads, just code).
> 
> Probably best refactored as udelay_depends(10, readl(dev_base + CTL1)).
> Or maybe udelay_after().

Sleeping on it, all the code can be put in udelay().
You just need a read memory barrier, followed by a memory read (of anywhere
'hot') and then use a data dependency (as above) from the second read
into the delay loop.

-- David


Reply via email to