On Wed, 2018-03-28 at 10:07 +0100, Will Deacon wrote: > > For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to > add an mb() to make it work. > > Do both of these work on power?
Yes. There's even another quirk, see further down ;-) > If so, I guess I can make readl even more > expensive :/ Feels a bit like the tail wagging the dog, though. Maybe, but then readl is always horribly slow anyway so you may not necessarily be losing that much. > Another thing I just realised is that we restrict the barriers we use in > readl/writel on arm64 so that they don't necessary apply to both loads and > stores. To be specific: > > writel is ordered against prior writes to memory, but not reads That could be tricky... You may end up with something that reads before triggering a DMA and ends up with the post-DMA value ... ugh. > readl is ordered against subsequent reads of memory, but not writes (but > note that in example (1) above, the control dependency ensures that). > > If necessary, I could move the barrier in our readl implementation to be > before the read, then play the control-dependency + instruction-sync (ISB) > trick that you do on power. Yeah so that other trick I'm talking about is also used for timing accuracy. For example, let's say I have a device with a reset bit and the spec says the reset bit needs to be set for at least 10us. This is wrong: writel(1, RESET_REG); usleep(10); writel(0, RESET_REG); Because of write posting, the first write might arrive to the device right before the second one. The typical "fix" is to turn that into: writel(1, RESET_REG); readl(RESET_REG); /* Flush posted writes */ usleep(10); writel(0, RESET_REG); *However* the issue here, at least on power, is that the CPU can issue that readl but doesn't necessarily wait for it to complete (ie, the data to return), before proceeding to the usleep. Now a usleep contains a bunch of loads and stores and is probably fine, but a udelay which just loops on the timebase may not be. Thus we may still violate the timing requirement. What we did inside readl, with the twi;isync sequence (which basically means, trap on return value with "trap never" as a condition, followed by isync that ensures all excpetion conditions are resolved), is force the CPU to "consume" the data from the read before moving on. This effectively makes readl fully synchronous (we would probably avoid that if we were to implement a readl_relaxed). Cheers, Ben.