(replying to my own mail from a different address to deal with the regular one being blacklisted somewhere, sorry for any duplicates)
On Wed, Oct 21, 2020 at 9:16 AM Arnd Bergmann <a...@arndb.de> wrote: > > On Wed, Oct 21, 2020 at 12:10 AM Benjamin Herrenschmidt > <b...@kernel.crashing.org> wrote: > > On Tue, 2020-10-20 at 21:49 +0200, Arnd Bergmann wrote: > > > On Tue, Oct 20, 2020 at 11:37 AM Dylan Hung <dylan_h...@aspeedtech.com> > > > wrote: > > > > > +1 @first is system memory from dma_alloc_coherent(), right? > > > > > > > > > > You shouldn't have to do this. Is coherent DMA memory broken on your > > > > > platform? > > > > > > > > It is about the arbitration on the DRAM controller. There are two > > > > queues in the dram controller, one is for the CPU access and the other > > > > is for the HW engines. > > > > When CPU issues a store command, the dram controller just acknowledges > > > > cpu's request and pushes the request into the queue. Then CPU triggers > > > > the HW MAC engine, the HW engine starts to fetch the DMA memory. > > > > But since the cpu's request may still stay in the queue, the HW engine > > > > may fetch the wrong data. > > > > Actually, I take back what I said earlier, the above seems to imply > > this is more generic. > > > > Dylan, please confirm, does this affect *all* DMA capable devices ? If > > yes, then it's a really really bad design bug in your chips > > unfortunately and the proper fix is indeed to make dma_wmb() do a dummy > > read of some sort (what address though ? would any dummy non-cachable > > page do ?) to force the data out as *all* drivers will potentially be > > affected. > > > > I was under the impression that it was a specific timing issue in the > > vhub and ethernet parts, but if it's more generic then it needs to be > > fixed globally. > > We have CONFIG_ARM_HEAVY_MB for SoCs with similar problems, > it turns mb() and wmb() into a platform specific function call, though it > doesn't do that for dma_wmb() and smp_wmb(), which should not > be affected if the problem is only missing serialization between DMA > and CPU writes. > > > > If either of the two is the case, then the READ_ONCE() would just > > > introduce a long delay before the iowrite32() that makes it more likely > > > that the data is there, but the inconsistent state would still be > > > observable > > > by the device if it is still working on previous frames. > > > > I think it just get stuck until we try another packet, ie, it doesn't > > see the new descriptor valid bit. But Dylan can elaborate. > > Ok, that would point to an insufficient barrier in iowrite32() as well, > not in dma_wmb(). > > At the moment, the only chips that need the heavy barrier are > omap4 and mstar_v7, and early l2 cache controllers (not the one > on Cortex-A7) have another synchronization callback that IIRC > is used for streaming mappings. > > These are the two implementations of soc_mb() we have: > > /* > * This may need locking to deal with situations where an interrupt > * happens while we are in here and mb() gets called by the interrupt handler. > * > * The vendor code did have a spin lock but it doesn't seem to be needed and > * removing it hasn't caused any side effects so far. > * > * [writel|readl]_relaxed have to be used here because otherwise > * we'd end up right back in here. > */ > static void mstarv7_mb(void) > { > /* toggle the flush miu pipe fire bit */ > writel_relaxed(0, l3bridge + MSTARV7_L3BRIDGE_FLUSH); > writel_relaxed(MSTARV7_L3BRIDGE_FLUSH_TRIGGER, l3bridge > + MSTARV7_L3BRIDGE_FLUSH); > while (!(readl_relaxed(l3bridge + MSTARV7_L3BRIDGE_STATUS) > & MSTARV7_L3BRIDGE_STATUS_DONE)) { > /* wait for flush to complete */ > } > } > /* > * OMAP4 interconnect barrier which is called for each mb() and wmb(). > * This is to ensure that normal paths to DRAM (normal memory, cacheable > * accesses) are properly synchronised with writes to DMA coherent memory > * (normal memory, uncacheable) and device writes. > * > * The mb() and wmb() barriers only operate only on the MPU->MA->EMIF > * path, as we need to ensure that data is visible to other system > * masters prior to writes to those system masters being seen. > * > * Note: the SRAM path is not synchronised via mb() and wmb(). > */ > static void omap4_mb(void) > { > if (dram_sync) > writel_relaxed(0, dram_sync); > } > > Obviously, adding one of these for ast2600 would slow down every > mb() and writel() a lot, but if it is a chip-wide problem rather than > one isolated to the network device, it would be the correct solution, > provided that a correct code sequence can be found. > > Arnd