On Tue, Jan 27, 2026 at 01:13:04AM +0200, Konstantin Belousov wrote:
> On Mon, Jan 26, 2026 at 09:30:58PM +0100, Marius Strobl wrote:
> > On Mon, Jan 26, 2026 at 06:34:49PM +0200, Konstantin Belousov wrote:
> > > On Mon, Jan 26, 2026 at 03:57:45PM +0000, Marius Strobl wrote:
> > > > The branch main has been updated by marius:
> > > > 
> > > > URL: 
> > > > https://cgit.FreeBSD.org/src/commit/?id=e769bc77184312b6137a9b180c97b87c0760b849
> > > > 
> > > > commit e769bc77184312b6137a9b180c97b87c0760b849
> > > > Author:     Marius Strobl <[email protected]>
> > > > AuthorDate: 2026-01-26 13:58:57 +0000
> > > > Commit:     Marius Strobl <[email protected]>
> > > > CommitDate: 2026-01-26 15:54:48 +0000
> > > > 
> > > >     sym(4): Employ memory barriers also on x86
> > > >     
> > > >     In an MP world, it doesn't hold that x86 requires no memory 
> > > > barriers.
> > > It does hold.  x86 is much more strongly ordered than all other arches
> > > we currently support.
> > 
> > If it does hold, then why is atomic_thread_fence_seq_cst() employing
> > a StoreLoad barrier even on amd64?
> > I agree that x86 is more strongly ordered than the other supported
> > architectures, though.
> Well, it depends on the purpose.
> 
> Can you please explain what is the purpose of this specific barrier, and
> where is the reciprocal barrier for it?
> 
> Often drivers for advanced devices do need fences.  For instance, from
> my experience with the Mellanox networking cards, there are some structures
> that are located in regular cacheable memory.  The readiness of the structure
> for the card is indicated by a write to some location.  If this location is
> BAR, then at least on x86 we do not need any barriers. But if it is also
> in the regular memory, the visibility of writes to the structure before
> the write to a signalling variable must be enforced.
> 
> This is done normally by atomic_thread_fence_rel(), which on x86 becomes
> just compiler barrier, since the ordering is guaranteed by CPU (but not
> compiler).
> 
> In this situation, using rmb() (which is fence) really degrades
> the performance on high rates.

The problem at hand is reads from different memory locations (neither
in BAR) apparently getting reordered after having kicked the chip. As
a result, data read doesn't match its flag.

Several factors contribute to this scenario. First off, this hardware
doesn't have shiny doorbell registers but is a rather convoluted design
dating back to the early days of PCI, using a heavy mix of registers
in BAR space and DMAed control data, with addresses being patched into
programs that are transferred to the controller RAM by the driver or
may reside in host memory etc. Additional things also doesn't work
equally across all supported chips as only newer ones provide load/
store instructions for example.
As such, the operations of these chips might very well escape the bus
snooping of more modern machines and optimizations therein. There are
PCI bridges which only synchronize DMA themselves on interrupts for
example.

For drivers, we generally would want to express DMA synchronization
and bus space access ordering needs in terms of bus_dmamap_sync(9)
and bus_space_barrier(9) as there may be buffers, caches, IOMMUs etc.
involved on/in the bus. These later are not taken into account by
atomic_thread_fence_*(9). Apparently, this is also of relevance for
x86, as otherwise BUS_SPACE_BARRIER_READ would be a compiler barrier
there at most.
With things like index or bank switching registers, it's generally
also not the case that there are pairs of bus space barriers, i. e.
also reciprocal ones, see e. g. the example in the BARRIERS section
of bus_space.9.

Due to the mess with these chips and depending on architecture,
actually barriers for both, DMAed memory and bus space, might be
required. That's why already before this change, powerpc already
used both sync and eieio as well as the comment above the macros
was already previously talking about using I/O barriers, too.

Actually, I would have expected this hardware to have aged out by
now. However, apparently it still is a thing with OpenStack, so it
makes sense to keep this driver working, with performance not being
of great concern.
I can change the driver back to duplicate *mb(9) as it was before
this change or back this change out completely it you absolutely
dislike it. I won't waste time with working on an alternate approach
to what the Linux version does, though, especially not given that
the Linux version presumably gets considerably more exposure and
testing.

Marius


Reply via email to