Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions

Benjamin Herrenschmidt Mon, 21 May 2012 04:46:43 -0700

On Mon, 2012-05-21 at 13:31 +0300, Michael S. Tsirkin wrote:

> > IE. Just making the default accessors ordered means that all devices
> > written with the assumption that the guest will see accesses in the
> > order they are written in the emulated device will be correct, which
> > means pretty much all of them (well, almost).
> > 
> >  --> It actually fixes a real bug that affects almost all devices
> >      that do DMA today in qemu
> 
> In theory fine but practical examples that affect x86?
> We might want to at least document some of them.


x86 I don't know, I suspect mostly none that have actually been hit but
I could be wrong, I'm not familiar enough with it.

I have practical examples that affect power though :-) And I'm
reasonably confident they'll affect ARM as soon as people start doing
serious SMP on it etc...

IE. The code is provably wrong without barriers.

> wmb and rmb are nops so there's no bug in practice.
> So the only actual rule which might be violated by qemu is that
> read flushes out writes.
> It's unlikely you will find real examples where this matters
> but I'm interested to hear otherwise.

wmb and rmb are not nops on powerpc and arm to name a couple.

mb is more than just "read flush writes" (besides it's not a statement
about flushing, it's a statement about ordering. whether it has a
flushing side effect on x86 is a separate issue, it doesn't on power for
example).

Real flushing out writes matters very much in real life in two very
different contexts that tend to not affect emulation in qemu as much.

One is flushing write in the opposite direction (or rather, having the
read response queued up behind those writes) which is critical to
ensuring proper completion of DMAs after an LSI from a guest driver
perspective on real HW typically.

The other classic case is to flush posted MMIO writes in order to ensure
that a subsequent delay is respected.

Most of those don't actually matter when doing emulation. Besides a
barrier won't provide you the second guarantee, you need a nastier
construct at least on some architectures like power.

However, we do need to ensure that read and writes are properly ordered
vs. each other (regardless of any "flush" semantic) or things could go
very wrong on OO architectures (here too, my understanding on x86 is
limited).

> I also note that guests do use write-combining e.g. for vga.
> One wonders whether stronger barrriers are needed
> because of that?


What I'm trying to address here is really to ensure that load and stores
issued by qemu emulated devices "appear" in the right order in respect
to guest driver code running simultaneously on different CPUs (both P
and V in this context).

I've somewhat purposefully for now left alone the problem ordering
"accross" transfer directions which in the case of PCI (and most "sane"
busses) means read flushing writes in the other direction. (Here too
it's not really a flush, it's just an ordering statement between the
write and the read response).

If you want to look at that other problem more closely, it breaks down
into two parts:

 - Guest writes vs. qemu reads. My gut feeling is that this should take
care of itself for the most part, tho you might want to bracket PIO
operations with barriers for extra safety. But we might want to dig
more.

 - qemu writes vs. guest reads. Here too, this only matters as long as
the guest can observe the cat inside the box. This means the guest gets
to "observe" the writes vs. some other event that might need to be
synchronized with the read, such as an interrupt. Here my gut feeling is
that we might be safer by having a barrier before any interrupt get shot
to the guest (and a syscall/ioctl is not a memory barrier, though at
least with kvm on power, a guest entry is so we should be fine) but here
too, it might be a good idea to dig more.

However, both of these can (and probably should) be treated separately
and are rather unlikely to cause problems in practice. In most case
these have to do with flushing non coherent DMA buffers in intermediary
bridges, and while those could be considered as akin to a store queue
that hasn't yet reached coherency on a core, in practice I think we are
fine.

So let's go back to the problem at hand, which is to make sure that load
and stores done by emulated devices get observed by the guest code in
the right order.

My preference is to do it in the dma_* accessors (with possibly
providing __relaxed versions), which means "low level" stuff using cpu_*
directly such as virtio doesn't pay the price (or rather is responsible
for doing its own barriers, which is good since typically that's our
real perf sensitive stuff).

Anthony earlier preferred all the way down into cpu_physical_* which is
what my latest round of patches did. But I can understand that this
makes people uncomfortable.

In any case, we yet have to get some measurements of the actual cost of
those barriers on x86. I can try to give it a go on my laptop next week,
but I'm not an x86 expert and don't have that much different x86 HW
around (such as large SMP machines where the cost might differ or
different generations of x86 processors etc...).

> > Then, fine-tuning performance critical one by selectively removing
> > barriers allows to improve performance where it would be othewise
> > harmed.
> 
> So that breaks attempts to bisect performance regressions.
> Not good.

Disagreed. In all cases safety trumps performances. Almost all our
devices were written without any thought given to ordering, so they
basically can and should be considered as all broken. Since thinking
about ordering is something that, by experience, very few programmer can
do and get right, the default should absolutely be fully ordered.

Performance regressions aren't a big deal to bisect in that case: If
there's a regression for a given driver and it points to this specific
patch adding the barriers then we know precisely where the regression
come from, and we can get some insight about how this specific driver
can be improved to use more relaxed accessors.

I don't see the problem.

One thing that might be worth looking at is if indeed mb() is so much
more costly than just wmb/rmb, in which circumstances we could have some
smarts in the accessors to make them skip the full mb based on knowledge
of previous access direction, though here too I would be tempted to only
do that if absolutely necessary (ie if we can't instead just fix the
sensitive driver to use explicitly relaxed accessors).

> > So on that I will not compromise.
> > 
> > However, I think it might be better to leave the barrier in the dma
> > accessor since that's how you also get iommu transparency etc... so it's
> > not a bad place to put them, and leave the cpu_physical_* for use by
> > lower level device drivers which are thus responsible also for dealing
> > with ordering if they have to.
> > 
> > Cheers,
> > Ben.
> 
> You claim to understand what matters for all devices I doubt that.

It's pretty obvious that anything that does DMA using a classic
descriptor + buffers structure is broken without appropriate ordering.

And yes, I claim to have a fairly good idea of the problem, but I don't
think throwing credentials around is going to be helpful.
 
> Why don't we add safe APIs, then go over devices and switch over?
> I counted 97 pci_dma_ accesses.
> 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> 
> Let maintainers make a decision where does speed matter.

No. Let's fix the general bug first. Then let's people who know the
individual drivers intimately and understand their access patterns make
the call as to when things can/should be relaxed.

Ben.

Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions

Reply via email to