On Mon, 2012-05-21 at 13:31 +0300, Michael S. Tsirkin wrote: > > IE. Just making the default accessors ordered means that all devices > > written with the assumption that the guest will see accesses in the > > order they are written in the emulated device will be correct, which > > means pretty much all of them (well, almost). > > > > --> It actually fixes a real bug that affects almost all devices > > that do DMA today in qemu > > In theory fine but practical examples that affect x86? > We might want to at least document some of them.
x86 I don't know, I suspect mostly none that have actually been hit but I could be wrong, I'm not familiar enough with it. I have practical examples that affect power though :-) And I'm reasonably confident they'll affect ARM as soon as people start doing serious SMP on it etc... IE. The code is provably wrong without barriers. > wmb and rmb are nops so there's no bug in practice. > So the only actual rule which might be violated by qemu is that > read flushes out writes. > It's unlikely you will find real examples where this matters > but I'm interested to hear otherwise. wmb and rmb are not nops on powerpc and arm to name a couple. mb is more than just "read flush writes" (besides it's not a statement about flushing, it's a statement about ordering. whether it has a flushing side effect on x86 is a separate issue, it doesn't on power for example). Real flushing out writes matters very much in real life in two very different contexts that tend to not affect emulation in qemu as much. One is flushing write in the opposite direction (or rather, having the read response queued up behind those writes) which is critical to ensuring proper completion of DMAs after an LSI from a guest driver perspective on real HW typically. The other classic case is to flush posted MMIO writes in order to ensure that a subsequent delay is respected. Most of those don't actually matter when doing emulation. Besides a barrier won't provide you the second guarantee, you need a nastier construct at least on some architectures like power. However, we do need to ensure that read and writes are properly ordered vs. each other (regardless of any "flush" semantic) or things could go very wrong on OO architectures (here too, my understanding on x86 is limited). > I also note that guests do use write-combining e.g. for vga. > One wonders whether stronger barrriers are needed > because of that? What I'm trying to address here is really to ensure that load and stores issued by qemu emulated devices "appear" in the right order in respect to guest driver code running simultaneously on different CPUs (both P and V in this context). I've somewhat purposefully for now left alone the problem ordering "accross" transfer directions which in the case of PCI (and most "sane" busses) means read flushing writes in the other direction. (Here too it's not really a flush, it's just an ordering statement between the write and the read response). If you want to look at that other problem more closely, it breaks down into two parts: - Guest writes vs. qemu reads. My gut feeling is that this should take care of itself for the most part, tho you might want to bracket PIO operations with barriers for extra safety. But we might want to dig more. - qemu writes vs. guest reads. Here too, this only matters as long as the guest can observe the cat inside the box. This means the guest gets to "observe" the writes vs. some other event that might need to be synchronized with the read, such as an interrupt. Here my gut feeling is that we might be safer by having a barrier before any interrupt get shot to the guest (and a syscall/ioctl is not a memory barrier, though at least with kvm on power, a guest entry is so we should be fine) but here too, it might be a good idea to dig more. However, both of these can (and probably should) be treated separately and are rather unlikely to cause problems in practice. In most case these have to do with flushing non coherent DMA buffers in intermediary bridges, and while those could be considered as akin to a store queue that hasn't yet reached coherency on a core, in practice I think we are fine. So let's go back to the problem at hand, which is to make sure that load and stores done by emulated devices get observed by the guest code in the right order. My preference is to do it in the dma_* accessors (with possibly providing __relaxed versions), which means "low level" stuff using cpu_* directly such as virtio doesn't pay the price (or rather is responsible for doing its own barriers, which is good since typically that's our real perf sensitive stuff). Anthony earlier preferred all the way down into cpu_physical_* which is what my latest round of patches did. But I can understand that this makes people uncomfortable. In any case, we yet have to get some measurements of the actual cost of those barriers on x86. I can try to give it a go on my laptop next week, but I'm not an x86 expert and don't have that much different x86 HW around (such as large SMP machines where the cost might differ or different generations of x86 processors etc...). > > Then, fine-tuning performance critical one by selectively removing > > barriers allows to improve performance where it would be othewise > > harmed. > > So that breaks attempts to bisect performance regressions. > Not good. Disagreed. In all cases safety trumps performances. Almost all our devices were written without any thought given to ordering, so they basically can and should be considered as all broken. Since thinking about ordering is something that, by experience, very few programmer can do and get right, the default should absolutely be fully ordered. Performance regressions aren't a big deal to bisect in that case: If there's a regression for a given driver and it points to this specific patch adding the barriers then we know precisely where the regression come from, and we can get some insight about how this specific driver can be improved to use more relaxed accessors. I don't see the problem. One thing that might be worth looking at is if indeed mb() is so much more costly than just wmb/rmb, in which circumstances we could have some smarts in the accessors to make them skip the full mb based on knowledge of previous access direction, though here too I would be tempted to only do that if absolutely necessary (ie if we can't instead just fix the sensitive driver to use explicitly relaxed accessors). > > So on that I will not compromise. > > > > However, I think it might be better to leave the barrier in the dma > > accessor since that's how you also get iommu transparency etc... so it's > > not a bad place to put them, and leave the cpu_physical_* for use by > > lower level device drivers which are thus responsible also for dealing > > with ordering if they have to. > > > > Cheers, > > Ben. > > You claim to understand what matters for all devices I doubt that. It's pretty obvious that anything that does DMA using a classic descriptor + buffers structure is broken without appropriate ordering. And yes, I claim to have a fairly good idea of the problem, but I don't think throwing credentials around is going to be helpful. > Why don't we add safe APIs, then go over devices and switch over? > I counted 97 pci_dma_ accesses. > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000. > > Let maintainers make a decision where does speed matter. No. Let's fix the general bug first. Then let's people who know the individual drivers intimately and understand their access patterns make the call as to when things can/should be relaxed. Ben.