On 04/12/2018 07:08 AM, David Gibson wrote: > On Thu, Dec 21, 2017 at 11:12:06AM +1100, Benjamin Herrenschmidt wrote: >> On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote: >>> >>> As you've suggested in yourself, I think we might need to more >>> explicitly model the different components of the XIVE system. As part >>> of that, I think you need to be clearer in this base skeleton about >>> exactly what component your XIVE object represents. >>> >>> If the answer is "the overall thing" I suspect that's not what you >>> want - I had one of those for XICs which proved to be a mistake >>> (eventually replaced by the XICSFabric interface). >>> >>> Changing the model later isn't impossible, but doing so without >>> breaking migration can be a real pain, so I think it's worth a >>> reasonable effort to try and get it right initially. >> >> Note: we do need to speed things up a bit, as having exploitation mode >> in KVM will significantly help with IPI performance among other things. >> >> I'm about ready to do the KVM bits. The one thing we need to discuss >> and figure a good design for is how we map all those interrupt control >> pages into qemu. >> >> Each interrupt (either PCIe pass-through or the "generic XIVE IPIs" >> which are used for guest IPIs and for vio/virtio/emulated interrupts) >> comes with a "control page" (ESB page) which needs to be mapped into >> the guest, and the generic IPIs also come with a trigger page which >> needs to be mapped into the guest for guest IPIs or OpenCAPI >> interrupts, or just qemu for emulated devices. >> >> Now that can be thousands of these critters. I certainly don't want to >> create thousands of VMAs in qemu and even less thousands of memory >> regions in KVM. >> >> So we need some kind of mechanism by wich a single large VMA gets >> mmap'ed into qemu (or maybe a couple of these, but not too many) and >> the interrupt pages can be assigned to slots in there and demand >> faulted. > > Ok, I see your point. We'll definitely need to be able to map things > in as a block, rather than one by one.
So, the approach taken is to use a mmap() exposed in a single ram_device memory region to the guest. The size is the irq number space size. This is hardcoded to 4096 (IPIs) + 1024 (virtual device interrupts) in QEMU. We can change that, but the 4K split is important for XICS compatibility. The kvm xive device should self adapt. C. >> For the generic interrupts, this can probably be covered by KVM, adding >> some arch ioctls for allocating IPIs and mmap'ing that region etc... >> >> For pass-through, it's trickier, we don't want to mmap each irqfd >> individually for the above reason, so we want to "link" them to KVM. We >> don't want to allow qemu to take control of any arbitrary interrupt in >> the system though, so it has to related to the ownership of the irqfd >> coming from vfio. >> >> OpenCAPI I suspect will be its own can of worms... >> >> Also, have we decided how the process of switching between XICS and >> XIVE will work vs. CAS ? And how that will interact with KVM ? I was >> thinking the kernel would implement a different KVM device type, ie >> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be >> KVM_DEV_TYPE_XIVE. >> >