On Wed, May 28, 2014 at 01:24:15PM +0200, Alexander Graf wrote: > >On 28.05.14 06:12, Gavin Shan wrote: >>On Wed, May 28, 2014 at 12:41:37AM +0200, Alexander Graf wrote: >>>On 28.05.14 00:27, Benjamin Herrenschmidt wrote: >>>>On Wed, 2014-05-28 at 00:22 +0200, Alexander Graf wrote: >>.../... >> >>>>In any case, the above isn't the problem, we register rtas functions >>>>called rtas_ibm_*, that's fine. >>>They're called rtas_ibm, not rtas_vfio. They simply live in the wrong >>>file for starters. Once you start moving them out of *vfio*.c and >>>make sure you don't sneak in vfio headers into spapr_rtas.c all the >>>abstraction should come naturally. >>> >>>>The problem you have is with what's inside these functions, correct ? >>>>You want some kind of PCI Dev ops... am I understanding properly ? >>>>Instead of having them call the VFIO ioctl's directly. >>>We have a nice bus hierarchy. The RTAS call tells you the PHB's ID. >>>The machine knows about all of its PHBs, so you can find that one. >>> >>>From the PHB we can then find a PCIDevice by identifier if we >>>received one in the RTAS call. >>> >>>If we received a PE, we need to act on a virtual PE object (not >>>available yet I think) which then would call at least into each VFIO >>>container associated with that guest PE. We don't want to call into >>>devices here, since we could have 2 VFIO devices in one PE and don't >>>want to do a PE level operation twice. >>> >>>If we want to simplify things (and I'm fine with that), we can also >>>restrict a "guest PE" to only span at most a single VFIO container >>>context and unlimited emulated devices. That way we don't need to >>>worry about potential side effects. >>> >>>I can't tell you what the "guest PE" would best be modeled at. Maybe >>>a simple list/hash inside the PHB is the answer. But we need to >>>ensure that the guest knows about its scope of action. And we need to >>>make sure that we maintain abstraction levels to keep at least a >>>minimum level of sanity for the reader. >>> >>sPAPRVFIOPHBState has one-to-one mapping with container, which has >>one-to-one mapping relationship with IOMMU group on sPAPR platform. >>So sPAPRVFIOPHBState is representing "guest PE". > >So each PHB only spans a single PE? >
Yep. >> >>Yes. RTAS calls from guest, 2 types of address information as input there: >>PHB BUID+device's PCI config address, or PHB BUID+PE address. After removing >>the logic of translating device's PCI config address to PE address from >>host to QEMU, we don't have to care much about the first case (PHB BUID+ >>device's PCI config address). That means we only have to care "PE address", >>which is basically the IOMMU group ID (plus fixed offset) of >>sPAPRPHBVFIOState. >>So the BUID can be used to locate sPAPRPHBVFIOState and "PE address" can >>help doing same thing. > >... then we can just ignore the "PE address", no? > Yes, we can ignore that. But I plan to use it for more sanity check, which isn't harmful. >> >>As you mentioned above, spapr_rtas.c can't include "vfio.h". I guess I have >>to rework on PCIDevice/PCIDeviceClass for a bit as follows. Alex, could you >>help confirming it's what you're expecting to see? > >I depends. If you need a PCI device level callback, then yes. If you >don't need a PCI level callback, just have one in the PHB. I'm not >quite sure how you can find the "VFIO container" that belongs to that >PHB though, but I'm sure you'll figure that out :). > Yeah, I'll have PHB level callback after having a discussion with Alexey who gave good idea to make the abstraction as follows. Please let me know if I'm in wrong direction again because I don't have QEMU experience at all: RTAS call -> hw/ppc/spapr_pci.c where RTAS calls get registered -> sPAPRPHBClass::rtas_eeh_handler() in hw/ppc/spapr_pci_vfio.c -> vfio_container_ioctl() in hw/misc/vfio.c If you think it's right way to go, I'm going to change the code accordingly and send it to you for review after Alexey's quick scan. That would save you a bit time in future though it already took you much time. >> >>- In include/hw/pci/pci.h, to have following callback for PCIDeviceClass >> >> /* It's basically mapping ioctl commands to another set of macros */ >> #define PCI_DEV_EEH_OP_ENABLE 0 >> #define PCI_DEV_EEH_OP_DISABLE 1 >> #define PCI_DEV_EEH_OP_GET_STATE 2 >> #define PCI_DEV_EEH_OP_PE_RESET 3 >> #define PCI_DEV_EEH_OP_PE_CONFIGURE 4 >> >> #define PCI_DEV_EEH_OPT_HRSET 0 >> #define PCI_DEV_EEH_OPT_FRESET 1 >> : >> int (*eeh_handler)(PCIDevice *pdev, int opcode, int option); > >If we need a PCI device level callback, yes. > >> >>- In hw/ppc/spapr_pci_vfio.c, to register EEH RTAS calls there and I >> shouldn't include linux-headers/vfio.h there. When EEH RTAS calls >> come in, find the PE according to the PE address and pick up any >> one of the included PCI device and call into eeh_handler(). If the >> input address is device's PCI config address, find the PCI device >> and call to its eeh_handler() callback. > >Uh. The RTAS flow should go through a few different files that >basically represent the different layers on a real machine. > >The spapr_rtas.c file should register the RTAS calls, as that one >emulates the RTAS code that would usually run in guest code. That >code determines and calls into the PHB. > >The PHB is implemented in hw/ppc/spapr_pci.c, so that's where the EEH >calls on the PHB level belong. If the target of a call is a PCI >device, the PHB looks up the device, its class and calls into an EEH >call inside the PCI device. If the target is a PE, the PHB needs to >call out on the TCE entity representing the PE. If that is backed by >a VFIO IOMMU, it should get called somehow. > Ah, It's almost same to the above idea that Alexey shared. The different point is that I'm going to have PHB level callback as sPAPRVFIOPHBState could be regarded as "PE" :-) Thanks, Gavin