On Tue, Apr 27, 2021 at 9:56 PM Mahesh Salgaonkar <mah...@linux.ibm.com> wrote: > > With upstream kernel, especially after commit 98ba956f6a389 > ("powerpc/pseries/eeh: Rework device EEH PE determination") we see that KVM > guest isn't able to enable EEH option for PCI pass-through devices anymore.
How are you passing the devices through to the guest? > [root@atest-guest ~]# dmesg | grep EEH > [ 0.032337] EEH: pSeries platform initialized > [ 0.298207] EEH: No capable adapters found: recovery disabled. > [root@atest-guest ~]# > > So far the linux kernel was assuming pe_config_addr equal to device's > config_addr and using it to enable EEH on the PE through ibm,set-eeh-option > RTAS call. Which wasn't the correct way as per PAPR. The linux kernel > commit 98ba956f6a389 fixed this flow. With that fixed, linux now gets the > PE config address first using the ibm,get-config-addr-info2 RTAS call, and > then uses the found address as argument to ibm,set-eeh-option RTAS call to > enable EEH on the PCI device. That's not really correct. EEH being enabled or disabled is a per-PE setting rather than a per-device setting. The fact there's not a 1-1 correspondence between devices and PEs is a large part of why the get-config-addr-info2 RTAS call exists in the first place. Unfortunately, the initial implementation of EEH support in linux conflated the two because in the past there was typically a single device within a PE. However, that assumption was never really correct and it has long outlived its usefulness. > This works on PowerVM lpar but fails in qemu > KVM guests. This is because ibm,set-eeh-option RTAS call in qemu expects PE > config address bits 16:20 be populated with device number (DEVNUM). > > The rtas call ibm,get-config-addr-info2 in qemu always returns the PE > config address in form of "00BB0001" (i.e. <00><BUS><DEVFN><REG>) where > "BB" represents the bus number of PE's primary bus and with device number > information always set to zero. However until commit 98ba956f6a389 this > return value wasn't used to enable EEH on the PCI device. > > Now in qemu guest with recent kernel the ibm,set-eeh-option RTAS call fails > with -3 return value indicating that there is no PCI device exist for the > specified pe config address. The rtas_ibm_set_eeh_option call uses > pci_find_device() to get the PC device that matches specific bus and devfn > extracted from PE config address passed as argument. Since the DEVFN part > of PE config always contains zero, pci_find_device() fails to find the > specific PCI device and hence fails to enable the EEH capability. > > hw/ppc/spapr_pci_vfio.c: spapr_phb_vfio_eeh_set_option() > case RTAS_EEH_ENABLE: { > PCIHostState *phb; > PCIDevice *pdev; > > /* > * The EEH functionality is enabled on basis of PCI device, > * instead of PE. We need check the validity of the PCI > * device address. > */ > phb = PCI_HOST_BRIDGE(sphb); > pdev = pci_find_device(phb->bus, > (addr >> 16) & 0xFF, (addr >> 8) & 0xFF); > if (!pdev || !object_dynamic_cast(OBJECT(pdev), "vfio-pci")) { > return RTAS_OUT_PARAM_ERROR; > } > > This patch fixes this issue by populating DEVNUM field (bits 16:20) of PE > config address with device number. I don't think this is a good idea and I'm fairly sure you're introducing some subtle breakage here. As an example, say that on the host side you have two devices on the same bus: 0000:00:00.0 - card 1 0000:00:01.0 - card 2 On PowerNV (i.e. the hypervisor) these would be placed into the same hardware PE since they're on the same bus. Now, if I take both devices and pass them through to the guest on the same PHB and bus (use 0005:ff) we'll have: 0005:ff:00.0 - card 1 0005:ff:01.0 - card 2 With this patch applied get-config-addr-info2 returns 00BBD001, so the PE we get for each device will be: card 1 - 00ff0001 card 2 - 00ff1001 Which implies the two are in different PEs. As a result, if the guest requests a reset of card 1's PE then the guest will see an unexpected reset of card 2 as well. From the hypervisor's point of view the two are in the same PE so this is a legitimate thing to do, but due to this patch the guest doesn't know that. As far as I can remember this is why you're supposed to pass each EEH capable devices to the guest on a seperate spapr-phb (which matches what PHYP does). Alexy can probably tell you more. Oliver