On Tue, Feb 21, 2017 at 10:08 PM, Peter Xu <pet...@redhat.com> wrote:
> [cc Jintack] > > On Tue, Feb 21, 2017 at 02:43:03PM -0700, Alex Williamson wrote: > > Since commit 4bb571d857d9 ("pci/pcie: don't assume cap id 0 is > > reserved") removes the internal use of extended capability ID 0, the > > comment here becomes invalid. However, peeling back the onion, the > > code is still correct and we still can't seed the capability chain > > with ID 0, unless we want to muck with using the version number to > > force the header to be non-zero, which is much uglier to deal with. > > The comment also now covers some of the subtleties of using cap ID 0, > > such as transparently indicating absence of capabilities if none are > > added. This doesn't detract from the correctness of the referenced > > commit as vfio in the kernel also uses capability ID zero to mask > > capabilties. In fact, we should skip zero capabilities precisely > > because the kernel might also expose such a capability at the head > > position and re-introduce the problem. > > > > Signed-off-by: Alex Williamson <alex.william...@redhat.com> > > Cc: Peter Xu <pet...@redhat.com> > > Cc: Michael S. Tsirkin <m...@redhat.com> > > --- > > hw/vfio/pci.c | 31 +++++++++++++++++++++---------- > > 1 file changed, 21 insertions(+), 10 deletions(-) > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > > index f2ba9b6cfafc..03a3d0154976 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -1880,16 +1880,26 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) > > /* > > * Extended capabilities are chained with each pointing to the > next, so we > > * can drop anything other than the head of the chain simply by > modifying > > - * the previous next pointer. For the head of the chain, we can > modify the > > - * capability ID to something that cannot match a valid > capability. ID > > - * 0 is reserved for this since absence of capabilities is > indicated by > > - * 0 for the ID, version, AND next pointer. However, > pcie_add_capability() > > - * uses ID 0 as reserved for list management and will incorrectly > match and > > - * assert if we attempt to pre-load the head of the chain with this > ID. > > - * Use ID 0xFFFF temporarily since it is also seems to be reserved > in > > - * part for identifying absence of capabilities in a root complex > register > > - * block. If the ID still exists after adding capabilities, switch > back to > > - * zero. We'll mark this entire first dword as emulated for this > purpose. > > + * the previous next pointer. Seed the head of the chain here such > that > > + * we can simply skip any capabilities we want to drop below, > regardless > > + * of their position in the chain. If this stub capability still > exists > > + * after we add the capabilities we want to expose, update the > capability > > + * ID to zero. Note that we cannot seed with the capability header > being > > + * zero as this conflicts with definition of an absent capability > chain > > + * and prevents capabilities beyond the head of the list from being > added. > > + * By replacing the dummy capability ID with zero after walking the > device > > + * chain, we also transparently mark extended capabilities as > absent if > > + * no capabilities were added. Note that the PCIe spec defines an > absence > > + * of extended capabilities to be determined by a value of zero for > the > > + * capability ID, version, AND next pointer. A non-zero next > pointer > > + * should be sufficient to indicate additional capabilities are > present, > > + * which will occur if we call pcie_add_capability() below. The > entire > > + * first dword is emulated to support this. > > + * > > + * NB. The kernel side does similar masking, so be prepared that our > > + * view of the device may also contain a capability ID zero in the > head > > + * of the chain. Skip it for the same reason that we cannot seed > the > > + * chain with a zero capability. > > */ > > pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE, > > PCI_EXT_CAP(0xFFFF, 0, 0)); > > @@ -1915,6 +1925,7 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) > > PCI_EXT_CAP_NEXT_MASK); > > > > switch (cap_id) { > > + case 0: /* kernel masked capability */ > > case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */ > > case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function > virtualization */ > > trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, > cap_id, next); > > > > Reviewed-by: Peter Xu <pet...@redhat.com> > > Since this bug is originally reported by Jintack, maybe we can also > add: > > Reported-by: Jintack Lim <jint...@cs.columbia.edu> > > Jintack, if you want to test it and provide your tested-by, it would > be nice as well. ;) > I believe this patch is to fix the error I got before. qemu-system-x86_64: hw/pci/pcie.c:686: pcie_add_capability: Assertion `prev >= 0x100' failed. If so, Tested-by: Jintack Lim <jint...@cs.columbia.edu> > Actually I just found that the bug still exist after Michael's fix (I > thought it was fixed). So we definitely need this patch or equivalent. > However, I would still slightly prefer removing the wrapping hack > since after all we need to touch it (and I do feel like that's error > prone...). So, Alex, do you like below one instead? > > --------8<---------- > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index 332f41d..6942c1d 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) > */ > config = g_memdup(pdev->config, vdev->config_size); > > - /* > - * Extended capabilities are chained with each pointing to the next, > so we > - * can drop anything other than the head of the chain simply by > modifying > - * the previous next pointer. For the head of the chain, we can > modify the > - * capability ID to something that cannot match a valid capability. > ID > - * 0 is reserved for this since absence of capabilities is indicated > by > - * 0 for the ID, version, AND next pointer. However, > pcie_add_capability() > - * uses ID 0 as reserved for list management and will incorrectly > match and > - * assert if we attempt to pre-load the head of the chain with this > ID. > - * Use ID 0xFFFF temporarily since it is also seems to be reserved in > - * part for identifying absence of capabilities in a root complex > register > - * block. If the ID still exists after adding capabilities, switch > back to > - * zero. We'll mark this entire first dword as emulated for this > purpose. > - */ > - pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE, > - PCI_EXT_CAP(0xFFFF, 0, 0)); > - pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0); > - pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0); > - > for (next = PCI_CONFIG_SPACE_SIZE; next; > next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) { > header = pci_get_long(config + next); > @@ -1910,24 +1891,33 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) > */ > size = vfio_ext_cap_max_size(config, next); > > - /* Use emulated next pointer to allow dropping extended caps */ > - pci_long_test_and_set_mask(vdev->emulated_config_bits + next, > - PCI_EXT_CAP_NEXT_MASK); > + /* Use emulated header to allow dropping extended caps */ > + pci_set_long(vdev->emulated_config_bits + next, 0xffffffffULL); > > switch (cap_id) { > case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */ > case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function > virtualization */ > + case PCI_EXT_CAP_ID_VC: > + /* > + * For dropped capabilities, we keep their slot but > + * replace them with a header containing cap_id=0 && > + * cap_ver=1. We do this reservation mostly to make sure > + * the head ecap (at offset 0x100) will always be there. > + * Anyway it won't hurt if we keep the rest of the dropped > + * ones as well. > + * > + * Here we use non-zero cap_ver because we want to "mark" > + * this ecap as "available" - from PCIe spec (chap 7.9.1), > + * it marked out that cap_id=cap_ver=next=0 means empty > + * ecap, and we really don't want it to be an "empty" slot > + * especially for the head ecap (we need a head, always!). > + */ > + pcie_add_capability(pdev, 0, 1, next, size); > trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, > next); > break; > default: > pcie_add_capability(pdev, cap_id, cap_ver, next, size); > } > - > - } > - > - /* Cleanup chain head ID if necessary */ > - if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) { > - pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0); > } > > g_free(config); > ------->8-------- > > The new patch will keep all the dropped ecaps (so we may see more than > one cap_id=0x0000 field), which I don't know whether would be a > drawback. OTOH, it provides benefits like: > > - we can remove the wrapping hack, so the code is much readable and > less error prone imho > > - we can avoid using the assumption that 0xffff cap_id is reserved > > I can live with both patches though. :-) > > Thanks! > > -- peterx > >