[cc Jintack] On Tue, Feb 21, 2017 at 02:43:03PM -0700, Alex Williamson wrote: > Since commit 4bb571d857d9 ("pci/pcie: don't assume cap id 0 is > reserved") removes the internal use of extended capability ID 0, the > comment here becomes invalid. However, peeling back the onion, the > code is still correct and we still can't seed the capability chain > with ID 0, unless we want to muck with using the version number to > force the header to be non-zero, which is much uglier to deal with. > The comment also now covers some of the subtleties of using cap ID 0, > such as transparently indicating absence of capabilities if none are > added. This doesn't detract from the correctness of the referenced > commit as vfio in the kernel also uses capability ID zero to mask > capabilties. In fact, we should skip zero capabilities precisely > because the kernel might also expose such a capability at the head > position and re-introduce the problem. > > Signed-off-by: Alex Williamson <alex.william...@redhat.com> > Cc: Peter Xu <pet...@redhat.com> > Cc: Michael S. Tsirkin <m...@redhat.com> > --- > hw/vfio/pci.c | 31 +++++++++++++++++++++---------- > 1 file changed, 21 insertions(+), 10 deletions(-) > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index f2ba9b6cfafc..03a3d0154976 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -1880,16 +1880,26 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) > /* > * Extended capabilities are chained with each pointing to the next, so > we > * can drop anything other than the head of the chain simply by modifying > - * the previous next pointer. For the head of the chain, we can modify > the > - * capability ID to something that cannot match a valid capability. ID > - * 0 is reserved for this since absence of capabilities is indicated by > - * 0 for the ID, version, AND next pointer. However, > pcie_add_capability() > - * uses ID 0 as reserved for list management and will incorrectly match > and > - * assert if we attempt to pre-load the head of the chain with this ID. > - * Use ID 0xFFFF temporarily since it is also seems to be reserved in > - * part for identifying absence of capabilities in a root complex > register > - * block. If the ID still exists after adding capabilities, switch back > to > - * zero. We'll mark this entire first dword as emulated for this > purpose. > + * the previous next pointer. Seed the head of the chain here such that > + * we can simply skip any capabilities we want to drop below, regardless > + * of their position in the chain. If this stub capability still exists > + * after we add the capabilities we want to expose, update the capability > + * ID to zero. Note that we cannot seed with the capability header being > + * zero as this conflicts with definition of an absent capability chain > + * and prevents capabilities beyond the head of the list from being > added. > + * By replacing the dummy capability ID with zero after walking the > device > + * chain, we also transparently mark extended capabilities as absent if > + * no capabilities were added. Note that the PCIe spec defines an > absence > + * of extended capabilities to be determined by a value of zero for the > + * capability ID, version, AND next pointer. A non-zero next pointer > + * should be sufficient to indicate additional capabilities are present, > + * which will occur if we call pcie_add_capability() below. The entire > + * first dword is emulated to support this. > + * > + * NB. The kernel side does similar masking, so be prepared that our > + * view of the device may also contain a capability ID zero in the head > + * of the chain. Skip it for the same reason that we cannot seed the > + * chain with a zero capability. > */ > pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE, > PCI_EXT_CAP(0xFFFF, 0, 0)); > @@ -1915,6 +1925,7 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) > PCI_EXT_CAP_NEXT_MASK); > > switch (cap_id) { > + case 0: /* kernel masked capability */ > case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */ > case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */ > trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, > next); >
Reviewed-by: Peter Xu <pet...@redhat.com> Since this bug is originally reported by Jintack, maybe we can also add: Reported-by: Jintack Lim <jint...@cs.columbia.edu> Jintack, if you want to test it and provide your tested-by, it would be nice as well. ;) Actually I just found that the bug still exist after Michael's fix (I thought it was fixed). So we definitely need this patch or equivalent. However, I would still slightly prefer removing the wrapping hack since after all we need to touch it (and I do feel like that's error prone...). So, Alex, do you like below one instead? --------8<---------- diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 332f41d..6942c1d 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) */ config = g_memdup(pdev->config, vdev->config_size); - /* - * Extended capabilities are chained with each pointing to the next, so we - * can drop anything other than the head of the chain simply by modifying - * the previous next pointer. For the head of the chain, we can modify the - * capability ID to something that cannot match a valid capability. ID - * 0 is reserved for this since absence of capabilities is indicated by - * 0 for the ID, version, AND next pointer. However, pcie_add_capability() - * uses ID 0 as reserved for list management and will incorrectly match and - * assert if we attempt to pre-load the head of the chain with this ID. - * Use ID 0xFFFF temporarily since it is also seems to be reserved in - * part for identifying absence of capabilities in a root complex register - * block. If the ID still exists after adding capabilities, switch back to - * zero. We'll mark this entire first dword as emulated for this purpose. - */ - pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE, - PCI_EXT_CAP(0xFFFF, 0, 0)); - pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0); - pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0); - for (next = PCI_CONFIG_SPACE_SIZE; next; next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) { header = pci_get_long(config + next); @@ -1910,24 +1891,33 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) */ size = vfio_ext_cap_max_size(config, next); - /* Use emulated next pointer to allow dropping extended caps */ - pci_long_test_and_set_mask(vdev->emulated_config_bits + next, - PCI_EXT_CAP_NEXT_MASK); + /* Use emulated header to allow dropping extended caps */ + pci_set_long(vdev->emulated_config_bits + next, 0xffffffffULL); switch (cap_id) { case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */ case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */ + case PCI_EXT_CAP_ID_VC: + /* + * For dropped capabilities, we keep their slot but + * replace them with a header containing cap_id=0 && + * cap_ver=1. We do this reservation mostly to make sure + * the head ecap (at offset 0x100) will always be there. + * Anyway it won't hurt if we keep the rest of the dropped + * ones as well. + * + * Here we use non-zero cap_ver because we want to "mark" + * this ecap as "available" - from PCIe spec (chap 7.9.1), + * it marked out that cap_id=cap_ver=next=0 means empty + * ecap, and we really don't want it to be an "empty" slot + * especially for the head ecap (we need a head, always!). + */ + pcie_add_capability(pdev, 0, 1, next, size); trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, next); break; default: pcie_add_capability(pdev, cap_id, cap_ver, next, size); } - - } - - /* Cleanup chain head ID if necessary */ - if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) { - pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0); } g_free(config); ------->8-------- The new patch will keep all the dropped ecaps (so we may see more than one cap_id=0x0000 field), which I don't know whether would be a drawback. OTOH, it provides benefits like: - we can remove the wrapping hack, so the code is much readable and less error prone imho - we can avoid using the assumption that 0xffff cap_id is reserved I can live with both patches though. :-) Thanks! -- peterx