On 12/08/16 15:46, David Gibson wrote: > On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote: >> On Wed, 10 Aug 2016 15:37:17 +1000 >> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: >> >>> On 09/08/16 22:16, Alex Williamson wrote: >>>> On Tue, 9 Aug 2016 15:19:39 +1000 >>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: >>>> >>>>> On 09/08/16 02:43, Alex Williamson wrote: >>>>>> On Wed, 3 Aug 2016 18:40:55 +1000 >>>>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: >>>>>> >>>>>>> This exports helpers which are needed to keep a VFIO container in >>>>>>> memory while there are external users such as KVM. >>>>>>> >>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>>>>> --- >>>>>>> drivers/vfio/vfio.c | 30 ++++++++++++++++++++++++++++++ >>>>>>> drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++- >>>>>>> include/linux/vfio.h | 6 ++++++ >>>>>>> 3 files changed, 51 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c >>>>>>> index d1d70e0..baf6a9c 100644 >>>>>>> --- a/drivers/vfio/vfio.c >>>>>>> +++ b/drivers/vfio/vfio.c >>>>>>> @@ -1729,6 +1729,36 @@ long vfio_external_check_extension(struct >>>>>>> vfio_group *group, unsigned long arg) >>>>>>> EXPORT_SYMBOL_GPL(vfio_external_check_extension); >>>>>>> >>>>>>> /** >>>>>>> + * External user API for containers, exported by symbols to be linked >>>>>>> + * dynamically. >>>>>>> + * >>>>>>> + */ >>>>>>> +struct vfio_container *vfio_container_get_ext(struct file *filep) >>>>>>> +{ >>>>>>> + struct vfio_container *container = filep->private_data; >>>>>>> + >>>>>>> + if (filep->f_op != &vfio_fops) >>>>>>> + return ERR_PTR(-EINVAL); >>>>>>> + >>>>>>> + vfio_container_get(container); >>>>>>> + >>>>>>> + return container; >>>>>>> +} >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext); >>>>>>> + >>>>>>> +void vfio_container_put_ext(struct vfio_container *container) >>>>>>> +{ >>>>>>> + vfio_container_put(container); >>>>>>> +} >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext); >>>>>>> + >>>>>>> +void *vfio_container_get_iommu_data_ext(struct vfio_container >>>>>>> *container) >>>>>>> +{ >>>>>>> + return container->iommu_data; >>>>>>> +} >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext); >>>>>>> + >>>>>>> +/** >>>>>>> * Sub-module support >>>>>>> */ >>>>>>> /* >>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c >>>>>>> b/drivers/vfio/vfio_iommu_spapr_tce.c >>>>>>> index 3594ad3..fceea3d 100644 >>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c >>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c >>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops >>>>>>> tce_iommu_driver_ops = { >>>>>>> .detach_group = tce_iommu_detach_group, >>>>>>> }; >>>>>>> >>>>>>> +struct iommu_table *vfio_container_spapr_tce_table_get_ext(void >>>>>>> *iommu_data, >>>>>>> + u64 offset) >>>>>>> +{ >>>>>>> + struct tce_container *container = iommu_data; >>>>>>> + struct iommu_table *tbl = NULL; >>>>>>> + >>>>>>> + if (tce_iommu_find_table(container, offset, &tbl) < 0) >>>>>>> + return NULL; >>>>>>> + >>>>>>> + iommu_table_get(tbl); >>>>>>> + >>>>>>> + return tbl; >>>>>>> +} >>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext); >>>>>>> + >>>>>>> static int __init tce_iommu_init(void) >>>>>>> { >>>>>>> return vfio_register_iommu_driver(&tce_iommu_driver_ops); >>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION); >>>>>>> MODULE_LICENSE("GPL v2"); >>>>>>> MODULE_AUTHOR(DRIVER_AUTHOR); >>>>>>> MODULE_DESCRIPTION(DRIVER_DESC); >>>>>>> - >>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h >>>>>>> index 0ecae0b..1c2138a 100644 >>>>>>> --- a/include/linux/vfio.h >>>>>>> +++ b/include/linux/vfio.h >>>>>>> @@ -91,6 +91,12 @@ extern void vfio_group_put_external_user(struct >>>>>>> vfio_group *group); >>>>>>> extern int vfio_external_user_iommu_id(struct vfio_group *group); >>>>>>> extern long vfio_external_check_extension(struct vfio_group *group, >>>>>>> unsigned long arg); >>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct file >>>>>>> *filep); >>>>>>> +extern void vfio_container_put_ext(struct vfio_container *container); >>>>>>> +extern void *vfio_container_get_iommu_data_ext( >>>>>>> + struct vfio_container *container); >>>>>>> +extern struct iommu_table *vfio_container_spapr_tce_table_get_ext( >>>>>>> + void *iommu_data, u64 offset); >>>>>>> >>>>>>> /* >>>>>>> * Sub-module helpers >>>>>> >>>>>> >>>>>> I think you need to take a closer look of the lifecycle of a container, >>>>>> having a reference means the container itself won't go away, but only >>>>>> having a group set within that container holds the actual IOMMU >>>>>> references. container->iommu_data is going to be NULL once the >>>>>> groups are lost. Thanks, >>>>> >>>>> >>>>> Container owns the iommu tables and this is what I care about here, groups >>>>> attached or not - this is handled separately via IOMMU group list in a >>>>> specific iommu_table struct, these groups get detached from iommu_table >>>>> when they are removed from a container. >>>> >>>> The container doesn't own anything, the container is privileged by the >>>> groups being attached to it. When groups are closed, they detach from >>>> the container and once the container group list is empty the iommu >>>> backend is released and iommu_data is NULL. A container reference >>>> doesn't give you what you're looking for. It implies nothing about the >>>> iommu backend. >>> >>> >>> Well. Backend is a part of a container and since a backend owns tables, a >>> container owns them too. >> >> The IOMMU backend is accessed through the container, but that backend >> is privileged by the groups it contains. Once those groups are gone, >> the IOMMU backend is released, regardless of whatever reference you >> have to the container itself such as you're attempting to do here. In >> that sense, the container does not own those tables. > > So, the thing is that what KVM fundamentally needs is a handle on the > container. KVM is essentially modelling the DMA address space of a > single guest bus, and the container is what's attached to that. > > The first part of the problem is that KVM wants to basically invoke > vfio_dma_map() operations without bouncing via qemu. Because > vfio_dma_map() works on the container level, that's the handle that > KVM needs to hold.
Well, I do not need to hold the reference to the container all the time, I just need it to get to the IOMMU backend, get+reference an iommu_table from it, referencing here helps to make sure the backend is not going away before we reference iommu_table. After that I only keep a reference to the container to know if/when I can release a particular iommu_table. This is can workaround by counting how many groups were attached to this particular KVM-spapt-tce-table and looking at the IOMMU group list attached to an iommu_table - if the list is empty, decrement the iommu_table reference counter and that's it, no extra references to a VFIO container. Or I need an alternative way of getting iommu_table's, i.e. QEMU should somehow tell KVM that this LIOBN is this VFIO container fd (easy - can be done via region_add/region_del interface) or VFIO IOMMU group fd(s) (more tricky as this needs to be done from more places - vfio-pci hotplug/unplug, window add/remove). > The second part of the problem is that in order to reduce overhead > further, we want to operate in real mode, which means bypassing most > of the usual VFIO structure and going directly(ish) from the KVM > hcall emulation to the IOMMU backend behind VFIO. This complicates > matters a fair bit. Because it is, explicitly, a performance hack, > some degree of ugliness is probably inevitable. > > Alexey - actually implementing this in two stages might make this > clearer. The first stage wouldn't allow real mode, and would call > through the same vfio_dma_map() path as qemu calls through now. The > second stage would then put in place the necessary hacks to add real > mode support. > >>> The problem I am trying to solve here is when KVM may release the >>> iommu_table objects. >>> >>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not really >>> matter) makes a link between KVM-spapr-tce-table and container and KVM can >>> start using tables (with referencing them). >>> >>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, called it >>> from region_del() and this works if QEMU removes a window. However if QEMU >>> removes a vfio-pci device, region_del() is not called and KVM does not get >>> notified that it can release the iommu_table's because the >>> KVM-spapr-tce-table remains alive and does not get destroyed (as it is >>> still used by emulated devices or other containers). >>> >>> So it was suggested that we could do such "unset" somehow later assuming, >>> for example, on every "set" I could check if some of currently attached >>> containers are no more used - and this is where being able to know if there >>> is no backend helps - KVM remembers a container pointer and can check this >>> via vfio_container_get_iommu_data_ext(). >>> >>> The other option would be changing vfio_container_get_ext() to take a >>> callback+opaque which container would call when it destroys iommu_data. >>> This looks more intrusive and not very intuitive how to make it right - >>> container would have to keep track of all registered external users and >>> vfio_container_put_ext() would have to pass the same callback+opaque to >>> unregister the exact external user. >> >> I'm not in favor of anything resembling the code above or extensions >> beyond it, the container is the wrong place to do this. >> >>> Or I could store container file* in KVM. Then iommu_data would never be >>> released until KVM-spapr-tce-table is destroyed. >> >> See above, holding a file pointer to the container doesn't do squat. >> The groups that are held by the container empower the IOMMU backend, >> references to the container itself don't matter. Those references will >> not maintain the IOMMU data. >> >>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug (closing its fd >>> would "unset" container from KVM-spapr-tce-table) is not an option as there >>> still may be devices using this KVM-spapr-tce-table. >>> >>> What obvious and nice solution am I missing here? Thanks. >> >> The interactions with the IOMMU backend that seem relevant are >> vfio_iommu_drivers_ops.{detach_group,release}. The kvm-vfio pseudo >> device is also used to tell kvm about groups as they come and go and >> has a way to check extensions, and thus properties of the IOMMU >> backend. All of these are available for your {ab}use. Thanks, > > So, Alexey started trying to do this via the KVM-VFIO device, but it's > a really bad fit. As noted above, fundamentally it's a container we > need to attach to the kvm-spapr-tce-table object, since what that > represents is a guest bus DMA address space, and by definition all the > groups in a container must have the same DMA address space. Well, in a bad case a LIOBN/kvm-spapr-tce-table has multiple containers attached so it is not 1:1... -- Alexey
signature.asc
Description: OpenPGP digital signature