On Mon, Oct 17, 2016 at 05:06:28PM +1100, Alexey Kardashevskiy wrote: > So far I got one question, below. > > > On 23/09/16 17:12, David Gibson wrote: > > On Wed, Sep 21, 2016 at 04:56:52PM +1000, Alexey Kardashevskiy wrote: > >> On 07/09/16 19:09, Alexey Kardashevskiy wrote: > >>> On 29/08/16 23:27, David Gibson wrote: > >>>> On Mon, Aug 29, 2016 at 04:35:15PM +1000, Alexey Kardashevskiy wrote: > >>>>> On 18/08/16 10:22, Alexey Kardashevskiy wrote: > >>>>>> On 17/08/16 13:17, David Gibson wrote: > >>>>>>> On Fri, Aug 12, 2016 at 09:22:01AM -0600, Alex Williamson wrote: > >>>>>>>> On Fri, 12 Aug 2016 15:46:01 +1000 > >>>>>>>> David Gibson <da...@gibson.dropbear.id.au> wrote: > >>>>>>>> > >>>>>>>>> On Wed, Aug 10, 2016 at 10:46:30AM -0600, Alex Williamson wrote: > >>>>>>>>>> On Wed, 10 Aug 2016 15:37:17 +1000 > >>>>>>>>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: > >>>>>>>>>> > >>>>>>>>>>> On 09/08/16 22:16, Alex Williamson wrote: > >>>>>>>>>>>> On Tue, 9 Aug 2016 15:19:39 +1000 > >>>>>>>>>>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> On 09/08/16 02:43, Alex Williamson wrote: > >>>>>>>>>>>>>> On Wed, 3 Aug 2016 18:40:55 +1000 > >>>>>>>>>>>>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This exports helpers which are needed to keep a VFIO > >>>>>>>>>>>>>>> container in > >>>>>>>>>>>>>>> memory while there are external users such as KVM. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> > >>>>>>>>>>>>>>> --- > >>>>>>>>>>>>>>> drivers/vfio/vfio.c | 30 > >>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++ > >>>>>>>>>>>>>>> drivers/vfio/vfio_iommu_spapr_tce.c | 16 +++++++++++++++- > >>>>>>>>>>>>>>> include/linux/vfio.h | 6 ++++++ > >>>>>>>>>>>>>>> 3 files changed, 51 insertions(+), 1 deletion(-) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c > >>>>>>>>>>>>>>> index d1d70e0..baf6a9c 100644 > >>>>>>>>>>>>>>> --- a/drivers/vfio/vfio.c > >>>>>>>>>>>>>>> +++ b/drivers/vfio/vfio.c > >>>>>>>>>>>>>>> @@ -1729,6 +1729,36 @@ long > >>>>>>>>>>>>>>> vfio_external_check_extension(struct vfio_group *group, > >>>>>>>>>>>>>>> unsigned long arg) > >>>>>>>>>>>>>>> EXPORT_SYMBOL_GPL(vfio_external_check_extension); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> /** > >>>>>>>>>>>>>>> + * External user API for containers, exported by symbols to > >>>>>>>>>>>>>>> be linked > >>>>>>>>>>>>>>> + * dynamically. > >>>>>>>>>>>>>>> + * > >>>>>>>>>>>>>>> + */ > >>>>>>>>>>>>>>> +struct vfio_container *vfio_container_get_ext(struct file > >>>>>>>>>>>>>>> *filep) > >>>>>>>>>>>>>>> +{ > >>>>>>>>>>>>>>> + struct vfio_container *container = filep->private_data; > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> + if (filep->f_op != &vfio_fops) > >>>>>>>>>>>>>>> + return ERR_PTR(-EINVAL); > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> + vfio_container_get(container); > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> + return container; > >>>>>>>>>>>>>>> +} > >>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_ext); > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> +void vfio_container_put_ext(struct vfio_container *container) > >>>>>>>>>>>>>>> +{ > >>>>>>>>>>>>>>> + vfio_container_put(container); > >>>>>>>>>>>>>>> +} > >>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_put_ext); > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> +void *vfio_container_get_iommu_data_ext(struct > >>>>>>>>>>>>>>> vfio_container *container) > >>>>>>>>>>>>>>> +{ > >>>>>>>>>>>>>>> + return container->iommu_data; > >>>>>>>>>>>>>>> +} > >>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_get_iommu_data_ext); > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> +/** > >>>>>>>>>>>>>>> * Sub-module support > >>>>>>>>>>>>>>> */ > >>>>>>>>>>>>>>> /* > >>>>>>>>>>>>>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c > >>>>>>>>>>>>>>> b/drivers/vfio/vfio_iommu_spapr_tce.c > >>>>>>>>>>>>>>> index 3594ad3..fceea3d 100644 > >>>>>>>>>>>>>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c > >>>>>>>>>>>>>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > >>>>>>>>>>>>>>> @@ -1331,6 +1331,21 @@ const struct vfio_iommu_driver_ops > >>>>>>>>>>>>>>> tce_iommu_driver_ops = { > >>>>>>>>>>>>>>> .detach_group = tce_iommu_detach_group, > >>>>>>>>>>>>>>> }; > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> +struct iommu_table > >>>>>>>>>>>>>>> *vfio_container_spapr_tce_table_get_ext(void *iommu_data, > >>>>>>>>>>>>>>> + u64 offset) > >>>>>>>>>>>>>>> +{ > >>>>>>>>>>>>>>> + struct tce_container *container = iommu_data; > >>>>>>>>>>>>>>> + struct iommu_table *tbl = NULL; > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> + if (tce_iommu_find_table(container, offset, &tbl) < 0) > >>>>>>>>>>>>>>> + return NULL; > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> + iommu_table_get(tbl); > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> + return tbl; > >>>>>>>>>>>>>>> +} > >>>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(vfio_container_spapr_tce_table_get_ext); > >>>>>>>>>>>>>>> + > >>>>>>>>>>>>>>> static int __init tce_iommu_init(void) > >>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>> return > >>>>>>>>>>>>>>> vfio_register_iommu_driver(&tce_iommu_driver_ops); > >>>>>>>>>>>>>>> @@ -1348,4 +1363,3 @@ MODULE_VERSION(DRIVER_VERSION); > >>>>>>>>>>>>>>> MODULE_LICENSE("GPL v2"); > >>>>>>>>>>>>>>> MODULE_AUTHOR(DRIVER_AUTHOR); > >>>>>>>>>>>>>>> MODULE_DESCRIPTION(DRIVER_DESC); > >>>>>>>>>>>>>>> - > >>>>>>>>>>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h > >>>>>>>>>>>>>>> index 0ecae0b..1c2138a 100644 > >>>>>>>>>>>>>>> --- a/include/linux/vfio.h > >>>>>>>>>>>>>>> +++ b/include/linux/vfio.h > >>>>>>>>>>>>>>> @@ -91,6 +91,12 @@ extern void > >>>>>>>>>>>>>>> vfio_group_put_external_user(struct vfio_group *group); > >>>>>>>>>>>>>>> extern int vfio_external_user_iommu_id(struct vfio_group > >>>>>>>>>>>>>>> *group); > >>>>>>>>>>>>>>> extern long vfio_external_check_extension(struct vfio_group > >>>>>>>>>>>>>>> *group, > >>>>>>>>>>>>>>> unsigned long arg); > >>>>>>>>>>>>>>> +extern struct vfio_container *vfio_container_get_ext(struct > >>>>>>>>>>>>>>> file *filep); > >>>>>>>>>>>>>>> +extern void vfio_container_put_ext(struct vfio_container > >>>>>>>>>>>>>>> *container); > >>>>>>>>>>>>>>> +extern void *vfio_container_get_iommu_data_ext( > >>>>>>>>>>>>>>> + struct vfio_container *container); > >>>>>>>>>>>>>>> +extern struct iommu_table > >>>>>>>>>>>>>>> *vfio_container_spapr_tce_table_get_ext( > >>>>>>>>>>>>>>> + void *iommu_data, u64 offset); > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> /* > >>>>>>>>>>>>>>> * Sub-module helpers > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I think you need to take a closer look of the lifecycle of a > >>>>>>>>>>>>>> container, > >>>>>>>>>>>>>> having a reference means the container itself won't go away, > >>>>>>>>>>>>>> but only > >>>>>>>>>>>>>> having a group set within that container holds the actual IOMMU > >>>>>>>>>>>>>> references. container->iommu_data is going to be NULL once the > >>>>>>>>>>>>>> groups are lost. Thanks, > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Container owns the iommu tables and this is what I care about > >>>>>>>>>>>>> here, groups > >>>>>>>>>>>>> attached or not - this is handled separately via IOMMU group > >>>>>>>>>>>>> list in a > >>>>>>>>>>>>> specific iommu_table struct, these groups get detached from > >>>>>>>>>>>>> iommu_table > >>>>>>>>>>>>> when they are removed from a container. > >>>>>>>>>>>> > >>>>>>>>>>>> The container doesn't own anything, the container is privileged > >>>>>>>>>>>> by the > >>>>>>>>>>>> groups being attached to it. When groups are closed, they > >>>>>>>>>>>> detach from > >>>>>>>>>>>> the container and once the container group list is empty the > >>>>>>>>>>>> iommu > >>>>>>>>>>>> backend is released and iommu_data is NULL. A container > >>>>>>>>>>>> reference > >>>>>>>>>>>> doesn't give you what you're looking for. It implies nothing > >>>>>>>>>>>> about the > >>>>>>>>>>>> iommu backend. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Well. Backend is a part of a container and since a backend owns > >>>>>>>>>>> tables, a > >>>>>>>>>>> container owns them too. > >>>>>>>>>> > >>>>>>>>>> The IOMMU backend is accessed through the container, but that > >>>>>>>>>> backend > >>>>>>>>>> is privileged by the groups it contains. Once those groups are > >>>>>>>>>> gone, > >>>>>>>>>> the IOMMU backend is released, regardless of whatever reference you > >>>>>>>>>> have to the container itself such as you're attempting to do here. > >>>>>>>>>> In > >>>>>>>>>> that sense, the container does not own those tables. > >>>>>>>>> > >>>>>>>>> So, the thing is that what KVM fundamentally needs is a handle on > >>>>>>>>> the > >>>>>>>>> container. KVM is essentially modelling the DMA address space of a > >>>>>>>>> single guest bus, and the container is what's attached to that. > >>>>>>>>> > >>>>>>>>> The first part of the problem is that KVM wants to basically invoke > >>>>>>>>> vfio_dma_map() operations without bouncing via qemu. Because > >>>>>>>>> vfio_dma_map() works on the container level, that's the handle that > >>>>>>>>> KVM needs to hold. > >>>>>>>>> > >>>>>>>>> The second part of the problem is that in order to reduce overhead > >>>>>>>>> further, we want to operate in real mode, which means bypassing most > >>>>>>>>> of the usual VFIO structure and going directly(ish) from the KVM > >>>>>>>>> hcall emulation to the IOMMU backend behind VFIO. This complicates > >>>>>>>>> matters a fair bit. Because it is, explicitly, a performance hack, > >>>>>>>>> some degree of ugliness is probably inevitable. > >>>>>>>>> > >>>>>>>>> Alexey - actually implementing this in two stages might make this > >>>>>>>>> clearer. The first stage wouldn't allow real mode, and would call > >>>>>>>>> through the same vfio_dma_map() path as qemu calls through now. The > >>>>>>>>> second stage would then put in place the necessary hacks to add real > >>>>>>>>> mode support. > >>>>>>>>> > >>>>>>>>>>> The problem I am trying to solve here is when KVM may release the > >>>>>>>>>>> iommu_table objects. > >>>>>>>>>>> > >>>>>>>>>>> "Set" ioctl() to KVM-spapr-tce-table (or KVM itself, does not > >>>>>>>>>>> really > >>>>>>>>>>> matter) makes a link between KVM-spapr-tce-table and container > >>>>>>>>>>> and KVM can > >>>>>>>>>>> start using tables (with referencing them). > >>>>>>>>>>> > >>>>>>>>>>> First I tried adding an "unset" ioctl to KVM-spapr-tce-table, > >>>>>>>>>>> called it > >>>>>>>>>>> from region_del() and this works if QEMU removes a window. > >>>>>>>>>>> However if QEMU > >>>>>>>>>>> removes a vfio-pci device, region_del() is not called and KVM > >>>>>>>>>>> does not get > >>>>>>>>>>> notified that it can release the iommu_table's because the > >>>>>>>>>>> KVM-spapr-tce-table remains alive and does not get destroyed (as > >>>>>>>>>>> it is > >>>>>>>>>>> still used by emulated devices or other containers). > >>>>>>>>>>> > >>>>>>>>>>> So it was suggested that we could do such "unset" somehow later > >>>>>>>>>>> assuming, > >>>>>>>>>>> for example, on every "set" I could check if some of currently > >>>>>>>>>>> attached > >>>>>>>>>>> containers are no more used - and this is where being able to > >>>>>>>>>>> know if there > >>>>>>>>>>> is no backend helps - KVM remembers a container pointer and can > >>>>>>>>>>> check this > >>>>>>>>>>> via vfio_container_get_iommu_data_ext(). > >>>>>>>>>>> > >>>>>>>>>>> The other option would be changing vfio_container_get_ext() to > >>>>>>>>>>> take a > >>>>>>>>>>> callback+opaque which container would call when it destroys > >>>>>>>>>>> iommu_data. > >>>>>>>>>>> This looks more intrusive and not very intuitive how to make it > >>>>>>>>>>> right - > >>>>>>>>>>> container would have to keep track of all registered external > >>>>>>>>>>> users and > >>>>>>>>>>> vfio_container_put_ext() would have to pass the same > >>>>>>>>>>> callback+opaque to > >>>>>>>>>>> unregister the exact external user. > >>>>>>>>>> > >>>>>>>>>> I'm not in favor of anything resembling the code above or > >>>>>>>>>> extensions > >>>>>>>>>> beyond it, the container is the wrong place to do this. > >>>>>>>>>> > >>>>>>>>>>> Or I could store container file* in KVM. Then iommu_data would > >>>>>>>>>>> never be > >>>>>>>>>>> released until KVM-spapr-tce-table is destroyed. > >>>>>>>>>> > >>>>>>>>>> See above, holding a file pointer to the container doesn't do > >>>>>>>>>> squat. > >>>>>>>>>> The groups that are held by the container empower the IOMMU > >>>>>>>>>> backend, > >>>>>>>>>> references to the container itself don't matter. Those references > >>>>>>>>>> will > >>>>>>>>>> not maintain the IOMMU data. > >>>>>>>>>> > >>>>>>>>>>> Recreating KVM-spapr-tce-table on every vfio-pci hotunplug > >>>>>>>>>>> (closing its fd > >>>>>>>>>>> would "unset" container from KVM-spapr-tce-table) is not an > >>>>>>>>>>> option as there > >>>>>>>>>>> still may be devices using this KVM-spapr-tce-table. > >>>>>>>>>>> > >>>>>>>>>>> What obvious and nice solution am I missing here? Thanks. > >>>>>>>>>> > >>>>>>>>>> The interactions with the IOMMU backend that seem relevant are > >>>>>>>>>> vfio_iommu_drivers_ops.{detach_group,release}. The kvm-vfio pseudo > >>>>>>>>>> device is also used to tell kvm about groups as they come and go > >>>>>>>>>> and > >>>>>>>>>> has a way to check extensions, and thus properties of the IOMMU > >>>>>>>>>> backend. All of these are available for your {ab}use. Thanks, > >>>>>>>>> > >>>>>>>>> So, Alexey started trying to do this via the KVM-VFIO device, but > >>>>>>>>> it's > >>>>>>>>> a really bad fit. As noted above, fundamentally it's a container we > >>>>>>>>> need to attach to the kvm-spapr-tce-table object, since what that > >>>>>>>>> represents is a guest bus DMA address space, and by definition all > >>>>>>>>> the > >>>>>>>>> groups in a container must have the same DMA address space. > >>>>>>>> > >>>>>>>> That's all fine and good, but the point remains that a reference to > >>>>>>>> the > >>>>>>>> container is no assurance of the iommu state. The iommu state is > >>>>>>>> maintained by the user and the groups attached to the container. If > >>>>>>>> the groups are removed, your container reference no long has any > >>>>>>>> iommu > >>>>>>>> backing and iommu_data is worthless. The user can do this as well by > >>>>>>>> un-setting the iommu. I understand what you're trying to do, it's > >>>>>>>> just > >>>>>>>> wrong. Thanks, > >>>>>>> > >>>>>>> I'm trying to figure out how to do this right, and it's not at all > >>>>>>> obvious. The container may be wrong, but that doesn't have the > >>>>>>> KVM-VFIO device any more useful. Attempting to do this at the group > >>>>>>> level is at least as wrong for the reasons I've mentioned elsewhere. > >>>>>>> > >>>>>> > >>>>>> I could create a new fd, one per iommu_table, the fd would reference > >>>>>> the > >>>>>> iommu_table (not touching an iommu_table_group or a container), VFIO > >>>>>> SPAPR > >>>>>> TCE backend would return it in VFIO_IOMMU_SPAPR_TCE_CREATE (ioctl which > >>>>>> creates windows) or I could add VFIO_IOMMU_SPAPR_TCE_GET_FD_BY_OFFSET; > >>>>>> then > >>>>>> I'd pass this new fd to the KVM or KVM-spapr-tce-table to hook them > >>>>>> up. To > >>>>>> release the reference, KVM-spapr-tce-table would have "unset" ioctl() > >>>>>> or/and on every "set" I would look if all attached tables have at > >>>>>> least one > >>>>>> iommu_table_group attached, if none - release the table. > >>>>>> > >>>>>> This would make no change to generic VFIO code and very little change > >>>>>> in > >>>>>> SPAPR TCE backend. Would that be acceptable or it is horrible again? > >>>>>> Thanks. > >>>>> > >>>>> > >>>>> Ping? > >>>> > >>>> I'm still in Toronto after KVM Forum. I had a detailed discussion > >>>> about this with Alex W, which I'll write up once I get back. > >>>> > >>>> The short version is that Alex more-or-less convinced me that we do > >>>> need to go back to doing this with an interface based on linking > >>>> groups to LIOBNs. That leads to an interface that's kind of weird and > >>>> has some fairly counter-intuitive properties, but in the end it works > >>>> out better than doing it with containers. > >>>> > >>> > >>> > >>> > >>> > >>> Soooo? :) > >> > >> > >> When can I expect a full version of how to do this in-kernel thingy? > >> Thanks. > > > > When I can dig myself out from under other things in my queue. Which > > turns out to be now. > > > > Ok.. here's hoping I can remember enough of the conclusions I came to > > with Alex W. > > > > User <-> KVM interface > > ---------------------- > > > > This needs to take an LIOBN and a group fd and associate (or > > disassociate) them. This should be possible to do by adding each > > group to the vfio-kvm device as on x86, then setting an attribute on > > the device to mark the associated liobn. > > > > Attaching different (overlapping) LIOBNs to different groups in the > > same container is boundedly undefined (i.e. it mustn't break the host, > > but can do anything to the guest). > > > > KVM <-> VFIO (in kernel) interface > > ---------------------------------- > > > > You'll need a special function which takes a vfio group fd and returns > > a reference to an iommu_table object. It would also return an error > > if the group isn't backed by the spapr_tce iommu driver (including any > > calls to it on a non-ppc host). This should probably also increment > > the iommu table's ref count (on success). > > > > Implementation notes > > -------------------- > > > > When a device in a new group is hotplugged, qemu would need to add the > > group to the container *then* tell KVM to attach the group to the > > correct liobn(s). > > > > KVM would add the group to a list for that liobn. It would call the > > vfio hook to get the associated iommu table. If there's an error, > > then it's unable to enable acceleration, and would either return an > > error immediately or ensure that later attempts to PUT_TCE will be > > punted to qemu. > > > > Assuming it is able to accelerate, it would add the iommu table to a > > list of iommu tables associated with the liobn. It will need to > > de-dupe here, since with multiple groups per container you'd expect > > multiple groups with the same iommu table. > > > > H_PUT_TCE would walk the list of attached iommu tables and update them > > using the ppc kernel iommu interfaces. > > > > When a group is removed from a liobn, kvm would need to recalculate > > the list of iommu tables, in case that was the last group attached to > > the table. It would need to decrement the refcount on the iommu table > > and, obviously, make sure everything is sychronized with the real mode > > PUT_TCE. > > > > When a group is hot unplugged, it's qemu's resposibility to tell kvm > > that the group is no longer associated with the liobn, before it > > removes the group from the container. > > > Cannot VFIO KVM device just release this extra reference when QEMU calls > KVM_DEV_VFIO_GROUP_DEL from vfio_instance_finalize->vfio_put_group?
Yes, that should do it. That doesn't contradict the statement above, it's just that we already have it - removing the group from the vfio-kvm device is the mechanism by which qemu informs the kernel the group is no longer associated with a liobn. > > If it doesn't there may be a > > stale iommu table attached to the liobn. That could certainly mess up > > DMA on the guest for other devices, but shouldn't damage the host - > > the group now belongs to the host again, but because the group was > > detached from the container, the HW is no longer using the container's > > iommu table (which KVM is touching) to actually serve the group. > > > > If all the groups are unplugged, so the container becomes quiescent, > > KVM's refcount(s) on the iommu table stop it going away. It won't be > > looked at by the hardware any more, so updates will be useless, but > > again that's only a problem for the guest, not the host. > > > > > > Hope that covers it. > > > > Alex, please let me know if I missed something from our discussion. > > > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature