On Fri, Sep 06, 2013 at 08:40:26PM +1000, Alexey Kardashevskiy wrote:
> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
> them to user space which saves time on switching to user space and back.
> 
> Both real and virtual modes are supported. The kernel tries to
> handle a TCE request in the real mode, if fails it passes the request
> to the virtual mode to complete the operation. If it a virtual mode
> handler fails, the request is passed to user space.
> 
> The first user of this is VFIO on POWER. Trampolines to the VFIO external
> user API functions are required for this patch.
> 
> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
> of map/unmap requests. The device supports a single attribute which is
> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
> establishes the connection between KVM and VFIO.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Paul Mackerras <pau...@samba.org>
> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> 
> ---
> 
> Changes:
> v10:
> * all IOMMU TCE links are handled by one KVM device now
> * KVM device has its own list of TCE descriptors
> * the search-by-liobn function was extended to search through
> emulated and IOMMU lists
> 
> v9:
> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
> KVM device
> * release_spapr_tce_table() is not shared between different TCE types
> * reduced the patch size by moving KVM device bits and VFIO external API
> trampolines to separate patches
> * moved documentation from Documentation/virtual/kvm/api.txt to
> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> 
> v8:
> * fixed warnings from check_patch.pl
> 
> 2013/07/11:
> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
> for KVM_BOOK3S_64
> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
> for this here but the next patch for hugepages support will use it more.
> 
> 2013/07/06:
> * added realmode arch_spin_lock to protect TCE table from races
> in real and virtual modes
> * POWERPC IOMMU API is changed to support real mode
> * iommu_take_ownership and iommu_release_ownership are protected by
> iommu_table's locks
> * VFIO external user API use rewritten
> * multiple small fixes
> 
> 2013/06/27:
> * tce_list page is referenced now in order to protect it from accident
> invalidation during H_PUT_TCE_INDIRECT execution
> * added use of the external user VFIO API
> 
> 2013/06/05:
> * changed capability number
> * changed ioctl number
> * update the doc article number
> 
> 2013/05/20:
> * removed get_user() from real mode handlers
> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
> translated TCEs, tries realmode_get_page() on those and if it fails, it
> passes control over the virtual mode handler which tries to finish
> the request handling
> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
> on a page
> * The only reason to pass the request to user mode now is when the user mode
> did not register TCE table in the kernel, in all other cases the virtual mode
> handler is expected to do the job
> ---
>  .../virtual/kvm/devices/spapr_tce_iommu.txt        |  40 +++
>  arch/powerpc/include/asm/kvm_host.h                |   8 +
>  arch/powerpc/include/uapi/asm/kvm.h                |   5 -
>  arch/powerpc/kvm/book3s_64_vio.c                   | 327 
> ++++++++++++++++++++-
>  arch/powerpc/kvm/book3s_64_vio_hv.c                | 142 +++++++++
>  arch/powerpc/kvm/powerpc.c                         |   1 +
>  include/linux/kvm_host.h                           |   1 +
>  virt/kvm/kvm_main.c                                |   5 +
>  8 files changed, 517 insertions(+), 12 deletions(-)
>  create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> 
> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt 
> b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> new file mode 100644
> index 0000000..b911945
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
> @@ -0,0 +1,40 @@
> +SPAPR TCE IOMMU device
> +
> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
> +Architectures: powerpc
> +
> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
> +
> +Groups:
> +  KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
> +  Attributes: one VFIO IOMMU fd per LIOBN, indexed by LIOBN
> +
> +This is completely made up device which provides API to link
> +logical bus number (LIOBN) and IOMMU group. The user space has
> +to create a new SPAPR TCE IOMMU device once per KVM session
> +and use "set_attr" to add or remove a logical bus.
> +
> +LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls
> +(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE).
> +IOMMU group is a minimal isolated device set which can be passed to
> +the user space via VFIO.
> +
> +The userspace adds the new LIOBN-IOMMU link by calling KVM_SET_DEVICE_ATTR
> +with the attribute initialized as shown below:
> +struct kvm_device_attr attr = {
> +     .flags = 0,
> +     .group = KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE,
> +     .attr = liobn,
> +     .addr = (uint64_t)(uintptr_t)&group_fd,
> +};
> +
> +To remove the link, the userspace calls KVM_SET_DEVICE_ATTR with
> +the group_fd equal to zero.
> +
Zero is a valid fd descriptor. Lest make it -1.

> +As the device opens VFIO group fds and holds the file pointer,
> +it does not need to keep an fd internally and therefore KVM_GET_DEVICE_ATTR
> +is not supported.
> +
> +When KVM exits, all links are destroyed automatically.
> +
> +The kernel advertises this feature via KVM_CAP_SPAPR_TCE_IOMMU capability.
Why KVM_CAP_SPAPR_TCE_IOMMU is needed? Supported devices are
discoverable without capabilities.

> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index a23f132..a2a481e 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -181,9 +181,15 @@ struct kvmppc_spapr_tce_table {
>       struct kvm *kvm;
>       u64 liobn;
>       u32 window_size;
> +     struct iommu_group *grp;                /* used for IOMMU groups */
> +     struct vfio_group *vfio_grp;            /* used for IOMMU groups */
>       struct page *pages[0];
>  };
>  
> +struct kvmppc_spapr_tce_iommu_device {
> +     struct list_head tables;
> +};
> +
>  struct kvmppc_linear_info {
>       void            *base_virt;
>       unsigned long    base_pfn;
> @@ -264,6 +270,7 @@ struct kvm_arch {
>  #endif /* CONFIG_KVM_BOOK3S_64_HV */
>  #ifdef CONFIG_PPC_BOOK3S_64
>       struct list_head spapr_tce_tables;
> +     struct kvmppc_spapr_tce_iommu_device *tcedev;
>       struct list_head rtas_tokens;
>  #endif
>  #ifdef CONFIG_KVM_MPIC
> @@ -612,6 +619,7 @@ struct kvm_vcpu_arch {
>       u64 busy_preempt;
>  
>       unsigned long *tce_tmp_hpas;    /* TCE cache for TCE_PUT_INDIRECT */
> +     unsigned long tce_tmp_num;      /* Number of handled TCEs in cache */
>       enum {
>               TCERM_NONE,
>               TCERM_GETPAGE,
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index c1ae1e5..a9d3d73 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -512,11 +512,6 @@ struct kvm_get_htab_header {
>  #define  KVM_XICS_PENDING            (1ULL << 42)
>  
>  /* SPAPR TCE IOMMU device specification */
> -struct kvm_create_spapr_tce_iommu_linkage {
> -     __u64 liobn;
> -     __u32 fd;
> -     __u32 flags;
> -};
>  #define KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE 0
>  
>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index 2880d2b..0978794 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -29,6 +29,8 @@
>  #include <linux/anon_inodes.h>
>  #include <linux/module.h>
>  #include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/file.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/kvm_ppc.h>
> @@ -158,10 +160,8 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>       int i;
>  
>       /* Check this LIOBN hasn't been previously allocated */
> -     list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
> -             if (stt->liobn == args->liobn)
> -                     return -EBUSY;
> -     }
> +     if (kvmppc_find_tce_table(kvm, args->liobn))
> +             return -EBUSY;
>  
>       npages = kvmppc_stt_npages(args->window_size);
>  
> @@ -201,9 +201,175 @@ fail:
>       return ret;
>  }
>  
> -/* Converts guest physical address to host virtual address */
> +static void kvmppc_spapr_tce_iommu_table_destroy(
> +             struct kvm_device *dev,
> +             struct kvmppc_spapr_tce_table *tt)
> +{
> +     struct kvm *kvm = dev->kvm;
> +
> +     mutex_lock(&kvm->lock);
> +     list_del(&tt->list);
> +
> +     if (tt->vfio_grp)
> +             kvmppc_vfio_group_put_external_user(tt->vfio_grp);
> +     iommu_group_put(tt->grp);
> +
> +     kfree(tt);
> +     mutex_unlock(&kvm->lock);
> +}
> +
> +static int kvmppc_spapr_tce_iommu_create(struct kvm_device *dev, u32 type)
> +{
> +     struct kvmppc_spapr_tce_iommu_device *tcedev;
> +     int ret = 0;
> +
> +     tcedev = kzalloc(sizeof(*tcedev), GFP_KERNEL);
> +     if (!tcedev)
> +             return -ENOMEM;
> +     dev->private = tcedev;
> +
> +     INIT_LIST_HEAD(&tcedev->tables);
> +
> +     /* Already there ? */
> +     mutex_lock(&dev->kvm->lock);
> +     if (dev->kvm->arch.tcedev)
> +             ret = -EEXIST;
> +     else
> +             dev->kvm->arch.tcedev = tcedev;
> +     mutex_unlock(&dev->kvm->lock);
> +
> +     if (ret)
Need to free tcedev here.

> +             return ret;
> +
> +     return 0;
> +}
> +

--
                        Gleb.
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Reply via email to