Re: [PATCH] KVM: Allow host IRQ sharing for assigned PCI 2.3 devices

Jan Kiszka Mon, 09 Jan 2012 13:25:58 -0800

On 2012-01-09 20:45, Alex Williamson wrote:
> On Mon, 2012-01-09 at 15:03 +0100, Jan Kiszka wrote:
>> PCI 2.3 allows to generically disable IRQ sources at device level. This
>> enables us to share legacy IRQs of such devices with other host devices
>> when passing them to a guest.
>>
>> The new IRQ sharing feature introduced here is optional, user space has
>> to request it explicitly. Moreover, user space can inform us about its
>> view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the
>> interrupt and signaling it if the guest masked it via the virtualized
>> PCI config space.
>>
>> Signed-off-by: Jan Kiszka <[email protected]>
>> ---
>>
>> This applies to kvm/master after merging
>>
>>   PCI: Rework config space blocking services
>>   PCI: Introduce INTx check & mask API
>>
>> from current linux-next/master. I suppose those two will make it into
>> 3.3.
>>
>> To recall the history of it: I tried hard to implement an adaptive
>> solution that automatically picks the fastest masking technique whenever
>> possible. However, the changes required to the IRQ core subsystem and
>> the logic of the device assignment code became so complex and partly
>> ugly that I gave up on this. It's simply not worth the pain given that
>> legacy PCI interrupts are rarely raised for performance critical device
>> at such a high rate (KHz...) that you can measure the difference.
>>
>>  Documentation/virtual/kvm/api.txt |   27 +++++
>>  arch/x86/kvm/x86.c                |    1 +
>>  include/linux/kvm.h               |    6 +
>>  include/linux/kvm_host.h          |    2 +
>>  virt/kvm/assigned-dev.c           |  208 
>> +++++++++++++++++++++++++++++++-----
>>  5 files changed, 215 insertions(+), 29 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/api.txt 
>> b/Documentation/virtual/kvm/api.txt
>> index e1d94bf..670015a 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -1159,6 +1159,14 @@ following flags are specified:
>>  
>>  /* Depends on KVM_CAP_IOMMU */
>>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
>> +/* The following two depend on KVM_CAP_PCI_2_3 */
>> +#define KVM_DEV_ASSIGN_PCI_2_3              (1 << 1)
>> +#define KVM_DEV_ASSIGN_MASK_INTX    (1 << 2)
>> +
>> +If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx 
>> interrupts
>> +via the PCI-2.3-compliant device-level mask, thus enable IRQ sharing with 
>> other
>> +assigned devices or host devices. KVM_DEV_ASSIGN_MASK_INTX specifies the
>> +guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
>>  
>>  The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
>>  isolation of the device.  Usages not specifying this flag are deprecated.
>> @@ -1399,6 +1407,25 @@ The following flags are defined:
>>  If datamatch flag is set, the event will be signaled only if the written 
>> value
>>  to the registered address is equal to datamatch in struct kvm_ioeventfd.
>>  
>> +4.59 KVM_ASSIGN_SET_INTX_MASK
>> +
>> +Capability: KVM_CAP_PCI_2_3
>> +Architectures: x86
>> +Type: vm ioctl
>> +Parameters: struct kvm_assigned_pci_dev (in)
>> +Returns: 0 on success, -1 on error
>> +
>> +Informs the kernel about the guest's view on the INTx mask. As long as the
>> +guest masks the legacy INTx, the kernel will refrain from unmasking it at
>> +hardware level and will not assert the guest's IRQ line. User space is still
>> +responsible for applying this state to the assigned device's real config 
>> space.
>> +To avoid that the kernel overwrites the state user space wants to set,
>> +KVM_ASSIGN_SET_INTX_MASK has to be called prior to updating the config 
>> space.
>> +
>> +See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is 
>> specified
>> +by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
>> +evaluated.
>> +
>>  4.62 KVM_CREATE_SPAPR_TCE
>>  
>>  Capability: KVM_CAP_SPAPR_TCE
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 1171def..9381806 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -2057,6 +2057,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>>      case KVM_CAP_XSAVE:
>>      case KVM_CAP_ASYNC_PF:
>>      case KVM_CAP_GET_TSC_KHZ:
>> +    case KVM_CAP_PCI_2_3:
>>              r = 1;
>>              break;
>>      case KVM_CAP_COALESCED_MMIO:
>> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
>> index 68e67e5..da5f7b7 100644
>> --- a/include/linux/kvm.h
>> +++ b/include/linux/kvm.h
>> @@ -556,6 +556,7 @@ struct kvm_ppc_pvinfo {
>>  #define KVM_CAP_PPC_RMA     65
>>  #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
>>  #define KVM_CAP_PPC_PAPR 68
>> +#define KVM_CAP_PCI_2_3 69
>>  #define KVM_CAP_S390_GMAP 71
>>  #define KVM_CAP_TSC_DEADLINE_TIMER 72
>>  
>> @@ -697,6 +698,9 @@ struct kvm_clock_data {
>>  /* Available with KVM_CAP_TSC_CONTROL */
>>  #define KVM_SET_TSC_KHZ           _IO(KVMIO,  0xa2)
>>  #define KVM_GET_TSC_KHZ           _IO(KVMIO,  0xa3)
>> +/* Available with KVM_CAP_PCI_2_3 */
>> +#define KVM_ASSIGN_SET_INTX_MASK  _IOW(KVMIO,  0xa4, \
>> +                                   struct kvm_assigned_pci_dev)
>>  
>>  /*
>>   * ioctls for vcpu fds
>> @@ -765,6 +769,8 @@ struct kvm_clock_data {
>>  #define KVM_ALLOCATE_RMA      _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
>>  
>>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
>> +#define KVM_DEV_ASSIGN_PCI_2_3              (1 << 1)
>> +#define KVM_DEV_ASSIGN_MASK_INTX    (1 << 2)
>>  
>>  struct kvm_assigned_pci_dev {
>>      __u32 assigned_dev_id;
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 900c763..07461bd 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -546,6 +546,7 @@ struct kvm_assigned_dev_kernel {
>>      unsigned int entries_nr;
>>      int host_irq;
>>      bool host_irq_disabled;
>> +    bool pci_2_3;
>>      struct msix_entry *host_msix_entries;
>>      int guest_irq;
>>      struct msix_entry *guest_msix_entries;
>> @@ -555,6 +556,7 @@ struct kvm_assigned_dev_kernel {
>>      struct pci_dev *dev;
>>      struct kvm *kvm;
>>      spinlock_t intx_lock;
>> +    struct mutex intx_mask_lock;
>>      char irq_name[32];
>>      struct pci_saved_state *pci_saved_state;
>>  };
>> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
>> index 758e3b3..b35aba9 100644
>> --- a/virt/kvm/assigned-dev.c
>> +++ b/virt/kvm/assigned-dev.c
>> @@ -57,22 +57,66 @@ static int find_index_from_host_irq(struct 
>> kvm_assigned_dev_kernel
>>      return index;
>>  }
>>  
>> -static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
>> +static irqreturn_t kvm_assigned_dev_intx(int irq, void *dev_id)
>>  {
>>      struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
>> +    int ret;
>>  
>> -    if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX) {
>> -            spin_lock(&assigned_dev->intx_lock);
>> +    spin_lock(&assigned_dev->intx_lock);
>> +    if (pci_check_and_mask_intx(assigned_dev->dev)) {
>> +            assigned_dev->host_irq_disabled = true;
>> +            ret = IRQ_WAKE_THREAD;
>> +    } else
>> +            ret = IRQ_NONE;
>> +    spin_unlock(&assigned_dev->intx_lock);
>> +
>> +    return ret;
>> +}
>> +
>> +static void
>> +kvm_assigned_dev_raise_guest_irq(struct kvm_assigned_dev_kernel 
>> *assigned_dev,
>> +                             int vector)
>> +{
>> +    if (unlikely(assigned_dev->irq_requested_type &
>> +                 KVM_DEV_IRQ_GUEST_INTX)) {
>> +            mutex_lock(&assigned_dev->intx_mask_lock);
>> +            if (!(assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX))
>> +                    kvm_set_irq(assigned_dev->kvm,
>> +                                assigned_dev->irq_source_id, vector, 1);
>> +            mutex_unlock(&assigned_dev->intx_mask_lock);
>> +    } else
>> +            kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>> +                        vector, 1);
>> +}
> 
> I question whether the above function is really worthwhile...
> 
>> +static irqreturn_t kvm_assigned_dev_thread_intx(int irq, void *dev_id)
>> +{
>> +    struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
>> +
>> +    if (!(assigned_dev->flags & KVM_DEV_ASSIGN_PCI_2_3)) {
>> +            spin_lock_irq(&assigned_dev->intx_lock);
>>              disable_irq_nosync(irq);
>>              assigned_dev->host_irq_disabled = true;
>> -            spin_unlock(&assigned_dev->intx_lock);
>> +            spin_unlock_irq(&assigned_dev->intx_lock);
>>      }
>>  
>> -    kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>> -                assigned_dev->guest_irq, 1);
>> +    kvm_assigned_dev_raise_guest_irq(assigned_dev,
>> +                                     assigned_dev->guest_irq);
> 
> This is the only user of the intx branch and we could avoid the
> irq_requested_type check from this call path.  The MSI paths can call
> kvm_set_irq directly.  A nice code consolidation, but probably trumped
> by a slightly shorter code path.


I see. Possibly, this code became that strange during the various
refactorings. Will have a look.

> 
> BTW, I already updated vfio to the INTx check and mask API, it's a great
> cleanup.

Cool!

> 
>> +
>> +    return IRQ_HANDLED;
>> +}
>> +
>> +#ifdef __KVM_HAVE_MSI
>> +static irqreturn_t kvm_assigned_dev_thread_msi(int irq, void *dev_id)
>> +{
>> +    struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
>> +
>> +    kvm_assigned_dev_raise_guest_irq(assigned_dev,
>> +                                     assigned_dev->guest_irq);
>>  
>>      return IRQ_HANDLED;
>>  }
>> +#endif
>>  
>>  #ifdef __KVM_HAVE_MSIX
>>  static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
>> @@ -83,8 +127,7 @@ static irqreturn_t kvm_assigned_dev_thread_msix(int irq, 
>> void *dev_id)
>>  
>>      if (index >= 0) {
>>              vector = assigned_dev->guest_msix_entries[index].vector;
>> -            kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>> -                        vector, 1);
>> +            kvm_assigned_dev_raise_guest_irq(assigned_dev, vector);
>>      }
>>  
>>      return IRQ_HANDLED;
>> @@ -100,15 +143,31 @@ static void kvm_assigned_dev_ack_irq(struct 
>> kvm_irq_ack_notifier *kian)
>>  
>>      kvm_set_irq(dev->kvm, dev->irq_source_id, dev->guest_irq, 0);
>>  
>> -    /* The guest irq may be shared so this ack may be
>> -     * from another device.
>> -     */
>> -    spin_lock(&dev->intx_lock);
>> -    if (dev->host_irq_disabled) {
>> -            enable_irq(dev->host_irq);
>> -            dev->host_irq_disabled = false;
>> +    mutex_lock(&dev->intx_mask_lock);
>> +
>> +    if (!(dev->flags & KVM_DEV_ASSIGN_MASK_INTX)) {
>> +            bool reassert = false;
>> +
>> +            spin_lock_irq(&dev->intx_lock);
>> +            /*
>> +             * The guest IRQ may be shared so this ack can come from an
>> +             * IRQ for another guest device.
>> +             */
>> +            if (dev->host_irq_disabled) {
>> +                    if (!(dev->flags & KVM_DEV_ASSIGN_PCI_2_3))
>> +                            enable_irq(dev->host_irq);
>> +                    else if (!pci_check_and_unmask_intx(dev->dev))
>> +                            reassert = true;
>> +                    dev->host_irq_disabled = reassert;
>> +            }
>> +            spin_unlock_irq(&dev->intx_lock);
>> +
>> +            if (reassert)
>> +                    kvm_set_irq(dev->kvm, dev->irq_source_id,
>> +                                dev->guest_irq, 1);
>>      }
>> -    spin_unlock(&dev->intx_lock);
>> +
>> +    mutex_unlock(&dev->intx_mask_lock);
>>  }
>>  
>>  static void deassign_guest_irq(struct kvm *kvm,
>> @@ -156,7 +215,13 @@ static void deassign_host_irq(struct kvm *kvm,
>>              pci_disable_msix(assigned_dev->dev);
>>      } else {
>>              /* Deal with MSI and INTx */
>> -            disable_irq(assigned_dev->host_irq);
>> +            if (assigned_dev->flags & KVM_DEV_ASSIGN_PCI_2_3) {
>> +                    spin_lock_irq(&assigned_dev->intx_lock);
>> +                    pci_intx(assigned_dev->dev, false);
>> +                    spin_unlock_irq(&assigned_dev->intx_lock);
>> +                    synchronize_irq(assigned_dev->host_irq);
>> +            } else
>> +                    disable_irq(assigned_dev->host_irq);
>>  
>>              free_irq(assigned_dev->host_irq, assigned_dev);
>>  
>> @@ -237,15 +302,34 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
>>  static int assigned_device_enable_host_intx(struct kvm *kvm,
>>                                          struct kvm_assigned_dev_kernel *dev)
>>  {
>> +    irq_handler_t irq_handler;
>> +    unsigned long flags;
>> +
>>      dev->host_irq = dev->dev->irq;
>> -    /* Even though this is PCI, we don't want to use shared
>> -     * interrupts. Sharing host devices with guest-assigned devices
>> -     * on the same interrupt line is not a happy situation: there
>> -     * are going to be long delays in accepting, acking, etc.
>> +
>> +    /*
>> +     * We can only share the IRQ line with other host devices if we are
>> +     * able to disable the IRQ source at device-level - independently of
>> +     * the guest driver. Otherwise host devices may suffer from unbounded
>> +     * IRQ latencies when the guest keeps the line asserted.
>>       */
>> -    if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
>> -                             IRQF_ONESHOT, dev->irq_name, dev))
>> +    if (dev->flags & KVM_DEV_ASSIGN_PCI_2_3) {
>> +            irq_handler = kvm_assigned_dev_intx;
>> +            flags = IRQF_SHARED;
>> +    } else {
>> +            irq_handler = NULL;
>> +            flags = IRQF_ONESHOT;
>> +    }
>> +    if (request_threaded_irq(dev->host_irq, irq_handler,
>> +                             kvm_assigned_dev_thread_intx, flags,
>> +                             dev->irq_name, dev))
>>              return -EIO;
>> +
>> +    if (dev->flags & KVM_DEV_ASSIGN_PCI_2_3) {
>> +            spin_lock_irq(&dev->intx_lock);
>> +            pci_intx(dev->dev, true);
>> +            spin_unlock_irq(&dev->intx_lock);
>> +    }
>>      return 0;
>>  }
>>  
>> @@ -262,8 +346,9 @@ static int assigned_device_enable_host_msi(struct kvm 
>> *kvm,
>>      }
>>  
>>      dev->host_irq = dev->dev->irq;
>> -    if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
>> -                             0, dev->irq_name, dev)) {
>> +    if (request_threaded_irq(dev->host_irq, NULL,
>> +                             kvm_assigned_dev_thread_msi, 0,
>> +                             dev->irq_name, dev)) {
>>              pci_disable_msi(dev->dev);
>>              return -EIO;
>>      }
>> @@ -321,7 +406,6 @@ static int assigned_device_enable_guest_msi(struct kvm 
>> *kvm,
>>  {
>>      dev->guest_irq = irq->guest_irq;
>>      dev->ack_notifier.gsi = -1;
>> -    dev->host_irq_disabled = false;
>>      return 0;
>>  }
>>  #endif
>> @@ -333,7 +417,6 @@ static int assigned_device_enable_guest_msix(struct kvm 
>> *kvm,
>>  {
>>      dev->guest_irq = irq->guest_irq;
>>      dev->ack_notifier.gsi = -1;
>> -    dev->host_irq_disabled = false;
>>      return 0;
>>  }
>>  #endif
>> @@ -367,6 +450,7 @@ static int assign_host_irq(struct kvm *kvm,
>>      default:
>>              r = -EINVAL;
>>      }
>> +    dev->host_irq_disabled = false;
>>  
>>      if (!r)
>>              dev->irq_requested_type |= host_irq_type;
>> @@ -468,6 +552,7 @@ static int kvm_vm_ioctl_deassign_dev_irq(struct kvm *kvm,
>>  {
>>      int r = -ENODEV;
>>      struct kvm_assigned_dev_kernel *match;
>> +    unsigned long irq_type;
>>  
>>      mutex_lock(&kvm->lock);
>>  
>> @@ -476,7 +561,9 @@ static int kvm_vm_ioctl_deassign_dev_irq(struct kvm *kvm,
>>      if (!match)
>>              goto out;
>>  
>> -    r = kvm_deassign_irq(kvm, match, assigned_irq->flags);
>> +    irq_type = assigned_irq->flags & (KVM_DEV_IRQ_HOST_MASK |
>> +                                      KVM_DEV_IRQ_GUEST_MASK);
>> +    r = kvm_deassign_irq(kvm, match, irq_type);
>>  out:
>>      mutex_unlock(&kvm->lock);
>>      return r;
>> @@ -609,6 +696,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>>      if (!match->pci_saved_state)
>>              printk(KERN_DEBUG "%s: Couldn't store %s saved state\n",
>>                     __func__, dev_name(&dev->dev));
>> +
>> +    if (!pci_intx_mask_supported(dev))
>> +            assigned_dev->flags &= ~KVM_DEV_ASSIGN_PCI_2_3;
>> +
>>      match->assigned_dev_id = assigned_dev->assigned_dev_id;
>>      match->host_segnr = assigned_dev->segnr;
>>      match->host_busnr = assigned_dev->busnr;
>> @@ -616,6 +707,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>>      match->flags = assigned_dev->flags;
>>      match->dev = dev;
>>      spin_lock_init(&match->intx_lock);
>> +    mutex_init(&match->intx_mask_lock);
>>      match->irq_source_id = -1;
>>      match->kvm = kvm;
>>      match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;
>> @@ -761,6 +853,56 @@ msix_entry_out:
>>  }
>>  #endif
>>  
>> +static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm,
>> +            struct kvm_assigned_pci_dev *assigned_dev)
>> +{
>> +    int r = 0;
>> +    struct kvm_assigned_dev_kernel *match;
>> +
>> +    mutex_lock(&kvm->lock);
>> +
>> +    match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
>> +                                  assigned_dev->assigned_dev_id);
>> +    if (!match) {
>> +            r = -ENODEV;
>> +            goto out;
>> +    }
>> +
>> +    mutex_lock(&match->intx_mask_lock);
>> +
>> +    match->flags &= ~KVM_DEV_ASSIGN_MASK_INTX;
>> +    match->flags |= assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX;
>> +
>> +    if (match->irq_requested_type & KVM_DEV_IRQ_GUEST_INTX) {
>> +            if (assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX) {
>> +                    kvm_set_irq(match->kvm, match->irq_source_id,
>> +                                match->guest_irq, 0);
>> +                    /*
>> +                     * Masking at hardware-level is performed on demand,
>> +                     * i.e. when an IRQ actually arrives at the host.
>> +                     */
> 
> Is there any harm in doing this synchronous to the ioctl?  We're on a
> slow path here anyway since the mask is likely drive by a config space
> write.

Not sure, maybe locking. What would be the advantage of doing it
synchronously?

Thanks for the review,
Jan

signature.asc
Description: OpenPGP digital signature

Re: [PATCH] KVM: Allow host IRQ sharing for assigned PCI 2.3 devices

Reply via email to