On 19/09/16 12:14, Alexander Graf wrote:
> We have 2 modes for dealing with interrupts in the ARM world. We can either
> handle them all using hardware acceleration through the vgic or we can emulate
> a gic in user space and only drive CPU IRQ pins from there.
> 
> Unfortunately, when driving IRQs from user space, we never tell user space
> about timer events that may result in interrupt line state changes, so we
> lose out on timer events if we run with user space gic emulation.
> 
> This patch fixes that by routing vtimer expiration events to user space.
> With this patch I can successfully run edk2 and Linux with user space gic
> emulation.
> 
> Signed-off-by: Alexander Graf <ag...@suse.de>
> 
> ---
> 
> v1 -> v2:
> 
>   - Add back curly brace that got lost
> 
> v2 -> v3:
> 
>   - Split into patch set
> 
> v3 -> v4:
> 
>   - Improve documentation
> ---
>  Documentation/virtual/kvm/api.txt |  30 ++++++++-
>  arch/arm/include/asm/kvm_host.h   |   3 +
>  arch/arm/kvm/arm.c                |  22 ++++---
>  arch/arm64/include/asm/kvm_host.h |   3 +
>  include/uapi/linux/kvm.h          |  14 +++++
>  virt/kvm/arm/arch_timer.c         | 125 
> +++++++++++++++++++++++++++-----------
>  6 files changed, 155 insertions(+), 42 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 23937e0..1c0bd86 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -3202,9 +3202,14 @@ struct kvm_run {
>       /* in */
>       __u8 request_interrupt_window;
>  
> -Request that KVM_RUN return when it becomes possible to inject external
> +[x86] Request that KVM_RUN return when it becomes possible to inject external
>  interrupts into the guest.  Useful in conjunction with KVM_INTERRUPT.
>  
> +[arm*] Bits set to 1 in here mask IRQ lines that would otherwise potentially
> +trigger forever. These lines are available:
> +
> +    KVM_IRQWINDOW_VTIMER  -  Masks hw virtual timer irq while in guest
> +
>       __u8 padding1[7];
>  
>       /* out */
> @@ -3519,6 +3524,18 @@ Hyper-V SynIC state change. Notification is used to 
> remap SynIC
>  event/message pages and to enable/disable SynIC messages/events processing
>  in userspace.
>  
> +             /* KVM_EXIT_ARM_TIMER */
> +             struct {
> +                     __u8 timesource;
> +             } arm_timer;
> +
> +Indicates that a timer triggered that user space needs to handle and
> +potentially mask with vcpu->run->request_interrupt_window to allow the
> +guest to proceed. This only happens for timers that got enabled through
> +KVM_CAP_ARM_TIMER. The following time sources are available:
> +
> +    KVM_ARM_TIMER_VTIMER  - virtual cpu timer
> +
>               /* Fix the size of the union. */
>               char padding[256];
>       };
> @@ -3739,6 +3756,17 @@ Once this is done the KVM_REG_MIPS_VEC_* and 
> KVM_REG_MIPS_MSA_* registers can be
>  accessed, and the Config5.MSAEn bit is accessible via the KVM API and also 
> from
>  the guest.
>  
> +6.11 KVM_CAP_ARM_TIMER
> +
> +Architectures: arm, arm64
> +Target: vcpu
> +Parameters: args[0] contains a bitmap of timers to select (see 5.)
> +
> +This capability allows to route per-core timers into user space. When it's
> +enabled and no in-kernel interrupt controller is in use, the timers selected
> +by args[0] trigger KVM_EXIT_ARM_TIMER guest exits when they are pending,
> +unless masked by vcpu->run->request_interrupt_window (see 5.).
> +
>  7. Capabilities that can be enabled on VMs
>  ------------------------------------------
>  
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index de338d9..77d1f73 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -180,6 +180,9 @@ struct kvm_vcpu_arch {
>  
>       /* Detect first run of a vcpu */
>       bool has_run_once;
> +
> +     /* User space wants timer notifications */
> +     bool user_space_arm_timers;

Please move this to the timer structure.

>  };
>  
>  struct kvm_vm_stat {
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index c84b6ad..57bdb71 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -187,6 +187,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> ext)
>       case KVM_CAP_ARM_PSCI_0_2:
>       case KVM_CAP_READONLY_MEM:
>       case KVM_CAP_MP_STATE:
> +     case KVM_CAP_ARM_TIMER:
>               r = 1;
>               break;
>       case KVM_CAP_COALESCED_MMIO:
> @@ -474,13 +475,7 @@ static int kvm_vcpu_first_run_init(struct kvm_vcpu *vcpu)
>                       return ret;
>       }
>  
> -     /*
> -      * Enable the arch timers only if we have an in-kernel VGIC
> -      * and it has been properly initialized, since we cannot handle
> -      * interrupts from the virtual timer with a userspace gic.
> -      */
> -     if (irqchip_in_kernel(kvm) && vgic_initialized(kvm))
> -             ret = kvm_timer_enable(vcpu);
> +     ret = kvm_timer_enable(vcpu);
>  
>       return ret;
>  }
> @@ -601,6 +596,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
> struct kvm_run *run)
>                       run->exit_reason = KVM_EXIT_INTR;
>               }
>  
> +             if (kvm_check_request(KVM_REQ_PENDING_TIMER, vcpu)) {

Since this is a very unlikely event (in the grand scheme of things), how
about making this unlikely()?

> +                     /* Tell user space about the pending vtimer */
> +                     ret = 0;
> +                     run->exit_reason = KVM_EXIT_ARM_TIMER;
> +                     run->arm_timer.timesource = KVM_ARM_TIMER_VTIMER;
> +             }

More importantly: why does it have to be indirected by a
make_request/check_request, and not be handled as part of the
kvm_timer_sync() call? We do update the state there, and you could
directly find out whether an exit is required.

> +
>               if (ret <= 0 || need_new_vmid_gen(vcpu->kvm) ||
>                       vcpu->arch.power_off || vcpu->arch.pause) {
>                       local_irq_enable();
> @@ -887,6 +889,12 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu 
> *vcpu,
>               return -EINVAL;
>  
>       switch (cap->cap) {
> +     case KVM_CAP_ARM_TIMER:
> +             r = 0;
> +             if (cap->args[0] != KVM_ARM_TIMER_VTIMER)
> +                     return -EINVAL;
> +             vcpu->arch.user_space_arm_timers = true;
> +             break;
>       default:
>               r = -EINVAL;
>               break;
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 3eda975..3d01481 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -270,6 +270,9 @@ struct kvm_vcpu_arch {
>  
>       /* Detect first run of a vcpu */
>       bool has_run_once;
> +
> +     /* User space wants timer notifications */
> +     bool user_space_arm_timers;
>  };
>  
>  #define vcpu_gp_regs(v)              (&(v)->arch.ctxt.gp_regs)
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 300ef25..00f4948 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -205,6 +205,7 @@ struct kvm_hyperv_exit {
>  #define KVM_EXIT_S390_STSI        25
>  #define KVM_EXIT_IOAPIC_EOI       26
>  #define KVM_EXIT_HYPERV           27
> +#define KVM_EXIT_ARM_TIMER        28
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -361,6 +362,10 @@ struct kvm_run {
>               } eoi;
>               /* KVM_EXIT_HYPERV */
>               struct kvm_hyperv_exit hyperv;
> +             /* KVM_EXIT_ARM_TIMER */
> +             struct {
> +                     __u8 timesource;
> +             } arm_timer;
>               /* Fix the size of the union. */
>               char padding[256];
>       };
> @@ -870,6 +875,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_S390_USER_INSTR0 130
>  #define KVM_CAP_MSI_DEVID 131
>  #define KVM_CAP_PPC_HTM 132
> +#define KVM_CAP_ARM_TIMER 133
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1327,4 +1333,12 @@ struct kvm_assigned_msix_entry {
>  #define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
>  #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)
>  
> +/* Available with KVM_CAP_ARM_TIMER */
> +
> +/* Bits for run->request_interrupt_window */
> +#define KVM_IRQWINDOW_VTIMER         (1 << 0)
> +
> +/* Bits for run->arm_timer.timesource */
> +#define KVM_ARM_TIMER_VTIMER         (1 << 0)
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> index 4309b60..cbbb50dd 100644
> --- a/virt/kvm/arm/arch_timer.c
> +++ b/virt/kvm/arm/arch_timer.c
> @@ -170,16 +170,45 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, 
> bool new_level)
>  {
>       int ret;
>       struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
> +     struct kvm_run *run = vcpu->run;
>  
> -     BUG_ON(!vgic_initialized(vcpu->kvm));
> +     BUG_ON(irqchip_in_kernel(vcpu->kvm) && !vgic_initialized(vcpu->kvm));
>  
>       timer->active_cleared_last = false;
>       timer->irq.level = new_level;
> -     trace_kvm_timer_update_irq(vcpu->vcpu_id, timer->irq.irq,
> +     trace_kvm_timer_update_irq(vcpu->vcpu_id, host_vtimer_irq,
>                                  timer->irq.level);
> -     ret = kvm_vgic_inject_mapped_irq(vcpu->kvm, vcpu->vcpu_id,
> -                                      timer->irq.irq,
> -                                      timer->irq.level);
> +
> +     if (irqchip_in_kernel(vcpu->kvm) && vgic_initialized(vcpu->kvm)) {

Given how many times you repeat this idiom in this patch, you should
have a single condition that encapsulate it once and for all.

> +             /* Fire the timer in the VGIC */
> +
> +             ret = kvm_vgic_inject_mapped_irq(vcpu->kvm, vcpu->vcpu_id,
> +                                              timer->irq.irq,
> +                                              timer->irq.level);
> +     } else if (!vcpu->arch.user_space_arm_timers) {
> +             /* User space has not activated timer use */
> +             ret = 0;
> +     } else {
> +             /*
> +              * Set PENDING_TIMER so that user space can handle the event if
> +              *
> +              *   1) Level is high
> +              *   2) The vtimer is not suppressed by user space
> +              *   3) We are not in the timer trigger exit path
> +              */
> +             if (new_level &&
> +                 !(run->request_interrupt_window & KVM_IRQWINDOW_VTIMER) &&
> +                 (run->exit_reason != KVM_EXIT_ARM_TIMER)) {
> +                     /* KVM_REQ_PENDING_TIMER means vtimer triggered */
> +                     kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
> +             }
> +
> +             /* Force a new level high check on next entry */
> +             timer->irq.level = 0;

I think this could become a bit more simple if you follow the flow I
mentioned earlier involving kvm_timer_sync(). Also, I only see how you
flag the line as being high, but not how you lower it. Care to explain
that flow?

> +
> +             ret = 0;
> +     }
> +
>       WARN_ON(ret);
>  }
>  
> @@ -197,7 +226,8 @@ static int kvm_timer_update_state(struct kvm_vcpu *vcpu)
>        * because the guest would never see the interrupt.  Instead wait
>        * until we call this function from kvm_timer_flush_hwstate.
>        */
> -     if (!vgic_initialized(vcpu->kvm) || !timer->enabled)
> +     if ((irqchip_in_kernel(vcpu->kvm) && !vgic_initialized(vcpu->kvm)) ||
> +         !timer->enabled)
>               return -ENODEV;
>  
>       if (kvm_timer_should_fire(vcpu) != timer->irq.level)
> @@ -275,35 +305,57 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
>       * to ensure that hardware interrupts from the timer triggers a guest
>       * exit.
>       */
> -     phys_active = timer->irq.level ||
> -                     kvm_vgic_map_is_active(vcpu, timer->irq.irq);
> -
> -     /*
> -      * We want to avoid hitting the (re)distributor as much as
> -      * possible, as this is a potentially expensive MMIO access
> -      * (not to mention locks in the irq layer), and a solution for
> -      * this is to cache the "active" state in memory.
> -      *
> -      * Things to consider: we cannot cache an "active set" state,
> -      * because the HW can change this behind our back (it becomes
> -      * "clear" in the HW). We must then restrict the caching to
> -      * the "clear" state.
> -      *
> -      * The cache is invalidated on:
> -      * - vcpu put, indicating that the HW cannot be trusted to be
> -      *   in a sane state on the next vcpu load,
> -      * - any change in the interrupt state
> -      *
> -      * Usage conditions:
> -      * - cached value is "active clear"
> -      * - value to be programmed is "active clear"
> -      */
> -     if (timer->active_cleared_last && !phys_active)
> -             return;
> +     if (irqchip_in_kernel(vcpu->kvm) && vgic_initialized(vcpu->kvm)) {
> +             phys_active = timer->irq.level ||
> +                             kvm_vgic_map_is_active(vcpu, timer->irq.irq);
> +
> +             /*
> +              * We want to avoid hitting the (re)distributor as much as
> +              * possible, as this is a potentially expensive MMIO access
> +              * (not to mention locks in the irq layer), and a solution for
> +              * this is to cache the "active" state in memory.
> +              *
> +              * Things to consider: we cannot cache an "active set" state,
> +              * because the HW can change this behind our back (it becomes
> +              * "clear" in the HW). We must then restrict the caching to
> +              * the "clear" state.
> +              *
> +              * The cache is invalidated on:
> +              * - vcpu put, indicating that the HW cannot be trusted to be
> +              *   in a sane state on the next vcpu load,
> +              * - any change in the interrupt state
> +              *
> +              * Usage conditions:
> +              * - cached value is "active clear"
> +              * - value to be programmed is "active clear"
> +              */
> +             if (timer->active_cleared_last && !phys_active)
> +                     return;
> +
> +             ret = irq_set_irqchip_state(host_vtimer_irq,
> +                                         IRQCHIP_STATE_ACTIVE,
> +                                         phys_active);
> +     } else {
> +             /* User space tells us whether the timer is in active mode */
> +             phys_active = vcpu->run->request_interrupt_window &
> +                           KVM_IRQWINDOW_VTIMER;

No, this just says "mask the interrupt". It doesn't say anything about
the state of the timer. More importantly: you sample the shared page.
What guarantees that the information there is preserved? Is userspace
writing that bit each time the vcpu thread re-enters the kernel with
this interrupt being in flight?

> +
> +             /* However if the line is high, we exit anyway, so we want
> +              * to keep the IRQ masked */
> +             phys_active = phys_active || timer->irq.level;

Why would you force the interrupt to be masked as soon as the timer is
firing? If userspace hasn't masked it, I don't think you should paper
over it.

> +
> +             /*
> +              * So we can just explicitly mask or unmask the IRQ, gaining
> +              * more compatibility with oddball irq controllers.
> +              */
> +             if (phys_active)
> +                     disable_percpu_irq(host_vtimer_irq);
> +             else
> +                     enable_percpu_irq(host_vtimer_irq, 0);

Since you are now targeting random irqchips (as opposed to a GIC
specifically), what guarantees that the timer is a per-cpu IRQ?

> +
> +             ret = 0;
> +     }
>  
> -     ret = irq_set_irqchip_state(host_vtimer_irq,
> -                                 IRQCHIP_STATE_ACTIVE,
> -                                 phys_active);
>       WARN_ON(ret);
>  
>       timer->active_cleared_last = !phys_active;
> @@ -479,6 +531,10 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu)
>       if (timer->enabled)
>               return 0;
>  
> +     /* No need to route physical IRQs when we don't use the vgic */
> +     if (!irqchip_in_kernel(vcpu->kvm))
> +             goto no_vgic;
> +
>       /*
>        * Find the physical IRQ number corresponding to the host_vtimer_irq
>        */
> @@ -502,6 +558,7 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu)
>       if (ret)
>               return ret;
>  
> +no_vgic:
>  
>       /*
>        * There is a potential race here between VCPUs starting for the first
> 

Thanks,

        M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Reply via email to