On Tue, Nov 27, 2012 at 03:38:05AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-11-26:
> > On Mon, Nov 26, 2012 at 12:29:54PM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-11-26:
> >>> On Mon, Nov 26, 2012 at 03:51:04AM +0000, Zhang, Yang Z wrote:
> >>>> Gleb Natapov wrote on 2012-11-25:
> >>>>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
> >>>>>> Posted Interrupt allows vAPICV interrupts to inject into guest directly
> >>>>>> without any vmexit.
> >>>>>> 
> >>>>>> - When delivering a interrupt to guest, if target vcpu is running,
> >>>>>>   update Posted-interrupt requests bitmap and send a notification
> >>>>>>   event to the vcpu. Then the vcpu will handle this interrupt
> >>>>>>   automatically, without any software involvemnt.
> >>>>> Looks like you allocating one irq vector per vcpu per pcpu and then
> >>>>> migrate it or reallocate when vcpu move from one pcpu to another.
> >>>>> This is not scalable and migrating irq migration slows things down.
> >>>>> What's wrong with allocating one global vector for posted interrupt
> >>>>> during vmx initialization and use it for all vcpus?
> >>>> 
> >>>> Consider the following situation:
> >>>> If vcpu A is running when notification event which belong to vcpu B is
> > arrived,
> >>> since the vector match the vcpu A's notification vector, then this event
> >>> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> >>> be handled in time. The exact same situation is possible with your code.
> >>> vcpu B can be migrated from pcpu and vcpu A will take its place and will
> >>> be assigned the same vector as vcpu B. But I fail to see why is this a
> >> No, the on bit will be set to prevent notification event when vcpu B start
> > migration. And it only free the vector before it going to run in another 
> > pcpu.
> > There is a race. Sender check on bit, vcpu B migrate to another pcpu and
> > starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
> > A gets it.
> Yes, it do exist. But I think it should be ok even this happens.
> 
Then it is OK to use global PI vector. The race should be dealt with
anyway.

> >> 
> >>> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> >>> detect new event during next vmentry.
> >> Yes, but the next vmentry may happen long time later and interrupt cannot 
> >> be
> > serviced until next vmentry. In current way, it will cause vmexit and 
> > re-schedule
> > the vcpu B.
> > Vmentry will happen when scheduler will decide that vcpu can run. There
> I don't know how scheduler can know the vcpu can run in this case, can you 
> elaborate it? 
> I thought it may have problems with global vector in some cases(maybe I am 
> wrong, since I am not familiar with KVM scheduler):
> If target VCPU is in idle, and this notification event is consumed by other 
> VCPU, then how can scheduler know the vcpu is ready to run? Even there is a 
> way for scheduler to know, then when? Isn't it too late?
> If notification event arrived in hypervisor, then how the handler know which 
> VCPU the notification event belong to?
When vcpu is idle its thread sleeps inside host kernel (see
virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
you should call kvm_vcpu_kick(), but only after changing vcpu
state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
there which checks apic IRR, but not PIR, so it is not enough to set
bit in PIR and call kvm_vcpu_kick() to wake up vcpu.

> 
> > is no problem here. What you probably want to say is that vcpu may not be
> > aware of interrupt happening since it was migrated to different vcpu
> > just after PI IPI was sent and thus missed it. But than PIR interrupts
> > should be processed during vmentry on another pcpu:
> > 
> > Sender:                             Guest:
> > 
> > set pir
> > set on
> > if (vcpu in guest mode on pcpu1)
> >                                    vmexit on pcpu1
> >                                    vmentry on pcpu2
> >                                    process pir, deliver interrupt
> >     send PI IPI to pcpu1
> 
> >> 
> >>>> 
> >>>>> 
> >>>>>> +      if (!cfg) {
> >>>>>> +              free_irq_at(irq, NULL);
> >>>>>> +              return 0;
> >>>>>> +      }
> >>>>>> +
> >>>>>> +      raw_spin_lock_irqsave(&vector_lock, flags);
> >>>>>> +      if (!__assign_irq_vector(irq, cfg, mask))
> >>>>>> +              ret = irq;
> >>>>>> +      raw_spin_unlock_irqrestore(&vector_lock, flags);
> >>>>>> +
> >>>>>> +      if (ret) {
> >>>>>> +              irq_set_chip_data(irq, cfg);
> >>>>>> +              irq_clear_status_flags(irq, IRQ_NOREQUEST);
> >>>>>> +      } else {
> >>>>>> +              free_irq_at(irq, cfg);
> >>>>>> +      }
> >>>>>> +      return ret;
> >>>>>> +}
> >>>>> 
> >>>>> This function is mostly cut&paste of create_irq_nr().
> >>>> 
> >>>> Yes, this function allow to allocate vector from specified cpu.
> >>>> 
> >>> Does not justify code duplication.
> >> ok. will change it in next version.
> >> 
> > Please use single global vector in the next version.
> > 
> >>>>>> 
> >>>>>>        if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
> >>>>>>                apic->vid_enabled = true;
> >>>>>> +
> >>>>>> +      if (kvm_x86_ops->has_posted_interrupt(vcpu))
> >>>>>> +              apic->pi_enabled = true;
> >>>>>> +
> >>>>> This is global state, no need per apic variable.
> >> Even all vcpus use the same setting, but according to SDM, apicv really is 
> >> a per
> > apic variable.
> > It is not per vapic in our implementation and this is what is
> > important here.
> > 
> >> Anyway, if you think we should not put it here, where is the best place?
> > It is not needed, just use has_posted_interrupt(vcpu) instead.
> ok
>  
> >> 
> >>>>>> @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
> >>> *vcpu,
> >>>>> int cpu)
> >>>>>>                struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
> >>>>>>                unsigned long sysenter_esp;
> >>>>>> +              if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>>>> +                      pi_set_on(to_vmx(vcpu)->pi);
> >>>>>> +
> >>>>> Why?
> >>>> 
> >>>> Here means the vcpu start migration. So we should prevent the
> >>>> notification event until migration end.
> >>>> 
> >>> You check for IN_GUEST_MODE while sending notification. Why is this not
> >> For interrupt from emulated device, it enough. But VT-d device doesn't know
> > the vcpu is migrating, so set the on bit to prevent the notification event 
> > when
> > target vcpu is migrating.
> > Why should VT-d device care about that? It sets bits in pir and sends
> > IPI. If vcpu is running it process pir immediately, if not it will do it
> > during next vmentry.
> We already know the vcpu is not running(it will run soon), we can set this 
> bit to prevent the unnecessary IPI.
We have IN_GUEST_MODE for that. And this is the wrong place to indicate
that vcpu is not running anyway. vcpu is not running immediately after
vmexit.

> 
> >> 
> >>> enough? Also why vmx_vcpu_load() call means that vcpu start migration?
> >> I think the follow check can ensure the vcpu is in migration, am I wrong?
> >> if (vmx->loaded_vmcs->cpu != cpu)
> > This code checks that this vcpu ran on that pcpu last time.
> Yes, migration starts more earlier than here. But I think it should be ok to 
> set ON bit here. Do you have any better idea?
>
If you want to prevent assigned device from sending IPI to non running
vcpu you should set the bit immediately after vmexit. For emulated
devices vcpu->mode should be used.

> >> {
> >>    if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>    pi_set_on(to_vmx(vcpu)->pi);
> >> }
> >> 
> >>>>>> +              kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
> >>>>>> +
> >>>>>>                kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);      
> >>>>>> local_irq_disable();
> >>>>>>                list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, 
> >>>>>> @@ -1582,6
> >>>>>>  +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
> >>>>>>        vcpu->cpu = -1;                 kvm_cpu_vmxoff();       }
> >>>>>> +      if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>>>> +              pi_set_on(to_vmx(vcpu)->pi);
> >>>>> Why?
> >>>> 
> >>>> When vcpu schedule out, no need to send notification event to it, just 
> >>>> set
> > the
> >>> PIR and wakeup it is enough.
> >>> Same as above. When vcpu is scheduled out it will no be in IN_GUEST_MODE
> >> Right.
> >> 
> >>> mode. Also in this case we probably should set bit directly in IRR and 
> >>> leave
> >>> PIR alone.
> >>> From the view of hypervisor, IRR and PIR are same. For each vmentry, if 
> >>> PI is
> > enabled, the IRR equal to (IRR | PIR). So there is no difference to set IRR 
> > or PIR if
> > target vcpu is not running.
> > But there is a difference for KVM code. For instance
> > kvm_arch_vcpu_runnable() checks for interrupts in IRR, but not PIR.
> > Migration code does the same.
> Right. With PI, we need check the PIR too.
>  
> >> 
> >>> 
> >>>> 
> >>>>>>  }
> >>>>>>  
> >>>>>>  static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
> >>>>>> @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct
> >>>>> vmcs_config *vmcs_conf)
> >>>>>>        u32 _vmexit_control = 0;
> >>>>>>        u32 _vmentry_control = 0;
> >>>>>> -      min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> >>>>>> -      opt = PIN_BASED_VIRTUAL_NMIS;
> >>>>>> -      if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> >>>>>> -                              &_pin_based_exec_control) < 0)
> >>>>>> -              return -EIO;
> >>>>>> -
> >>>>>>        min = CPU_BASED_HLT_EXITING |
> >>>>>>  #ifdef CONFIG_X86_64
> >>>>>>              CPU_BASED_CR8_LOAD_EXITING |
> >>>>>> @@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct
> >>>>> vmcs_config *vmcs_conf)
> >>>>>>                                &_vmexit_control) < 0)
> >>>>>>                return -EIO;
> >>>>>> +      min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> >>>>>> +      opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
> >>>>>> +      if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> >>>>>> +                              &_pin_based_exec_control) < 0)
> >>>>>> +              return -EIO;
> >>>>>> +
> >>>>>> +      if (!(_cpu_based_2nd_exec_control &
> >>>>>> +              SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ||
> >>>>>> +              !(_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT))
> >>>>>> +              _pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
> >>>>>> +
> >>>>>>        min = 0;        opt = VM_ENTRY_LOAD_IA32_PAT;   if
> >>>>>>  (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS, @@ -2715,6
> >>>>>>  +2783,9 @@ static __init int hardware_setup(void)     if
> >>>>>>  (!cpu_has_vmx_virtual_intr_delivery())                
> >>>>>> enable_apicv_vid = 0;
> >>>>>> +      if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
> >>>>> In nested guest x2apic may be enabled without irq remapping. Check for
> >>>>> irq remapping here.
> >>>> 
> >>>> There are no posted interrupt available in nested case. We don't need
> >>>> to check IR here.
> >>>> 
> >>> One day emulation will be added. If pre-request for PI is IR check
> >>> for IR.
> >> 
> >> 
> >>> BTW why IR is needed for PI. To deliver assigned devices interrupts
> >>> directly into a guest sure, but why is it required for delivering
> >>> interrupts from emulated devices or IPIs?
> >> Posted Interrupt support is Xeon only and these platforms will have 
> >> x2APIC. So,
> > Linux will enable x2APIC on these platforms. So we only want to enable PI 
> > when
> > x2apic is enabled and IR is required for x2apic.
> > The fact that x2APIC is available on all platform that support PI is
> > irrelevant. If one is not strictly required by the other by architecture
> > do not couple them.
> Right. We only want to simply the implementation of enable "ack intr on 
> exit". If IR enabled, then don't need to check the trig mode(all interrupts 
> are edge) when using self IPI to regenerate the interrupt.
With Avi's suggestion self IPI is not needed. Drop this dependency if it
is not architectural.

>  
> >> 
> >>>>> 
> >>>>>> +              enable_apicv_pi = 0;
> >>>>>> +
> >>>>>>        if (nested)             nested_vmx_setup_ctls_msrs(); @@ 
> >>>>>> -3881,6 +3952,93
> >>>>>>  @@ static void ept_set_mmio_spte_mask(void)
> >>>>>>        kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull); }
> >>>>>> +irqreturn_t pi_handler(int irq, void *data)
> >>>>>> +{
> >>>>>> +      struct vcpu_vmx *vmx = data;
> >>>>>> +
> >>>>>> +      kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
> >>>>>> +      kvm_vcpu_kick(&vmx->vcpu);
> >>>>>> +
> >>>>>> +      return IRQ_HANDLED;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
> >>>>>> +{
> >>>>>> +      return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
> >>>>>> +{
> >>>>>> +      int ret = 0;
> >>>>>> +      struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>>>> +
> >>>>>> +      if (!enable_apicv_pi)
> >>>>>> +              return ;
> >>>>>> +
> >>>>>> +      preempt_disable();
> >>>>>> +      local_irq_disable();
> >>>>>> +      if (!vmx->irq) {
> >>>>>> +              ret = arch_pi_alloc_irq(vmx);
> >>>>>> +              if (ret < 0) {
> >>>>>> +                      vmx->irq = -1;
> >>>>>> +                      goto out;
> >>>>>> +              }
> >>>>>> +              vmx->irq = ret;
> >>>>>> +
> >>>>>> +              ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
> >>>>>> +                                      "Posted Interrupt", vmx);
> >>>>>> +              if (ret) {
> >>>>>> +                      vmx->irq = -1;
> >>>>>> +                      goto out;
> >>>>>> +              }
> >>>>>> +
> >>>>>> +              ret = arch_pi_get_vector(vmx->irq);
> >>>>>> +      } else
> >>>>>> +              ret = arch_pi_migrate(vmx->irq, smp_processor_id());
> >>>>>> +
> >>>>>> +      if (ret < 0) {
> >>>>>> +              vmx->irq = -1;
> >>>>>> +              goto out;
> >>>>>> +      } else {
> >>>>>> +              vmx->vector = ret;
> >>>>>> +              vmcs_write16(POSTED_INTR_NV, vmx->vector);
> >>>>>> +              pi_clear_on(vmx->pi);
> >>>>>> +      }
> >>>>>> +out:
> >>>>>> +      local_irq_enable();
> >>>>>> +      preempt_enable();
> >>>>>> +      return ;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
> >>>>>> +              int vector)
> >>>>>> +{
> >>>>>> +      struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>>>> +
> >>>>>> +      if (unlikely(vmx->irq == -1))
> >>>>>> +              return 0;
> >>>>>> +
> >>>>>> +      if (vcpu->cpu == smp_processor_id()) {
> >>>>>> +              pi_set_on(vmx->pi);
> >>>>> Why? You clear this bit anyway in vmx_update_irq() during guest entry.
> >>>> Here means the target vcpu already in vmx non-root mode. Then it will
> >>> consume the interrupt on next vm entry and we don't need to send the
> >>> notification event from other cpu, just update PIR is enough.
> >>> I understand why you avoid sending PI IPI here, but you do not update
> >>> pir in this case either. You only set "on" bit here and set vector 
> >>> directly
> >>> in IRR in __apic_accept_irq() since vmx_send_nv() returns 0 in this case.
> >>> Interrupt is delivered from IRR on the next entry.
> >> As I mentioned, IRR is basically same as PIR.
> >> 
> > That does not explain why are you setting "on" without setting bit in pir.
> Right. Just set the PIR and return 1 is enough.
> 
> >>>> 
> >>>>>> +              return 0; +     } + +   pi_set_pir(vector, vmx->pi); +  
> >>>>>> if
> >>>>>> (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
> >>>>>> +              apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), 
> >>>>>> vmx->vector);
> >>>>>> +              return 1; +     } +     return 0; +} + +static void 
> >>>>>> free_pi(struct
> >>>>>> vcpu_vmx *vmx) +{ +    if (enable_apicv_pi) { +                
> >>>>>> kfree(vmx->pi);
> >>>>>> +              arch_pi_free_irq(vmx->irq, vmx); +      } +} +
> >>>>>>  /*
> >>>>>>   * Sets up the vmcs for emulated real mode.
> >>>>>>   */
> >>>>>> @@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> > *vmx)
> >>>>>>        unsigned long a;
> >>>>>>  #endif
> >>>>>>        int i;
> >>>>>> +      u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
> >>>>>> 
> >>>>>>        /* I/O */       vmcs_write64(IO_BITMAP_A, 
> >>>>>> __pa(vmx_io_bitmap_a)); @@
> >>>>>>  -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> >>>>>>  *vmx)         vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
> >>>>>>  
> >>>>>>        /* Control */
> >>>>>> -      vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> >>>>>> -              vmcs_config.pin_based_exec_ctrl); +     if 
> >>>>>> (!enable_apicv_pi)
> >>>>>> +              pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR; +
> >>>>>> +      vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
> >>>>>> 
> >>>>>>        vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> >>>>> vmx_exec_control(vmx));
> >>>>>> 
> >>>>>> @@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> >>> *vmx)
> >>>>>>                vmcs_write16(GUEST_INTR_STATUS, 0);
> >>>>>>        }
> >>>>>> +      if (enable_apicv_pi) {
> >>>>>> +              vmx->pi = kmalloc(sizeof(struct pi_desc),
> >>>>>> +                              GFP_KERNEL | __GFP_ZERO);
> >>>>>> +              vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi)));
> >>>>>> +      }
> >>>>>> +
> >>>>>>        if (ple_gap) {          vmcs_write32(PLE_GAP, ple_gap);
> >>>>>>                vmcs_write32(PLE_WINDOW, ple_window); @@ -6161,6 
> >>>>>> +6328,11 @@
> >>>>>>  static void vmx_update_irq(struct kvm_vcpu *vcpu)     if
> >>>>>>  (!enable_apicv_vid)           return ;
> >>>>>> +      if (enable_apicv_pi) {
> >>>>>> +              kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
> >>>>>> +              pi_clear_on(vmx->pi);
> >>>>> Why do you do that? Isn't VMX process posted interrupts on vmentry if
> >>>>> "on" bit is set?
> >>> Can you answer this question?
> >> No, vmentry do nothing for PI. Posted interrupt only happens when an
> > unmasked external interrupt arrived and the target vcpu is running. Beyond 
> > that,
> > cpu follow the old way.
> >> 
> > Now that totally contradicts what you wrote above! (unless I
> > misunderstood you, in which case clarify please)
> > 
> >   From the view of hypervisor, IRR and PIR are same. For each vmentry, if
> >   PI is enabled, the IRR equal to (IRR | PIR). So there is no difference
> >   to set IRR or PIR if target vcpu is not running.
> > --
> >                     Gleb.
> Sorry, maybe I mislead you. VMentry have nothing to do PI.
> What I mean" IRR and PIR are same" is because this patch will copy PIR to IRR 
> before each vmentry. So I think this two should be same in some levels. But 
> according your comments, it may wrong.
> 
OK. Thanks for clarification.

--
                        Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to