Re: [kvm-devel] KVM in-kernel APIC update

Avi Kivity Wed, 04 Apr 2007 00:41:36 -0700

Gregory Haskins wrote:
> My current thoughts are that we at least move the IOAPIC into the kernel as 
> well.  That will give sufficient control to generate ISA bus interrupts for 
> guests that understand APICs.  If we want to be able to generate ISA 
> interrupts for legacy guests which talk to the 8259s that will prove to be 
> insufficient.  The good news is that moving the 8259s down as well is 
> probably not a huge deal either, especially since I have already prepped the 
> usermode side.  Thoughts?
>


I would avoid moving down anything that's not strictly necessary.  If we 
want to keep the PIC in qemu, for example, just export the APIC-PIC 
interface to qemu.

I still don't have an opinion as to whether it is necessary; I'll need 
to study the details.  Xen pushes most of the platform into the 
hypervisor, but then Xen's communication path to qemu is much more 
expensive (involving the scheduler and a potential cpu switch) than 
kvm.  We'd need to balance possible performance improvements (I'd expect 
negligible) and interface simplification (possibly non-negligible) 
against further diverging from qemu.


> So heres a question for you guys out there.  What is the expected use of the 
> in-kernel APIC?  My interests lie in the ability to send IPIs for SMP, as 
> well as being able to inject asynchronous hypercall interrupts.  I assume 
> there are other reasons too, such as PV device interrupts, etc and I would 
> like to make sure I am seeing the big picture before making any bad design 
> decisions.  My question is, how do we expect the PV devices to look from a 
> bus perspective?  
>
> The current Bochs/QEMU system model paints a fairly simple ISA architecture 
> utilizing a single IOAPIC + dual 8259 setup.  Do we expect in-kernel injected 
> IRQs to follow the ISA model (e.g. either legacy or PCI interrupts only 
> limited to IRQ0-15) or do we want to expand on this?  The PCI hypercall 
> device introduced a while back would be an example of something ISA based.  
> Alternatives would be to utilize unused "pins" (such as IRQ16-23) on IOAPIC 
> #0, or introducing new an entirely new bus/IOAPICs just for KVM, etc.  
>   

There are two extreme models, which I think are both needed.  On one 
end, support for closed OSes (e.g. Windows) requires fairly strict 
conformance to the PCI model, which means going through the IOAPIC or 
PIC or however the interrupt lines are wired in qemu. This seems to 
indicate that an in-kernel IOAPIC is needed.  On the other end (Linux), 
a legacy-free and emulation-free device can just inject interrupts 
directly and use shared memory to ack interrupts and indicate their source.

> If the latter, we also need to decide what the resource conveyance model and 
> vector allocation policy should be.  For instance, do we publish said 
> resources formally in the MP/ACPI tables in Bochs?  Doing so would allow 
> MP/ACPI compliant OSs like linux to naturally route the IRQ.  Conversely, do 
> we do something more direct just like we do for KVM discovery via wrmsr?
>   

I think we can go the direct route for cooperative guests.

I also suggest doing the work in stages and measuring; that is, first 
push the local apic, determine what the remaining bottlenecks are and 
tackle them.  I'm pretty sure that Linux guests would only require the 
local apic but Windows (and older Linux kernels) might require more.

>  struct kvm_vcpu;
>  
> +struct kvm_irqinfo {
> +     int         vector;
> +     int         nmi;
> +};
> +
> +#define KVM_IRQFLAGS_NMI  (1 << 0)
> +#define KVM_IRQFLAGS_PEEK (1 << 1)
> +
> +struct kvm_irqdevice {
> +     int  (*pending)(struct kvm_irqdevice *this, int flags);
> +     int  (*read)(struct kvm_irqdevice *this, int flags, 
> +                  struct kvm_irqinfo *info);
>   

Aren't pending() and read() + PEEK overlapping?

> +     int  (*inject)(struct kvm_irqdevice *this, int irq, int flags);
> +     int  (*summary)(struct kvm_irqdevice *this, void *data);
> +     void (*destructor)(struct kvm_irqdevice *this);
> +
> +     void *private;
> +};
>   

Consider using container_of() to simulate C++ inheritance.  Messier but 
less indirections.   Also consider a kvm_irqdevice_operations structure.

> +
> +#define MAX_APIC_INT_VECTOR      256
> +
> +struct kvm_apic {
> +     u32                     status;
> +     u32                     vcpu_id;
> +     spinlock_t              lock;
> +     u32                     pcpu_lock_owner;
>   

Isn't this vcpu->cpu?

> +     atomic_t                timer_pending;
> +     u64                     apic_base_msr;
> +     unsigned long           base_address;
> +     u32                     timer_divide_count;
> +     struct hrtimer          apic_timer;
> +     int                     intr_pending_count[MAX_APIC_INT_VECTOR];
> +     ktime_t                 timer_last_update;
> +     struct {
> +             int deliver_mode;
> +             int source[6];
> +     } direct_intr;
> +     u32                     err_status;
> +     u32                     err_write_count;
> +     struct kvm_vcpu         *vcpu;
> +     struct page             *regs_page;
> +     void                    *regs;
> +     struct kvm_irqdevice    ext; /* Used for external/NMI interrupts */
> +};
> +
>  /*
>   * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level
>   * 32-bit).  The kvm_mmu structure abstracts the details of the current mmu
> @@ -236,6 +281,11 @@ struct kvm_pio_request {
>       int rep;
>  };
>  
> +#define KVM_VCPU_INIT_SIPI_SIPI_STATE_NORM          0
> +#define KVM_VCPU_INIT_SIPI_SIPI_STATE_WAIT_SIPI     1
> +
> +#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
> +
>  struct kvm_vcpu {
>       struct kvm *kvm;
>       union {
> @@ -248,12 +298,9 @@ struct kvm_vcpu {
>       u64 host_tsc;
>       struct kvm_run *run;
>       int interrupt_window_open;
> -     unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */
> -#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
> -     unsigned long irq_pending[NR_IRQ_WORDS];
>       unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */
>       unsigned long rip;      /* needs vcpu_load_rsp_rip() */
> -
> +     struct kvm_irqdevice irq_dev;
>       unsigned long cr0;
>       unsigned long cr2;
>       unsigned long cr3;
> @@ -261,10 +308,8 @@ struct kvm_vcpu {
>       struct page *para_state_page;
>       gpa_t hypercall_gpa;
>       unsigned long cr4;
> -     unsigned long cr8;
>   

I think we should keep cr8 here and accept the duplication.

>       u64 pdptrs[4]; /* pae */
>       u64 shadow_efer;
> -     u64 apic_base;
>       u64 ia32_misc_enable_msr;
>       int nmsrs;
>       struct vmx_msr_entry *guest_msrs;
> @@ -298,6 +343,11 @@ struct kvm_vcpu {
>       int sigset_active;
>       sigset_t sigset;
>  
> +     struct kvm_apic apic;
> +     wait_queue_head_t halt_wq;
> +     /* For AP startup */
> +     unsigned long init_sipi_sipi_state;
> +
>       struct {
>               int active;
>               u8 save_iopl;
> @@ -319,6 +369,23 @@ struct kvm_mem_alias {
>       gfn_t target_gfn;
>  };
>  
> +#define kvm_irq_pending(dev, flags)     (dev)->pending(dev, flags)
> +#define kvm_irq_read(dev, flags, info)  (dev)->read(dev, flags, info)
> +#define kvm_irq_inject(dev, irq, flags) (dev)->inject(dev, irq, flags)
> +#define kvm_irq_summary(dev, data)      (dev)->summary(dev, data)
>   

Static inlines please, no #defines.  In this case, I'd just call the 
functions directly.

> +
> +#define kvm_vcpu_irq_pending(vcpu, flags) \
> +   kvm_irq_pending(&vcpu->irq_dev, flags)
> +#define kvm_vcpu_irq_read(vcpu, flags, info) \
> +   kvm_irq_read(&vcpu->irq_dev, flags, info)
> +#define kvm_vcpu_irq_inject(vcpu, irq, flags) \
> +   kvm_irq_inject(&vcpu->irq_dev, irq, flags)
> +#define kvm_vcpu_irq_summary(vcpu, data)  \
> +   kvm_irq_summary(&vcpu->irq_dev, data)
> +
>   

Ditto.

>  
> +struct kvm_mmio_handler {
> +     unsigned long (*read)(struct kvm_vcpu *v,
> +                           unsigned long addr,
> +                           unsigned long length);
> +     void (*write)(struct kvm_vcpu *v,
> +                        unsigned long addr,
> +                        unsigned long length,
> +                        unsigned long val);
> +     int (*in_range)(struct kvm_vcpu *v, unsigned long addr);
> +};
>   

This would need to be split out into a separate patch.  Addresses are gpa_t.

> +
> +#define KVM_MMIO_HANDLER_NR_ARRAY_SIZE 1
> +static struct kvm_mmio_handler 
> *kvm_mmio_handlers[KVM_MMIO_HANDLER_NR_ARRAY_SIZE] =
> +{
> +    &apic_mmio_handler,
> +};
> +
>   

We'll need dynamic registration if we want kernel apic support to be 
optional.  Also, this needs to be per-vcpu as each vcpu's apic can be 
placed at a different address.


> @@ -1479,7 +1533,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
> u64 *pdata)
>               data = 3;
>               break;
>       case MSR_IA32_APICBASE:
> -             data = vcpu->apic_base;
> +             data = vcpu->apic.apic_base_msr;
> +             break;
> +     case MSR_IA32_TIME_STAMP_COUNTER:
> +             // FIXME
> +                //data = guest_read_tsc();
>               break;
>   

The tsc stuff looks spurious.

> +     case KVM_GET_OPTIONS: {
> +             r = -EFAULT;
> +             if (copy_to_user(&kvm->options, argp, sizeof(kvm->options)))
> +                 goto out;
> +             r = 0;
> +             break;
> +     }
> +     case KVM_SET_OPTIONS: {
> +             r = -EFAULT;
> +             if (copy_from_user(&kvm->options, argp, sizeof(kvm->options)))
> +                 goto out;
> +             r = 0;
> +             break;
> +     }
>   

This is difficult to expand over time.  I prefer a separate ioctl() for 
each option.

> +
> +struct bitarray {
> +    unsigned long summary; /* 1 per word in pending */
> +    unsigned long pending[NR_IRQ_WORDS];
> +};
> +
> +static int bitarray_pending(struct bitarray *this)
> +{
> +     return this->summary ? 1 : 0;   
> +}
> +
> +static int bitarray_findhighest(struct bitarray *this)
> +{
> +     if(!this->summary)
>   

Space after if, here and elsewhere.

> +             return 0;
> +
> +     int word_index = __ffs(this->summary);
> +     int bit_index = __ffs(this->pending[word_index]);
> +
> +     return word_index * BITS_PER_LONG + bit_index;  
> +}
> +
> +static void bitarray_set(struct bitarray *this, int nr)
> +{
> +     set_bit(nr, &this->pending);
> +     set_bit(nr / BITS_PER_LONG, &this->summary); 
>   

Since you need locking anyway, best to use the unlocked versions 
(__set_bit()).

> +typedef struct {
> +     struct bitarray irq_pending;
> +     struct bitarray nmi_pending;
>   

Why is nmi_pending a bitarray?

> @@ -108,20 +109,12 @@ static unsigned get_addr_size(struct kvm_vcpu *vcpu)
>  
>  static inline u8 pop_irq(struct kvm_vcpu *vcpu)
>  {
> -     int word_index = __ffs(vcpu->irq_summary);
> -     int bit_index = __ffs(vcpu->irq_pending[word_index]);
> -     int irq = word_index * BITS_PER_LONG + bit_index;
> -
> -     clear_bit(bit_index, &vcpu->irq_pending[word_index]);
> -     if (!vcpu->irq_pending[word_index])
> -             clear_bit(word_index, &vcpu->irq_summary);
> -     return irq;
> +     return kvm_vcpu_irq_read(vcpu, 0, NULL);
>  }
>  
>  static inline void push_irq(struct kvm_vcpu *vcpu, u8 irq)
>  {
> -     set_bit(irq, vcpu->irq_pending);
> -     set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary);
> +     kvm_vcpu_irq_inject(vcpu, irq, 0);
>  }
>   


It would be helpful to unify the vmx and svm irq code first (I can merge 
something like that immediately), so this patch becomes simpler.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] KVM in-kernel APIC update

Reply via email to