>>> On Wed, Apr 4, 2007 at 3:40 AM, in message <[EMAIL PROTECTED]>, Avi Kivity <[EMAIL PROTECTED]> wrote: > > I would avoid moving down anything that's not strictly necessary.
Agreed. > > I still don't have an opinion as to whether it is necessary; I'll need > to study the details. Xen pushes most of the platform into the > hypervisor, but then Xen's communication path to qemu is much more > expensive (involving the scheduler and a potential cpu switch) than > kvm. We'd need to balance possible performance improvements (I'd expect > negligible) and interface simplification (possibly non-negligible) > against further diverging from qemu. I am not going to advocate one way or the other here, but I will just layout how I saw this "full kernel emulation" happening so we are on the same page. What I found was that the entire QEMU interrupt system converges on "pic_send_irq()" regardless of whether its a PCI or legacy ISA device. This function also acts as a pin-forwarding mechanism, sending the IRQ event to both the 8259s and IOAPIC #0. The irq that is dispatched by pic_send_irq() is still a raw "pin IRQ" reference, and must be translated into an actual x86 vector by the 8259/IOAPIC (which is governed by how the BIOS/OS programs them). The action of raising the interrupt ends up synchronously calling cpu_interrupt() which currently seems to be a no-op for KVM. Later, in the pre_run stage the system retrieves a pending vector (this is when the irq/vector translation occurs) and injects it to the kernel with the INTERRUPT ioctl. What I was planning on doing was using that QEMU patch I provided to intercept all pic_send_irq() calls and forward them directly to the kernel via a new ioctl(). This ioctl would be directed at the VM fd, not the VCPU, since its a pure ISA global pin reference and wont know the targeted vcpu until the 8259/IOAPIC perform their translation. So that being said, I think the interface between userspace and kernel would be no more complex than it is today. I.e. we would just be passing a single int via an ioctl. The danger, as you have pointed out, is accepting the QEMU patch that I submitted that potentially diverges the pic code from QEMU upstream. What this buys us, however, is is that any code (kernel or userspace) would be able to inject an ISA interrupt. Using an ISA interrupt has the advantage of already being represented in the ACPI/MP tables presented by Bochs, and thus any compliant OS will automatically route the IRQ. Where things do get potentially complicated with the interface is if we go with a hybrid solution. Leaving the LAPIC in kernel, but the IOAPIC/8259 in userspace requires a much wider interface so the IOAPIC can deliver APIC style messages to the kernel. Also, IOAPIC EOI for level sensitive interrupts become more difficult and complex. Putting LAPIC + IOAPIC#0 in the kernel and leaving 8259 outside might actually work fairly well, but in-kernel ISA interrupts will only work with OSs which enable the APIC. This may or may not be an issue and may be an acceptable tradeoff. > > There are two extreme models, which I think are both needed. On one > end, support for closed OSes (e.g. Windows) requires fairly strict > conformance to the PCI model, which means going through the IOAPIC or > PIC or however the interrupt lines are wired in qemu. This seems to indicate > that an in-kernel IOAPIC is needed. On the other end (Linux), > a legacy-free and emulation-free device can just inject interrupts > directly and use shared memory to ack interrupts and indicate their source. Well, sort of. The problem as I see it in both cases is still IRQ routing. If you knew of a free vector to take (regardless of OS) you could forgo the ACPI/MP declarations all together and register the vector with your "device" (e.g. via a hypercall, etc) and have the device issue direct LAPIC messages with the proper vector. I think this would work assuming the device used edge triggered interrupts (which don't require IOAPIC IRR registration or EOI). > >> If the latter, we also need to decide what the resource conveyance model and > vector allocation policy should be. For instance, do we publish said > resources formally in the MP/ACPI tables in Bochs? Doing so would allow > MP/ACPI compliant OSs like linux to naturally route the IRQ. Conversely, do > we do something more direct just like we do for KVM discovery via wrmsr? >> > > I think we can go the direct route for cooperative guests. As long as there is a way for the guest code to know about a free vector to use, I think this should work just fine. > > I also suggest doing the work in stages and measuring; that is, first > push the local apic, determine what the remaining bottlenecks are and > tackle them. I'm pretty sure that Linux guests would only require the > local apic but Windows (and older Linux kernels) might require more. > >> struct kvm_vcpu; >> >> +struct kvm_irqinfo { >> + int vector; >> + int nmi; >> +}; >> + >> +#define KVM_IRQFLAGS_NMI (1 << 0) >> +#define KVM_IRQFLAGS_PEEK (1 << 1) >> + >> +struct kvm_irqdevice { >> + int (*pending)(struct kvm_irqdevice *this, int flags); >> + int (*read)(struct kvm_irqdevice *this, int flags, >> + struct kvm_irqinfo *info); >> > > Aren't pending() and read() + PEEK overlapping? Excellent point. I didnt introduce the PEEK/NMI support until after this interface was already defined. I will consolidate. > >> + int (*inject)(struct kvm_irqdevice *this, int irq, int flags); >> + int (*summary)(struct kvm_irqdevice *this, void *data); >> + void (*destructor)(struct kvm_irqdevice *this); >> + >> + void *private; >> +}; >> > > Consider using container_of() to simulate C++ inheritance. Messier but > less indirections. Also consider a kvm_irqdevice_operations structure. Ack. > >> + >> +#define MAX_APIC_INT_VECTOR 256 >> + >> +struct kvm_apic { >> + u32 status; >> + u32 vcpu_id; >> + spinlock_t lock; >> + u32 pcpu_lock_owner; >> > > Isn't this vcpu->cpu? With the exception of the kvm_irqdevice at the end, I inherited this struct from Dor's branch, so I am not sure. I will have to look into it deeper. > >> + atomic_t timer_pending; >> + u64 apic_base_msr; >> + unsigned long base_address; >> + u32 timer_divide_count; >> + struct hrtimer apic_timer; >> + int intr_pending_count[MAX_APIC_INT_VECTOR]; >> + ktime_t timer_last_update; >> + struct { >> + int deliver_mode; >> + int source[6]; >> + } direct_intr; >> + u32 err_status; >> + u32 err_write_count; >> + struct kvm_vcpu *vcpu; >> + struct page *regs_page; >> + void *regs; >> + struct kvm_irqdevice ext; /* Used for external/NMI interrupts */ >> +}; >> + >> /* >> * x86 supports 3 paging modes (4-level 64-bit, 3-level 64-bit, and 2-level >> * 32-bit). The kvm_mmu structure abstracts the details of the current mmu >> @@ -236,6 +281,11 @@ struct kvm_pio_request { >> int rep; >> }; >> >> +#define KVM_VCPU_INIT_SIPI_SIPI_STATE_NORM 0 >> +#define KVM_VCPU_INIT_SIPI_SIPI_STATE_WAIT_SIPI 1 >> + >> +#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long) >> + >> struct kvm_vcpu { >> struct kvm *kvm; >> union { >> @@ -248,12 +298,9 @@ struct kvm_vcpu { >> u64 host_tsc; >> struct kvm_run *run; >> int interrupt_window_open; >> - unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */ >> -#define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long) >> - unsigned long irq_pending[NR_IRQ_WORDS]; >> unsigned long regs[NR_VCPU_REGS]; /* for rsp: vcpu_load_rsp_rip() */ >> unsigned long rip; /* needs vcpu_load_rsp_rip() */ >> - >> + struct kvm_irqdevice irq_dev; >> unsigned long cr0; >> unsigned long cr2; >> unsigned long cr3; >> @@ -261,10 +308,8 @@ struct kvm_vcpu { >> struct page *para_state_page; >> gpa_t hypercall_gpa; >> unsigned long cr4; >> - unsigned long cr8; >> > > I think we should keep cr8 here and accept the duplication. Agreed. The original branch had deleted this, but I really need to leave it in if we expect to be able to support the ABI compatible mode of "apic = disabled". I will restore these. > >> u64 pdptrs[4]; /* pae */ >> u64 shadow_efer; >> - u64 apic_base; >> u64 ia32_misc_enable_msr; >> int nmsrs; >> struct vmx_msr_entry *guest_msrs; >> @@ -298,6 +343,11 @@ struct kvm_vcpu { >> int sigset_active; >> sigset_t sigset; >> >> + struct kvm_apic apic; >> + wait_queue_head_t halt_wq; >> + /* For AP startup */ >> + unsigned long init_sipi_sipi_state; >> + >> struct { >> int active; >> u8 save_iopl; >> @@ -319,6 +369,23 @@ struct kvm_mem_alias { >> gfn_t target_gfn; >> }; >> >> +#define kvm_irq_pending(dev, flags) (dev)->pending(dev, flags) >> +#define kvm_irq_read(dev, flags, info) (dev)->read(dev, flags, info) >> +#define kvm_irq_inject(dev, irq, flags) (dev)->inject(dev, irq, flags) >> +#define kvm_irq_summary(dev, data) (dev)->summary(dev, data) >> > > Static inlines please, no #defines. In this case, I'd just call the > functions directly. ack. > >> + >> +#define kvm_vcpu_irq_pending(vcpu, flags) \ >> + kvm_irq_pending(&vcpu->irq_dev, flags) >> +#define kvm_vcpu_irq_read(vcpu, flags, info) \ >> + kvm_irq_read(&vcpu->irq_dev, flags, info) >> +#define kvm_vcpu_irq_inject(vcpu, irq, flags) \ >> + kvm_irq_inject(&vcpu->irq_dev, irq, flags) >> +#define kvm_vcpu_irq_summary(vcpu, data) \ >> + kvm_irq_summary(&vcpu->irq_dev, data) >> + >> > > Ditto. ack > >> >> +struct kvm_mmio_handler { >> + unsigned long (*read)(struct kvm_vcpu *v, >> + unsigned long addr, >> + unsigned long length); >> + void (*write)(struct kvm_vcpu *v, >> + unsigned long addr, >> + unsigned long length, >> + unsigned long val); >> + int (*in_range)(struct kvm_vcpu *v, unsigned long addr); >> +}; >> > > This would need to be split out into a separate patch. Addresses are gpa_t. ack > >> + >> +#define KVM_MMIO_HANDLER_NR_ARRAY_SIZE 1 >> +static struct kvm_mmio_handler > *kvm_mmio_handlers[KVM_MMIO_HANDLER_NR_ARRAY_SIZE] = >> +{ >> + &apic_mmio_handler, >> +}; >> + >> > > We'll need dynamic registration if we want kernel apic support to be > optional. Agreed. This is another example of something I inherited so it probably wasn't written with the dynamic model in mind. I will clean this up >Also, this needs to be per-vcpu as each vcpu's apic can be placed at a >different address. Note that this interface does actually work on a per-cpu basis today as the "in-range' mechanism considers the apicbase MSR from its vcpu structure. But that aside, I fully agree that this should be more explicit. Actually, i think it would be good to come up with a general MMIO/PIO system that can work on a per VM and per VCPU basis so we can get coverage for both global and per-cpu resources. For instance, if we move the IOAPIC and i8259s, they should be associated with a VM global address. > > >> @@ -1479,7 +1533,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 > msr, u64 *pdata) >> data = 3; >> break; >> case MSR_IA32_APICBASE: >> - data = vcpu->apic_base; >> + data = vcpu->apic.apic_base_msr; >> + break; >> + case MSR_IA32_TIME_STAMP_COUNTER: >> + // FIXME >> + //data = guest_read_tsc(); >> break; >> > > The tsc stuff looks spurious. More inherited stuff, and Im not sure what it was trying to do. I will investigate further. > >> + case KVM_GET_OPTIONS: { >> + r = -EFAULT; >> + if (copy_to_user(&kvm->options, argp, sizeof(kvm->options))) >> + goto out; >> + r = 0; >> + break; >> + } >> + case KVM_SET_OPTIONS: { >> + r = -EFAULT; >> + if (copy_from_user(&kvm->options, argp, sizeof(kvm->options))) >> + goto out; >> + r = 0; >> + break; >> + } >> > > This is difficult to expand over time. I prefer a separate ioctl() for > each option. Ack. > >> + >> +struct bitarray { >> + unsigned long summary; /* 1 per word in pending */ >> + unsigned long pending[NR_IRQ_WORDS]; >> +}; >> + >> +static int bitarray_pending(struct bitarray *this) >> +{ >> + return this->summary ? 1 : 0; >> +} >> + >> +static int bitarray_findhighest(struct bitarray *this) >> +{ >> + if(!this->summary) >> > > Space after if, here and elsewhere. Ack. > >> + return 0; >> + >> + int word_index = __ffs(this->summary); >> + int bit_index = __ffs(this->pending[word_index]); >> + >> + return word_index * BITS_PER_LONG + bit_index; >> +} >> + >> +static void bitarray_set(struct bitarray *this, int nr) >> +{ >> + set_bit(nr, &this->pending); >> + set_bit(nr / BITS_PER_LONG, &this->summary); >> > > Since you need locking anyway, best to use the unlocked versions > (__set_bit()). Ack. Yes, locking needs to be added. Votes for appropriate mechanism? (e.g. spinlock, mutex, etc?) > >> +typedef struct { >> + struct bitarray irq_pending; >> + struct bitarray nmi_pending; >> > > Why is nmi_pending a bitarray? When I wrote this I thought any vector could be declared as an NMI. I just restudied the doc and I see this is wrong. You are right, the nmi_pending doesnt need to be an array. I will fix this. > >> @@ -108,20 +109,12 @@ static unsigned get_addr_size(struct kvm_vcpu *vcpu) >> >> static inline u8 pop_irq(struct kvm_vcpu *vcpu) >> { >> - int word_index = __ffs(vcpu->irq_summary); >> - int bit_index = __ffs(vcpu->irq_pending[word_index]); >> - int irq = word_index * BITS_PER_LONG + bit_index; >> - >> - clear_bit(bit_index, &vcpu->irq_pending[word_index]); >> - if (!vcpu->irq_pending[word_index]) >> - clear_bit(word_index, &vcpu->irq_summary); >> - return irq; >> + return kvm_vcpu_irq_read(vcpu, 0, NULL); >> } >> >> static inline void push_irq(struct kvm_vcpu *vcpu, u8 irq) >> { >> - set_bit(irq, vcpu->irq_pending); >> - set_bit(irq / BITS_PER_LONG, &vcpu->irq_summary); >> + kvm_vcpu_irq_inject(vcpu, irq, 0); >> } >> > > > It would be helpful to unify the vmx and svm irq code first (I can merge > something like that immediately), so this patch becomes simpler. Well, I could just change any occurrence of "pop_irq" with kvm_vcpu_irq_read() and any occurrence of push_irq() to kvm_vcpu_irq_inject(), but I don't get the impression that this is what you were referring to. Could you elaborate? > > -- > Do not meddle in the internals of kernels, for they are subtle and quick to > panic. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel