On Sun, Mar 05, 2023 at 06:43:42PM +0000, David Woodhouse wrote: > From: David Woodhouse <d...@amazon.co.uk> > > A Linux guest will perform IRQ migration after the IRQ has happened, > updating the RTE to point to the new destination CPU and then unmasking > the interrupt. > > However, when the guest updates the RTE, ioapic_mem_write() calls > ioapic_service(), which redelivers the pending level interrupt via > kvm_set_irq(), *before* calling ioapic_update_kvm_routes() which sets > the new target CPU. > > Thus, the IRQ which is supposed to go to the new target CPU is instead > misdelivered to the previous target. An example where the guest kernel > is attempting to migrate from CPU#2 to CPU#0 shows: > > xenstore_read tx 0 path control/platform-feature-xs_reset_watches > ioapic_set_irq vector: 11 level: 1 > ioapic_set_remote_irr set remote irr for pin 11 > ioapic_service: trigger KVM IRQ 11 > [ 0.523627] The affinity mask was 0-3 and the handler is on 2 > ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26 > ioapic_update_kvm_routes: update KVM route for IRQ 11: fee02000 8021 > ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021 > xenstore_reset_watches > ioapic_set_irq vector: 11 level: 1 > ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x1c021 > [ 0.524569] ioapic_ack_level IRQ 11 moveit = 1 > ioapic_eoi_broadcast EOI broadcast for vector 33 > ioapic_clear_remote_irr clear remote irr for pin 11 vector 33 > ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26 > ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x18021 > [ 0.525235] ioapic_finish_move IRQ 11 calls irq_move_masked_irq() > [ 0.526147] irq_do_set_affinity for IRQ 11, 0 > [ 0.526732] ioapic_set_affinity for IRQ 11, 0 > [ 0.527330] ioapic_setup_msg_from_msi for IRQ11 target 0 > ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x27 > ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x27 size 0x4 val 0x0 > ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26 > ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021 > [ 0.527623] ioapic_set_affinity returns 0 > [ 0.527623] ioapic_finish_move IRQ 11 calls unmask_ioapic_irq() > ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26 > ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x8021 > ioapic_set_remote_irr set remote irr for pin 11 > ioapic_service: trigger KVM IRQ 11 > ioapic_update_kvm_routes: update KVM route for IRQ 11: fee00000 8021 > [ 0.529571] The affinity mask was 0 and the handler is on 2 > [ xenstore_watch path memory/target token FFFFFFFF92847D40 > > There are no other code paths in ioapic_mem_write() which need the KVM > IRQ routing table to be updated, so just shift the call from the end > of the function to happen right before the call to ioapic_service() > and thus deliver the re-enabled IRQ to the right place. > > Fixes: 15eafc2e602f "kvm: x86: add support for KVM_CAP_SPLIT_IRQCHIP" > Signed-off-by: David Woodhouse <d...@amazon.co.uk>
Reviewed-by: Peter Xu <pet...@redhat.com> > --- > Alternative fixes might have been just to remove the part in > ioapic_service() which delivers the IRQ via kvm_set_irq() because > surely delivering as MSI ought to work just fine anyway in all cases? > That code lacks a comment justifying its existence. Didn't check all details, but AFAIU there're still some different paths triggered so at least it looks still clean to use the path it's for. E.g., I think if someone traces kvm_set_irq() in kernel this specific irq triggered right after unmasking might seem to be missed misterously (but actually it was not). Thanks, > > Or maybe in the specific case shown in the above log, it would have > sufficed for ioapic_update_kvm_routes() to update the route *even* > when the IRQ is masked. It's not like it's actually going to get > triggered unless QEMU deliberately does so, anyway? But that only > works because the target CPU happens to be in the high word of the > RTE; if something in the *low* word (vector, perhaps) was changed > at the same time as the unmask, we'd still trigger with stale data. -- Peter Xu