apic overlaps PCI space. On real hardware it has higher priority, emulate this correctly.
This should addresses the following issue: > Subject: Re: [BUG] Guest OS hangs on boot when 64bit BAR present > (kvm-apic-msi resource conflict) > Sometime ago I reported an issue about guest OS hang when 64bit BAR present. > http://lists.gnu.org/archive/html/qemu-devel/2012-01/msg03189.html > http://lists.gnu.org/archive/html/qemu-devel/2012-12/msg00413.html > > Some more investigation has been done, so in this post I'll try to explain > why it happens and offer possible solutions: > > *When the issue happens* > The issue occurs on Linux guest OS if kernel version <2.6.36 > A Guest OS hangs on boot when a 64bit PCI BAR is present in a system (if we > use ivshmem driver for example) and occupies range within first > 4 GB. > > *How to reproduce* > I used the following qemu command to reproduce the case: > /usr/local/bin/qemu-system-x86_64 -M pc-1.3 -enable-kvm -m 2000 -smp > 1,sockets=1,cores=1,threads=1 -name Rh5332 -chardev > socket,id=charmonitor,path=/var/lib/libvirt/qemu/Rh5332.monitor,server,nowait > -mon chardev=charmonitor,id=monitor,mode=readline -rtc > base=utc -boot cd -drive > file=/home/akorolev/rh5332.img,if=none,id=drive-ide0-0-0,format=raw -device > ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -chardev > file,id=charserial0,path=/home/akorolev/serial.log -device > isa-serial,chardev=charserial0,id=serial0 -usb -vnc 127.0.0.1:0 -k en-us -vga > cirrus -device ivshmem,shm,size=32M-device > virtio-balloon-pci,id=balloon0 > > Tried different guests: Centos 5.8 64bit, RHEL 5.3 32bit, FC 12 64bit on all > machines hang occurs in 100% cases > > *Why it happens* > The issue basically comes from Linux PCI enumeration code. > > The OS enumerates 64BIT bars when device is enabled using the following > procedure. > 1. Write all FF's to lower half of 64bit BAR > 2. Write address back to lower half of 64bit BAR > 3. Write all FF's to higher half of 64bit BAR > 4. Write address back to higher half of 64bit BAR > > For qemu it means that qemu pci_default_write_config() recevies all FFs for > lower part of the 64bit BAR. > Then it applies the mask and converts the value to "All FF's - size + 1" > (FE000000 if size is 32MB). > > So for short period of time the range [0xFE000000 - 0xFFFFFFFF] will be > occupied by ivshmem resource. > For some reason it is lethal for further boot process. > > We have found that boot process screws up completely if kvm-apic-msi range is > overlapped even for short period of time. (We still don't > know why it happens, hope that the qemu maintainers can answer?) > > If we look at kvm-apic-msi memory region it is a non-overlapable memory > region with hardcoded address range [0xFEE00000 - 0xFEF00000]. > > Here is a log we collected from render_memory_regions: > > system overlap 0 pri 0 [0x0 - 0x7fffffffffffffff] > kvmvapic-rom overlap 1 pri 1000 [0xca000 - 0xcd000] > pc.ram overlap 0 pri 0 [0xca000 - 0xcd000] > ++ pc.ram [0xca000 - 0xcd000] is added to view > .................... > smram-region overlap 1 pri 1 [0xa0000 - 0xc0000] > pci overlap 0 pri 0 [0xa0000 - 0xc0000] > cirrus-lowmem-container overlap 1 pri 1 [0xa0000 - 0xc0000] > cirrus-low-memory overlap 0 pri 0 [0xa0000 - 0xc0000] > ++cirrus-low-memory [0xa0000 - 0xc0000] is added to view > kvm-ioapic overlap 0 pri 0 [0xfec00000 - 0xfec01000] > ++kvm-ioapic [0xfec00000 - 0xfec01000] is added to view > pci-hole64 overlap 0 pri 0 [0x100000000 - 0x4000000100000000] > pci overlap 0 pri 0 [0x100000000 - 0x4000000100000000] > pci-hole overlap 0 pri 0 [0x7d000000 - 0x100000000] > pci overlap 0 pri 0 [0x7d000000 - 0x100000000] > ivshmem-bar2-container overlap 1 pri 1 [0xfe000000 - 0x100000000] > ivshmem.bar2 overlap 0 pri 0 [0xfe000000 - 0x100000000] > ++ivshmem.bar2 [0xfe000000 - 0xfec00000] is added to view > ++ivshmem.bar2 [0xfec01000 - 0x100000000] is added to view > ivshmem-mmio overlap 1 pri 1 [0xfebf1000 - 0xfebf1100] > e1000-mmio overlap 1 pri 1 [0xfeba0000 - 0xfebc0000] > cirrus-mmio overlap 1 pri 1 [0xfebf0000 - 0xfebf1000] > cirrus-pci-bar0 overlap 1 pri 1 [0xfa000000 - 0xfc000000] > vga.vram overlap 1 pri 1 [0xfa000000 - 0xfa800000] > ++vga.vram [0xfa000000 - 0xfa800000] is added to view > cirrus-bitblt-mmio overlap 0 pri 0 [0xfb000000 - 0xfb400000] > ++cirrus-bitblt-mmio [0xfb000000 - 0xfb400000] is added to > view > cirrus-linear-io overlap 0 pri 0 [0xfa000000 - 0xfa800000] > pc.bios overlap 0 pri 0 [0xfffe0000 - 0x100000000] > ram-below-4g overlap 0 pri 0 [0x0 - 0x7d000000] > pc.ram overlap 0 pri 0 [0x0 - 0x7d000000] > ++pc.ram [0x0 - 0xa0000] is added to view > ++pc.ram [0x100000 - 0x7d000000] is added to view > kvm-apic-msi overlap 0 pri 0 [0xfee00000 - 0xfef00000] > > As you can see from log the kvm-apic-msi is enumarated last when range > [0xfee00000 - 0xfef00000] is already occupied by ivshmem.bar2 > [0xfec01000 - 0x100000000]. > > > *Possible solutions* > Solution 1. Probably the best would be adding the rule that regions which may > not be overlapped are added to view first (In in other words > regions which must not be overlapped have the highest priority). Please find > patch in the following message. > > Solution 2. Raise priority of kvm-apic-msi resource. This is a bit misleading > solution, as priority is only applicable for overlap-able > regions, but this region must not be overlapped. > > Solution 3. Fix the issue at PCI level. Track if the resource is 64bit and > apply changes if both parts of 64bit BAR are programmed. (It > appears that real PCI bus controllers are smart enough to track 64bit BAR > writes on PC, so qemu could do the same? Drawbacks are that > tracking PCI writes is bit cumbersome, and such tracking may appear to > somebody as a hack) --- The following patch is under test ATM. Alexey, does it address the crash for you? diff --git a/hw/pc.c b/hw/pc.c index 53cc173..745c89d 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -1154,7 +1154,8 @@ void ioapic_init_gsi(GSIState *gsi_state, const char *parent_name) } qdev_init_nofail(dev); d = SYS_BUS_DEVICE(dev); - sysbus_mmio_map(d, 0, 0xfec00000); + /* APIC overlaps the PCI window. */ + sysbus_mmio_map_overlap(d, 0, 0xfec00000, 1000); for (i = 0; i < IOAPIC_NUM_PINS; i++) { gsi_state->ioapic_irq[i] = qdev_get_gpio_in(dev, i); diff --git a/hw/sysbus.c b/hw/sysbus.c index 6d9d1df..40bf352 100644 --- a/hw/sysbus.c +++ b/hw/sysbus.c @@ -66,6 +66,26 @@ void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr) dev->mmio[n].memory); } +void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr, + unsigned priority) +{ + assert(n >= 0 && n < dev->num_mmio); + + if (dev->mmio[n].addr == addr) { + /* ??? region already mapped here. */ + return; + } + if (dev->mmio[n].addr != (hwaddr)-1) { + /* Unregister previous mapping. */ + memory_region_del_subregion(get_system_memory(), dev->mmio[n].memory); + } + dev->mmio[n].addr = addr; + memory_region_add_subregion_overlap(get_system_memory(), + addr, + dev->mmio[n].memory, + priority); +} + /* Request an IRQ source. The actual IRQ object may be populated later. */ void sysbus_init_irq(SysBusDevice *dev, qemu_irq *p) diff --git a/hw/sysbus.h b/hw/sysbus.h index a7fcded..2100bd7 100644 --- a/hw/sysbus.h +++ b/hw/sysbus.h @@ -56,6 +56,8 @@ void sysbus_init_ioports(SysBusDevice *dev, pio_addr_t ioport, pio_addr_t size); void sysbus_connect_irq(SysBusDevice *dev, int n, qemu_irq irq); void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr); +void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr, + unsigned priority); void sysbus_add_memory(SysBusDevice *dev, hwaddr addr, MemoryRegion *mem); void sysbus_add_memory_overlap(SysBusDevice *dev, hwaddr addr,