Downgraded the following packages on compute nodes: qemu/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5] qemu-block-extra/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5] qemu-kvm/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5] qemu-system/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5] qemu-system-common/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5] qemu-system-misc/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5] qemu-system-x86/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5] qemu-user/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5] qemu-utils/xenial-updates,xenial-security 1:2.5+dfsg-5ubuntu10.6 amd64 [upgradable from: 1:2.5+dfsg-5ubuntu10.5]
Restarted libvirtd and nova-compute. Performed 16 live migrations of different instances between compute nodes - 0 failures. As before it took me 5-6 attempts at the most to trigger the problem, and 2-3 in most cases, I guess I can safely state that this issue is resolved. Sean, thank you for pointing at exactly the right solution. Chris, thank you for providing me with a link to packages. 2016-12-20 15:59 GMT+03:00 Chris Sarginson <[email protected]>: > Hi Vladimir, > > The packages are available on launchpad here: > https://launchpad.net/ubuntu/+source/qemu/1:2.5+dfsg- > 5ubuntu10.5/+build/10938755 > > On Tue, 20 Dec 2016 at 12:49 Vladimir Prokofev <[email protected]> wrote: > >> Using >> compute1:~$ dpkg-query -W qemu-system-x86 >> qemu-system-x86 1:2.5+dfsg-5ubuntu10.6 >> compute1:~$ qemu-system-x86_64 --version >> QEMU emulator version 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.6), Copyright >> (c) 2003-2008 Fabrice Bellard >> >> So I guess you may be right. Now I have to confirm this. >> Did you build your package from source? 1:2.5+dfsg-5ubuntu10.5 is no >> longer available in official repository, and I can't find it anywhere >> except for the source code. >> >> >> 2016-12-20 12:17 GMT+03:00 Sean Redmond <[email protected]>: >> >> What version of qemu are you running? I was hit by this bug[1] that seems >> to give off the same faults you are reporting and had to downgrade qemu >> packages to version '1:2.5+dfsg-5ubuntu10.5' >> >> [1] https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389 >> >> On Tue, Dec 20, 2016 at 12:10 AM, Vladimir Prokofev <[email protected]> >> wrote: >> >> Hello Ops. >> >> I want to pick your brains on a live migration issue, cause I'm kinda >> stuck atm. >> >> I'm running a small HA Ubuntu Openstack cloud - 3 controllers(VMs on >> Ubuntu/KVM) with ha-proxy LB, corosync/pacemaker manages VIP, Galera DB, 3 >> compute nodes(using KVM hypervisor), 3 network nodes(VMs on Ubuntu/KVM), >> CEPH cluster(4 OSD nodes, 3 MON nodes(VMs on Ubuntu/KVM)). >> Nova, cinder, and glance use CEPH as backend. Neutron uses linux-bridge. >> >> We were running Ubuntu Trusty 14.04/Mitaka, and decided to upgrade to >> Ubuntu Xenial 16.04/Newton. The issue appeared sometime after upgrade of >> all Mitaka packages to the latest version prior to Xenial upgrade, and >> stayed even after upgrade to Xenial/Newton. It's hard to tell for sure as >> we didn't use LM for a while. It worked fine under Trusty/Mitaka, and broke >> under Xenial/latest-Mitaka. >> >> >> With a chance of 20-30% live migration will fail. The instance will pause >> on a source node, but will not be resumed on target node. Target node will >> destroy instance assuming LM failed, and on source node instance will stay >> paused. On source node no new messages appear in nova-compute.log, and >> commands such as "virsh list" won't provide any output, or even exit for >> that matter. nova-compute can be restarted, but after normal startup >> messages it once again doesn't provide any new log entries. Nova considers >> compute node as up and running. >> In dashboard you can see that instance is now residing at a new host, and >> is in shutoff state. >> >> If I restart libvirtd on target or source node then whole system >> "unfreezes". Source host unpauses instance, and it is live again. But it >> now resides on a source node, while database thinks it's on a target node >> and is shutdown. Warning messages will appear on source host: >> >> 2016-12-20 01:13:37.095 20025 WARNING nova.compute.manager >> [req-b3879539-989f-4075-9ef0-d23ef8868102 - - - - -] While synchronizing >> instance power states, found 3 instances in the database and 4 instances on >> the hypervisor. >> >> Currenly I'm stopping/destroying/undefining instance on a source node, >> and launching it via standart Openstack means, but this leads to an >> instance reboot. >> >> >> Last messages in nova-compute.log on source node after LM start: >> 2016-12-20 00:09:35.961 16834 INFO nova.virt.libvirt.migration >> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530 >> ecee1197f46e453dba25669554226ce5 - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Increasing downtime to 1251 ms >> after 0 sec elapsed time >> 2016-12-20 00:09:36.127 16834 INFO nova.virt.libvirt.driver >> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530 >> ecee1197f46e453dba25669554226ce5 - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Migration running for 0 secs, >> memory 100% remaining; (bytes processed=0, remaining=0, total=0) >> 2016-12-20 00:09:36.894 16834 INFO nova.compute.manager >> [req-3a2a828f-8b76-4a65-b49b-ea8d232a3de5 - - - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Paused (Lifecycle Event) >> 2016-12-20 00:09:37.046 16834 INFO nova.compute.manager >> [req-3a2a828f-8b76-4a65-b49b-ea8d232a3de5 - - - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] During sync_power_state the >> instance has a pending task (migrating). Skip. >> 2016-12-20 00:09:37.300 16834 INFO nova.virt.libvirt.driver >> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530 >> ecee1197f46e453dba25669554226ce5 - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Migration operation has completed >> 2016-12-20 00:09:37.301 16834 INFO nova.compute.manager >> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530 >> ecee1197f46e453dba25669554226ce5 - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] _post_live_migration() is started.. >> 2016-12-20 00:09:39.718 16834 INFO os_vif >> [req-51801b5e-b77a-4d76-ad87-176326ac910e >> 84498aa7e26443c4908d973f3e86d530 ecee1197f46e453dba25669554226ce5 - - -] >> Successfully unplugged vif VIFBridge(active=False, >> address=fa:16:3e:e3:c4:c7,bridge_name='brq0b7e75ee-b6', >> has_traffic_filtering=True,id=2a141727-77a7-4b43-b4cd- >> 21fb415a6a55,network=Network(0b7e75ee-b691-4851-9b29- >> 08c3c62a96e0),plugin='linux_bridge',port_profile=<?>, >> preserve_on_delete=False,vif_name='tap2a141727-77') >> 2016-12-20 00:09:39.777 16834 INFO nova.virt.libvirt.driver >> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530 >> ecee1197f46e453dba25669554226ce5 - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Deleting instance files >> /var/lib/nova/instances/cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53_del >> 2016-12-20 00:09:39.778 16834 INFO nova.virt.libvirt.driver >> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530 >> ecee1197f46e453dba25669554226ce5 - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Deletion of >> /var/lib/nova/instances/cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53_del complete >> 2016-12-20 00:10:33.644 16834 INFO nova.compute.resource_tracker >> [req-00c4c403-d039-4347-b7ad-3098834a891c - - - - -] Auditing locally >> available compute resources for node compute1.cloud1.local >> >> No new messages will appear until libvirtd restart. >> >> >> Messages during LM in nova-compute.log on target node: >> 2016-12-20 00:09:32.234 29438 INFO nova.virt.libvirt.driver >> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530 >> ecee1197f46e453dba25669554226ce5 - - -] Instance launched has CPU info: >> {"vendor": "Intel", "model": "Nehalem", "arch": "x86_64", "features": >> ["pge", "clflush", "sep", "syscall", "vme", "dtes64", "msr", "vmx", "xtpr", >> "cmov", "ssse3", "est", "pat", "monitor", "smx", "pbe", "lm", "tsc", "nx", >> "fxsr", "tm", "sse4.1", "pae", "sse4.2", "acpi", "mmx", "cx8", "mce", "de", >> "rdtscp", "ht", "dca", "lahf_lm", "pdcm", "mca", "pdpe1gb", "apic", "sse", >> "pse", "ds", "invtsc", "pni", "tm2", "sse2", "ss", "ds_cpl", "pcid", "fpu", >> "cx16", "pse36", "mtrr", "popcnt"], "topology": {"cores": 6, "cells": 2, >> "threads": 2, "sockets": 1}} >> 2016-12-20 00:09:34.476 29438 INFO os_vif >> [req-51801b5e-b77a-4d76-ad87-176326ac910e >> 84498aa7e26443c4908d973f3e86d530 ecee1197f46e453dba25669554226ce5 - - -] >> Successfully plugged vif VIFBridge(active=True,address= >> fa:16:3e:e3:c4:c7,bridge_name='brq0b7e75ee-b6',has_traffic_ >> filtering=True,id=2a141727-77a7-4b43-b4cd-21fb415a6a55, >> network=Network(0b7e75ee-b691-4851-9b29-08c3c62a96e0), >> plugin='linux_bridge',port_profile=<?>,preserve_on_ >> delete=False,vif_name='tap2a141727-77') >> 2016-12-20 00:09:36.711 29438 INFO nova.compute.manager >> [req-75abcda4-701a-4b84-8f28-c11f6ad6f933 - - - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Started (Lifecycle Event) >> 2016-12-20 00:09:39.777 29438 INFO nova.compute.manager >> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530 >> ecee1197f46e453dba25669554226ce5 - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Post operation of migration started >> 2016-12-20 00:09:40.381 29438 INFO nova.network.neutronv2.api >> [req-51801b5e-b77a-4d76-ad87-176326ac910e 84498aa7e26443c4908d973f3e86d530 >> ecee1197f46e453dba25669554226ce5 - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Updating port >> 2a141727-77a7-4b43-b4cd-21fb415a6a55 with attributes {'binding:profile': >> {}, 'binding:host_id': 'compute3'} >> 2016-12-20 00:09:52.589 29438 INFO nova.compute.manager [-] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] VM Stopped (Lifecycle Event) >> 2016-12-20 00:09:52.747 29438 WARNING nova.compute.manager >> [req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance shutdown by itself. >> Calling the stop API. Current vm_state: active, current task_state: None, >> original DB power_state: 4, current VM power_state: 4 >> 2016-12-20 00:09:52.964 29438 INFO nova.compute.manager >> [req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance is already powered off in >> the hypervisor when stop is called. >> 2016-12-20 00:09:53.039 29438 INFO nova.virt.libvirt.driver >> [req-e945fe0d-dea0-47ee-8b87-b88991b9796e - - - - -] [instance: >> cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance already shutdown. >> 2016-12-20 00:09:53.043 29438 INFO nova.virt.libvirt.driver [-] >> [instance: cd8cb1db-dca3-4b0f-a03e-c0befbbd7b53] Instance destroyed >> successfully. >> 2016-12-20 00:09:56.696 29438 INFO nova.compute.resource_tracker >> [req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Auditing locally >> available compute resources for node compute3.cloud1.local >> 2016-12-20 00:09:57.295 29438 INFO nova.compute.resource_tracker >> [req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Total usable vcpus: >> 24, total allocated vcpus: 39 >> 2016-12-20 00:09:57.295 29438 INFO nova.compute.resource_tracker >> [req-53e5b7a1-aba9-4b26-a3fb-b9b293bf2f17 - - - - -] Final resource >> view: name=compute3.cloud1.local phys_ram=145060MB used_ram=94720MB >> phys_disk=14710GB used_disk=45GB total_vcpus=24 used_vcpus=39 pci_stats=[] >> >> Target node will proceed normal operations. >> >> Nothing interesting appears in libvirtd log or syslog on any compute >> node. I've tried running debug log for libvirt, but it's unreadable for me >> - all I can figure out is that libvirt is still running and other instances >> are unaffected. >> >> >> So, my questions. >> - Is there any workaround to sync the state of DB and compute node? I >> wish at least to avoid instance reboot, and for that I need some way to >> tell DB that instance is on a source node and running, without service >> interruption. Is it ok to update database directly via SQL? >> - How to further troubleshoot this issue? As I said. I'm kinda stuck and >> have no idea where to look further. If the issue appeared 100% of the time >> it would be a lot easier, but there's no pattern I can figure out - it >> happens randomly with heavy loaded instances, and idle instances alike. >> >> Any suggestions appreciated. I need a boot in a right direction. >> I will provide any logs and/or config files per request. >> >> _______________________________________________ >> OpenStack-operators mailing list >> [email protected] >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >> >> >> >> _______________________________________________ >> OpenStack-operators mailing list >> [email protected] >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >> >
_______________________________________________ OpenStack-operators mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
