[Yahoo-eng-team] [Bug 2020552] [NEW] trunk_details missing sub port MAC addresses for LIST
Public bug reported: When returning port details, trunk_details.sub_ports should contain: * segmentation_id * segmentation_type * port_id * mac_address This is the case when GETting a single port, but when listing ports mac_address is missing. In the following: * Parent port: a47df912-1cba-458c-9bb9-00cd3d71b9e6 * Trunk: 70f314f8-5577-4b98-be9c-68bbe3791d7f * Sub port: d11793a9-8862-4378-a1fe-045f04dad841 GET request: > curl -s -H "X-Auth-Token: $OS_TOKEN" > "https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13696/v2.0/ports/a47df912-1cba-458c-9bb9-00cd3d71b9e6"; > | jq { "port": { ... "trunk_details": { "trunk_id": "70f314f8-5577-4b98-be9c-68bbe3791d7f", "sub_ports": [ { "segmentation_id": 100, "segmentation_type": "vlan", "port_id": "d11793a9-8862-4378-a1fe-045f04dad841", "mac_address": "fa:16:3e:88:29:a0" } ] }, ... } } LIST request returning the same port: > curl -s -H "X-Auth-Token: $OS_TOKEN" > "https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13696/v2.0/ports?id=a47df912-1cba-458c-9bb9-00cd3d71b9e6"; > | jq { "ports": [ { ... "trunk_details": { "trunk_id": "70f314f8-5577-4b98-be9c-68bbe3791d7f", "sub_ports": [ { "segmentation_id": 100, "segmentation_type": "vlan", "port_id": "d11793a9-8862-4378-a1fe-045f04dad841" } ] }, ... } } } Note that mac_address is missing for the LIST request. * Version: Little bit of guesswork going on here, but Nova reports a latest microversion of 2.79, which corresponds to Train. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2020552 Title: trunk_details missing sub port MAC addresses for LIST Status in neutron: New Bug description: When returning port details, trunk_details.sub_ports should contain: * segmentation_id * segmentation_type * port_id * mac_address This is the case when GETting a single port, but when listing ports mac_address is missing. In the following: * Parent port: a47df912-1cba-458c-9bb9-00cd3d71b9e6 * Trunk: 70f314f8-5577-4b98-be9c-68bbe3791d7f * Sub port: d11793a9-8862-4378-a1fe-045f04dad841 GET request: > curl -s -H "X-Auth-Token: $OS_TOKEN" "https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13696/v2.0/ports/a47df912-1cba-458c-9bb9-00cd3d71b9e6"; | jq { "port": { ... "trunk_details": { "trunk_id": "70f314f8-5577-4b98-be9c-68bbe3791d7f", "sub_ports": [ { "segmentation_id": 100, "segmentation_type": "vlan", "port_id": "d11793a9-8862-4378-a1fe-045f04dad841", "mac_address": "fa:16:3e:88:29:a0" } ] }, ... } } LIST request returning the same port: > curl -s -H "X-Auth-Token: $OS_TOKEN" "https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13696/v2.0/ports?id=a47df912-1cba-458c-9bb9-00cd3d71b9e6"; | jq { "ports": [ { ... "trunk_details": { "trunk_id": "70f314f8-5577-4b98-be9c-68bbe3791d7f", "sub_ports": [ { "segmentation_id": 100, "segmentation_type": "vlan", "port_id": "d11793a9-8862-4378-a1fe-045f04dad841" } ] }, ... } } } Note that mac_address is missing for the LIST request. * Version: Little bit of guesswork going on here, but Nova reports a latest microversion of 2.79, which corresponds to Train. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2020552/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1848666] [NEW] Race can cause instance to become ACTIVE after build error
Public bug reported: 2 functions used in error cleanup in _do_build_and_run_instance: _cleanup_allocated_networks and _set_instance_obj_error_state, call an unguarded instance.save(). The problem with this is that the instance object may have been in an unclean state before the build exception was raised. Calling instance.save() will persist this unclean error state in addition to whatever change was made during cleanup, which is not intended. Specifically in the case that a build races with a delete, the build can fail when we try to do an atomic save to set the vm_state to active, raising UnexpectedDeletingTaskStateError. However, the instance object still contains the unpersisted vm_state change along with other concomitant changes. These will all be persisted when _cleanup_allocated_networks calls instance.save(). This means that the instance.save(expected_task_state=SPAWNING) which correctly failed due to a race, later succeeds accidentally in cleanup resulting in an inconsistent instance state. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1848666 Title: Race can cause instance to become ACTIVE after build error Status in OpenStack Compute (nova): New Bug description: 2 functions used in error cleanup in _do_build_and_run_instance: _cleanup_allocated_networks and _set_instance_obj_error_state, call an unguarded instance.save(). The problem with this is that the instance object may have been in an unclean state before the build exception was raised. Calling instance.save() will persist this unclean error state in addition to whatever change was made during cleanup, which is not intended. Specifically in the case that a build races with a delete, the build can fail when we try to do an atomic save to set the vm_state to active, raising UnexpectedDeletingTaskStateError. However, the instance object still contains the unpersisted vm_state change along with other concomitant changes. These will all be persisted when _cleanup_allocated_networks calls instance.save(). This means that the instance.save(expected_task_state=SPAWNING) which correctly failed due to a race, later succeeds accidentally in cleanup resulting in an inconsistent instance state. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1848666/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1774249] Re: update_available_resource will raise DiskNotFound after resize but before confirm
This is not fixed. We've just had a report where we appear to be hitting the race reported in review here: https://review.opendev.org/#/c/571410/7/nova/virt/libvirt/driver.py ** Changed in: nova Status: Fix Released => In Progress ** Changed in: nova/stein Status: Fix Committed => In Progress ** Changed in: nova/rocky Status: Fix Committed => In Progress ** Changed in: nova/queens Status: Fix Committed => In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1774249 Title: update_available_resource will raise DiskNotFound after resize but before confirm Status in OpenStack Compute (nova): In Progress Status in OpenStack Compute (nova) ocata series: Triaged Status in OpenStack Compute (nova) pike series: Triaged Status in OpenStack Compute (nova) queens series: In Progress Status in OpenStack Compute (nova) rocky series: In Progress Status in OpenStack Compute (nova) stein series: In Progress Bug description: Original reported in RH Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1584315 Tested on OSP12 (Pike), but appears to be still present on master. Should only occur if nova compute is configured to use local file instance storage. Create instance A on compute X Resize instance A to compute Y Domain is powered off /var/lib/nova/instances/ renamed to _resize on X Domain is *not* undefined On compute X: update_available_resource runs as a periodic task First action is to update self rt calls driver.get_available_resource() ...calls _get_disk_over_committed_size_total ...iterates over all defined domains, including the ones whose disks we renamed ...fails because a referenced disk no longer exists Results in errors in nova-compute.log: 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager [req-bd52371f-c6ec-4a83-9584-c00c5377acd8 - - - - -] Error updating resources for node compute-0.localdomain.: DiskNotFound: No disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager Traceback (most recent call last): 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6695, in update_available_resource_for_node 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager rt.update_available_resource(context, nodename) 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 641, in update_available_resource 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename) 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5892, in get_available_resource 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total() 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7393, in _get_disk_over_committed_size_total 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager config, block_device_info) 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7301, in _get_instance_disk_info_from_config 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager dk_size = disk_api.get_allocated_disk_size(path) 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 156, in get_allocated_disk_size 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager return images.qemu_img_info(path).disk_size 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager raise exception.DiskNotFound(location=path) 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk And resource tracker is no longer updated. We can find lots of these in the gate. Note that change Icec2769bf42455853cbe686fb30fda73df791b25 nearly mitigates this, but doesn't because task_state is not set while the instance is awaiting confirm. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1774249/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1840912] [NEW] libvirt calls aren't reliably using tpool.Proxy
Public bug reported: A customer is hitting an issue with symptoms identical to bug 1045152 (from 2012). Specifically, we are frequently seeing the compute host being marked down. From log correlation, we can see that when this occurs the relevant compute is always in the middle of executing LibvirtDriver._get_disk_over_committed_size_total(). The reason for this appears to be a long-running libvirt call which is not using tpool.Proxy, and therefore blocks all other greenthreads during execution. We do not yet know why the libvirt call is slow, but we have identified the reason it is not using tpool.Proxy. Because eventlet, we proxy libvirt calls at the point we create the libvirt connection in libvirt.Host._connect: return tpool.proxy_call( (libvirt.virDomain, libvirt.virConnect), libvirt.openAuth, uri, auth, flags) This means: run libvirt.openAuth(uri, auth, flags) in a native thread. If the returned object is a libvirt.virDomain or libvirt.virConnect, wrap the returned object in a tpool.Proxy with the same autowrap rules. There are 2 problems with this. Firstly, the autowrap list is incomplete. At the very least we need to add libvirt.virNodeDevice, libvirt.virSecret, and libvirt.NWFilter to this list as we use all of these objects in Nova. Currently none of our interactions with these objects are using the tpool proxy. Secondly, and the specific root cause of this bug, it doesn't understand lists: https://github.com/eventlet/eventlet/blob/ca8dd0748a1985a409e9a9a517690f46e05cae99/eventlet/tpool.py#L149 In LibvirtDriver._get_disk_over_committed_size_total() we get a list of running libvirt domains with libvirt.Host.list_instance_domains, which calls virConnect.listAllDomains(). listAllDomains() returns a *list* of virDomain, which the above code in tpool doesn't match. Consequently, none of the subsequent virDomain calls use the tpool proxy, which starves all other greenthreads. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1840912 Title: libvirt calls aren't reliably using tpool.Proxy Status in OpenStack Compute (nova): New Bug description: A customer is hitting an issue with symptoms identical to bug 1045152 (from 2012). Specifically, we are frequently seeing the compute host being marked down. From log correlation, we can see that when this occurs the relevant compute is always in the middle of executing LibvirtDriver._get_disk_over_committed_size_total(). The reason for this appears to be a long-running libvirt call which is not using tpool.Proxy, and therefore blocks all other greenthreads during execution. We do not yet know why the libvirt call is slow, but we have identified the reason it is not using tpool.Proxy. Because eventlet, we proxy libvirt calls at the point we create the libvirt connection in libvirt.Host._connect: return tpool.proxy_call( (libvirt.virDomain, libvirt.virConnect), libvirt.openAuth, uri, auth, flags) This means: run libvirt.openAuth(uri, auth, flags) in a native thread. If the returned object is a libvirt.virDomain or libvirt.virConnect, wrap the returned object in a tpool.Proxy with the same autowrap rules. There are 2 problems with this. Firstly, the autowrap list is incomplete. At the very least we need to add libvirt.virNodeDevice, libvirt.virSecret, and libvirt.NWFilter to this list as we use all of these objects in Nova. Currently none of our interactions with these objects are using the tpool proxy. Secondly, and the specific root cause of this bug, it doesn't understand lists: https://github.com/eventlet/eventlet/blob/ca8dd0748a1985a409e9a9a517690f46e05cae99/eventlet/tpool.py#L149 In LibvirtDriver._get_disk_over_committed_size_total() we get a list of running libvirt domains with libvirt.Host.list_instance_domains, which calls virConnect.listAllDomains(). listAllDomains() returns a *list* of virDomain, which the above code in tpool doesn't match. Consequently, none of the subsequent virDomain calls use the tpool proxy, which starves all other greenthreads. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1840912/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1836212] Re: libvirt: Failure to recover from failed detach
Yep. The actual error thrown was "Unable to detach from guest transient domain.", which is now "Unable to detach the device from the live config." in master. That RetryDecorator makes this function a whole lot harder to read, but with your explanation it seems that the detach was actually timing out, which is consistent with the underlying problem we eventually discovered. Thanks! I'll close this out. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1836212 Title: libvirt: Failure to recover from failed detach Status in OpenStack Compute (nova): Invalid Bug description: 1020162 ERROR root [req-46fbc6c8-de2c-4afb-9f24-9d75947c9a3c 9ccddbb72e2d42b6ab1a31ad48ea21fb 86bea4eb057b412a98402a1b7e1d9222 - - -] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site- packages/nova/virt/libvirt/guest.py", line 390, in _try_detach_device\nself.detach_device(conf, persistent=persistent, live=live)\n', ' File "/usr/lib/python2.7 /site-packages/nova/virt/libvirt/guest.py", line 467, in detach_device\nself._domain.detachDeviceFlags(device_xml, flags=flags)\n', ' File "/usr/lib/python2.7/site- packages/eventlet/tpool.py", line 186, in doit\nresult = proxy_call(self._autowrap, f, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call\nrv = execute(f, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute\nsix.reraise(c, e, tb)\n', ' File "/usr/lib/python2.7 /site-packages/eventlet/tpool.py", line 83, in tworker\nrv = meth(*args, **kwargs)\n', ' File "/usr/lib64/python2.7/site- packages/libvirt.py", line 1194, in detachDeviceFlags\nif ret == -1: raise libvirtError (\'virDomainDetachDeviceFlags() failed\', dom=self)\n', 'libvirtError: invalid argument: no target device vdb\n'] This appears to happen because when we call detach_device_with_retry(live=True) we ultimately call detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_AFFECT_LIVE). 'no target device' is the error generated when libvirt failed to remove the device from CONFIG (persistent). This can happen because detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_AFFECT_LIVE) will succeed and remove the device from the CONFIG domain as long as the LIVE domain removal was queued, even though this is an asynchronous operation. Consequently, a subsequent check for the device may return the device because it hasn't yet been (and may never be) removed from the LIVE domain, but it has been removed from the CONFIG domain. This will prevent libvirt from attempting to remove the device from the LIVE domain, and so the detach will never succeed. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1836212/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1836212] [NEW] libvirt: Failure to recover from failed detach
Public bug reported: 1020162 ERROR root [req-46fbc6c8-de2c-4afb-9f24-9d75947c9a3c 9ccddbb72e2d42b6ab1a31ad48ea21fb 86bea4eb057b412a98402a1b7e1d9222 - - -] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site- packages/nova/virt/libvirt/guest.py", line 390, in _try_detach_device\n self.detach_device(conf, persistent=persistent, live=live)\n', ' File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 467, in detach_device\nself._domain.detachDeviceFlags(device_xml, flags=flags)\n', ' File "/usr/lib/python2.7/site- packages/eventlet/tpool.py", line 186, in doit\nresult = proxy_call(self._autowrap, f, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call\nrv = execute(f, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute\nsix.reraise(c, e, tb)\n', ' File "/usr/lib/python2.7/site- packages/eventlet/tpool.py", line 83, in tworker\nrv = meth(*args, **kwargs)\n', ' File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1194, in detachDeviceFlags\nif ret == -1: raise libvirtError (\'virDomainDetachDeviceFlags() failed\', dom=self)\n', 'libvirtError: invalid argument: no target device vdb\n'] This appears to happen because when we call detach_device_with_retry(live=True) we ultimately call detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_AFFECT_LIVE). 'no target device' is the error generated when libvirt failed to remove the device from CONFIG (persistent). This can happen because detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_AFFECT_LIVE) will succeed and remove the device from the CONFIG domain as long as the LIVE domain removal was queued, even though this is an asynchronous operation. Consequently, a subsequent check for the device may return the device because it hasn't yet been (and may never be) removed from the LIVE domain, but it has been removed from the CONFIG domain. This will prevent libvirt from attempting to remove the device from the LIVE domain, and so the detach will never succeed. ** Affects: nova Importance: Undecided Status: New ** Bug watch added: Red Hat Bugzilla #1669225 https://bugzilla.redhat.com/show_bug.cgi?id=1669225 -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1836212 Title: libvirt: Failure to recover from failed detach Status in OpenStack Compute (nova): New Bug description: 1020162 ERROR root [req-46fbc6c8-de2c-4afb-9f24-9d75947c9a3c 9ccddbb72e2d42b6ab1a31ad48ea21fb 86bea4eb057b412a98402a1b7e1d9222 - - -] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site- packages/nova/virt/libvirt/guest.py", line 390, in _try_detach_device\nself.detach_device(conf, persistent=persistent, live=live)\n', ' File "/usr/lib/python2.7 /site-packages/nova/virt/libvirt/guest.py", line 467, in detach_device\nself._domain.detachDeviceFlags(device_xml, flags=flags)\n', ' File "/usr/lib/python2.7/site- packages/eventlet/tpool.py", line 186, in doit\nresult = proxy_call(self._autowrap, f, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call\nrv = execute(f, *args, **kwargs)\n', ' File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute\nsix.reraise(c, e, tb)\n', ' File "/usr/lib/python2.7 /site-packages/eventlet/tpool.py", line 83, in tworker\nrv = meth(*args, **kwargs)\n', ' File "/usr/lib64/python2.7/site- packages/libvirt.py", line 1194, in detachDeviceFlags\nif ret == -1: raise libvirtError (\'virDomainDetachDeviceFlags() failed\', dom=self)\n', 'libvirtError: invalid argument: no target device vdb\n'] This appears to happen because when we call detach_device_with_retry(live=True) we ultimately call detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_AFFECT_LIVE). 'no target device' is the error generated when libvirt failed to remove the device from CONFIG (persistent). This can happen because detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG | VIR_DOMAIN_AFFECT_LIVE) will succeed and remove the device from the CONFIG domain as long as the LIVE domain removal was queued, even though this is an asynchronous operation. Consequently, a subsequent check for the device may return the device because it hasn't yet been (and may never be) removed from the LIVE domain, but it has been removed from the CONFIG domain. This will prevent libvirt from attempting to remove the device from the LIVE domain, and so the detach will never succeed. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1836212/+subscriptions -- Mailing list: https://launchpad.ne
[Yahoo-eng-team] [Bug 1821373] [NEW] Most instance actions can be called concurrently
Public bug reported: A customer reported that they were getting DB corruption if they called shelve twice in quick succession on the same instance. This should be prevented by the guard in nova.API.shelve, which does: instance.task_state = task_states.SHELVING instance.save(expected_task_state=[None]) This is intended to act as a robust gate against 2 instance actions happening concurrently. The first will set the task state to SHELVING, the second will fail because the task state is not SHELVING. The comparison is done atomically in db.instance_update_and_get_original(), and should be race free. However, instance.save() shortcuts if there is no update and does not call db.instance_update_and_get_original(). Therefore this guard fails if we call the same operation twice: instance = get_instance() => Returned instance.task_state is None instance.task_state = task_states.SHELVING instance.save(expected_task_state=[None]) => task_state was None, now SHELVING, updates = {'task_state': SHELVING} => db.instance_update_and_get_original() executes and succeeds instance = get_instance() => Returned instance.task_state is SHELVING instance.task_state = task_states.SHELVING instance.save(expected_task_state=[None]) => task_state was SHELVING, still SHELVING, updates = {} => db.instance_update_and_get_original() does not execute, therefore doesn't raise the expected exception This pattern is common to almost all instance actions in nova api. A quick scan suggests that all of the following actions are affected by this bug, and can therefore all potentially be executed multiple times concurrently for the same instance: restore force_stop start backup snapshot soft reboot hard reboot rebuild revert_resize resize shelve shelve_offload unshelve pause unpause suspend resume rescue unrescue set_admin_password live_migrate evacuate ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1821373 Title: Most instance actions can be called concurrently Status in OpenStack Compute (nova): New Bug description: A customer reported that they were getting DB corruption if they called shelve twice in quick succession on the same instance. This should be prevented by the guard in nova.API.shelve, which does: instance.task_state = task_states.SHELVING instance.save(expected_task_state=[None]) This is intended to act as a robust gate against 2 instance actions happening concurrently. The first will set the task state to SHELVING, the second will fail because the task state is not SHELVING. The comparison is done atomically in db.instance_update_and_get_original(), and should be race free. However, instance.save() shortcuts if there is no update and does not call db.instance_update_and_get_original(). Therefore this guard fails if we call the same operation twice: instance = get_instance() => Returned instance.task_state is None instance.task_state = task_states.SHELVING instance.save(expected_task_state=[None]) => task_state was None, now SHELVING, updates = {'task_state': SHELVING} => db.instance_update_and_get_original() executes and succeeds instance = get_instance() => Returned instance.task_state is SHELVING instance.task_state = task_states.SHELVING instance.save(expected_task_state=[None]) => task_state was SHELVING, still SHELVING, updates = {} => db.instance_update_and_get_original() does not execute, therefore doesn't raise the expected exception This pattern is common to almost all instance actions in nova api. A quick scan suggests that all of the following actions are affected by this bug, and can therefore all potentially be executed multiple times concurrently for the same instance: restore force_stop start backup snapshot soft reboot hard reboot rebuild revert_resize resize shelve shelve_offload unshelve pause unpause suspend resume rescue unrescue set_admin_password live_migrate evacuate To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1821373/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1804811] [NEW] DatabaseAtVersion fixture causes order-related failures in tests using Database fixture
Public bug reported: The DatabaseAtVersion fixture starts the global TransactionContext, but doesn't set the guard to configure() used by the Database fixture. Consequently, if Database runs after DatabaseAtVersion in the same worker, the subsequent fixture will fail. An example ordering which fails is: nova.tests.unit.db.test_sqlalchemy_migration.TestNewtonCellsCheck.test_upgrade_without_cell0 nova.tests.unit.db.test_sqlalchemy_migration.TestNewtonCheck.test_pci_device_type_vf_not_migrated ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1804811 Title: DatabaseAtVersion fixture causes order-related failures in tests using Database fixture Status in OpenStack Compute (nova): New Bug description: The DatabaseAtVersion fixture starts the global TransactionContext, but doesn't set the guard to configure() used by the Database fixture. Consequently, if Database runs after DatabaseAtVersion in the same worker, the subsequent fixture will fail. An example ordering which fails is: nova.tests.unit.db.test_sqlalchemy_migration.TestNewtonCellsCheck.test_upgrade_without_cell0 nova.tests.unit.db.test_sqlalchemy_migration.TestNewtonCheck.test_pci_device_type_vf_not_migrated To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1804811/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1804652] [NEW] nova.db.sqlalchemy.migration.db_version is racy
Public bug reported: db_version() attempts to initialise versioning if the db is not versioned. However, it doesn't consider concurrency, so we can get errors if multiple watchers try to get the db version before the db is initialised. We are seeing this in practise during tripleo deployments in a script which waits on multiple controller nodes for db sync to complete. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1804652 Title: nova.db.sqlalchemy.migration.db_version is racy Status in OpenStack Compute (nova): New Bug description: db_version() attempts to initialise versioning if the db is not versioned. However, it doesn't consider concurrency, so we can get errors if multiple watchers try to get the db version before the db is initialised. We are seeing this in practise during tripleo deployments in a script which waits on multiple controller nodes for db sync to complete. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1804652/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1803961] [NEW] Nova doesn't call migrate_volume_completion after cinder volume migration
Public bug reported: Originally reported in Red Hat Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1648931 Create a cinder volume, attach it to a nova instance, and migrate the volume to a different storage host: $ cinder create 1 --volume-type foo --name myvol $ nova volume-attach myinstance myvol $ cinder migrate myvol c-vol2 Everything seems to work correctly, but if we look at myinstance we see that it's now connected to a new volume, and the original volume is still present on the original storage host. This is because nova didn't call cinder's migrate_volume_completion. migrate_volume_completion would have deleted the original volume, and changed the volume id of the new volume to be the same as the original. The result would be that myinstance would appear to be connected to the same volume as before. Note that there are 2 ways (that I'm aware of) to intiate a cinder volume migration: retype and migrate. AFAICT retype is *not* affected. In fact, I updated the relevant tempest test to try to trip it up and it didn't fail. However, an exlicit migrate *is* affected. They are different top-level entry points in cinder, and set different state, which is what triggers the Nova bug. This appears to be a regression which was introduced by https://review.openstack.org/#/c/456971/ : # Yes this is a tightly-coupled state check of what's going on inside # cinder, but we need this while we still support old (v1/v2) and # new style attachments (v3.44). Once we drop support for old style # attachments we could think about cleaning up the cinder-initiated # swap volume API flows. is_cinder_migration = ( True if old_volume['status'] in ('retyping', 'migrating') else False) There's a bug here because AFAICT cinder never sets status to 'migrating' during any operation: it sets migration_status to 'migrating' during both retype and migrate. During retype it sets status to 'retyping', but not during an explicit migrate. ** Affects: nova Importance: Undecided Assignee: Matthew Booth (mbooth-9) Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1803961 Title: Nova doesn't call migrate_volume_completion after cinder volume migration Status in OpenStack Compute (nova): In Progress Bug description: Originally reported in Red Hat Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1648931 Create a cinder volume, attach it to a nova instance, and migrate the volume to a different storage host: $ cinder create 1 --volume-type foo --name myvol $ nova volume-attach myinstance myvol $ cinder migrate myvol c-vol2 Everything seems to work correctly, but if we look at myinstance we see that it's now connected to a new volume, and the original volume is still present on the original storage host. This is because nova didn't call cinder's migrate_volume_completion. migrate_volume_completion would have deleted the original volume, and changed the volume id of the new volume to be the same as the original. The result would be that myinstance would appear to be connected to the same volume as before. Note that there are 2 ways (that I'm aware of) to intiate a cinder volume migration: retype and migrate. AFAICT retype is *not* affected. In fact, I updated the relevant tempest test to try to trip it up and it didn't fail. However, an exlicit migrate *is* affected. They are different top-level entry points in cinder, and set different state, which is what triggers the Nova bug. This appears to be a regression which was introduced by https://review.openstack.org/#/c/456971/ : # Yes this is a tightly-coupled state check of what's going on inside # cinder, but we need this while we still support old (v1/v2) and # new style attachments (v3.44). Once we drop support for old style # attachments we could think about cleaning up the cinder-initiated # swap volume API flows. is_cinder_migration = ( True if old_volume['status'] in ('retyping', 'migrating') else False) There's a bug here because AFAICT cinder never sets status to 'migrating' during any operation: it sets migration_status to 'migrating' during both retype and migrate. During retype it sets status to 'retyping', but not during an explicit migrate. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1803961/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1794333] [NEW] Local delete emits only legacy start and end notifications
Public bug reported: If the compute api service does a 'local delete', it only emits legacy notifications when the operation starts and ends. If the delete goes to a compute host, the compute host emits both legacy and versioned notifications. This is both inconsistent, and a gap in versioned notifications. It would appear that every caller of compute_utils.notify_about_instance_delete in compute.API fails to emit versioned notifications. I suggest that the best way to fix this will be to fix compute_utils.notify_about_instance_delete, but note that there's also a caller in compute.Manager which emits versioned notifications explicitly. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1794333 Title: Local delete emits only legacy start and end notifications Status in OpenStack Compute (nova): New Bug description: If the compute api service does a 'local delete', it only emits legacy notifications when the operation starts and ends. If the delete goes to a compute host, the compute host emits both legacy and versioned notifications. This is both inconsistent, and a gap in versioned notifications. It would appear that every caller of compute_utils.notify_about_instance_delete in compute.API fails to emit versioned notifications. I suggest that the best way to fix this will be to fix compute_utils.notify_about_instance_delete, but note that there's also a caller in compute.Manager which emits versioned notifications explicitly. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1794333/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1780973] [NEW] Failure during live migration leaves BDM with incorrect connection_info
Public bug reported: _rollback_live_migration doesn't restore connection_info. ** Affects: nova Importance: Undecided Assignee: Matthew Booth (mbooth-9) Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1780973 Title: Failure during live migration leaves BDM with incorrect connection_info Status in OpenStack Compute (nova): In Progress Bug description: _rollback_live_migration doesn't restore connection_info. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1780973/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1778206] [NEW] Compute leaks volume attachments if we fail in driver.pre_live_migration
Public bug reported: ComputeManager.pre_live_migration fails to clean up volume attachments if the call to driver.pre_live_migration() fails. There's a try block in there to clean up attachments, but its scope isn't large enough. The result is a volume in a perpetual attaching state. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1778206 Title: Compute leaks volume attachments if we fail in driver.pre_live_migration Status in OpenStack Compute (nova): New Bug description: ComputeManager.pre_live_migration fails to clean up volume attachments if the call to driver.pre_live_migration() fails. There's a try block in there to clean up attachments, but its scope isn't large enough. The result is a volume in a perpetual attaching state. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1778206/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1777475] Re: Undercloud vm in state error after update of the undercloud.
The Nova fix should be to not call plug_vifs at all during ironic driver initialization. It probably isn't necessary for 'non-local' hypervisors in general, so guessing also Power, Hyper-V, and VMware. ** Also affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1777475 Title: Undercloud vm in state error after update of the undercloud. Status in OpenStack Compute (nova): New Status in tripleo: In Progress Bug description: Hi, after an update of the undercloud, the undercloud vm is in error: [stack@undercloud-0 ~]$ openstack server list +--+--+++++ | ID | Name | Status | Networks | Image | Flavor | +--+--+++++ | 9f80c38a-9f33-4a18-88e0-b89776e62150 | compute-0| ERROR | ctlplane=192.168.24.18 | overcloud-full | compute| | e87efe17-b939-4df2-af0c-8e2effd58c95 | controller-1 | ERROR | ctlplane=192.168.24.9 | overcloud-full | controller | | 5a3ea20c-75e8-49fe-90b6-edad01fc0a48 | controller-2 | ERROR | ctlplane=192.168.24.13 | overcloud-full | controller | | ba0f26e7-ec2c-4e61-be8e-05edf00ce78a | controller-0 | ERROR | ctlplane=192.168.24.8 | overcloud-full | controller | +--+--+++++ Originally found starting there https://bugzilla.redhat.com/show_bug.cgi?id=1590297#c14 It boils down to a ordering issue between openstack-ironic-conductor and openstack-nova-compute, a simple reproducer is: sudo systemctl stop openstack-ironic-conductor sudo systemctl restart openstack-nova-compute on the undercloud. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1777475/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1775418] [NEW] Swap volume of multiattached volume will corrupt data
Public bug reported: We currently permit the following: Create multiattach volumes a and b Create servers 1 and 2 Attach volume a to servers 1 and 2 swap_volume(server 1, volume a, volume b) In fact, we have a tempest test which tests exactly this sequence: api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach The problem is that writes from server 2 during the copy operation on server 1 will continue to hit the underlying storage, but as server 1 doesn't know about them they won't be reflected on the copy on volume b. This will lead to an inconsistent copy, and therefore data corruption on volume b. Also, this whole flow makes no sense for a multiattached volume because even if we managed a consistent copy all we've achieved is forking our data between the 2 volumes. The purpose of this call is to allow the operator to move volumes. We need a fundamentally different approach for multiattached volumes. In the short term we should at least prevent data corruption by preventing swap volume of a multiattached volume. This would also cause the above tempest test to fail, but as I don't believe it's possible to implement the test safely this would be correct. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1775418 Title: Swap volume of multiattached volume will corrupt data Status in OpenStack Compute (nova): New Bug description: We currently permit the following: Create multiattach volumes a and b Create servers 1 and 2 Attach volume a to servers 1 and 2 swap_volume(server 1, volume a, volume b) In fact, we have a tempest test which tests exactly this sequence: api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach The problem is that writes from server 2 during the copy operation on server 1 will continue to hit the underlying storage, but as server 1 doesn't know about them they won't be reflected on the copy on volume b. This will lead to an inconsistent copy, and therefore data corruption on volume b. Also, this whole flow makes no sense for a multiattached volume because even if we managed a consistent copy all we've achieved is forking our data between the 2 volumes. The purpose of this call is to allow the operator to move volumes. We need a fundamentally different approach for multiattached volumes. In the short term we should at least prevent data corruption by preventing swap volume of a multiattached volume. This would also cause the above tempest test to fail, but as I don't believe it's possible to implement the test safely this would be correct. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1775418/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1774252] [NEW] Resize confirm fails if nova-compute is restarted after resize
Public bug reported: Originally reported in RH bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1584315 Reproduced on OSP12 (Pike). After resizing an instance but before confirm, update_available_resource will fail on the source compute due to bug 1774249. If nova compute is restarted at this point before the resize is confirmed, the update_available_resource period task will never have succeeded, and therefore ResourceTracker's compute_nodes dict will not be populated at all. When confirm calls _delete_allocation_after_move() it will fail with ComputeHostNotFound because there is no entry for the current node in ResourceTracker. The error looks like: 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [req-4f7d5d63-fc05-46ed-b505-41050d889752 09abbd4893bb45eea8fb1d5e40635339 d4483d13a6ef41b2ae575ddbd0c59141 - default default] [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] Setting instance vm_state to ERROR: ComputeHostNotFound: Compute host compute-1.localdomain could not be found. 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] Traceback (most recent call last): 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7445, in _error_out_instance_on_exception 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] yield 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3757, in _confirm_resize 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] migration.source_node) 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3790, in _delete_allocation_after_move 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] cn_uuid = rt.get_node_uuid(nodename) 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 155, in get_node_uuid 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] raise exception.ComputeHostNotFound(host=nodename) 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] ComputeHostNotFound: Compute host compute-1.localdomain could not be found. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1774252 Title: Resize confirm fails if nova-compute is restarted after resize Status in OpenStack Compute (nova): New Bug description: Originally reported in RH bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1584315 Reproduced on OSP12 (Pike). After resizing an instance but before confirm, update_available_resource will fail on the source compute due to bug 1774249. If nova compute is restarted at this point before the resize is confirmed, the update_available_resource period task will never have succeeded, and therefore ResourceTracker's compute_nodes dict will not be populated at all. When confirm calls _delete_allocation_after_move() it will fail with ComputeHostNotFound because there is no entry for the current node in ResourceTracker. The error looks like: 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [req-4f7d5d63-fc05-46ed-b505-41050d889752 09abbd4893bb45eea8fb1d5e40635339 d4483d13a6ef41b2ae575ddbd0c59141 - default default] [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] Setting instance vm_state to ERROR: ComputeHostNotFound: Compute host compute-1.localdomain could not be found. 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] Traceback (most recent call last): 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7445, in _error_out_instance_on_exception 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] yield 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3757, in _confirm_resize 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] migration.source_node) 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instan
[Yahoo-eng-team] [Bug 1774249] [NEW] update_available_resource will raise DiskNotFound after resize but before confirm
Public bug reported: Original reported in RH Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1584315 Tested on OSP12 (Pike), but appears to be still present on master. Should only occur if nova compute is configured to use local file instance storage. Create instance A on compute X Resize instance A to compute Y Domain is powered off /var/lib/nova/instances/ renamed to _resize on X Domain is *not* undefined On compute X: update_available_resource runs as a periodic task First action is to update self rt calls driver.get_available_resource() ...calls _get_disk_over_committed_size_total ...iterates over all defined domains, including the ones whose disks we renamed ...fails because a referenced disk no longer exists Results in errors in nova-compute.log: 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager [req-bd52371f-c6ec-4a83-9584-c00c5377acd8 - - - - -] Error updating resources for node compute-0.localdomain.: DiskNotFound: No disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager Traceback (most recent call last): 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6695, in update_available_resource_for_node 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager rt.update_available_resource(context, nodename) 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 641, in update_available_resource 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename) 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5892, in get_available_resource 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager disk_over_committed = self._get_disk_over_committed_size_total() 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7393, in _get_disk_over_committed_size_total 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager config, block_device_info) 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7301, in _get_instance_disk_info_from_config 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager dk_size = disk_api.get_allocated_disk_size(path) 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 156, in get_allocated_disk_size 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager return images.qemu_img_info(path).disk_size 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in qemu_img_info 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager raise exception.DiskNotFound(location=path) 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk And resource tracker is no longer updated. We can find lots of these in the gate. Note that change Icec2769bf42455853cbe686fb30fda73df791b25 nearly mitigates this, but doesn't because task_state is not set while the instance is awaiting confirm. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1774249 Title: update_available_resource will raise DiskNotFound after resize but before confirm Status in OpenStack Compute (nova): New Bug description: Original reported in RH Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1584315 Tested on OSP12 (Pike), but appears to be still present on master. Should only occur if nova compute is configured to use local file instance storage. Create instance A on compute X Resize instance A to compute Y Domain is powered off /var/lib/nova/instances/ renamed to _resize on X Domain is *not* undefined On compute X: update_available_resource runs as a periodic task First action is to update self rt calls driver.get_available_resource() ...calls _get_disk_over_committed_size_total ...iterates over all defined domains, including the ones whose disks we renamed ...fails because a referenced disk no longer exists Results in errors in nova-compute.log: 2018-05-30 02:17:08.647 1 ERROR nova.compute.manager [req-bd52371f-c6ec-4a83-9584-c00c5377acd8 - - - - -] Error updating resources for node compute-0.localdomain.: DiskNotFound: No disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk 2018-05-30 02:17:
[Yahoo-eng-team] [Bug 1767363] [NEW] Deleting 2 instances with a common multi-attached volume can leave the volume attached
Public bug reported: CAVEAT: The following is only from code inspection. I have not reproduced the issue. During instance delete, we call: driver.cleanup(): foreach volume: _disconnect_volume(): if _should_disconnect_target(): disconnect_volume() There is no volume-specific or global locking around _disconnect_volume that I can see in this call graph. _should_disconnect_target() is intended to check for multi-attached volumes on a single host, to prevent a volume being disconnected while it is still in use by another instance. It does: volume = cinder->get_volume() connection_count = count of volume.attachments where instance is on this host As there is no locking between the above operation and the subsequent disconnect_volume(), 2 simultaneous calls to _disconnect_volume() can both return False from _should_disconnect_target(). Not only this, but as this involves both a slow call out to cinder and a db lookup, this is likely to be easily hit in practice for example by an orchestration tool mass-deleting instances. Also note that there are many call paths which call _disconnect_volume() apart from cleanup(), so there are likely numerous other potential interactions here. The result would be that all attachments are deleted, but the volume remains attached to the host. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1767363 Title: Deleting 2 instances with a common multi-attached volume can leave the volume attached Status in OpenStack Compute (nova): New Bug description: CAVEAT: The following is only from code inspection. I have not reproduced the issue. During instance delete, we call: driver.cleanup(): foreach volume: _disconnect_volume(): if _should_disconnect_target(): disconnect_volume() There is no volume-specific or global locking around _disconnect_volume that I can see in this call graph. _should_disconnect_target() is intended to check for multi-attached volumes on a single host, to prevent a volume being disconnected while it is still in use by another instance. It does: volume = cinder->get_volume() connection_count = count of volume.attachments where instance is on this host As there is no locking between the above operation and the subsequent disconnect_volume(), 2 simultaneous calls to _disconnect_volume() can both return False from _should_disconnect_target(). Not only this, but as this involves both a slow call out to cinder and a db lookup, this is likely to be easily hit in practice for example by an orchestration tool mass-deleting instances. Also note that there are many call paths which call _disconnect_volume() apart from cleanup(), so there are likely numerous other potential interactions here. The result would be that all attachments are deleted, but the volume remains attached to the host. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1767363/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1754716] [NEW] Disconnect volume on live migration source fails if initialize_connection doesn't return identical output
Public bug reported: During live migration we update bdm.connection_info for attached volumes in pre_live_migration to reflect the new connection on the destination node. This means that after migration completes we no longer have a reference to the original connection_info to do the detach on the source host, so we have to re-fetch it with a second call to initialize_connection before calling disconnect. Unfortunately the cinder driver interface does not strictly require that multiple calls to initialize_connection will return consistent results. Although they normally do in practice, there is at least one cinder driver (delliscsi) which doesn't. This results in a failure to disconnect on the source host post migration. ** Affects: nova Importance: Undecided Assignee: Matthew Booth (mbooth-9) Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1754716 Title: Disconnect volume on live migration source fails if initialize_connection doesn't return identical output Status in OpenStack Compute (nova): In Progress Bug description: During live migration we update bdm.connection_info for attached volumes in pre_live_migration to reflect the new connection on the destination node. This means that after migration completes we no longer have a reference to the original connection_info to do the detach on the source host, so we have to re-fetch it with a second call to initialize_connection before calling disconnect. Unfortunately the cinder driver interface does not strictly require that multiple calls to initialize_connection will return consistent results. Although they normally do in practice, there is at least one cinder driver (delliscsi) which doesn't. This results in a failure to disconnect on the source host post migration. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1754716/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1744079] [NEW] disk over-commit still not correctly calculated during live migration
Public bug reported: Change I8a705114d47384fcd00955d4a4f204072fed57c2 (written by me... sigh) addressed a bug which prevented live migration to a target host with overcommitted disk when made with microversion <2.25. It achieved this, but the fix is still not correct. We now do: if disk_over_commit: disk_available_gb = dst_compute_info['local_gb'] Unfortunately local_gb is *total* disk, not available disk. We actually want free_disk_gb. Fun fact: due to the way we calculate this for filesystems, without taking into account reserved space, this can also be negative. The test we're currently running is: could we fit this guest's allocated disks on the target if the target disk was empty. This is at least better than it was before, as we don't spuriously fail early. In fact, we're effectively disabling a test which is disabled for microversion >=2.25 anyway. IOW we should fix it, but it's probably not a high priority. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1744079 Title: disk over-commit still not correctly calculated during live migration Status in OpenStack Compute (nova): New Bug description: Change I8a705114d47384fcd00955d4a4f204072fed57c2 (written by me... sigh) addressed a bug which prevented live migration to a target host with overcommitted disk when made with microversion <2.25. It achieved this, but the fix is still not correct. We now do: if disk_over_commit: disk_available_gb = dst_compute_info['local_gb'] Unfortunately local_gb is *total* disk, not available disk. We actually want free_disk_gb. Fun fact: due to the way we calculate this for filesystems, without taking into account reserved space, this can also be negative. The test we're currently running is: could we fit this guest's allocated disks on the target if the target disk was empty. This is at least better than it was before, as we don't spuriously fail early. In fact, we're effectively disabling a test which is disabled for microversion >=2.25 anyway. IOW we should fix it, but it's probably not a high priority. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1744079/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1719362] [NEW] libvirt: Data corruptor live migrating BFV instance with config disk
Public bug reported: When live migrating a BFV instance with a config disk, the API currently requires block migration to be specified due to the local storage requirement. This doesn't make sense on a number of levels. Before calling migrateToURI3() in this case, the libvirt driver filters out all disks which it shouldn't migrate, which is both of them: the config drive because it's read-only and we already copied it with scp, and the root disk because it's a volume. It calls migrateToURI3() with an empty migrate_disks in params, and VIR_MIGRATE_NON_SHARED_INC in flags (because block-migration). There's a quirk in the behaviour of the libvirt python bindings here: it doesn't distinguish between an empty migrate_disks list, and no migrate_disks list. Both use the default behaviour of "block migrate all writable disks". This will include the attached root volume. As the root volume is simultaneously attached to both ends of the migration, one of which is running guest, this a data corruptor. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1719362 Title: libvirt: Data corruptor live migrating BFV instance with config disk Status in OpenStack Compute (nova): New Bug description: When live migrating a BFV instance with a config disk, the API currently requires block migration to be specified due to the local storage requirement. This doesn't make sense on a number of levels. Before calling migrateToURI3() in this case, the libvirt driver filters out all disks which it shouldn't migrate, which is both of them: the config drive because it's read-only and we already copied it with scp, and the root disk because it's a volume. It calls migrateToURI3() with an empty migrate_disks in params, and VIR_MIGRATE_NON_SHARED_INC in flags (because block-migration). There's a quirk in the behaviour of the libvirt python bindings here: it doesn't distinguish between an empty migrate_disks list, and no migrate_disks list. Both use the default behaviour of "block migrate all writable disks". This will include the attached root volume. As the root volume is simultaneously attached to both ends of the migration, one of which is running guest, this a data corruptor. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1719362/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1718439] Re: Apparent lack of locking in conductor logs
After some brief discussion in #openstack-nova I've moved this to oslo.log. The issue here appears to be that we spawn multiple separate conductor processes writing to the same nova-conductor.log file. We don't want to stop doing this, as it would break people. It seems that by default python logging uses thread logs rather than external locks: https://docs.python.org/2/library/multiprocessing.html#logging Suggest the fix might be to explicitly use multiprocessing.get_logger(), or at least provide an option to do this when we know it's required. ** Project changed: nova => oslo.log -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1718439 Title: Apparent lack of locking in conductor logs Status in oslo.log: New Bug description: I'm looking at conductor logs generated by a customer running RH OSP 10 (Newton). The logs appear to be corrupt in a manner I'd expect to see if 2 processes were writing to the same log file simultaneously. For example: === 2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return self.dbapi.connect(*cargs, **cparams) 2017-09-14 15:54:39.689 120626 ERROR nova.s2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db [-] Unexpected error while reporting service status 2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db Traceback (most recent call last): === Notice how a new log starts part way through the second line above. This also results in log entries in the wrong sort order: === 2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db return self.dbapi.connect(*cargs, **cparams) 2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/pymysql/__init__.py", line 88, in Connect 2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return Connection(*args, **kwargs) 2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 657, in __init__ === Note how the first 2 lines are after the last 2 by timestamp, as presumably the last 2 are a continuation of a previous log entry. This confounds merge sorting of log files, which is exceptionally useful. We also see truncated lines with no timestamp which aren't a continuation of the previous line: === 2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] 2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db elf._execute_and_instances(context) === I strongly suspect this is because multiple conductors are running in separate processes, and are therefore not benefiting from the thread safety of python's logging. To manage notifications about this bug go to: https://bugs.launchpad.net/oslo.log/+bug/1718439/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1718439] [NEW] Apparent lack of locking in conductor logs
Public bug reported: I'm looking at conductor logs generated by a customer running RH OSP 10 (Newton). The logs appear to be corrupt in a manner I'd expect to see if 2 processes were writing to the same log file simultaneously. For example: === 2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return self.dbapi.connect(*cargs, **cparams) 2017-09-14 15:54:39.689 120626 ERROR nova.s2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db [-] Unexpected error while reporting service status 2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db Traceback (most recent call last): === Notice how a new log starts part way through the second line above. This also results in log entries in the wrong sort order: === 2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db return self.dbapi.connect(*cargs, **cparams) 2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/pymysql/__init__.py", line 88, in Connect 2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return Connection(*args, **kwargs) 2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 657, in __init__ === Note how the first 2 lines are after the last 2 by timestamp, as presumably the last 2 are a continuation of a previous log entry. This confounds merge sorting of log files, which is exceptionally useful. We also see truncated lines with no timestamp which aren't a continuation of the previous line: === 2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] 2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db elf._execute_and_instances(context) === I strongly suspect this is because multiple conductors are running in separate processes, and are therefore not benefiting from the thread safety of python's logging. ** Affects: nova Importance: Undecided Status: New ** Summary changed: - Apparent lack of locking in logger + Apparent lack of locking in conductor logs -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1718439 Title: Apparent lack of locking in conductor logs Status in OpenStack Compute (nova): New Bug description: I'm looking at conductor logs generated by a customer running RH OSP 10 (Newton). The logs appear to be corrupt in a manner I'd expect to see if 2 processes were writing to the same log file simultaneously. For example: === 2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return self.dbapi.connect(*cargs, **cparams) 2017-09-14 15:54:39.689 120626 ERROR nova.s2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db [-] Unexpected error while reporting service status 2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db Traceback (most recent call last): === Notice how a new log starts part way through the second line above. This also results in log entries in the wrong sort order: === 2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db return self.dbapi.connect(*cargs, **cparams) 2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/pymysql/__init__.py", line 88, in Connect 2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return Connection(*args, **kwargs) 2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 657, in __init__ === Note how the first 2 lines are after the last 2 by timestamp, as presumably the last 2 are a continuation of a previous log entry. This confounds merge sorting of log files, which is exceptionally useful. We also see truncated lines with no timestamp which aren't a continuation of the previous line: === 2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] 2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db elf._execute_and_instances(context) === I strongly suspect this is because multiple conductors are running in separate processes, and are therefore not benefiting from the thread safety of python's logging. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1718439/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1688228] [NEW] Failure in resize_instance after cast to finish_resize still sets instance error state
Public bug reported: This is from code inspection only. ComputeManager.resize_instance does: with self._error_out_instance_on_exception(context, instance, quotas=quotas): ...stuff... self.compute_rpcapi.finish_resize(context, instance, migration, image, disk_info, migration.dest_compute, reservations=quotas.reservations) ... Responsibility for the instance has now been punted to the destination, but... self._notify_about_instance_usage(context, instance, "resize.end", network_info=network_info) compute_utils.notify_about_instance_action(context, instance, self.host, action=fields.NotificationAction.RESIZE, phase=fields.NotificationPhase.END) self.instance_events.clear_events_for_instance(instance) The problem is that a failure in anything after the cast to finish_resize will cause the instance to be put in an error state and its quotas rolled back. This would not be correct, as any error here would be purely ephemeral. The resize operation will continue on the destination regardless, so this would almost certainly result in an inconsistent state. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1688228 Title: Failure in resize_instance after cast to finish_resize still sets instance error state Status in OpenStack Compute (nova): New Bug description: This is from code inspection only. ComputeManager.resize_instance does: with self._error_out_instance_on_exception(context, instance, quotas=quotas): ...stuff... self.compute_rpcapi.finish_resize(context, instance, migration, image, disk_info, migration.dest_compute, reservations=quotas.reservations) ... Responsibility for the instance has now been punted to the destination, but... self._notify_about_instance_usage(context, instance, "resize.end", network_info=network_info) compute_utils.notify_about_instance_action(context, instance, self.host, action=fields.NotificationAction.RESIZE, phase=fields.NotificationPhase.END) self.instance_events.clear_events_for_instance(instance) The problem is that a failure in anything after the cast to finish_resize will cause the instance to be put in an error state and its quotas rolled back. This would not be correct, as any error here would be purely ephemeral. The resize operation will continue on the destination regardless, so this would almost certainly result in an inconsistent state. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1688228/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1686703] [NEW] Error in finish_migration results in image deletion on source with no copy
Public bug reported: ML post describing the issue here: http://lists.openstack.org/pipermail/openstack- dev/2017-April/115989.html User was resizing an instance whose glance image had been deleted. An ssh failure occurred in finish_migration, which runs on the destination, attempting to copy the image out of the image cache on the source. This left the instance and migration in an error state on the destination, but with no copy of the image on the destination. Cache manager later ran on the source and expired the image from the image cache there, leaving no remaining copies. At this point the user's instance was unrecoverable. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1686703 Title: Error in finish_migration results in image deletion on source with no copy Status in OpenStack Compute (nova): New Bug description: ML post describing the issue here: http://lists.openstack.org/pipermail/openstack- dev/2017-April/115989.html User was resizing an instance whose glance image had been deleted. An ssh failure occurred in finish_migration, which runs on the destination, attempting to copy the image out of the image cache on the source. This left the instance and migration in an error state on the destination, but with no copy of the image on the destination. Cache manager later ran on the source and expired the image from the image cache there, leaving no remaining copies. At this point the user's instance was unrecoverable. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1686703/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1669844] [NEW] Host failure shortly after image download can result in data corruption
Public bug reported: GlanceImageServiceV2.download() ensures its downloaded file is closed before releasing for use by an external qemu process, but it doesn't do an fdatasync(). This means that the downloaded file may be temporarily in the host kernel's cache rather than on disk, which means there is a short window in which a host crash will lose the contents of the backing file, despite it being in use by a running instance. Disclaimer: I'm not personally able to reproduce this, but it looks sane and our QE team is reliably hitting it. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1669844 Title: Host failure shortly after image download can result in data corruption Status in OpenStack Compute (nova): New Bug description: GlanceImageServiceV2.download() ensures its downloaded file is closed before releasing for use by an external qemu process, but it doesn't do an fdatasync(). This means that the downloaded file may be temporarily in the host kernel's cache rather than on disk, which means there is a short window in which a host crash will lose the contents of the backing file, despite it being in use by a running instance. Disclaimer: I'm not personally able to reproduce this, but it looks sane and our QE team is reliably hitting it. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1669844/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1669400] [NEW] delete_instance_metadata and update_instance_metadata are permitted during an ongoing task
Public bug reported: Note: this is exclusively from code inspection. delete_instance_metadata and update_instance_metadata in ComputeManager are both guarded by: @check_instance_state(vm_state=[vm_states.ACTIVE, vm_states.PAUSED, vm_states.SUSPENDED, vm_states.STOPPED], task_state=None) The problem is the task_state=None which, despite appearances, actually explicitly disables the task_state check, i.e. it does not explicitly check that task_state is None. This was introduced in change I70212879 and does not appear to have been deliberate. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1669400 Title: delete_instance_metadata and update_instance_metadata are permitted during an ongoing task Status in OpenStack Compute (nova): New Bug description: Note: this is exclusively from code inspection. delete_instance_metadata and update_instance_metadata in ComputeManager are both guarded by: @check_instance_state(vm_state=[vm_states.ACTIVE, vm_states.PAUSED, vm_states.SUSPENDED, vm_states.STOPPED], task_state=None) The problem is the task_state=None which, despite appearances, actually explicitly disables the task_state check, i.e. it does not explicitly check that task_state is None. This was introduced in change I70212879 and does not appear to have been deliberate. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1669400/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1662483] [NEW] detach_volume races with delete
Public bug reported: If a client does: nova volume-detach foo vol nova delete foo Assuming the volume-detach takes a moment, which it normally does, the delete will race with it also also attempt to detach the same volume. It's possible there are no side effects from this other than untidy log messages, but this is difficult to prove. I found this looking through CI logs. Note that volume-detach can also race with other instance operations, including itself. I'm almost certain that if you poke hard enough you'll find some combination that breaks things badly. ** Affects: nova Importance: Undecided Assignee: Matthew Booth (mbooth-9) Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1662483 Title: detach_volume races with delete Status in OpenStack Compute (nova): In Progress Bug description: If a client does: nova volume-detach foo vol nova delete foo Assuming the volume-detach takes a moment, which it normally does, the delete will race with it also also attempt to detach the same volume. It's possible there are no side effects from this other than untidy log messages, but this is difficult to prove. I found this looking through CI logs. Note that volume-detach can also race with other instance operations, including itself. I'm almost certain that if you poke hard enough you'll find some combination that breaks things badly. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1662483/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1648109] [NEW] Libvirt LVM storage backend doesn't initialise filesystems of ephemeral disks
Public bug reported: N.B. This is from code inspection only. When creating an LVM-backed instance with an ephemeral disk, the ephemeral disk will not be initialised with the requested filesystem. This is because Image.cache() wraps the _create_ephemeral callback in fetch_func_sync, which will not call _create_ephemeral if the target already exists. Because the Lvm backend must create the disk first, this is never called. ** Affects: nova Importance: Undecided Assignee: Matthew Booth (mbooth-9) Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1648109 Title: Libvirt LVM storage backend doesn't initialise filesystems of ephemeral disks Status in OpenStack Compute (nova): In Progress Bug description: N.B. This is from code inspection only. When creating an LVM-backed instance with an ephemeral disk, the ephemeral disk will not be initialised with the requested filesystem. This is because Image.cache() wraps the _create_ephemeral callback in fetch_func_sync, which will not call _create_ephemeral if the target already exists. Because the Lvm backend must create the disk first, this is never called. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1648109/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1623497] [NEW] Booting Ceph instance using Ceph glance doesn't resize root disk to flavor size
Public bug reported: This bug is purely from code inspection; I haven't replicated it on a running system. Change I46b5658efafe558dd6b28c9910fb8fde830adec0 added a resize check that the backing file exists before checking its size. Unfortunately we forgot that Rbd overrides get_disk_size(path), and ignores the path argument, which means it would previously not have failed even when the given path didn't exist. Additionally, the callback function passed to cache() by driver will also ignore its path argument, and therefore not write to the image cache, when cloning to a ceph instance from a ceph glance store (see the section starting if backend.SUPPORTS_CLONE in driver._create_and_inject_local_root). Consequently, when creating a ceph instance using a ceph glance store: 1. 'base' will not exist in the image cache 2. get_disk_size(base) will return the correct value anyway We broke this with change I46b5658efafe558dd6b28c9910fb8fde830adec0. ** Affects: nova Importance: Undecided Status: New ** Tags: newton-rc-potential ** Tags added: newton-rc-potential -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1623497 Title: Booting Ceph instance using Ceph glance doesn't resize root disk to flavor size Status in OpenStack Compute (nova): New Bug description: This bug is purely from code inspection; I haven't replicated it on a running system. Change I46b5658efafe558dd6b28c9910fb8fde830adec0 added a resize check that the backing file exists before checking its size. Unfortunately we forgot that Rbd overrides get_disk_size(path), and ignores the path argument, which means it would previously not have failed even when the given path didn't exist. Additionally, the callback function passed to cache() by driver will also ignore its path argument, and therefore not write to the image cache, when cloning to a ceph instance from a ceph glance store (see the section starting if backend.SUPPORTS_CLONE in driver._create_and_inject_local_root). Consequently, when creating a ceph instance using a ceph glance store: 1. 'base' will not exist in the image cache 2. get_disk_size(base) will return the correct value anyway We broke this with change I46b5658efafe558dd6b28c9910fb8fde830adec0. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1623497/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1619606] [NEW] snapshot_volume_backed races, could result in data corruption
Public bug reported: snapshot_volume_backed() in compute.API does not set a task_state during execution. However, in essence it does: if vm_state == ACTIVE: quiesce() snapshot() if vm_state == ACTIVE: unquiesce() There is no exclusion here, though, which means a user could do: quiesce() quiesce() snapshot() snapshot() unquiesce()--snapshot() now running after unquiesce -> corruption unquiesce() or: suspend() snapshot() NO QUIESCE (we're suspended) snapshot() resume() --snapshot() now running after resume -> corruption Same goes for stop/start. Note that snapshot_volume_backed() is a separate top-level entry point from snapshot(). snapshot() does not suffer from this problem, because it atomically sets the task state to IMAGE_SNAPSHOT_PENDING when running, which prevents the user from performing a concurrent operation on the instance. I suggest that snapshot_volume_backed() should do the same. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1619606 Title: snapshot_volume_backed races, could result in data corruption Status in OpenStack Compute (nova): New Bug description: snapshot_volume_backed() in compute.API does not set a task_state during execution. However, in essence it does: if vm_state == ACTIVE: quiesce() snapshot() if vm_state == ACTIVE: unquiesce() There is no exclusion here, though, which means a user could do: quiesce() quiesce() snapshot() snapshot() unquiesce()--snapshot() now running after unquiesce -> corruption unquiesce() or: suspend() snapshot() NO QUIESCE (we're suspended) snapshot() resume() --snapshot() now running after resume -> corruption Same goes for stop/start. Note that snapshot_volume_backed() is a separate top-level entry point from snapshot(). snapshot() does not suffer from this problem, because it atomically sets the task state to IMAGE_SNAPSHOT_PENDING when running, which prevents the user from performing a concurrent operation on the instance. I suggest that snapshot_volume_backed() should do the same. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1619606/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1597754] [NEW] Unable to boot instance using UML
Public bug reported: CAVEAT: This is from code inspection only. Change I931421ea moved the following snippet of code: if CONF.libvirt.virt_type == 'uml': libvirt_utils.chown(image('disk').path, 'root') from the bottom of _create_image to the top. The problem is, the new location is before the creation of the root disk. This means that on initial creation we will run libvirt_utils.chown on a path which hasn't been created yet, which will cause an exception. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1597754 Title: Unable to boot instance using UML Status in OpenStack Compute (nova): New Bug description: CAVEAT: This is from code inspection only. Change I931421ea moved the following snippet of code: if CONF.libvirt.virt_type == 'uml': libvirt_utils.chown(image('disk').path, 'root') from the bottom of _create_image to the top. The problem is, the new location is before the creation of the root disk. This means that on initial creation we will run libvirt_utils.chown on a path which hasn't been created yet, which will cause an exception. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1597754/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1594377] [NEW] resize does not resize ephemeral disks
Public bug reported: Nova resize does not resize ephemeral disks. I have tested this with the default qcow2 backend, but I expect it to be true for all backends. I have created 2 flavors: | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 1 | | disk | 1 | | extra_specs| {} | | id | test-1 | | name | test-1 | | os-flavor-access:is_public | True | | ram| 256| | rxtx_factor| 1.0| | swap | 1 | | vcpus | 1 | and: | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 2 | | disk | 2 | | extra_specs| {} | | id | test-2 | | name | test-2 | | os-flavor-access:is_public | True | | ram| 512| | rxtx_factor| 1.0| | swap | 2 | | vcpus | 2 | I boot an instance with flavor test-1 with: $ nova boot --flavor test-1 --image cirros foo It creates instance directory 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c containing (amongst non-disk files) disk, disk.eph0, disk.swap, and disk.config. disk.config is not relevant here. I check the sizes of each of these disks: instances]$ for disk in disk disk.eph0 disk.swap; do qemu-img info 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/$disk; done image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk file format: qcow2 virtual size: 1.0G (1073741824 bytes) disk size: 10M cluster_size: 65536 backing file: /home/mbooth/data/nova/instances/_base/1ba6fbdbe52377ff7e075c3317a48205ac6c28c4 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk.eph0 file format: qcow2 virtual size: 1.0G (1073741824 bytes) disk size: 324K cluster_size: 65536 backing file: /home/mbooth/data/nova/instances/_base/ephemeral_1_40d1d2c Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk.swap file format: qcow2 virtual size: 1.0M (1048576 bytes) disk size: 196K cluster_size: 65536 backing file: /home/mbooth/data/nova/instances/_base/swap_1 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false I resize foo with: $ nova resize foo test-2 --poll I check the sizes again: instances]$ for disk in disk disk.eph0 disk.swap; do qemu-img info 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/$disk; done image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk file format: qcow2 virtual size: 2.0G (2147483648 bytes) disk size: 26M cluster_size: 65536 backing file: /home/mbooth/data/nova/instances/_base/1ba6fbdbe52377ff7e075c3317a48205ac6c28c4 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk.eph0 file format: qcow2 virtual size: 1.0G (1073741824 bytes) disk size: 384K cluster_size: 65536 backing file: /home/mbooth/data/nova/instances/_base/ephemeral_1_40d1d2c Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk.swap file format: qcow2 virtual size: 2.0M (2097152 bytes) disk size: 196K cluster_size: 65536 backing file: /home/mbooth/data/nova/instances/_base/swap_2 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false Note that the root and swap disks have been resized, but the ephemeral disk has not. This is caused by 2 bugs. Firstly, there is some code in finish_migration in the libvirt driver which purports to resize disks. This code is actually a no-op, because disk resizing has already been done by _create_image, which called cache() with the correct size, and therefore did the resizing. However, as noted in a comment, the no-op code would not have covered our ephemeral disk anyway, as it only loops over 'disk.local', which is the legacy disk naming. Secondly, _create_image does not iterate over ephemeral disks at all when called by finish_migration, because finish_migration explicitly passes block_device_info=None. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1594377 Title: resize does not resize ephemeral disks Status in OpenStack Compute (nova): New Bug description: Nova resize does not resize ephemeral disks. I have tested this with the default qcow2 backend, but I expect it to be true for all backends. I have created 2 flavors: | OS-FLV-DISABLED:disabled | Fal
[Yahoo-eng-team] [Bug 1593155] [NEW] over_committed_disk_size is wrong for sparse flat files
Public bug reported: The libvirt driver creates flat disks as sparse by default. However, it always returns over_committed_disk_size=0 for flat disks in _get_instance_disk_info(). This incorrect data ends up being reported to the scheduler in the libvirt driver's get_available_resource() via _get_disk_over_committed_size_total(). _get_instance_disk_info() should use allocated blocks, not file size, when calculating over_commited_disk_size for flat disks. ** Affects: nova Importance: Undecided Status: New ** Tags: libvirt low-hanging-fruit -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1593155 Title: over_committed_disk_size is wrong for sparse flat files Status in OpenStack Compute (nova): New Bug description: The libvirt driver creates flat disks as sparse by default. However, it always returns over_committed_disk_size=0 for flat disks in _get_instance_disk_info(). This incorrect data ends up being reported to the scheduler in the libvirt driver's get_available_resource() via _get_disk_over_committed_size_total(). _get_instance_disk_info() should use allocated blocks, not file size, when calculating over_commited_disk_size for flat disks. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1593155/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1590693] Re: libvirt's use of driver.get_instance_disk_info() is generally problematic
This was intended to be a low hanging fruit bug, but it doesn't meet the criteria. Closing, at it has no other purpose. ** Changed in: nova Status: Incomplete => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1590693 Title: libvirt's use of driver.get_instance_disk_info() is generally problematic Status in OpenStack Compute (nova): Invalid Bug description: The nova.virt.driver 'interface' defines a get_instance_disk_info method, which is called by compute manager to get disk info during live migration to get the source hypervisor's internal representation of disk info and pass it directly to the target hypervisor over rpc. To compute manager this is an opaque blob of stuff which only the driver understands, which is presumably why json was chosen. There are a couple of problems with it. This is a useful method within the libvirt driver, which uses it fairly liberally. However, the method returns a json blob. Every use of it internal to the libvirt driver first json encodes it in get_instance_disk_info, then immediately decodes it again, which is inefficient... except 2 uses of it in migrate_disk_and_power_off and check_can_live_migrate_source, which don't decode it and assume it's a dict. These are both broken, which presumably means something relating to migration of volume-backed instances is broken. The libvirt driver should not use this internally. We can have a wrapper method to do the json encoding for compute manager, and internally use the unencoded data data directly. Secondly, we're passing an unversioned blob of data over rpc. We should probably turn this data into a versioned object. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1590693/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1590693] [NEW] libvirt's use of driver.get_instance_disk_info() is generally problematic
Public bug reported: The nova.virt.driver 'interface' defines a get_instance_disk_info method, which is called by compute manager to get disk info during live migration to get the source hypervisor's internal representation of disk info and pass it directly to the target hypervisor over rpc. To compute manager this is an opaque blob of stuff which only the driver understands, which is presumably why json was chosen. There are a couple of problems with it. This is a useful method within the libvirt driver, which uses it fairly liberally. However, the method returns a json blob. Every use of it internal to the libvirt driver first json encodes it in get_instance_disk_info, then immediately decodes it again, which is inefficient... except 2 uses of it in migrate_disk_and_power_off and check_can_live_migrate_source, which don't decode it and assume it's a dict. These are both broken, which presumably means something relating to migration of volume-backed instances is broken. The libvirt driver should not use this internally. We can have a wrapper method to do the json encoding for compute manager, and internally use the unencoded data data directly. Secondly, we're passing an unversioned blob of data over rpc. We should probably turn this data into a versioned object. ** Affects: nova Importance: Undecided Status: New ** Tags: low-hanging-fruit ** Description changed: The nova.virt.driver 'interface' defines a get_instance_disk_info method, which is called by compute manager to get disk info during live migration to get the source hypervisor's internal representation of disk info and pass it directly to the target hypervisor over rpc. To compute manager this is an opaque blob of stuff which only the driver understands, which is presumably why json was chosen. There are a couple of problems with it. This is a useful method within the libvirt driver, which uses it fairly liberally. However, the method returns a json blob. Every use of it internal to the libvirt driver first json encodes it in get_instance_disk_info, then immediately decodes it again, which is - efficient. Except 2 uses of it in migrate_disk_and_power_off and + inefficient... except 2 uses of it in migrate_disk_and_power_off and check_can_live_migrate_source, which don't decode it and assume it's a dict. These are both broken, which presumably means something relating to migration of volume-backed instances is broken. The libvirt driver should not use this internally. We can have a wrapper method to do the json encoding for compute manager, and internally use the unencoded data data directly. Secondly, we're passing an unversioned blob of data over rpc. We should probably turn this data into a versioned object. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1590693 Title: libvirt's use of driver.get_instance_disk_info() is generally problematic Status in OpenStack Compute (nova): New Bug description: The nova.virt.driver 'interface' defines a get_instance_disk_info method, which is called by compute manager to get disk info during live migration to get the source hypervisor's internal representation of disk info and pass it directly to the target hypervisor over rpc. To compute manager this is an opaque blob of stuff which only the driver understands, which is presumably why json was chosen. There are a couple of problems with it. This is a useful method within the libvirt driver, which uses it fairly liberally. However, the method returns a json blob. Every use of it internal to the libvirt driver first json encodes it in get_instance_disk_info, then immediately decodes it again, which is inefficient... except 2 uses of it in migrate_disk_and_power_off and check_can_live_migrate_source, which don't decode it and assume it's a dict. These are both broken, which presumably means something relating to migration of volume-backed instances is broken. The libvirt driver should not use this internally. We can have a wrapper method to do the json encoding for compute manager, and internally use the unencoded data data directly. Secondly, we're passing an unversioned blob of data over rpc. We should probably turn this data into a versioned object. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1590693/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1581382] [NEW] nova migration-list --status returns no results
Public bug reported: 'nova migration-list --status ' returns no results. On further investigation, this is because this status is passed down to db.migration_get_all_by_filters() as unicode, which doesn't handle it correctly. ** Affects: nova Importance: Undecided Assignee: Matthew Booth (mbooth-9) Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1581382 Title: nova migration-list --status returns no results Status in OpenStack Compute (nova): In Progress Bug description: 'nova migration-list --status ' returns no results. On further investigation, this is because this status is passed down to db.migration_get_all_by_filters() as unicode, which doesn't handle it correctly. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1581382/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1548884] [NEW] libvirt driver converts config drives to qcow2 during resize/migrate
Public bug reported: In finish_migration(), after resize the driver does: if info['type'] == 'raw' and CONF.use_cow_images: self._disk_raw_to_qcow2(info['path']) This ensures that if use_cow_images is set to True, all raw disks will be converted to qcow2. This includes config disks, which isn't the intention here. A second part of this bug is that config disks are then subsequently overwritten, which also doesn't seem to be intentional. This is why this hasn't previously come to light. It is currently just very efficient: we copy the config disk, convert it to qcow2, then overwrite it with a new one. We should stop after the original copy. This code was added here: https://review.openstack.org/#/c/78626/ . I have read the change, the bug it related to, spoken to the original author, and one of the core reviewers. None of us could work out why the above code was there. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1548884 Title: libvirt driver converts config drives to qcow2 during resize/migrate Status in OpenStack Compute (nova): New Bug description: In finish_migration(), after resize the driver does: if info['type'] == 'raw' and CONF.use_cow_images: self._disk_raw_to_qcow2(info['path']) This ensures that if use_cow_images is set to True, all raw disks will be converted to qcow2. This includes config disks, which isn't the intention here. A second part of this bug is that config disks are then subsequently overwritten, which also doesn't seem to be intentional. This is why this hasn't previously come to light. It is currently just very efficient: we copy the config disk, convert it to qcow2, then overwrite it with a new one. We should stop after the original copy. This code was added here: https://review.openstack.org/#/c/78626/ . I have read the change, the bug it related to, spoken to the original author, and one of the core reviewers. None of us could work out why the above code was there. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1548884/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1547577] [NEW] ephemeral and swap disks on a single compute share UUIDs
Public bug reported: The libvirt driver caches the output of mkfs and mkswap in the image cache. One consequence of this is that all ephemeral disks of a particular size and format on a single compute will have the same UUID. The same applies to swap disks. These identifiers are intended to be universally unique, but they are not. This is unlikely to be an issue in practise for ephemeral disks, as they will never be shared, however it is a wart. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1547577 Title: ephemeral and swap disks on a single compute share UUIDs Status in OpenStack Compute (nova): New Bug description: The libvirt driver caches the output of mkfs and mkswap in the image cache. One consequence of this is that all ephemeral disks of a particular size and format on a single compute will have the same UUID. The same applies to swap disks. These identifiers are intended to be universally unique, but they are not. This is unlikely to be an issue in practise for ephemeral disks, as they will never be shared, however it is a wart. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1547577/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1547582] [NEW] Block migrating an ephemeral or swap disk can result in filesystem corruption when using qcow2
Public bug reported: The libvirt driver uses common backing files for ephemeral and swap disks. These are generated on the local compute host by running mkfs or mkswap as appropriate. The output of these files for a particular size and format is stored in the image cache on the compute host which ran it. When all things are equal, 2 runs of mkfs or mkswap are guaranteed never to produce identical output, because at the very least they have different uuids. When you also consider the potential for different patch levels on different compute hosts, the potential for other differences is also significant. When block migrating an ephemeral disk, the libvirt driver copies the 'overlay' qcow2 from source to dest. Assuming that some other instance on dest also has a similar ephemeral disk, the backing file will already exist on dest. However, it is guaranteed not to be the same as the disk's original backing file for the reasons above. If this works currently, it is either by luck, or because the tiny amount of metadata originally written by mkfs or mkswap is likely to have been overwritten if it has been in use for any amount of time. The libvirt driver should not cache the output of mkfs and mkswap. The space and performance benefits are negligible, but it introduces the potential for data corruption. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1547582 Title: Block migrating an ephemeral or swap disk can result in filesystem corruption when using qcow2 Status in OpenStack Compute (nova): New Bug description: The libvirt driver uses common backing files for ephemeral and swap disks. These are generated on the local compute host by running mkfs or mkswap as appropriate. The output of these files for a particular size and format is stored in the image cache on the compute host which ran it. When all things are equal, 2 runs of mkfs or mkswap are guaranteed never to produce identical output, because at the very least they have different uuids. When you also consider the potential for different patch levels on different compute hosts, the potential for other differences is also significant. When block migrating an ephemeral disk, the libvirt driver copies the 'overlay' qcow2 from source to dest. Assuming that some other instance on dest also has a similar ephemeral disk, the backing file will already exist on dest. However, it is guaranteed not to be the same as the disk's original backing file for the reasons above. If this works currently, it is either by luck, or because the tiny amount of metadata originally written by mkfs or mkswap is likely to have been overwritten if it has been in use for any amount of time. The libvirt driver should not cache the output of mkfs and mkswap. The space and performance benefits are negligible, but it introduces the potential for data corruption. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1547582/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1543181] [NEW] Raw and qcow2 disks are never preallocated on systems with newer util-linux
Public bug reported: imagebackend.Image._can_fallocate tests if fallocate works by running the following command: fallocate -n -l 1 .fallocate_test where exists, but .fallocate_test does not. This command line is copied from the code which actually fallocates a disk. However, while this works on systems with an older version of util-linux, such as RHEL 7, it does not work on systems with a newer version of util-linux, such as Fedora 23. The result of this is that this test will always fail, and preallocation with fallocate will be erroneously disabled. On RHEL 7, which has util-linux-2.23.2-26.el7.x86_64 on my system: $ fallocate -n -l 1 foo $ ls -lh foo -rw-r--r--. 1 mbooth mbooth 0 Feb 8 15:33 foo $ du -sh foo 4.0Kfoo On Fedora 23, which has util-linux-2.27.1-2.fc23.x86_64 on my system: $ fallocate -n -l 1 foo fallocate: cannot open foo: No such file or directory The F23 behaviour actually makes sense. From the fallocate man page: -n, --keep-size Do not modify the apparent length of the file. This doesn't make any sense if the file doesn't exist. That is, the -n option makes sense when preallocating an existing disk image, but not when testing if fallocate works on a given filesystem and the test file doesn't already exist. You could also reasonably argue that util-linux probably should be breaking an interface like this, even when misused. However, that's a separate discussion. We shouldn't be misusing it. ** Affects: nova Importance: Undecided Assignee: Matthew Booth (mbooth-9) Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1543181 Title: Raw and qcow2 disks are never preallocated on systems with newer util- linux Status in OpenStack Compute (nova): In Progress Bug description: imagebackend.Image._can_fallocate tests if fallocate works by running the following command: fallocate -n -l 1 .fallocate_test where exists, but .fallocate_test does not. This command line is copied from the code which actually fallocates a disk. However, while this works on systems with an older version of util-linux, such as RHEL 7, it does not work on systems with a newer version of util-linux, such as Fedora 23. The result of this is that this test will always fail, and preallocation with fallocate will be erroneously disabled. On RHEL 7, which has util-linux-2.23.2-26.el7.x86_64 on my system: $ fallocate -n -l 1 foo $ ls -lh foo -rw-r--r--. 1 mbooth mbooth 0 Feb 8 15:33 foo $ du -sh foo 4.0K foo On Fedora 23, which has util-linux-2.27.1-2.fc23.x86_64 on my system: $ fallocate -n -l 1 foo fallocate: cannot open foo: No such file or directory The F23 behaviour actually makes sense. From the fallocate man page: -n, --keep-size Do not modify the apparent length of the file. This doesn't make any sense if the file doesn't exist. That is, the -n option makes sense when preallocating an existing disk image, but not when testing if fallocate works on a given filesystem and the test file doesn't already exist. You could also reasonably argue that util-linux probably should be breaking an interface like this, even when misused. However, that's a separate discussion. We shouldn't be misusing it. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1543181/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1392527] Re: [OSSA 2015-017] Deleting instance while resize instance is running leads to unuseable compute nodes (CVE-2015-3280)
** Changed in: nova Status: Fix Released => New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1392527 Title: [OSSA 2015-017] Deleting instance while resize instance is running leads to unuseable compute nodes (CVE-2015-3280) Status in OpenStack Compute (nova): New Status in OpenStack Compute (nova) juno series: In Progress Status in OpenStack Compute (nova) kilo series: Fix Committed Status in OpenStack Security Advisory: Fix Committed Bug description: Steps to reproduce: 1) Create a new instance,waiting until it’s status goes to ACTIVE state 2) Call resize API 3) Delete the instance immediately after the task_state is “resize_migrated” or vm_state is “resized” 4) Repeat 1 through 3 in a loop I have kept attached program running for 4 hours, all instances created are deleted (nova list returns empty list) but I noticed instances directories with the name “_resize> are not deleted from the instance path of the compute nodes (mainly from the source compute nodes where the instance was running before resize). If I keep this program running for couple of more hours (depending on the number of compute nodes), then it completely uses the entire disk of the compute nodes (based on the disk_allocation_ratio parameter value). Later, nova scheduler doesn’t select these compute nodes for launching new vms and starts reporting error "No valid hosts found". Note: Even the periodic tasks doesn't cleanup these orphan instance directories from the instance path. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1392527/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1462957] [NEW] VMware driver cannot report non-contiguous resources to the scheduler
Public bug reported: A VMware hypervisor can have various types of non-contiguous resource. This includes: * CPUs and memory, assuming a cluster has more than 1 member. * Storage space, if a (VMware) host has more than 1 datastore. Focussing on the latter, if a host has 5 datastores, each with 50GB of free space, we currently report the largest contiguous free space to the hypervisor: 50GB. This means that the scheduler knows it can allocate an instance with a 50GB block device, but until the host stats are updated it will not allow subsequent instances to be scheduled there. We could alternatively report 250GB of free space, but would risk the scheduler repeatedly sending us a request for an instance with a 100GB block device, which we cannot fulfil. Without the ability to represent non- contiguous resources we are left choosing between 2 suboptimal choices. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1462957 Title: VMware driver cannot report non-contiguous resources to the scheduler Status in OpenStack Compute (Nova): New Bug description: A VMware hypervisor can have various types of non-contiguous resource. This includes: * CPUs and memory, assuming a cluster has more than 1 member. * Storage space, if a (VMware) host has more than 1 datastore. Focussing on the latter, if a host has 5 datastores, each with 50GB of free space, we currently report the largest contiguous free space to the hypervisor: 50GB. This means that the scheduler knows it can allocate an instance with a 50GB block device, but until the host stats are updated it will not allow subsequent instances to be scheduled there. We could alternatively report 250GB of free space, but would risk the scheduler repeatedly sending us a request for an instance with a 100GB block device, which we cannot fulfil. Without the ability to represent non-contiguous resources we are left choosing between 2 suboptimal choices. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1462957/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1430223] [NEW] Live migration with ceph fails to cleanup instance directory on failure
Public bug reported: When doing a live migration of an instance using ceph for shared storage, if the migration fails then the instance directory will not be cleaned up on the destination host. The next attempt to do the live migration will fail with DestinationDiskExists, but will cleanup the directory. A simple way to test this is to setup a working system which allows a ceph instance to be live migrated, then delete the relevant ceph secret from libvirt on one of the hosts. Live migration to that host will fail, triggering this bug. ** Affects: nova Importance: Undecided Status: New ** Description changed: When doing a live migration of an instance using ceph for shared storage, if the migration fails then the instance directory will not be cleaned up on the destination host. The next attempt to do the live migration will fail with DestinationDiskExists, but will cleanup the directory. + + A simple way to test this is to setup a working system which allows a + ceph instance to be live migrated, then delete the relevant ceph secret + from libvirt on one of the hosts. Live migration to that host will fail, + triggering this bug. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1430223 Title: Live migration with ceph fails to cleanup instance directory on failure Status in OpenStack Compute (Nova): New Bug description: When doing a live migration of an instance using ceph for shared storage, if the migration fails then the instance directory will not be cleaned up on the destination host. The next attempt to do the live migration will fail with DestinationDiskExists, but will cleanup the directory. A simple way to test this is to setup a working system which allows a ceph instance to be live migrated, then delete the relevant ceph secret from libvirt on one of the hosts. Live migration to that host will fail, triggering this bug. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1430223/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1416000] Re: VMware: write error lost while transferring volume
** Also affects: oslo.vmware Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1416000 Title: VMware: write error lost while transferring volume Status in Cinder: New Status in OpenStack Compute (Nova): New Status in Oslo VMware library for OpenStack projects: New Bug description: I'm running the following command: cinder create --image-id a24f216f-9746-418e-97f9-aebd7fa0e25f 1 The write side of the data transfer (a VMwareHTTPWriteFile object) returns an error in write() which I haven't debugged, yet. However, this error is never reported to the user, although it does show up in the logs. The effect is that the transfer sits in the 'downloading' state until the 7200 second timeout, when it reports the timeout. The reason is that the code which waits on transfer completion (in start_transfer) does: try: # Wait on the read and write events to signal their end read_event.wait() write_event.wait() except (timeout.Timeout, Exception) as exc: ... That is, it waits for the read thread to signal completion via read_event before checking write_event. However, because write_thread has died, read_thread is blocking and will never signal completion. You can demonstrate this by swapping the order. If you want for write first it will die immediately, which is what you want. However, that's not right either because now you're missing read errors. Ideally this code needs to be able to notice an error at either end and stop immediately. To manage notifications about this bug go to: https://bugs.launchpad.net/cinder/+bug/1416000/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1412436] [NEW] Race in instance_create with security_group_destroy
Public bug reported: There is a race in instance_create between fetching security groups (returned by _security_group_get_by_names) and adding them to the instance. We have no guarantee that they have not been deleted in the meantime. The result is currently that the SecurityGroupInstanceAssociation is created, pointing to the deleted SecurityGroup. This is different to the result of deleting the SecurityGroup afterwards, when both SecurityGroupInstanceAssociation and SecurityGroup are marked deleted. It is also different to the result of deleting the SecurityGroup before, which is to raise an error. While this intermediate state doesn't appear to cause an immediate problem, I feel it would be likely to result in unexpected behaviour at some point in the future, probably during a datamodel upgrade. My preference would be to cause it to fail, as that feels intuitively to me to be the most useful response to the end user (they have just requested an instance with a security group, but the returned instance already does not have that security group). However, either behaviour would be correct IMO. I suspect the failure behaviour would be harder to achieve in practice. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1412436 Title: Race in instance_create with security_group_destroy Status in OpenStack Compute (Nova): New Bug description: There is a race in instance_create between fetching security groups (returned by _security_group_get_by_names) and adding them to the instance. We have no guarantee that they have not been deleted in the meantime. The result is currently that the SecurityGroupInstanceAssociation is created, pointing to the deleted SecurityGroup. This is different to the result of deleting the SecurityGroup afterwards, when both SecurityGroupInstanceAssociation and SecurityGroup are marked deleted. It is also different to the result of deleting the SecurityGroup before, which is to raise an error. While this intermediate state doesn't appear to cause an immediate problem, I feel it would be likely to result in unexpected behaviour at some point in the future, probably during a datamodel upgrade. My preference would be to cause it to fail, as that feels intuitively to me to be the most useful response to the end user (they have just requested an instance with a security group, but the returned instance already does not have that security group). However, either behaviour would be correct IMO. I suspect the failure behaviour would be harder to achieve in practice. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1412436/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1409024] [NEW] DNSDomain.register_for_zone races
Public bug reported: 2 simultaneous calls to DNSDomain.register_for_zone or DNSDomain.register_for_project will race. The winner is undefined. Consequently, the caller has no way of knowing if the DNSDomain is appropriately registered following a call. register_for_zone or register_for_project will not currently generate an error in this case. I can think of 2 ways to resolve this: 1. Assert that only an unregistered domain can be registered. Attempting to register a registered domain is an error. This would be a semantic change to the existing APIs. 2. Create new APIs which additionally take the expected current registration, and fail if it is not as expected. Deprecate the existing APIs. I favour the former. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1409024 Title: DNSDomain.register_for_zone races Status in OpenStack Compute (Nova): New Bug description: 2 simultaneous calls to DNSDomain.register_for_zone or DNSDomain.register_for_project will race. The winner is undefined. Consequently, the caller has no way of knowing if the DNSDomain is appropriately registered following a call. register_for_zone or register_for_project will not currently generate an error in this case. I can think of 2 ways to resolve this: 1. Assert that only an unregistered domain can be registered. Attempting to register a registered domain is an error. This would be a semantic change to the existing APIs. 2. Create new APIs which additionally take the expected current registration, and fail if it is not as expected. Deprecate the existing APIs. I favour the former. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1409024/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1388095] [NEW] VMware fake driver returns invalid search results due to incorrect use of lstrip()
Public bug reported: _search_ds in the fake driver does: path = file.lstrip(dname).split('/') The intention is to remove a prefix of dname from the beginning of file, but this actually removes all instances of all characters in dname from the left of file. ** Affects: nova Importance: Undecided Status: New ** Tags: vmware -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1388095 Title: VMware fake driver returns invalid search results due to incorrect use of lstrip() Status in OpenStack Compute (Nova): New Bug description: _search_ds in the fake driver does: path = file.lstrip(dname).split('/') The intention is to remove a prefix of dname from the beginning of file, but this actually removes all instances of all characters in dname from the left of file. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1388095/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1384309] [NEW] VMware: New permission required: Extension.Register
Public bug reported: Change I1046576c448704841ae8e1800b8390e947b0d457 uses ExtensionManager.RegisterExtension, which requires the additional permission Extension.Register on the vSphere server. Unfortunately we missed the DocImpact in review. This needs to be added to the relevant docs. The impact of not having this permission is that n-cpu fails to start with the error: WebFault: Server raised fault: 'Permission to perform this operation was denied.' ** Affects: nova Importance: Undecided Status: New ** Tags: documentation vmware -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1384309 Title: VMware: New permission required: Extension.Register Status in OpenStack Compute (Nova): New Bug description: Change I1046576c448704841ae8e1800b8390e947b0d457 uses ExtensionManager.RegisterExtension, which requires the additional permission Extension.Register on the vSphere server. Unfortunately we missed the DocImpact in review. This needs to be added to the relevant docs. The impact of not having this permission is that n-cpu fails to start with the error: WebFault: Server raised fault: 'Permission to perform this operation was denied.' To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1384309/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1381061] [NEW] VMware: ESX hosts must not be externally routable
Public bug reported: Change I70fd7d3ee06040d6ce49d93a4becd9cbfdd71f78 removed passwords from VNC hosts. This change is fine because we proxy the VNC connection and do access control at the proxy, but it assumes that ESX hosts are not externally routable. In a non-OpenStack VMware deployment, accessing a VM's console requires the end user to have a direct connection to an ESX host. This leads me to believe that many VMware administrators may leave ESX hosts externally routable if not specifically directed otherwise. The above change makes a design decision which requires ESX hosts not to be externally routable. There may also be other reasons. We need to ensure that this is very clearly documented. This may already be documented, btw, but I don't know how our documentation is organised, and would prefer that somebody more familiar with it assures themselves that this has been given appropriate weight. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1381061 Title: VMware: ESX hosts must not be externally routable Status in OpenStack Compute (Nova): New Bug description: Change I70fd7d3ee06040d6ce49d93a4becd9cbfdd71f78 removed passwords from VNC hosts. This change is fine because we proxy the VNC connection and do access control at the proxy, but it assumes that ESX hosts are not externally routable. In a non-OpenStack VMware deployment, accessing a VM's console requires the end user to have a direct connection to an ESX host. This leads me to believe that many VMware administrators may leave ESX hosts externally routable if not specifically directed otherwise. The above change makes a design decision which requires ESX hosts not to be externally routable. There may also be other reasons. We need to ensure that this is very clearly documented. This may already be documented, btw, but I don't know how our documentation is organised, and would prefer that somebody more familiar with it assures themselves that this has been given appropriate weight. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1381061/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1375688] [NEW] test failure in ShelveComputeManagerTestCase.test_unshelve
Public bug reported: Full logs here: http://logs.openstack.org/02/124402/3/check/gate-nova- python26/1d3512b/ Seen: 2014-09-26 15:20:46.795 | ExpectedMethodCallsError: Verify: Expected methods never called: 2014-09-26 15:20:46.796 | 0. _notify_about_instance_usage.__call__(, Instance(access_ip_v4=None,access_ip_v6=None,architecture='x86_64',auto_disk_config=False,availability_zone=None,cell_name=None,cleaned=False,config_drive=None,created_at=2014-09-26T15:09:38Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,disable_terminate=False,display_description=None,display_name=None,ephemeral_gb=0,ephemeral_key_uuid=None,fault=,host='fake-mini',hostname=None,id=1,image_ref='fake-image-ref',info_cache=,instance_type_id=2,kernel_id=None,key_data=None,key_name=None,launch_index=None,launched_at=2014-09-26T15:09:39Z,launched_on=None,locked=False,locked_by=None,memory_mb=0,metadata={},node='fakenode1',numa_topology=,os_type='Linux',pci_devices=,power_state=123,progress=None,project_id='fake',ramdisk_id=None,reservation_id='r-fakeres',root_device_name=None,root_gb=0,scheduled_at=None,security_gro ups=,shutdown_terminate=False,system_metadata={instance_type_ephemeral_gb='0',instance_type_flavorid='1',instance_type_id='2',instance_type_memory_mb='512',instance_type_name='m1.tiny',instance_type_root_gb='1',instance_type_rxtx_factor='1.0',instance_type_swap='0',instance_type_vcpu_weight=None,instance_type_vcpus='1'},task_state=None,terminated_at=None,updated_at=2014-09-26T15:09:38Z,user_data=None,user_id='fake',uuid=cb73da32-e73e-4f52-a332-f66e9752ac9d,vcpus=0,vm_mode=None,vm_state='active'), 'unshelve.end') -> None and: 2014-09-26 15:20:46.800 | UnexpectedMethodCallError: Unexpected method call instance_update_and_get_original.__call__(, 'cb73da32-e73e-4f52-a332-f66e9752ac9d', {'vm_state': u'active', 'expected_task_state': 'spawning', 'key_data': None, 'host': u'fake-mini', 'image_ref': u'fake-image-ref', 'power_state': 123, 'auto_disk_config': False, 'task_state': None, 'launched_at': datetime.datetime(2014, 9, 26, 15, 9, 39, 224533, tzinfo=)}, columns_to_join=['metadata', 'system_metadata'], update_cells=False) -> None My initial reaction is that the mox error messages don't contain enough information to diagnose the problem, or at least they certainly don't make it obvious to the uninitiated, due to the missing expected values. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1375688 Title: test failure in ShelveComputeManagerTestCase.test_unshelve Status in OpenStack Compute (Nova): New Bug description: Full logs here: http://logs.openstack.org/02/124402/3/check/gate-nova- python26/1d3512b/ Seen: 2014-09-26 15:20:46.795 | ExpectedMethodCallsError: Verify: Expected methods never called: 2014-09-26 15:20:46.796 | 0. _notify_about_instance_usage.__call__(, Instance(access_ip_v4=None,access_ip_v6=None,architecture='x86_64',auto_disk_config=False,availability_zone=None,cell_name=None,cleaned=False,config_drive=None,created_at=2014-09-26T15:09:38Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,disable_terminate=False,display_description=None,display_name=None,ephemeral_gb=0,ephemeral_key_uuid=None,fault=,host='fake-mini',hostname=None,id=1,image_ref='fake-image-ref',info_cache=,instance_type_id=2,kernel_id=None,key_data=None,key_name=None,launch_index=None,launched_at=2014-09-26T15:09:39Z,launched_on=None,locked=False,locked_by=None,memory_mb=0,metadata={},node='fakenode1',numa_topology=,os_type='Linux',pci_devices=,power_state=123,progress=None,project_id='fake',ramdisk_id=None,reservation_id='r-fakeres',root_device_name=None,root_gb=0,scheduled_at=None,security_g roups=,shutdown_terminate=False,system_metadata={instance_type_ephemeral_gb='0',instance_type_flavorid='1',instance_type_id='2',instance_type_memory_mb='512',instance_type_name='m1.tiny',instance_type_root_gb='1',instance_type_rxtx_factor='1.0',instance_type_swap='0',instance_type_vcpu_weight=None,instance_type_vcpus='1'},task_state=None,terminated_at=None,updated_at=2014-09-26T15:09:38Z,user_data=None,user_id='fake',uuid=cb73da32-e73e-4f52-a332-f66e9752ac9d,vcpus=0,vm_mode=None,vm_state='active'), 'unshelve.end') -> None and: 2014-09-26 15:20:46.800 | UnexpectedMethodCallError: Unexpected method call instance_update_and_get_original.__call__(, 'cb73da32-e73e-4f52-a332-f66e9752ac9d', {'vm_state': u'active', 'expected_task_state': 'spawning', 'key_data': None, 'host': u'fake-mini', 'image_ref': u'fake-image-ref', 'power_state': 123, 'auto_disk_config': False, 'task_state': None, 'launched_at': datetime.datetime(2014, 9, 26, 15, 9, 39, 224533, tzinfo=)}, columns_to_join=['metadata', 'system_metadata'], update_cells=F
[Yahoo-eng-team] [Bug 1372369] [NEW] Blockdev reports 'No such device or address'
Public bug reported: Tempest failure: http://logs.openstack.org/57/122757/1/check/check- tempest-dsvm-neutron-full/a08fb08/ 2014-09-19 18:48:47.388 | 2014-09-19 18:15:35,926 6578 INFO [tempest.common.rest_client] Request (DeleteServersTestJSON:test_delete_server_while_in_verify_resize_state): 500 DELETE http://127.0.0.1:8774/v2/9959855b406d4563a6174eb27f11450e/servers/0a451fd1-72d8-4849-8b47-2095986f9cd4 60.155s 2014-09-19 18:48:47.389 | }}} Due to: 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5671, in _get_instance_disk_info 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task dk_size = lvm.get_volume_size(path) 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task File "/opt/stack/new/nova/nova/virt/libvirt/lvm.py", line 157, in get_volume_size 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task run_as_root=True) 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task File "/opt/stack/new/nova/nova/virt/libvirt/utils.py", line 53, in execute 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task return utils.execute(*args, **kwargs) 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task File "/opt/stack/new/nova/nova/utils.py", line 163, in execute 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task return processutils.execute(*cmd, **kwargs) 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task File "/opt/stack/new/nova/nova/openstack/common/processutils.py", line 203, in execute 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task cmd=sanitized_cmd) 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task ProcessExecutionError: Unexpected error while running command. 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task Command: sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-eef2d948-c15b-4525-b477-4ca2b194b8ae-lun-1 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task Exit code: 1 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task Stdout: u'' 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task Stderr: u'blockdev: cannot open /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-eef2d948-c15b-4525-b477-4ca2b194b8ae-lun-1: No such device or address\n' 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1372369 Title: Blockdev reports 'No such device or address' Status in OpenStack Compute (Nova): New Bug description: Tempest failure: http://logs.openstack.org/57/122757/1/check/check- tempest-dsvm-neutron-full/a08fb08/ 2014-09-19 18:48:47.388 | 2014-09-19 18:15:35,926 6578 INFO [tempest.common.rest_client] Request (DeleteServersTestJSON:test_delete_server_while_in_verify_resize_state): 500 DELETE http://127.0.0.1:8774/v2/9959855b406d4563a6174eb27f11450e/servers/0a451fd1-72d8-4849-8b47-2095986f9cd4 60.155s 2014-09-19 18:48:47.389 | }}} Due to: 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5671, in _get_instance_disk_info 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task dk_size = lvm.get_volume_size(path) 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task File "/opt/stack/new/nova/nova/virt/libvirt/lvm.py", line 157, in get_volume_size 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task run_as_root=True) 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task File "/opt/stack/new/nova/nova/virt/libvirt/utils.py", line 53, in execute 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task return utils.execute(*args, **kwargs) 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task File "/opt/stack/new/nova/nova/utils.py", line 163, in execute 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task return processutils.execute(*cmd, **kwargs) 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task File "/opt/stack/new/nova/nova/openstack/common/processutils.py", line 203, in execute 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task cmd=sanitized_cmd) 2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task ProcessExecutionError: Unexpected error while running command. 2014-09-19 18:15:38.432 3076
[Yahoo-eng-team] [Bug 1365031] [NEW] VMware fake session doesn't detect implicitly created directory
Public bug reported: The VMware fake session keeps an internal list of created files and directories. Directories can be created explicitly, e.g. by MakeDirectory(createParentDirectories=True), but the fake session will not recognise these. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1365031 Title: VMware fake session doesn't detect implicitly created directory Status in OpenStack Compute (Nova): New Bug description: The VMware fake session keeps an internal list of created files and directories. Directories can be created explicitly, e.g. by MakeDirectory(createParentDirectories=True), but the fake session will not recognise these. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1365031/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1364849] [NEW] VMware driver doesn't return typed console
Public bug reported: Change I8f6a857b88659ee30b4aa1a25ac52d7e01156a68 added typed consoles, and updated drivers to use them. However, when it touched the VMware driver, it modified get_vnc_console in VMwareVMOps, but not in VMwareVCVMOps, which is the one which is actually used. Incidentally, VMwareVMOps has now been removed, so this type of confusion should not happen again. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1364849 Title: VMware driver doesn't return typed console Status in OpenStack Compute (Nova): New Bug description: Change I8f6a857b88659ee30b4aa1a25ac52d7e01156a68 added typed consoles, and updated drivers to use them. However, when it touched the VMware driver, it modified get_vnc_console in VMwareVMOps, but not in VMwareVCVMOps, which is the one which is actually used. Incidentally, VMwareVMOps has now been removed, so this type of confusion should not happen again. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1364849/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1357428] [NEW] DBDeadlock in gate test
Public bug reported: gate test failed with: DBDeadlock: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'UPDATE image_properties SET updated_at=%s, deleted_at=%s, deleted=%s WHERE image_properties.image_id = %s AND image_properties.deleted = false' (datetime.datetime(2014, 8, 15, 13, 42, 36, 164537), datetime.datetime(2014, 8, 15, 13, 42, 36, 144848), 1, '62832243-7165-4493-bacc-7801640cc718') Above from: http://logs.openstack.org/28/114528/1/check/check-tempest-dsvm- full/176f0f2/logs/screen-g-reg.txt.gz Full logs: http://logs.openstack.org/28/114528/1/check/check-tempest-dsvm- full/176f0f2/ ** Affects: glance Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to Glance. https://bugs.launchpad.net/bugs/1357428 Title: DBDeadlock in gate test Status in OpenStack Image Registry and Delivery Service (Glance): New Bug description: gate test failed with: DBDeadlock: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'UPDATE image_properties SET updated_at=%s, deleted_at=%s, deleted=%s WHERE image_properties.image_id = %s AND image_properties.deleted = false' (datetime.datetime(2014, 8, 15, 13, 42, 36, 164537), datetime.datetime(2014, 8, 15, 13, 42, 36, 144848), 1, '62832243-7165-4493-bacc-7801640cc718') Above from: http://logs.openstack.org/28/114528/1/check/check-tempest-dsvm- full/176f0f2/logs/screen-g-reg.txt.gz Full logs: http://logs.openstack.org/28/114528/1/check/check-tempest-dsvm- full/176f0f2/ To manage notifications about this bug go to: https://bugs.launchpad.net/glance/+bug/1357428/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1357263] [NEW] Unhelpful error message when attempting to boot a guest with an invalid guestId
Public bug reported: When booting a VMware instance from an image, guestId is taken from the vmware_ostype property in glance. If this value is invalid, spawn() will fail with the error message: VMwareDriverException: A specified parameter was not correct. As there are many parameters to CreateVM_Task, this error message does not help us narrow down the offending one. Unfortunately this error message is all that vSphere provides us, so we can't do better by relying on vSphere alone. As this is a user-editable parameter, we should try harder to provide an indication of what the error might be. We can do this by validating the field ourselves. As there is no way I'm aware of to extract a canonical list of valid guestIds from a running vSphere host, I think we're left embedding our own list and validating against it. This is not ideal, because: 1. We will need to update our list for every ESX release 2. A simple list will not take account of the ESX version we're running against (i.e. we may have a list for 5.5, but be running against 5.1, which doesn't support everything on our list) Consequently, to maintain a loose coupling we should validate the field, but only warn for values we don't recognise. vSphere will continue to return its non-specific error message, but there will be an additional indication of what the root cause might be in the logs. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1357263 Title: Unhelpful error message when attempting to boot a guest with an invalid guestId Status in OpenStack Compute (Nova): New Bug description: When booting a VMware instance from an image, guestId is taken from the vmware_ostype property in glance. If this value is invalid, spawn() will fail with the error message: VMwareDriverException: A specified parameter was not correct. As there are many parameters to CreateVM_Task, this error message does not help us narrow down the offending one. Unfortunately this error message is all that vSphere provides us, so we can't do better by relying on vSphere alone. As this is a user-editable parameter, we should try harder to provide an indication of what the error might be. We can do this by validating the field ourselves. As there is no way I'm aware of to extract a canonical list of valid guestIds from a running vSphere host, I think we're left embedding our own list and validating against it. This is not ideal, because: 1. We will need to update our list for every ESX release 2. A simple list will not take account of the ESX version we're running against (i.e. we may have a list for 5.5, but be running against 5.1, which doesn't support everything on our list) Consequently, to maintain a loose coupling we should validate the field, but only warn for values we don't recognise. vSphere will continue to return its non-specific error message, but there will be an additional indication of what the root cause might be in the logs. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1357263/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1355928] [NEW] Deadlock in reservation commit
Public bug reported: Details in http://logs.openstack.org/46/104146/15/check/check-tempest- dsvm-full/d235389/, specifically in n-cond logs: 2014-08-12 14:58:57.099 ERROR nova.quota [req-7efe48be-f5b4-4343-898a-5b4b32694530 AggregatesAdminTestJSON-719157131 AggregatesAdminTestJSON-1908648657] Failed to commit reservations [u'5bdde344-b26f-4e0a-9aa7-d91d775b6df0', u'5f757426-8f4e-454f-aedb-1186771f85fd', u'819aeaf6-9faf-4da5-a16d-ce1c571c4975'] 2014-08-12 14:58:57.099 21994 TRACE nova.quota Traceback (most recent call last): 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/opt/stack/new/nova/nova/quota.py", line 1326, in commit 2014-08-12 14:58:57.099 21994 TRACE nova.quota user_id=user_id) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/opt/stack/new/nova/nova/quota.py", line 569, in commit 2014-08-12 14:58:57.099 21994 TRACE nova.quota user_id=user_id) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/opt/stack/new/nova/nova/db/api.py", line 1148, in reservation_commit 2014-08-12 14:58:57.099 21994 TRACE nova.quota user_id=user_id) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 167, in wrapper 2014-08-12 14:58:57.099 21994 TRACE nova.quota return f(*args, **kwargs) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 205, in wrapped 2014-08-12 14:58:57.099 21994 TRACE nova.quota return f(*args, **kwargs) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 3302, in reservation_commit 2014-08-12 14:58:57.099 21994 TRACE nova.quota for reservation in reservation_query.all(): 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2241, in all 2014-08-12 14:58:57.099 21994 TRACE nova.quota return list(self) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2353, in __iter__ 2014-08-12 14:58:57.099 21994 TRACE nova.quota return self._execute_and_instances(context) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2368, in _execute_and_instances 2014-08-12 14:58:57.099 21994 TRACE nova.quota result = conn.execute(querycontext.statement, self._params) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 662, in execute 2014-08-12 14:58:57.099 21994 TRACE nova.quota params) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 761, in _execute_clauseelement 2014-08-12 14:58:57.099 21994 TRACE nova.quota compiled_sql, distilled_params 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 874, in _execute_context 2014-08-12 14:58:57.099 21994 TRACE nova.quota context) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1024, in _handle_dbapi_exception 2014-08-12 14:58:57.099 21994 TRACE nova.quota exc_info 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 196, in raise_from_cause 2014-08-12 14:58:57.099 21994 TRACE nova.quota reraise(type(exception), exception, tb=exc_tb) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 867, in _execute_context 2014-08-12 14:58:57.099 21994 TRACE nova.quota context) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 324, in do_execute 2014-08-12 14:58:57.099 21994 TRACE nova.quota cursor.execute(statement, parameters) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in execute 2014-08-12 14:58:57.099 21994 TRACE nova.quota self.errorhandler(self, exc, value) 2014-08-12 14:58:57.099 21994 TRACE nova.quota File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler 2014-08-12 14:58:57.099 21994 TRACE nova.quota raise errorclass, errorvalue 2014-08-12 14:58:57.099 21994 TRACE nova.quota OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'SELECT reservations.created_at AS reservations_created_at, reservations.updated_at AS reservations_updated_at, reservations.deleted_at AS reservations_deleted_at, reservations.deleted AS reservations_deleted, reservations.id AS reservations_id, reservations.uuid AS reservations_uuid, reservations.usage_id AS reservations_usage_id, reservations.project_id AS reservations_project_id, reservations.user
[Yahoo-eng-team] [Bug 1354403] [NEW] Numerous config options ignored due to CONF used in import context
Public bug reported: In general[1] it is incorrect to use the value of a config variable at import time, because although the config variable may have been registered, its value will not have been loaded. The result will always be the default value, regardless of the contents of the relevant config file. I did a quick scan of Nova, and found the following instances of config variables being used in import context: nova/api/openstack/common.py:limited() nova/api/openstack/common.py:get_limit_and_marker() nova/compute/manager.py:_heal_instance_info_cache() nova/compute/manager.py:_poll_shelved_instances() nova/compute/manager.py:_poll_bandwidth_usage() nova/compute/manager.py:_poll_volume_usage() nova/compute/manager.py:_sync_power_states() nova/compute/manager.py:_cleanup_running_deleted_instances() nova/compute/manager.py:_run_image_cache_manager_pass() nova/compute/manager.py:_run_pending_deletes() nova/network/manager.py:_periodic_update_dns() nova/scheduler/manager.py:_run_periodic_tasks() Consequently, it appears that the given values of the following config variables are being ignored: osapi_max_limit heal_instance_info_cache_interval shelved_poll_interval bandwidth_poll_interval volume_usage_poll_interval sync_power_state_interval running_deleted_instance_poll_interval image_cache_manager_interval instance_delete_interval dns_update_periodic_interval scheduler_driver_task_period [1] This doesn't apply to drivers, which are loaded dynamically after the config has been loaded. However, relying on that seems even nastier. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1354403 Title: Numerous config options ignored due to CONF used in import context Status in OpenStack Compute (Nova): New Bug description: In general[1] it is incorrect to use the value of a config variable at import time, because although the config variable may have been registered, its value will not have been loaded. The result will always be the default value, regardless of the contents of the relevant config file. I did a quick scan of Nova, and found the following instances of config variables being used in import context: nova/api/openstack/common.py:limited() nova/api/openstack/common.py:get_limit_and_marker() nova/compute/manager.py:_heal_instance_info_cache() nova/compute/manager.py:_poll_shelved_instances() nova/compute/manager.py:_poll_bandwidth_usage() nova/compute/manager.py:_poll_volume_usage() nova/compute/manager.py:_sync_power_states() nova/compute/manager.py:_cleanup_running_deleted_instances() nova/compute/manager.py:_run_image_cache_manager_pass() nova/compute/manager.py:_run_pending_deletes() nova/network/manager.py:_periodic_update_dns() nova/scheduler/manager.py:_run_periodic_tasks() Consequently, it appears that the given values of the following config variables are being ignored: osapi_max_limit heal_instance_info_cache_interval shelved_poll_interval bandwidth_poll_interval volume_usage_poll_interval sync_power_state_interval running_deleted_instance_poll_interval image_cache_manager_interval instance_delete_interval dns_update_periodic_interval scheduler_driver_task_period [1] This doesn't apply to drivers, which are loaded dynamically after the config has been loaded. However, relying on that seems even nastier. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1354403/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1342055] [NEW] Suspending and restoring a rescued instance restores it to ACTIVE rather than RESCUED
Public bug reported: If you suspend a rescued instance, resume returns it to the ACTIVE state rather than the RESCUED state. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1342055 Title: Suspending and restoring a rescued instance restores it to ACTIVE rather than RESCUED Status in OpenStack Compute (Nova): New Bug description: If you suspend a rescued instance, resume returns it to the ACTIVE state rather than the RESCUED state. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1342055/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1337798] [NEW] VMware: snapshot operation copies a live image
Public bug reported: N.B. This is based purely on code inspection. A reasonable resolution would be to point out that I've misunderstood something and it's actually fine. I'm filing this bug because it's potentially a subtle data corruptor, and I'd like more eyes on it. The snapshot code in vmwareapi/vmops.py does: 1. snapshot 2. copy disk image to vmware 3. delete snapshot 4. copy disk image from vmware to glance I think the problem is in step 2. I don't see how it's copying the snapshot it just created rather than the live disk image. i.e. I don't think step 2 is copying the snapshot it created in step 1. It's possible that there's some subtlety to do with path names here, but in that case it could still do with a comment. If it is in fact copying the live image, it would normally work. However, this would potentially be a subtle data corruptor. For example, consider that a file's data was towards the beginning of a disk, but its metadata was towards the end of the disk. If the VM guest creates the file during the copy operation, it copies the metadata at the end of the disk, but misses the contents at the beginning of the disk. ** Affects: nova Importance: Undecided Status: New ** Tags: vmware -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1337798 Title: VMware: snapshot operation copies a live image Status in OpenStack Compute (Nova): New Bug description: N.B. This is based purely on code inspection. A reasonable resolution would be to point out that I've misunderstood something and it's actually fine. I'm filing this bug because it's potentially a subtle data corruptor, and I'd like more eyes on it. The snapshot code in vmwareapi/vmops.py does: 1. snapshot 2. copy disk image to vmware 3. delete snapshot 4. copy disk image from vmware to glance I think the problem is in step 2. I don't see how it's copying the snapshot it just created rather than the live disk image. i.e. I don't think step 2 is copying the snapshot it created in step 1. It's possible that there's some subtlety to do with path names here, but in that case it could still do with a comment. If it is in fact copying the live image, it would normally work. However, this would potentially be a subtle data corruptor. For example, consider that a file's data was towards the beginning of a disk, but its metadata was towards the end of the disk. If the VM guest creates the file during the copy operation, it copies the metadata at the end of the disk, but misses the contents at the beginning of the disk. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1337798/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1333587] [NEW] VMware: ExtendVirtualDisk_Task fails due to locked file
Public bug reported: Extending a disk during spawn races, which can result in failure. It is possible to hit this bug by launching a large number of instances of an image which isn't already cached, simultaneously. Some of them will race to extend the cached image, ultimately resulting in an error such as: 2014-06-17 10:49:26.006 9177 WARNING nova.virt.vmwareapi.driver [-] Task [ExtendVirtualDisk_Task] value = "task-12073" _type = "Task" } status: error Unable to access file [datastore1] 172.16.0.13_base/326153d2-1226-415a-a194-2ca47ac3c48b/326153d2-1226-415a-a194-2ca47ac3c48b.1.vmdk since it is locked ** Affects: nova Importance: Undecided Status: New ** Tags: vmware -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1333587 Title: VMware: ExtendVirtualDisk_Task fails due to locked file Status in OpenStack Compute (Nova): New Bug description: Extending a disk during spawn races, which can result in failure. It is possible to hit this bug by launching a large number of instances of an image which isn't already cached, simultaneously. Some of them will race to extend the cached image, ultimately resulting in an error such as: 2014-06-17 10:49:26.006 9177 WARNING nova.virt.vmwareapi.driver [-] Task [ExtendVirtualDisk_Task] value = "task-12073" _type = "Task" } status: error Unable to access file [datastore1] 172.16.0.13_base/326153d2-1226-415a-a194-2ca47ac3c48b/326153d2-1226-415a-a194-2ca47ac3c48b.1.vmdk since it is locked To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1333587/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1333232] [NEW] Gate failure: autodoc: failed to import module X
Public bug reported: Spurious gate failure: http://logs.openstack.org/65/99065/4/check/gate- nova-docs/af27af8/console.html Logs are full of: 2014-06-23 09:55:32.057 | /home/jenkins/workspace/gate-nova-docs/doc/source/devref/api.rst:39: WARNING: autodoc: failed to import module u'nova.api.cloud'; the following exception was raised: 2014-06-23 09:55:32.057 | Traceback (most recent call last): 2014-06-23 09:55:32.057 | File "/home/jenkins/workspace/gate-nova-docs/.tox/venv/local/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 335, in import_object 2014-06-23 09:55:32.057 | __import__(self.modname) 2014-06-23 09:55:32.057 | ImportError: No module named cloud 2014-06-23 09:55:32.057 | /home/jenkins/workspace/gate-nova-docs/doc/source/devref/api.rst:66: WARNING: autodoc: failed to import module u'nova.api.openstack.backup_schedules'; the following exception was raised: 2014-06-23 09:55:32.057 | Traceback (most recent call last): 2014-06-23 09:55:32.057 | File "/home/jenkins/workspace/gate-nova-docs/.tox/venv/local/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 335, in import_object 2014-06-23 09:55:32.057 | __import__(self.modname) 2014-06-23 09:55:32.058 | ImportError: No module named backup_schedules ** Affects: nova Importance: Undecided Status: New ** Tags: ci -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1333232 Title: Gate failure: autodoc: failed to import module X Status in OpenStack Compute (Nova): New Bug description: Spurious gate failure: http://logs.openstack.org/65/99065/4/check /gate-nova-docs/af27af8/console.html Logs are full of: 2014-06-23 09:55:32.057 | /home/jenkins/workspace/gate-nova-docs/doc/source/devref/api.rst:39: WARNING: autodoc: failed to import module u'nova.api.cloud'; the following exception was raised: 2014-06-23 09:55:32.057 | Traceback (most recent call last): 2014-06-23 09:55:32.057 | File "/home/jenkins/workspace/gate-nova-docs/.tox/venv/local/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 335, in import_object 2014-06-23 09:55:32.057 | __import__(self.modname) 2014-06-23 09:55:32.057 | ImportError: No module named cloud 2014-06-23 09:55:32.057 | /home/jenkins/workspace/gate-nova-docs/doc/source/devref/api.rst:66: WARNING: autodoc: failed to import module u'nova.api.openstack.backup_schedules'; the following exception was raised: 2014-06-23 09:55:32.057 | Traceback (most recent call last): 2014-06-23 09:55:32.057 | File "/home/jenkins/workspace/gate-nova-docs/.tox/venv/local/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 335, in import_object 2014-06-23 09:55:32.057 | __import__(self.modname) 2014-06-23 09:55:32.058 | ImportError: No module named backup_schedules To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1333232/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1328539] [NEW] Fixed IP allocation doesn't clean up properly on failure
Public bug reported: If fixed IP allocation fails, for example because nova's network interfaces got renamed after a reboot, nova will loop continuously trying, and failing, to create a new instance. For every attempted spawn the instance will end up with an additional fixed IP allocated to it. This is because the code is associating the IP, but not disassociating it if the function fails. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1328539 Title: Fixed IP allocation doesn't clean up properly on failure Status in OpenStack Compute (Nova): New Bug description: If fixed IP allocation fails, for example because nova's network interfaces got renamed after a reboot, nova will loop continuously trying, and failing, to create a new instance. For every attempted spawn the instance will end up with an additional fixed IP allocated to it. This is because the code is associating the IP, but not disassociating it if the function fails. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1328539/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1324036] [NEW] Can't add authenticated iscsi volume to a vmware instance
Public bug reported: The VMware driver doesn't pass volume authentication information to the hba when attaching an iscsi volume. Consequently, adding an iscsi volume which requires authentication will always fail. ** Affects: nova Importance: Undecided Status: New ** Tags: vmware -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1324036 Title: Can't add authenticated iscsi volume to a vmware instance Status in OpenStack Compute (Nova): New Bug description: The VMware driver doesn't pass volume authentication information to the hba when attaching an iscsi volume. Consequently, adding an iscsi volume which requires authentication will always fail. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1324036/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1297375] [NEW] All nova apis relying on Instance.save(expected_*_state) for safety contain a race condition
Public bug reported: Take, for example, resize_instance(). In manager.py, we assert that the instance is in RESIZE_PREP state with: instance.save(expected_task_state=task_states.RESIZE_PREP) This should mean that the first resize will succeed, and any subsequent will fail. However, the underlying db implementation does not lock the instance during the update, and therefore doesn't guarantee this. Specifically, _instance_update() in db/sqlalchemy/apy.py starts a session, and reads task_state from the instance. However, it does not use a 'select ... for update', meaning the row is not locked. 2 concurrent calls to this method can both read the same state, then race to the update. The last writer will win. Without 'select ... for update', the db transaction is only ensuring that all writes are atomic, not reads with dependent writes. SQLAlchemy seems to support select ... for update, as do MySQL and PostgreSQL, although MySQL will fall back to whole table locks for non- InnoDB tables, which would likely be a significant performance hit. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1297375 Title: All nova apis relying on Instance.save(expected_*_state) for safety contain a race condition Status in OpenStack Compute (Nova): New Bug description: Take, for example, resize_instance(). In manager.py, we assert that the instance is in RESIZE_PREP state with: instance.save(expected_task_state=task_states.RESIZE_PREP) This should mean that the first resize will succeed, and any subsequent will fail. However, the underlying db implementation does not lock the instance during the update, and therefore doesn't guarantee this. Specifically, _instance_update() in db/sqlalchemy/apy.py starts a session, and reads task_state from the instance. However, it does not use a 'select ... for update', meaning the row is not locked. 2 concurrent calls to this method can both read the same state, then race to the update. The last writer will win. Without 'select ... for update', the db transaction is only ensuring that all writes are atomic, not reads with dependent writes. SQLAlchemy seems to support select ... for update, as do MySQL and PostgreSQL, although MySQL will fall back to whole table locks for non-InnoDB tables, which would likely be a significant performance hit. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1297375/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1290455] [NEW] libvirt inject_data assumes instance with kernel_id doesn't contain a partition table
Public bug reported: libvirt/driver.py passes partition=None to disk.inject_data() for any instance with kernel_id set. partition=None means that inject_data will attempt to mount the whole image, i.e. assuming there is no partition table. While this may be true for EC2, it is not safe to assume that Xen images don't contain partition tables. This should check something more directly related to the disk image. In fact, ideally it would leave it up to libguestfs to work it out, as libguestfs is very good at this. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1290455 Title: libvirt inject_data assumes instance with kernel_id doesn't contain a partition table Status in OpenStack Compute (Nova): New Bug description: libvirt/driver.py passes partition=None to disk.inject_data() for any instance with kernel_id set. partition=None means that inject_data will attempt to mount the whole image, i.e. assuming there is no partition table. While this may be true for EC2, it is not safe to assume that Xen images don't contain partition tables. This should check something more directly related to the disk image. In fact, ideally it would leave it up to libguestfs to work it out, as libguestfs is very good at this. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1290455/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1275773] [NEW] VMware session not logged out on VMwareAPISession garbage collection
Public bug reported: A bug in VMwareAPISession.__del__() prevents the session being logged out when the session object is garbage collected. ** Affects: nova Importance: Medium Status: New ** Tags: havana-backport-potential vmware -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1275773 Title: VMware session not logged out on VMwareAPISession garbage collection Status in OpenStack Compute (Nova): New Bug description: A bug in VMwareAPISession.__del__() prevents the session being logged out when the session object is garbage collected. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1275773/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1271966] [NEW] Not possible to spawn vmware instance with multiple disks
Public bug reported: The behaviour of spawn() in the vmwareapi driver wrt images and block device mappings is currently as follows: If there are any block device mappings, images are ignored If there are any block device mappings, the last becomes the root device and all others are ignored This means that, for example, the following scenarios are not possible: 1. Spawn an instance with a root device from an image, and a secondary volume 2. Spawn an instance with a volume as a root device, and a secondary volume The behaviour of the libvirt driver is as follows: If there is an image, it will be the root device unless there is also a block device mapping for the root device All block device mappings are used If there are multiple block device mappings for the same device, the last one is used The vmwareapi driver's behaviour is surprising, and should be modified to follow the libvirt driver. ** Affects: nova Importance: Undecided Status: New ** Tags: vmware -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1271966 Title: Not possible to spawn vmware instance with multiple disks Status in OpenStack Compute (Nova): New Bug description: The behaviour of spawn() in the vmwareapi driver wrt images and block device mappings is currently as follows: If there are any block device mappings, images are ignored If there are any block device mappings, the last becomes the root device and all others are ignored This means that, for example, the following scenarios are not possible: 1. Spawn an instance with a root device from an image, and a secondary volume 2. Spawn an instance with a volume as a root device, and a secondary volume The behaviour of the libvirt driver is as follows: If there is an image, it will be the root device unless there is also a block device mapping for the root device All block device mappings are used If there are multiple block device mappings for the same device, the last one is used The vmwareapi driver's behaviour is surprising, and should be modified to follow the libvirt driver. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1271966/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp