from:"\"Matthew Booth\""

[Yahoo-eng-team] [Bug 2020552] [NEW] trunk_details missing sub port MAC addresses for LIST

2023-05-23 Thread Matthew Booth

Public bug reported:

When returning port details, trunk_details.sub_ports should contain:
* segmentation_id
* segmentation_type
* port_id
* mac_address

This is the case when GETting a single port, but when listing ports
mac_address is missing.

In the following:
* Parent port: a47df912-1cba-458c-9bb9-00cd3d71b9e6
* Trunk: 70f314f8-5577-4b98-be9c-68bbe3791d7f
* Sub port: d11793a9-8862-4378-a1fe-045f04dad841

GET request:

> curl -s -H "X-Auth-Token: $OS_TOKEN" 
> "https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13696/v2.0/ports/a47df912-1cba-458c-9bb9-00cd3d71b9e6";
>  | jq
{
  "port": {
...
"trunk_details": {
  "trunk_id": "70f314f8-5577-4b98-be9c-68bbe3791d7f",
  "sub_ports": [
{
  "segmentation_id": 100,
  "segmentation_type": "vlan",
  "port_id": "d11793a9-8862-4378-a1fe-045f04dad841",
  "mac_address": "fa:16:3e:88:29:a0"
}
  ]
},
...
  }
}

LIST request returning the same port:

> curl -s -H "X-Auth-Token: $OS_TOKEN" 
> "https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13696/v2.0/ports?id=a47df912-1cba-458c-9bb9-00cd3d71b9e6";
>  | jq
{
  "ports": [
{
  ...
  "trunk_details": {
"trunk_id": "70f314f8-5577-4b98-be9c-68bbe3791d7f",
"sub_ports": [
  {
"segmentation_id": 100,
"segmentation_type": "vlan",
"port_id": "d11793a9-8862-4378-a1fe-045f04dad841"
  }
]
  },
  ...
}
  }
}

Note that mac_address is missing for the LIST request.

* Version: Little bit of guesswork going on here, but Nova reports a
latest microversion of 2.79, which corresponds to Train.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2020552

Title:
  trunk_details missing sub port MAC addresses for LIST

Status in neutron:
  New

Bug description:
  When returning port details, trunk_details.sub_ports should contain:
  * segmentation_id
  * segmentation_type
  * port_id
  * mac_address

  This is the case when GETting a single port, but when listing ports
  mac_address is missing.

  In the following:
  * Parent port: a47df912-1cba-458c-9bb9-00cd3d71b9e6
  * Trunk: 70f314f8-5577-4b98-be9c-68bbe3791d7f
  * Sub port: d11793a9-8862-4378-a1fe-045f04dad841

  GET request:

  > curl -s -H "X-Auth-Token: $OS_TOKEN" 
"https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13696/v2.0/ports/a47df912-1cba-458c-9bb9-00cd3d71b9e6";
 | jq
  {
"port": {
  ...
  "trunk_details": {
"trunk_id": "70f314f8-5577-4b98-be9c-68bbe3791d7f",
"sub_ports": [
  {
"segmentation_id": 100,
"segmentation_type": "vlan",
"port_id": "d11793a9-8862-4378-a1fe-045f04dad841",
"mac_address": "fa:16:3e:88:29:a0"
  }
]
  },
  ...
}
  }

  LIST request returning the same port:

  > curl -s -H "X-Auth-Token: $OS_TOKEN" 
"https://rhos-d.infra.prod.upshift.rdu2.redhat.com:13696/v2.0/ports?id=a47df912-1cba-458c-9bb9-00cd3d71b9e6";
 | jq
  {
"ports": [
  {
...
"trunk_details": {
  "trunk_id": "70f314f8-5577-4b98-be9c-68bbe3791d7f",
  "sub_ports": [
{
  "segmentation_id": 100,
  "segmentation_type": "vlan",
  "port_id": "d11793a9-8862-4378-a1fe-045f04dad841"
}
  ]
},
...
  }
}
  }

  Note that mac_address is missing for the LIST request.

  * Version: Little bit of guesswork going on here, but Nova reports a
  latest microversion of 2.79, which corresponds to Train.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2020552/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1848666] [NEW] Race can cause instance to become ACTIVE after build error

2019-10-18 Thread Matthew Booth

Public bug reported:

2 functions used in error cleanup in _do_build_and_run_instance:
_cleanup_allocated_networks and _set_instance_obj_error_state, call an
unguarded instance.save(). The problem with this is that the instance
object may have been in an unclean state before the build exception was
raised. Calling instance.save() will persist this unclean error state in
addition to whatever change was made during cleanup, which is not
intended.

Specifically in the case that a build races with a delete, the build can
fail when we try to do an atomic save to set the vm_state to active,
raising UnexpectedDeletingTaskStateError. However, the instance object
still contains the unpersisted vm_state change along with other
concomitant changes. These will all be persisted when
_cleanup_allocated_networks calls instance.save(). This means that the
instance.save(expected_task_state=SPAWNING) which correctly failed due
to a race, later succeeds accidentally in cleanup resulting in an
inconsistent instance state.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1848666

Title:
  Race can cause instance to become ACTIVE after build error

Status in OpenStack Compute (nova):
  New

Bug description:
  2 functions used in error cleanup in _do_build_and_run_instance:
  _cleanup_allocated_networks and _set_instance_obj_error_state, call an
  unguarded instance.save(). The problem with this is that the instance
  object may have been in an unclean state before the build exception
  was raised. Calling instance.save() will persist this unclean error
  state in addition to whatever change was made during cleanup, which is
  not intended.

  Specifically in the case that a build races with a delete, the build
  can fail when we try to do an atomic save to set the vm_state to
  active, raising UnexpectedDeletingTaskStateError. However, the
  instance object still contains the unpersisted vm_state change along
  with other concomitant changes. These will all be persisted when
  _cleanup_allocated_networks calls instance.save(). This means that the
  instance.save(expected_task_state=SPAWNING) which correctly failed due
  to a race, later succeeds accidentally in cleanup resulting in an
  inconsistent instance state.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1848666/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1774249] Re: update_available_resource will raise DiskNotFound after resize but before confirm

2019-09-27 Thread Matthew Booth

This is not fixed. We've just had a report where we appear to be hitting
the race reported in review here:

https://review.opendev.org/#/c/571410/7/nova/virt/libvirt/driver.py

** Changed in: nova
   Status: Fix Released => In Progress

** Changed in: nova/stein
   Status: Fix Committed => In Progress

** Changed in: nova/rocky
   Status: Fix Committed => In Progress

** Changed in: nova/queens
   Status: Fix Committed => In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1774249

Title:
  update_available_resource will raise DiskNotFound after resize but
  before confirm

Status in OpenStack Compute (nova):
  In Progress
Status in OpenStack Compute (nova) ocata series:
  Triaged
Status in OpenStack Compute (nova) pike series:
  Triaged
Status in OpenStack Compute (nova) queens series:
  In Progress
Status in OpenStack Compute (nova) rocky series:
  In Progress
Status in OpenStack Compute (nova) stein series:
  In Progress

Bug description:
  Original reported in RH Bugzilla:
  https://bugzilla.redhat.com/show_bug.cgi?id=1584315

  Tested on OSP12 (Pike), but appears to be still present on master.
  Should only occur if nova compute is configured to use local file
  instance storage.

  Create instance A on compute X

  Resize instance A to compute Y
Domain is powered off
/var/lib/nova/instances/ renamed to _resize on X
Domain is *not* undefined

  On compute X:
update_available_resource runs as a periodic task
First action is to update self
rt calls driver.get_available_resource()
...calls _get_disk_over_committed_size_total
...iterates over all defined domains, including the ones whose disks we 
renamed
...fails because a referenced disk no longer exists

  Results in errors in nova-compute.log:

  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager 
[req-bd52371f-c6ec-4a83-9584-c00c5377acd8 - - - - -] Error updating resources 
for node compute-0.localdomain.: DiskNotFound: No disk at 
/var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager Traceback (most 
recent call last):
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6695, in 
update_available_resource_for_node
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager 
rt.update_available_resource(context, nodename)
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 641, 
in update_available_resource
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager resources = 
self.driver.get_available_resource(nodename)
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5892, in 
get_available_resource
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager 
disk_over_committed = self._get_disk_over_committed_size_total()
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7393, in 
_get_disk_over_committed_size_total
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager config, 
block_device_info)
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7301, in 
_get_instance_disk_info_from_config
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager dk_size = 
disk_api.get_allocated_disk_size(path)
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 156, in 
get_allocated_disk_size
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager return 
images.qemu_img_info(path).disk_size
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in 
qemu_img_info
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager raise 
exception.DiskNotFound(location=path)
  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager DiskNotFound: No 
disk at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk

  And resource tracker is no longer updated. We can find lots of these
  in the gate.

  Note that change Icec2769bf42455853cbe686fb30fda73df791b25 nearly
  mitigates this, but doesn't because task_state is not set while the
  instance is awaiting confirm.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1774249/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1840912] [NEW] libvirt calls aren't reliably using tpool.Proxy

2019-08-21 Thread Matthew Booth

Public bug reported:

A customer is hitting an issue with symptoms identical to bug 1045152
(from 2012). Specifically, we are frequently seeing the compute host
being marked down. From log correlation, we can see that when this
occurs the relevant compute is always in the middle of executing
LibvirtDriver._get_disk_over_committed_size_total(). The reason for this
appears to be a long-running libvirt call which is not using
tpool.Proxy, and therefore blocks all other greenthreads during
execution. We do not yet know why the libvirt call is slow, but we have
identified the reason it is not using tpool.Proxy.

Because eventlet, we proxy libvirt calls at the point we create the
libvirt connection in libvirt.Host._connect:

return tpool.proxy_call(
(libvirt.virDomain, libvirt.virConnect),
libvirt.openAuth, uri, auth, flags)

This means: run libvirt.openAuth(uri, auth, flags) in a native thread.
If the returned object is a libvirt.virDomain or libvirt.virConnect,
wrap the returned object in a tpool.Proxy with the same autowrap rules.

There are 2 problems with this. Firstly, the autowrap list is
incomplete. At the very least we need to add libvirt.virNodeDevice,
libvirt.virSecret, and libvirt.NWFilter to this list as we use all of
these objects in Nova. Currently none of our interactions with these
objects are using the tpool proxy.

Secondly, and the specific root cause of this bug, it doesn't understand
lists:

https://github.com/eventlet/eventlet/blob/ca8dd0748a1985a409e9a9a517690f46e05cae99/eventlet/tpool.py#L149

In LibvirtDriver._get_disk_over_committed_size_total() we get a list of
running libvirt domains with libvirt.Host.list_instance_domains, which
calls virConnect.listAllDomains(). listAllDomains() returns a *list* of
virDomain, which the above code in tpool doesn't match. Consequently,
none of the subsequent virDomain calls use the tpool proxy, which
starves all other greenthreads.

** Affects: nova
Importance: Undecided
Status: New

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1840912

Title:
libvirt calls aren't reliably using tpool.Proxy

Status in OpenStack Compute (nova):
New

Bug description:
A customer is hitting an issue with symptoms identical to bug 1045152
(from 2012). Specifically, we are frequently seeing the compute host
being marked down. From log correlation, we can see that when this
occurs the relevant compute is always in the middle of executing
LibvirtDriver._get_disk_over_committed_size_total(). The reason for
this appears to be a long-running libvirt call which is not using
tpool.Proxy, and therefore blocks all other greenthreads during
execution. We do not yet know why the libvirt call is slow, but we
have identified the reason it is not using tpool.Proxy.

Because eventlet, we proxy libvirt calls at the point we create the
libvirt connection in libvirt.Host._connect:

return tpool.proxy_call(
(libvirt.virDomain, libvirt.virConnect),
libvirt.openAuth, uri, auth, flags)

Secondly, and the specific root cause of this bug, it doesn't
understand lists:

https://github.com/eventlet/eventlet/blob/ca8dd0748a1985a409e9a9a517690f46e05cae99/eventlet/tpool.py#L149

In LibvirtDriver._get_disk_over_committed_size_total() we get a list
of running libvirt domains with libvirt.Host.list_instance_domains,
which calls virConnect.listAllDomains(). listAllDomains() returns a
*list* of virDomain, which the above code in tpool doesn't match.
Consequently, none of the subsequent virDomain calls use the tpool
proxy, which starves all other greenthreads.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1840912/+subscriptions

--
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1836212] Re: libvirt: Failure to recover from failed detach

2019-07-18 Thread Matthew Booth

Yep. The actual error thrown was "Unable to detach from guest transient
domain.", which is now "Unable to detach the device from the live
config." in master. That RetryDecorator makes this function a whole lot
harder to read, but with your explanation it seems that the detach was
actually timing out, which is consistent with the underlying problem we
eventually discovered.

Thanks! I'll close this out.

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1836212

Title:
  libvirt: Failure to recover from failed detach

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  1020162 ERROR root [req-46fbc6c8-de2c-4afb-9f24-9d75947c9a3c
  9ccddbb72e2d42b6ab1a31ad48ea21fb 86bea4eb057b412a98402a1b7e1d9222 - -
  -] Original exception being dropped: ['Traceback (most recent call
  last):\n', '  File "/usr/lib/python2.7/site-
  packages/nova/virt/libvirt/guest.py", line 390, in
  _try_detach_device\nself.detach_device(conf,
  persistent=persistent, live=live)\n', '  File "/usr/lib/python2.7
  /site-packages/nova/virt/libvirt/guest.py", line 467, in
  detach_device\nself._domain.detachDeviceFlags(device_xml,
  flags=flags)\n', '  File "/usr/lib/python2.7/site-
  packages/eventlet/tpool.py", line 186, in doit\nresult =
  proxy_call(self._autowrap, f, *args, **kwargs)\n', '  File
  "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in
  proxy_call\nrv = execute(f, *args, **kwargs)\n', '  File
  "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in
  execute\nsix.reraise(c, e, tb)\n', '  File "/usr/lib/python2.7
  /site-packages/eventlet/tpool.py", line 83, in tworker\nrv =
  meth(*args, **kwargs)\n', '  File "/usr/lib64/python2.7/site-
  packages/libvirt.py", line 1194, in detachDeviceFlags\nif ret ==
  -1: raise libvirtError (\'virDomainDetachDeviceFlags() failed\',
  dom=self)\n', 'libvirtError: invalid argument: no target device
  vdb\n']

  This appears to happen because when we call
  detach_device_with_retry(live=True) we ultimately call
  detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG |
  VIR_DOMAIN_AFFECT_LIVE). 'no target device' is the error generated
  when libvirt failed to remove the device from CONFIG (persistent).
  This can happen because
  detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG |
  VIR_DOMAIN_AFFECT_LIVE) will succeed and remove the device from the
  CONFIG domain as long as the LIVE domain removal was queued, even
  though this is an asynchronous operation. Consequently, a subsequent
  check for the device may return the device because it hasn't yet been
  (and may never be) removed from the LIVE domain, but it has been
  removed from the CONFIG domain. This will prevent libvirt from
  attempting to remove the device from the LIVE domain, and so the
  detach will never succeed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1836212/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1836212] [NEW] libvirt: Failure to recover from failed detach

2019-07-11 Thread Matthew Booth

Public bug reported:

1020162 ERROR root [req-46fbc6c8-de2c-4afb-9f24-9d75947c9a3c
9ccddbb72e2d42b6ab1a31ad48ea21fb 86bea4eb057b412a98402a1b7e1d9222 - - -]
Original exception being dropped: ['Traceback (most recent call
last):\n', '  File "/usr/lib/python2.7/site-
packages/nova/virt/libvirt/guest.py", line 390, in _try_detach_device\n
self.detach_device(conf, persistent=persistent, live=live)\n', '  File
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 467,
in detach_device\nself._domain.detachDeviceFlags(device_xml,
flags=flags)\n', '  File "/usr/lib/python2.7/site-
packages/eventlet/tpool.py", line 186, in doit\nresult =
proxy_call(self._autowrap, f, *args, **kwargs)\n', '  File
"/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in
proxy_call\nrv = execute(f, *args, **kwargs)\n', '  File
"/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in
execute\nsix.reraise(c, e, tb)\n', '  File "/usr/lib/python2.7/site-
packages/eventlet/tpool.py", line 83, in tworker\nrv = meth(*args,
**kwargs)\n', '  File "/usr/lib64/python2.7/site-packages/libvirt.py",
line 1194, in detachDeviceFlags\nif ret == -1: raise libvirtError
(\'virDomainDetachDeviceFlags() failed\', dom=self)\n', 'libvirtError:
invalid argument: no target device vdb\n']

This appears to happen because when we call
detach_device_with_retry(live=True) we ultimately call
detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG |
VIR_DOMAIN_AFFECT_LIVE). 'no target device' is the error generated when
libvirt failed to remove the device from CONFIG (persistent). This can
happen because detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG |
VIR_DOMAIN_AFFECT_LIVE) will succeed and remove the device from the
CONFIG domain as long as the LIVE domain removal was queued, even though
this is an asynchronous operation. Consequently, a subsequent check for
the device may return the device because it hasn't yet been (and may
never be) removed from the LIVE domain, but it has been removed from the
CONFIG domain. This will prevent libvirt from attempting to remove the
device from the LIVE domain, and so the detach will never succeed.

** Affects: nova
 Importance: Undecided
 Status: New

** Bug watch added: Red Hat Bugzilla #1669225
   https://bugzilla.redhat.com/show_bug.cgi?id=1669225

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1836212

Title:
  libvirt: Failure to recover from failed detach

Status in OpenStack Compute (nova):
  New

Bug description:
  1020162 ERROR root [req-46fbc6c8-de2c-4afb-9f24-9d75947c9a3c
  9ccddbb72e2d42b6ab1a31ad48ea21fb 86bea4eb057b412a98402a1b7e1d9222 - -
  -] Original exception being dropped: ['Traceback (most recent call
  last):\n', '  File "/usr/lib/python2.7/site-
  packages/nova/virt/libvirt/guest.py", line 390, in
  _try_detach_device\nself.detach_device(conf,
  persistent=persistent, live=live)\n', '  File "/usr/lib/python2.7
  /site-packages/nova/virt/libvirt/guest.py", line 467, in
  detach_device\nself._domain.detachDeviceFlags(device_xml,
  flags=flags)\n', '  File "/usr/lib/python2.7/site-
  packages/eventlet/tpool.py", line 186, in doit\nresult =
  proxy_call(self._autowrap, f, *args, **kwargs)\n', '  File
  "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in
  proxy_call\nrv = execute(f, *args, **kwargs)\n', '  File
  "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in
  execute\nsix.reraise(c, e, tb)\n', '  File "/usr/lib/python2.7
  /site-packages/eventlet/tpool.py", line 83, in tworker\nrv =
  meth(*args, **kwargs)\n', '  File "/usr/lib64/python2.7/site-
  packages/libvirt.py", line 1194, in detachDeviceFlags\nif ret ==
  -1: raise libvirtError (\'virDomainDetachDeviceFlags() failed\',
  dom=self)\n', 'libvirtError: invalid argument: no target device
  vdb\n']

  This appears to happen because when we call
  detach_device_with_retry(live=True) we ultimately call
  detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG |
  VIR_DOMAIN_AFFECT_LIVE). 'no target device' is the error generated
  when libvirt failed to remove the device from CONFIG (persistent).
  This can happen because
  detachDeviceFlags(flags=VIR_DOMAIN_AFFECT_CONFIG |
  VIR_DOMAIN_AFFECT_LIVE) will succeed and remove the device from the
  CONFIG domain as long as the LIVE domain removal was queued, even
  though this is an asynchronous operation. Consequently, a subsequent
  check for the device may return the device because it hasn't yet been
  (and may never be) removed from the LIVE domain, but it has been
  removed from the CONFIG domain. This will prevent libvirt from
  attempting to remove the device from the LIVE domain, and so the
  detach will never succeed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1836212/+subscriptions

-- 
Mailing list: https://launchpad.ne

[Yahoo-eng-team] [Bug 1821373] [NEW] Most instance actions can be called concurrently

2019-03-22 Thread Matthew Booth

Public bug reported:

A customer reported that they were getting DB corruption if they called
shelve twice in quick succession on the same instance. This should be
prevented by the guard in nova.API.shelve, which does:

  instance.task_state = task_states.SHELVING
  instance.save(expected_task_state=[None])

This is intended to act as a robust gate against 2 instance actions
happening concurrently. The first will set the task state to SHELVING,
the second will fail because the task state is not SHELVING. The
comparison is done atomically in db.instance_update_and_get_original(),
and should be race free.

However, instance.save() shortcuts if there is no update and does not
call db.instance_update_and_get_original(). Therefore this guard fails
if we call the same operation twice:

  instance = get_instance()
=> Returned instance.task_state is None
  instance.task_state = task_states.SHELVING
  instance.save(expected_task_state=[None])
=> task_state was None, now SHELVING, updates = {'task_state': SHELVING}
=> db.instance_update_and_get_original() executes and succeeds

  instance = get_instance()
=> Returned instance.task_state is SHELVING
  instance.task_state = task_states.SHELVING
  instance.save(expected_task_state=[None])
=> task_state was SHELVING, still SHELVING, updates = {}
=> db.instance_update_and_get_original() does not execute, therefore 
doesn't raise the expected exception

This pattern is common to almost all instance actions in nova api. A
quick scan suggests that all of the following actions are affected by
this bug, and can therefore all potentially be executed multiple times
concurrently for the same instance:

restore
force_stop
start
backup
snapshot
soft reboot
hard reboot
rebuild
revert_resize
resize
shelve
shelve_offload
unshelve
pause
unpause
suspend
resume
rescue
unrescue
set_admin_password
live_migrate
evacuate

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821373

Title:
  Most instance actions can be called concurrently

Status in OpenStack Compute (nova):
  New

Bug description:
  A customer reported that they were getting DB corruption if they
  called shelve twice in quick succession on the same instance. This
  should be prevented by the guard in nova.API.shelve, which does:

instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])

  This is intended to act as a robust gate against 2 instance actions
  happening concurrently. The first will set the task state to SHELVING,
  the second will fail because the task state is not SHELVING. The
  comparison is done atomically in
  db.instance_update_and_get_original(), and should be race free.

  However, instance.save() shortcuts if there is no update and does not
  call db.instance_update_and_get_original(). Therefore this guard fails
  if we call the same operation twice:

instance = get_instance()
  => Returned instance.task_state is None
instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])
  => task_state was None, now SHELVING, updates = {'task_state': SHELVING}
  => db.instance_update_and_get_original() executes and succeeds

instance = get_instance()
  => Returned instance.task_state is SHELVING
instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])
  => task_state was SHELVING, still SHELVING, updates = {}
  => db.instance_update_and_get_original() does not execute, therefore 
doesn't raise the expected exception

  This pattern is common to almost all instance actions in nova api. A
  quick scan suggests that all of the following actions are affected by
  this bug, and can therefore all potentially be executed multiple times
  concurrently for the same instance:

  restore
  force_stop
  start
  backup
  snapshot
  soft reboot
  hard reboot
  rebuild
  revert_resize
  resize
  shelve
  shelve_offload
  unshelve
  pause
  unpause
  suspend
  resume
  rescue
  unrescue
  set_admin_password
  live_migrate
  evacuate

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1821373/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1804811] [NEW] DatabaseAtVersion fixture causes order-related failures in tests using Database fixture

2018-11-23 Thread Matthew Booth

Public bug reported:

The DatabaseAtVersion fixture starts the global TransactionContext, but
doesn't set the guard to configure() used by the Database fixture.
Consequently, if Database runs after DatabaseAtVersion in the same
worker, the subsequent fixture will fail. An example ordering which
fails is:

nova.tests.unit.db.test_sqlalchemy_migration.TestNewtonCellsCheck.test_upgrade_without_cell0
nova.tests.unit.db.test_sqlalchemy_migration.TestNewtonCheck.test_pci_device_type_vf_not_migrated

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1804811

Title:
  DatabaseAtVersion fixture causes order-related failures in tests using
  Database fixture

Status in OpenStack Compute (nova):
  New

Bug description:
  The DatabaseAtVersion fixture starts the global TransactionContext,
  but doesn't set the guard to configure() used by the Database fixture.
  Consequently, if Database runs after DatabaseAtVersion in the same
  worker, the subsequent fixture will fail. An example ordering which
  fails is:

  
nova.tests.unit.db.test_sqlalchemy_migration.TestNewtonCellsCheck.test_upgrade_without_cell0
  
nova.tests.unit.db.test_sqlalchemy_migration.TestNewtonCheck.test_pci_device_type_vf_not_migrated

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1804811/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1804652] [NEW] nova.db.sqlalchemy.migration.db_version is racy

2018-11-22 Thread Matthew Booth

Public bug reported:

db_version() attempts to initialise versioning if the db is not
versioned. However, it doesn't consider concurrency, so we can get
errors if multiple watchers try to get the db version before the db is
initialised. We are seeing this in practise during tripleo deployments
in a script which waits on multiple controller nodes for db sync to
complete.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1804652

Title:
  nova.db.sqlalchemy.migration.db_version is racy

Status in OpenStack Compute (nova):
  New

Bug description:
  db_version() attempts to initialise versioning if the db is not
  versioned. However, it doesn't consider concurrency, so we can get
  errors if multiple watchers try to get the db version before the db is
  initialised. We are seeing this in practise during tripleo deployments
  in a script which waits on multiple controller nodes for db sync to
  complete.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1804652/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1803961] [NEW] Nova doesn't call migrate_volume_completion after cinder volume migration

2018-11-19 Thread Matthew Booth

Public bug reported:

Originally reported in Red Hat Bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1648931

Create a cinder volume, attach it to a nova instance, and migrate the
volume to a different storage host:

$ cinder create 1 --volume-type foo --name myvol
$ nova volume-attach myinstance myvol
$ cinder migrate myvol c-vol2

Everything seems to work correctly, but if we look at myinstance we see
that it's now connected to a new volume, and the original volume is
still present on the original storage host.

This is because nova didn't call cinder's migrate_volume_completion.
migrate_volume_completion would have deleted the original volume, and
changed the volume id of the new volume to be the same as the original.
The result would be that myinstance would appear to be connected to the
same volume as before.

Note that there are 2 ways (that I'm aware of) to intiate a cinder
volume migration: retype and migrate. AFAICT retype is *not* affected.
In fact, I updated the relevant tempest test to try to trip it up and it
didn't fail. However, an exlicit migrate *is* affected. They are
different top-level entry points in cinder, and set different state,
which is what triggers the Nova bug.

This appears to be a regression which was introduced by
https://review.openstack.org/#/c/456971/ :

# Yes this is a tightly-coupled state check of what's going on inside
# cinder, but we need this while we still support old (v1/v2) and
# new style attachments (v3.44). Once we drop support for old style
# attachments we could think about cleaning up the cinder-initiated
# swap volume API flows.
is_cinder_migration = (
True if old_volume['status'] in ('retyping',
 'migrating') else False)

There's a bug here because AFAICT cinder never sets status to
'migrating' during any operation: it sets migration_status to
'migrating' during both retype and migrate. During retype it sets status
to 'retyping', but not during an explicit migrate.

** Affects: nova
 Importance: Undecided
 Assignee: Matthew Booth (mbooth-9)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1803961

Title:
  Nova doesn't call migrate_volume_completion after cinder volume
  migration

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  Originally reported in Red Hat Bugzilla:
  https://bugzilla.redhat.com/show_bug.cgi?id=1648931

  Create a cinder volume, attach it to a nova instance, and migrate the
  volume to a different storage host:

  $ cinder create 1 --volume-type foo --name myvol
  $ nova volume-attach myinstance myvol
  $ cinder migrate myvol c-vol2

  Everything seems to work correctly, but if we look at myinstance we
  see that it's now connected to a new volume, and the original volume
  is still present on the original storage host.

  This is because nova didn't call cinder's migrate_volume_completion.
  migrate_volume_completion would have deleted the original volume, and
  changed the volume id of the new volume to be the same as the
  original. The result would be that myinstance would appear to be
  connected to the same volume as before.

  Note that there are 2 ways (that I'm aware of) to intiate a cinder
  volume migration: retype and migrate. AFAICT retype is *not* affected.
  In fact, I updated the relevant tempest test to try to trip it up and
  it didn't fail. However, an exlicit migrate *is* affected. They are
  different top-level entry points in cinder, and set different state,
  which is what triggers the Nova bug.

  This appears to be a regression which was introduced by
  https://review.openstack.org/#/c/456971/ :

  # Yes this is a tightly-coupled state check of what's going on inside
  # cinder, but we need this while we still support old (v1/v2) and
  # new style attachments (v3.44). Once we drop support for old style
  # attachments we could think about cleaning up the cinder-initiated
  # swap volume API flows.
  is_cinder_migration = (
  True if old_volume['status'] in ('retyping',
   'migrating') else False)

  There's a bug here because AFAICT cinder never sets status to
  'migrating' during any operation: it sets migration_status to
  'migrating' during both retype and migrate. During retype it sets
  status to 'retyping', but not during an explicit migrate.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1803961/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1794333] [NEW] Local delete emits only legacy start and end notifications

2018-09-25 Thread Matthew Booth

Public bug reported:

If the compute api service does a 'local delete', it only emits legacy
notifications when the operation starts and ends. If the delete goes to
a compute host, the compute host emits both legacy and versioned
notifications. This is both inconsistent, and a gap in versioned
notifications.

It would appear that every caller of
compute_utils.notify_about_instance_delete in compute.API fails to emit
versioned notifications. I suggest that the best way to fix this will be
to fix compute_utils.notify_about_instance_delete, but note that there's
also a caller in compute.Manager which emits versioned notifications
explicitly.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1794333

Title:
  Local delete emits only legacy start and end notifications

Status in OpenStack Compute (nova):
  New

Bug description:
  If the compute api service does a 'local delete', it only emits legacy
  notifications when the operation starts and ends. If the delete goes
  to a compute host, the compute host emits both legacy and versioned
  notifications. This is both inconsistent, and a gap in versioned
  notifications.

  It would appear that every caller of
  compute_utils.notify_about_instance_delete in compute.API fails to
  emit versioned notifications. I suggest that the best way to fix this
  will be to fix compute_utils.notify_about_instance_delete, but note
  that there's also a caller in compute.Manager which emits versioned
  notifications explicitly.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1794333/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1780973] [NEW] Failure during live migration leaves BDM with incorrect connection_info

2018-07-10 Thread Matthew Booth

Public bug reported:

_rollback_live_migration doesn't restore connection_info.

** Affects: nova
 Importance: Undecided
 Assignee: Matthew Booth (mbooth-9)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1780973

Title:
  Failure during live migration leaves BDM with incorrect
  connection_info

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  _rollback_live_migration doesn't restore connection_info.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1780973/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1778206] [NEW] Compute leaks volume attachments if we fail in driver.pre_live_migration

2018-06-22 Thread Matthew Booth

Public bug reported:

ComputeManager.pre_live_migration fails to clean up volume attachments
if the call to driver.pre_live_migration() fails. There's a try block in
there to clean up attachments, but its scope isn't large enough. The
result is a volume in a perpetual attaching state.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1778206

Title:
  Compute leaks volume attachments if we fail in
  driver.pre_live_migration

Status in OpenStack Compute (nova):
  New

Bug description:
  ComputeManager.pre_live_migration fails to clean up volume attachments
  if the call to driver.pre_live_migration() fails. There's a try block
  in there to clean up attachments, but its scope isn't large enough.
  The result is a volume in a perpetual attaching state.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1778206/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1777475] Re: Undercloud vm in state error after update of the undercloud.

2018-06-21 Thread Matthew Booth

The Nova fix should be to not call plug_vifs at all during ironic driver
initialization. It probably isn't necessary for 'non-local' hypervisors
in general, so guessing also Power, Hyper-V, and VMware.

** Also affects: nova
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1777475

Title:
  Undercloud vm in state error after update of the undercloud.

Status in OpenStack Compute (nova):
  New
Status in tripleo:
  In Progress

Bug description:
  Hi,

  after an update of the undercloud, the undercloud vm is in error:

  [stack@undercloud-0 ~]$ openstack server list 

   
  
+--+--+++++

   
  | ID   | Name | Status | Networks 
  | Image  | Flavor |   

  
+--+--+++++

   
  | 9f80c38a-9f33-4a18-88e0-b89776e62150 | compute-0| ERROR  | 
ctlplane=192.168.24.18 | overcloud-full | compute|  
 
  | e87efe17-b939-4df2-af0c-8e2effd58c95 | controller-1 | ERROR  | 
ctlplane=192.168.24.9  | overcloud-full | controller |  
 
  | 5a3ea20c-75e8-49fe-90b6-edad01fc0a48 | controller-2 | ERROR  | 
ctlplane=192.168.24.13 | overcloud-full | controller |  
 
  | ba0f26e7-ec2c-4e61-be8e-05edf00ce78a | controller-0 | ERROR  | 
ctlplane=192.168.24.8  | overcloud-full | controller |  
 
  
+--+--+++++
 

  
  Originally found starting there 
https://bugzilla.redhat.com/show_bug.cgi?id=1590297#c14

  It boils down to a ordering issue between openstack-ironic-conductor
  and openstack-nova-compute, a simple reproducer is:

  sudo systemctl stop openstack-ironic-conductor
  sudo systemctl restart openstack-nova-compute

  on the undercloud.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1777475/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1775418] [NEW] Swap volume of multiattached volume will corrupt data

2018-06-06 Thread Matthew Booth

Public bug reported:

We currently permit the following:

Create multiattach volumes a and b
Create servers 1 and 2
Attach volume a to servers 1 and 2
swap_volume(server 1, volume a, volume b)

In fact, we have a tempest test which tests exactly this sequence:
api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach

The problem is that writes from server 2 during the copy operation on
server 1 will continue to hit the underlying storage, but as server 1
doesn't know about them they won't be reflected on the copy on volume b.
This will lead to an inconsistent copy, and therefore data corruption on
volume b.

Also, this whole flow makes no sense for a multiattached volume because
even if we managed a consistent copy all we've achieved is forking our
data between the 2 volumes. The purpose of this call is to allow the
operator to move volumes. We need a fundamentally different approach for
multiattached volumes.

In the short term we should at least prevent data corruption by
preventing swap volume of a multiattached volume. This would also cause
the above tempest test to fail, but as I don't believe it's possible to
implement the test safely this would be correct.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1775418

Title:
  Swap volume of multiattached volume will corrupt data

Status in OpenStack Compute (nova):
  New

Bug description:
  We currently permit the following:

  Create multiattach volumes a and b
  Create servers 1 and 2
  Attach volume a to servers 1 and 2
  swap_volume(server 1, volume a, volume b)

  In fact, we have a tempest test which tests exactly this sequence:
  
api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach

  The problem is that writes from server 2 during the copy operation on
  server 1 will continue to hit the underlying storage, but as server 1
  doesn't know about them they won't be reflected on the copy on volume
  b. This will lead to an inconsistent copy, and therefore data
  corruption on volume b.

  Also, this whole flow makes no sense for a multiattached volume
  because even if we managed a consistent copy all we've achieved is
  forking our data between the 2 volumes. The purpose of this call is to
  allow the operator to move volumes. We need a fundamentally different
  approach for multiattached volumes.

  In the short term we should at least prevent data corruption by
  preventing swap volume of a multiattached volume. This would also
  cause the above tempest test to fail, but as I don't believe it's
  possible to implement the test safely this would be correct.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1775418/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1774252] [NEW] Resize confirm fails if nova-compute is restarted after resize

2018-05-30 Thread Matthew Booth

Public bug reported:

Originally reported in RH bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1584315

Reproduced on OSP12 (Pike).

After resizing an instance but before confirm, update_available_resource
will fail on the source compute due to bug 1774249. If nova compute is
restarted at this point before the resize is confirmed, the
update_available_resource period task will never have succeeded, and
therefore ResourceTracker's compute_nodes dict will not be populated at
all.

When confirm calls _delete_allocation_after_move() it will fail with
ComputeHostNotFound because there is no entry for the current node in
ResourceTracker. The error looks like:

2018-05-30 13:42:19.239 1 ERROR nova.compute.manager 
[req-4f7d5d63-fc05-46ed-b505-41050d889752 09abbd4893bb45eea8fb1d5e40635339 
d4483d13a6ef41b2ae575ddbd0c59141 - default default] [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] Setting instance vm_state to ERROR: 
ComputeHostNotFound: Compute host compute-1.localdomain could not be found.
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] Traceback (most recent call last):
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7445, in 
_error_out_instance_on_exception
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] yield
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3757, in 
_confirm_resize
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] migration.source_node)
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3790, in 
_delete_allocation_after_move
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] cn_uuid = rt.get_node_uuid(nodename)
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0]   File 
"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 155, 
in get_node_uuid
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] raise 
exception.ComputeHostNotFound(host=nodename)
2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] ComputeHostNotFound: Compute host 
compute-1.localdomain could not be found.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1774252

Title:
  Resize confirm fails if nova-compute is restarted after resize

Status in OpenStack Compute (nova):
  New

Bug description:
  Originally reported in RH bugzilla:
  https://bugzilla.redhat.com/show_bug.cgi?id=1584315

  Reproduced on OSP12 (Pike).

  After resizing an instance but before confirm,
  update_available_resource will fail on the source compute due to bug
  1774249. If nova compute is restarted at this point before the resize
  is confirmed, the update_available_resource period task will never
  have succeeded, and therefore ResourceTracker's compute_nodes dict
  will not be populated at all.

  When confirm calls _delete_allocation_after_move() it will fail with
  ComputeHostNotFound because there is no entry for the current node in
  ResourceTracker. The error looks like:

  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager 
[req-4f7d5d63-fc05-46ed-b505-41050d889752 09abbd4893bb45eea8fb1d5e40635339 
d4483d13a6ef41b2ae575ddbd0c59141 - default default] [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] Setting instance vm_state to ERROR: 
ComputeHostNotFound: Compute host compute-1.localdomain could not be found.
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] Traceback (most recent call last):
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7445, in 
_error_out_instance_on_exception
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] yield
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3757, in 
_confirm_resize
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] migration.source_node)
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instan

[Yahoo-eng-team] [Bug 1774249] [NEW] update_available_resource will raise DiskNotFound after resize but before confirm

2018-05-30 Thread Matthew Booth

Public bug reported:

Original reported in RH Bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1584315

Tested on OSP12 (Pike), but appears to be still present on master.
Should only occur if nova compute is configured to use local file
instance storage.

Create instance A on compute X

Resize instance A to compute Y
  Domain is powered off
  /var/lib/nova/instances/ renamed to _resize on X
  Domain is *not* undefined

On compute X:
  update_available_resource runs as a periodic task
  First action is to update self
  rt calls driver.get_available_resource()
  ...calls _get_disk_over_committed_size_total
  ...iterates over all defined domains, including the ones whose disks we 
renamed
  ...fails because a referenced disk no longer exists

Results in errors in nova-compute.log:

2018-05-30 02:17:08.647 1 ERROR nova.compute.manager 
[req-bd52371f-c6ec-4a83-9584-c00c5377acd8 - - - - -] Error updating resources 
for node compute-0.localdomain.: DiskNotFound: No disk at 
/var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager Traceback (most recent 
call last):
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6695, in 
update_available_resource_for_node
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager 
rt.update_available_resource(context, nodename)
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 641, 
in update_available_resource
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager resources = 
self.driver.get_available_resource(nodename)
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 5892, in 
get_available_resource
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager 
disk_over_committed = self._get_disk_over_committed_size_total()
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7393, in 
_get_disk_over_committed_size_total
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager config, 
block_device_info)
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7301, in 
_get_instance_disk_info_from_config
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager dk_size = 
disk_api.get_allocated_disk_size(path)
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/virt/disk/api.py", line 156, in 
get_allocated_disk_size
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager return 
images.qemu_img_info(path).disk_size
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/site-packages/nova/virt/images.py", line 57, in 
qemu_img_info
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager raise 
exception.DiskNotFound(location=path)
2018-05-30 02:17:08.647 1 ERROR nova.compute.manager DiskNotFound: No disk 
at /var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk

And resource tracker is no longer updated. We can find lots of these in
the gate.

Note that change Icec2769bf42455853cbe686fb30fda73df791b25 nearly
mitigates this, but doesn't because task_state is not set while the
instance is awaiting confirm.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1774249

Title:
  update_available_resource will raise DiskNotFound after resize but
  before confirm

Status in OpenStack Compute (nova):
  New

Bug description:
  Original reported in RH Bugzilla:
  https://bugzilla.redhat.com/show_bug.cgi?id=1584315

  Tested on OSP12 (Pike), but appears to be still present on master.
  Should only occur if nova compute is configured to use local file
  instance storage.

  Create instance A on compute X

  Resize instance A to compute Y
Domain is powered off
/var/lib/nova/instances/ renamed to _resize on X
Domain is *not* undefined

  On compute X:
update_available_resource runs as a periodic task
First action is to update self
rt calls driver.get_available_resource()
...calls _get_disk_over_committed_size_total
...iterates over all defined domains, including the ones whose disks we 
renamed
...fails because a referenced disk no longer exists

  Results in errors in nova-compute.log:

  2018-05-30 02:17:08.647 1 ERROR nova.compute.manager 
[req-bd52371f-c6ec-4a83-9584-c00c5377acd8 - - - - -] Error updating resources 
for node compute-0.localdomain.: DiskNotFound: No disk at 
/var/lib/nova/instances/f3ed9015-3984-43f4-b4a5-c2898052b47d/disk
  2018-05-30 02:17:

[Yahoo-eng-team] [Bug 1767363] [NEW] Deleting 2 instances with a common multi-attached volume can leave the volume attached

2018-04-27 Thread Matthew Booth

Public bug reported:

CAVEAT: The following is only from code inspection. I have not
reproduced the issue.

During instance delete, we call:

  driver.cleanup():
foreach volume:
  _disconnect_volume():
if _should_disconnect_target():
  disconnect_volume()

There is no volume-specific or global locking around _disconnect_volume
that I can see in this call graph.

_should_disconnect_target() is intended to check for multi-attached
volumes on a single host, to prevent a volume being disconnected while
it is still in use by another instance. It does:

  volume = cinder->get_volume()
  connection_count = count of volume.attachments where instance is on this host

As there is no locking between the above operation and the subsequent
disconnect_volume(), 2 simultaneous calls to _disconnect_volume() can
both return False from _should_disconnect_target(). Not only this, but
as this involves both a slow call out to cinder and a db lookup, this is
likely to be easily hit in practice for example by an orchestration tool
mass-deleting instances.

Also note that there are many call paths which call _disconnect_volume()
apart from cleanup(), so there are likely numerous other potential
interactions here.

The result would be that all attachments are deleted, but the volume
remains attached to the host.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1767363

Title:
  Deleting 2 instances with a common multi-attached volume can leave the
  volume attached

Status in OpenStack Compute (nova):
  New

Bug description:
  CAVEAT: The following is only from code inspection. I have not
  reproduced the issue.

  During instance delete, we call:

driver.cleanup():
  foreach volume:
_disconnect_volume():
  if _should_disconnect_target():
disconnect_volume()

  There is no volume-specific or global locking around
  _disconnect_volume that I can see in this call graph.

  _should_disconnect_target() is intended to check for multi-attached
  volumes on a single host, to prevent a volume being disconnected while
  it is still in use by another instance. It does:

volume = cinder->get_volume()
connection_count = count of volume.attachments where instance is on this 
host

  As there is no locking between the above operation and the subsequent
  disconnect_volume(), 2 simultaneous calls to _disconnect_volume() can
  both return False from _should_disconnect_target(). Not only this, but
  as this involves both a slow call out to cinder and a db lookup, this
  is likely to be easily hit in practice for example by an orchestration
  tool mass-deleting instances.

  Also note that there are many call paths which call
  _disconnect_volume() apart from cleanup(), so there are likely
  numerous other potential interactions here.

  The result would be that all attachments are deleted, but the volume
  remains attached to the host.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1767363/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1754716] [NEW] Disconnect volume on live migration source fails if initialize_connection doesn't return identical output

2018-03-09 Thread Matthew Booth

Public bug reported:

During live migration we update bdm.connection_info for attached volumes
in pre_live_migration to reflect the new connection on the destination
node. This means that after migration completes we no longer have a
reference to the original connection_info to do the detach on the source
host, so we have to re-fetch it with a second call to
initialize_connection before calling disconnect.

Unfortunately the cinder driver interface does not strictly require that
multiple calls to initialize_connection will return consistent results.
Although they normally do in practice, there is at least one cinder
driver (delliscsi) which doesn't. This results in a failure to
disconnect on the source host post migration.

** Affects: nova
 Importance: Undecided
 Assignee: Matthew Booth (mbooth-9)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1754716

Title:
  Disconnect volume on live migration source fails if
  initialize_connection doesn't return identical output

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  During live migration we update bdm.connection_info for attached
  volumes in pre_live_migration to reflect the new connection on the
  destination node. This means that after migration completes we no
  longer have a reference to the original connection_info to do the
  detach on the source host, so we have to re-fetch it with a second
  call to initialize_connection before calling disconnect.

  Unfortunately the cinder driver interface does not strictly require
  that multiple calls to initialize_connection will return consistent
  results. Although they normally do in practice, there is at least one
  cinder driver (delliscsi) which doesn't. This results in a failure to
  disconnect on the source host post migration.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1754716/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1744079] [NEW] disk over-commit still not correctly calculated during live migration

2018-01-18 Thread Matthew Booth

Public bug reported:

Change I8a705114d47384fcd00955d4a4f204072fed57c2 (written by me... sigh)
addressed a bug which prevented live migration to a target host with
overcommitted disk when made with microversion <2.25. It achieved this,
but the fix is still not correct. We now do:

if disk_over_commit:
disk_available_gb = dst_compute_info['local_gb']

Unfortunately local_gb is *total* disk, not available disk. We actually
want free_disk_gb. Fun fact: due to the way we calculate this for
filesystems, without taking into account reserved space, this can also
be negative.

The test we're currently running is: could we fit this guest's allocated
disks on the target if the target disk was empty. This is at least
better than it was before, as we don't spuriously fail early. In fact,
we're effectively disabling a test which is disabled for microversion
>=2.25 anyway. IOW we should fix it, but it's probably not a high
priority.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1744079

Title:
  disk over-commit still not correctly calculated during live migration

Status in OpenStack Compute (nova):
  New

Bug description:
  Change I8a705114d47384fcd00955d4a4f204072fed57c2 (written by me...
  sigh) addressed a bug which prevented live migration to a target host
  with overcommitted disk when made with microversion <2.25. It achieved
  this, but the fix is still not correct. We now do:

  if disk_over_commit:
  disk_available_gb = dst_compute_info['local_gb']

  Unfortunately local_gb is *total* disk, not available disk. We
  actually want free_disk_gb. Fun fact: due to the way we calculate this
  for filesystems, without taking into account reserved space, this can
  also be negative.

  The test we're currently running is: could we fit this guest's
  allocated disks on the target if the target disk was empty. This is at
  least better than it was before, as we don't spuriously fail early. In
  fact, we're effectively disabling a test which is disabled for
  microversion >=2.25 anyway. IOW we should fix it, but it's probably
  not a high priority.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1744079/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1719362] [NEW] libvirt: Data corruptor live migrating BFV instance with config disk

2017-09-25 Thread Matthew Booth

Public bug reported:

When live migrating a BFV instance with a config disk, the API currently
requires block migration to be specified due to the local storage
requirement. This doesn't make sense on a number of levels.

Before calling migrateToURI3() in this case, the libvirt driver filters
out all disks which it shouldn't migrate, which is both of them: the
config drive because it's read-only and we already copied it with scp,
and the root disk because it's a volume. It calls migrateToURI3() with
an empty migrate_disks in params, and VIR_MIGRATE_NON_SHARED_INC in
flags (because block-migration).

There's a quirk in the behaviour of the libvirt python bindings here: it
doesn't distinguish between an empty migrate_disks list, and no
migrate_disks list. Both use the default behaviour of "block migrate all
writable disks". This will include the attached root volume. As the root
volume is simultaneously attached to both ends of the migration, one of
which is running guest, this a data corruptor.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1719362

Title:
  libvirt: Data corruptor live migrating BFV instance with config disk

Status in OpenStack Compute (nova):
  New

Bug description:
  When live migrating a BFV instance with a config disk, the API
  currently requires block migration to be specified due to the local
  storage requirement. This doesn't make sense on a number of levels.

  Before calling migrateToURI3() in this case, the libvirt driver
  filters out all disks which it shouldn't migrate, which is both of
  them: the config drive because it's read-only and we already copied it
  with scp, and the root disk because it's a volume. It calls
  migrateToURI3() with an empty migrate_disks in params, and
  VIR_MIGRATE_NON_SHARED_INC in flags (because block-migration).

  There's a quirk in the behaviour of the libvirt python bindings here:
  it doesn't distinguish between an empty migrate_disks list, and no
  migrate_disks list. Both use the default behaviour of "block migrate
  all writable disks". This will include the attached root volume. As
  the root volume is simultaneously attached to both ends of the
  migration, one of which is running guest, this a data corruptor.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1719362/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1718439] Re: Apparent lack of locking in conductor logs

2017-09-20 Thread Matthew Booth

After some brief discussion in #openstack-nova I've moved this to
oslo.log. The issue here appears to be that we spawn multiple separate
conductor processes writing to the same nova-conductor.log file. We
don't want to stop doing this, as it would break people.

It seems that by default python logging uses thread logs rather than
external locks:

  https://docs.python.org/2/library/multiprocessing.html#logging

Suggest the fix might be to explicitly use multiprocessing.get_logger(),
or at least provide an option to do this when we know it's required.

** Project changed: nova => oslo.log

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1718439

Title:
  Apparent lack of locking in conductor logs

Status in oslo.log:
  New

Bug description:
  I'm looking at conductor logs generated by a customer running RH OSP
  10 (Newton). The logs appear to be corrupt in a manner I'd expect to
  see if 2 processes were writing to the same log file simultaneously.
  For example:

  ===
  2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return 
self.dbapi.connect(*cargs, **cparams)
  2017-09-14 15:54:39.689 120626 ERROR nova.s2017-09-14 15:54:39.690 120562 
ERROR nova.servicegroup.drivers.db [-] Unexpected error while reporting service 
status
  2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db Traceback 
(most recent call last):
  ===

  Notice how a new log starts part way through the second line above.
  This also results in log entries in the wrong sort order:

  ===
  2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db return 
self.dbapi.connect(*cargs, **cparams)
  2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/pymysql/__init__.py", line 88, in Connect
  2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return 
Connection(*args, **kwargs)
  2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/pymysql/connections.py", line 657, in __init__
  ===

  Note how the first 2 lines are after the last 2 by timestamp, as
  presumably the last 2 are a continuation of a previous log entry. This
  confounds merge sorting of log files, which is exceptionally useful.

  We also see truncated lines with no timestamp which aren't a
  continuation of the previous line:

  ===
  2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db 
DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to 
MySQL server during query') [SQL: u'SELECT 1']
  2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db 
  elf._execute_and_instances(context)
  ===

  I strongly suspect this is because multiple conductors are running in
  separate processes, and are therefore not benefiting from the thread
  safety of python's logging.

To manage notifications about this bug go to:
https://bugs.launchpad.net/oslo.log/+bug/1718439/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1718439] [NEW] Apparent lack of locking in conductor logs

2017-09-20 Thread Matthew Booth

Public bug reported:

I'm looking at conductor logs generated by a customer running RH OSP 10
(Newton). The logs appear to be corrupt in a manner I'd expect to see if
2 processes were writing to the same log file simultaneously. For
example:

===
2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return 
self.dbapi.connect(*cargs, **cparams)
2017-09-14 15:54:39.689 120626 ERROR nova.s2017-09-14 15:54:39.690 120562 ERROR 
nova.servicegroup.drivers.db [-] Unexpected error while reporting service status
2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db Traceback 
(most recent call last):
===

Notice how a new log starts part way through the second line above. This
also results in log entries in the wrong sort order:

===
2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db return 
self.dbapi.connect(*cargs, **cparams)
2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/pymysql/__init__.py", line 88, in Connect
2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return 
Connection(*args, **kwargs)
2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/pymysql/connections.py", line 657, in __init__
===

Note how the first 2 lines are after the last 2 by timestamp, as
presumably the last 2 are a continuation of a previous log entry. This
confounds merge sorting of log files, which is exceptionally useful.

We also see truncated lines with no timestamp which aren't a
continuation of the previous line:

===
2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db 
DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to 
MySQL server during query') [SQL: u'SELECT 1']
2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db 
elf._execute_and_instances(context)
===

I strongly suspect this is because multiple conductors are running in
separate processes, and are therefore not benefiting from the thread
safety of python's logging.

** Affects: nova
 Importance: Undecided
 Status: New

** Summary changed:

- Apparent lack of locking in logger
+ Apparent lack of locking in conductor logs

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1718439

Title:
  Apparent lack of locking in conductor logs

Status in OpenStack Compute (nova):
  New

Bug description:
  I'm looking at conductor logs generated by a customer running RH OSP
  10 (Newton). The logs appear to be corrupt in a manner I'd expect to
  see if 2 processes were writing to the same log file simultaneously.
  For example:

  ===
  2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return 
self.dbapi.connect(*cargs, **cparams)
  2017-09-14 15:54:39.689 120626 ERROR nova.s2017-09-14 15:54:39.690 120562 
ERROR nova.servicegroup.drivers.db [-] Unexpected error while reporting service 
status
  2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db Traceback 
(most recent call last):
  ===

  Notice how a new log starts part way through the second line above.
  This also results in log entries in the wrong sort order:

  ===
  2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db return 
self.dbapi.connect(*cargs, **cparams)
  2017-09-14 15:54:39.690 120562 ERROR nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/pymysql/__init__.py", line 88, in Connect
  2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db return 
Connection(*args, **kwargs)
  2017-09-14 15:54:39.689 120626 ERROR nova.servicegroup.drivers.db   File 
"/usr/lib/python2.7/site-packages/pymysql/connections.py", line 657, in __init__
  ===

  Note how the first 2 lines are after the last 2 by timestamp, as
  presumably the last 2 are a continuation of a previous log entry. This
  confounds merge sorting of log files, which is exceptionally useful.

  We also see truncated lines with no timestamp which aren't a
  continuation of the previous line:

  ===
  2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db 
DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to 
MySQL server during query') [SQL: u'SELECT 1']
  2017-09-14 15:54:39.690 120607 ERROR nova.servicegroup.drivers.db 
  elf._execute_and_instances(context)
  ===

  I strongly suspect this is because multiple conductors are running in
  separate processes, and are therefore not benefiting from the thread
  safety of python's logging.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1718439/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1688228] [NEW] Failure in resize_instance after cast to finish_resize still sets instance error state

2017-05-04 Thread Matthew Booth

Public bug reported:

This is from code inspection only.

ComputeManager.resize_instance does:

  with self._error_out_instance_on_exception(context, instance,
 quotas=quotas):
  ...stuff...

  self.compute_rpcapi.finish_resize(context, instance,
migration, image, disk_info,
migration.dest_compute, reservations=quotas.reservations)

  ... Responsibility for the instance has now been punted to the
destination, but...

  self._notify_about_instance_usage(context, instance, "resize.end",
  network_info=network_info)

  compute_utils.notify_about_instance_action(context, instance,
   self.host, action=fields.NotificationAction.RESIZE,
   phase=fields.NotificationPhase.END)
  self.instance_events.clear_events_for_instance(instance)

The problem is that a failure in anything after the cast to
finish_resize will cause the instance to be put in an error state and
its quotas rolled back. This would not be correct, as any error here
would be purely ephemeral. The resize operation will continue on the
destination regardless, so this would almost certainly result in an
inconsistent state.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1688228

Title:
  Failure in resize_instance after cast to finish_resize still sets
  instance error state

Status in OpenStack Compute (nova):
  New

Bug description:
  This is from code inspection only.

  ComputeManager.resize_instance does:

with self._error_out_instance_on_exception(context, instance,
   quotas=quotas):
...stuff...

self.compute_rpcapi.finish_resize(context, instance,
  migration, image, disk_info,
  migration.dest_compute, reservations=quotas.reservations)

... Responsibility for the instance has now been punted to the
  destination, but...

self._notify_about_instance_usage(context, instance, "resize.end",
network_info=network_info)

compute_utils.notify_about_instance_action(context, instance,
 self.host, action=fields.NotificationAction.RESIZE,
 phase=fields.NotificationPhase.END)
self.instance_events.clear_events_for_instance(instance)

  The problem is that a failure in anything after the cast to
  finish_resize will cause the instance to be put in an error state and
  its quotas rolled back. This would not be correct, as any error here
  would be purely ephemeral. The resize operation will continue on the
  destination regardless, so this would almost certainly result in an
  inconsistent state.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1688228/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1686703] [NEW] Error in finish_migration results in image deletion on source with no copy

2017-04-27 Thread Matthew Booth

Public bug reported:

ML post describing the issue here:

  http://lists.openstack.org/pipermail/openstack-
dev/2017-April/115989.html

User was resizing an instance whose glance image had been deleted. An
ssh failure occurred in finish_migration, which runs on the destination,
attempting to copy the image out of the image cache on the source. This
left the instance and migration in an error state on the destination,
but with no copy of the image on the destination. Cache manager later
ran on the source and expired the image from the image cache there,
leaving no remaining copies. At this point the user's instance was
unrecoverable.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1686703

Title:
  Error in finish_migration results in image deletion on source with no
  copy

Status in OpenStack Compute (nova):
  New

Bug description:
  ML post describing the issue here:

http://lists.openstack.org/pipermail/openstack-
  dev/2017-April/115989.html

  User was resizing an instance whose glance image had been deleted. An
  ssh failure occurred in finish_migration, which runs on the
  destination, attempting to copy the image out of the image cache on
  the source. This left the instance and migration in an error state on
  the destination, but with no copy of the image on the destination.
  Cache manager later ran on the source and expired the image from the
  image cache there, leaving no remaining copies. At this point the
  user's instance was unrecoverable.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1686703/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1669844] [NEW] Host failure shortly after image download can result in data corruption

2017-03-03 Thread Matthew Booth

Public bug reported:

GlanceImageServiceV2.download() ensures its downloaded file is closed
before releasing for use by an external qemu process, but it doesn't do
an fdatasync(). This means that the downloaded file may be temporarily
in the host kernel's cache rather than on disk, which means there is a
short window in which a host crash will lose the contents of the backing
file, despite it being in use by a running instance.

Disclaimer: I'm not personally able to reproduce this, but it looks sane
and our QE team is reliably hitting it.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1669844

Title:
  Host failure shortly after image download can result in data
  corruption

Status in OpenStack Compute (nova):
  New

Bug description:
  GlanceImageServiceV2.download() ensures its downloaded file is closed
  before releasing for use by an external qemu process, but it doesn't
  do an fdatasync(). This means that the downloaded file may be
  temporarily in the host kernel's cache rather than on disk, which
  means there is a short window in which a host crash will lose the
  contents of the backing file, despite it being in use by a running
  instance.

  Disclaimer: I'm not personally able to reproduce this, but it looks
  sane and our QE team is reliably hitting it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1669844/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1669400] [NEW] delete_instance_metadata and update_instance_metadata are permitted during an ongoing task

2017-03-02 Thread Matthew Booth

Public bug reported:

Note: this is exclusively from code inspection.

delete_instance_metadata and update_instance_metadata in ComputeManager
are both guarded by:

 @check_instance_state(vm_state=[vm_states.ACTIVE, vm_states.PAUSED,
 vm_states.SUSPENDED, vm_states.STOPPED],
   task_state=None)

The problem is the task_state=None which, despite appearances, actually
explicitly disables the task_state check, i.e. it does not explicitly
check that task_state is None. This was introduced in change I70212879
and does not appear to have been deliberate.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1669400

Title:
  delete_instance_metadata and update_instance_metadata are permitted
  during an ongoing task

Status in OpenStack Compute (nova):
  New

Bug description:
  Note: this is exclusively from code inspection.

  delete_instance_metadata and update_instance_metadata in
  ComputeManager are both guarded by:

   @check_instance_state(vm_state=[vm_states.ACTIVE, vm_states.PAUSED,
   vm_states.SUSPENDED, vm_states.STOPPED],
 task_state=None)

  The problem is the task_state=None which, despite appearances,
  actually explicitly disables the task_state check, i.e. it does not
  explicitly check that task_state is None. This was introduced in
  change I70212879 and does not appear to have been deliberate.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1669400/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1662483] [NEW] detach_volume races with delete

2017-02-07 Thread Matthew Booth

Public bug reported:

If a client does:

nova volume-detach foo vol
nova delete foo

Assuming the volume-detach takes a moment, which it normally does, the
delete will race with it also also attempt to detach the same volume.
It's possible there are no side effects from this other than untidy log
messages, but this is difficult to prove.

I found this looking through CI logs.

Note that volume-detach can also race with other instance operations,
including itself. I'm almost certain that if you poke hard enough you'll
find some combination that breaks things badly.

** Affects: nova
 Importance: Undecided
 Assignee: Matthew Booth (mbooth-9)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1662483

Title:
  detach_volume races with delete

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  If a client does:

  nova volume-detach foo vol
  nova delete foo

  Assuming the volume-detach takes a moment, which it normally does, the
  delete will race with it also also attempt to detach the same volume.
  It's possible there are no side effects from this other than untidy
  log messages, but this is difficult to prove.

  I found this looking through CI logs.

  Note that volume-detach can also race with other instance operations,
  including itself. I'm almost certain that if you poke hard enough
  you'll find some combination that breaks things badly.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1662483/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1648109] [NEW] Libvirt LVM storage backend doesn't initialise filesystems of ephemeral disks

2016-12-07 Thread Matthew Booth

Public bug reported:

N.B. This is from code inspection only.

When creating an LVM-backed instance with an ephemeral disk, the
ephemeral disk will not be initialised with the requested filesystem.
This is because Image.cache() wraps the _create_ephemeral callback in
fetch_func_sync, which will not call _create_ephemeral if the target
already exists. Because the Lvm backend must create the disk first, this
is never called.

** Affects: nova
 Importance: Undecided
 Assignee: Matthew Booth (mbooth-9)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1648109

Title:
  Libvirt LVM storage backend doesn't initialise filesystems of
  ephemeral disks

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  N.B. This is from code inspection only.

  When creating an LVM-backed instance with an ephemeral disk, the
  ephemeral disk will not be initialised with the requested filesystem.
  This is because Image.cache() wraps the _create_ephemeral callback in
  fetch_func_sync, which will not call _create_ephemeral if the target
  already exists. Because the Lvm backend must create the disk first,
  this is never called.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1648109/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1623497] [NEW] Booting Ceph instance using Ceph glance doesn't resize root disk to flavor size

2016-09-14 Thread Matthew Booth

Public bug reported:

This bug is purely from code inspection; I haven't replicated it on a
running system.

Change I46b5658efafe558dd6b28c9910fb8fde830adec0 added a resize check
that the backing file exists before checking its size. Unfortunately we
forgot that Rbd overrides get_disk_size(path), and ignores the path
argument, which means it would previously not have failed even when the
given path didn't exist. Additionally, the callback function passed to
cache() by driver will also ignore its path argument, and therefore not
write to the image cache, when cloning to a ceph instance from a ceph
glance store (see the section starting if backend.SUPPORTS_CLONE in
driver._create_and_inject_local_root). Consequently, when creating a
ceph instance using a ceph glance store:

1. 'base' will not exist in the image cache
2. get_disk_size(base) will return the correct value anyway

We broke this with change I46b5658efafe558dd6b28c9910fb8fde830adec0.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: newton-rc-potential

** Tags added: newton-rc-potential

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1623497

Title:
  Booting Ceph instance using Ceph glance doesn't resize root disk to
  flavor size

Status in OpenStack Compute (nova):
  New

Bug description:
  This bug is purely from code inspection; I haven't replicated it on a
  running system.

  Change I46b5658efafe558dd6b28c9910fb8fde830adec0 added a resize check
  that the backing file exists before checking its size. Unfortunately
  we forgot that Rbd overrides get_disk_size(path), and ignores the path
  argument, which means it would previously not have failed even when
  the given path didn't exist. Additionally, the callback function
  passed to cache() by driver will also ignore its path argument, and
  therefore not write to the image cache, when cloning to a ceph
  instance from a ceph glance store (see the section starting if
  backend.SUPPORTS_CLONE in driver._create_and_inject_local_root).
  Consequently, when creating a ceph instance using a ceph glance store:

  1. 'base' will not exist in the image cache
  2. get_disk_size(base) will return the correct value anyway

  We broke this with change I46b5658efafe558dd6b28c9910fb8fde830adec0.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1623497/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1619606] [NEW] snapshot_volume_backed races, could result in data corruption

2016-09-02 Thread Matthew Booth

Public bug reported:

snapshot_volume_backed() in compute.API does not set a task_state during
execution. However, in essence it does:

if vm_state == ACTIVE:
  quiesce()
snapshot()
if vm_state == ACTIVE:
  unquiesce()

There is no exclusion here, though, which means a user could do:

quiesce()
   quiesce()
snapshot()
   snapshot()

unquiesce()--snapshot() now running after unquiesce -> corruption
   unquiesce()

or:

suspend()
snapshot()
  NO QUIESCE (we're suspended)
  snapshot()
   resume()
  --snapshot() now running after resume -> corruption

Same goes for stop/start.

Note that snapshot_volume_backed() is a separate top-level entry point
from snapshot(). snapshot() does not suffer from this problem, because
it atomically sets the task state to IMAGE_SNAPSHOT_PENDING when
running, which prevents the user from performing a concurrent operation
on the instance. I suggest that snapshot_volume_backed() should do the
same.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1619606

Title:
  snapshot_volume_backed races, could result in data corruption

Status in OpenStack Compute (nova):
  New

Bug description:
  snapshot_volume_backed() in compute.API does not set a task_state
  during execution. However, in essence it does:

  if vm_state == ACTIVE:
quiesce()
  snapshot()
  if vm_state == ACTIVE:
unquiesce()

  There is no exclusion here, though, which means a user could do:

  quiesce()
 quiesce()
  snapshot()
 snapshot()

  unquiesce()--snapshot() now running after unquiesce -> corruption
 unquiesce()

  or:

  suspend()
  snapshot()
NO QUIESCE (we're suspended)
snapshot()
 resume()
--snapshot() now running after resume -> corruption

  Same goes for stop/start.

  Note that snapshot_volume_backed() is a separate top-level entry point
  from snapshot(). snapshot() does not suffer from this problem, because
  it atomically sets the task state to IMAGE_SNAPSHOT_PENDING when
  running, which prevents the user from performing a concurrent
  operation on the instance. I suggest that snapshot_volume_backed()
  should do the same.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1619606/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1597754] [NEW] Unable to boot instance using UML

2016-06-30 Thread Matthew Booth

Public bug reported:

CAVEAT: This is from code inspection only.

Change I931421ea moved the following snippet of code:

if CONF.libvirt.virt_type == 'uml':
libvirt_utils.chown(image('disk').path, 'root')

from the bottom of _create_image to the top. The problem is, the new
location is before the creation of the root disk. This means that on
initial creation we will run libvirt_utils.chown on a path which hasn't
been created yet, which will cause an exception.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1597754

Title:
  Unable to boot instance using UML

Status in OpenStack Compute (nova):
  New

Bug description:
  CAVEAT: This is from code inspection only.

  Change I931421ea moved the following snippet of code:

  if CONF.libvirt.virt_type == 'uml':
  libvirt_utils.chown(image('disk').path, 'root')

  from the bottom of _create_image to the top. The problem is, the new
  location is before the creation of the root disk. This means that on
  initial creation we will run libvirt_utils.chown on a path which
  hasn't been created yet, which will cause an exception.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1597754/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1594377] [NEW] resize does not resize ephemeral disks

2016-06-20 Thread Matthew Booth

Public bug reported:

Nova resize does not resize ephemeral disks. I have tested this with the
default qcow2 backend, but I expect it to be true for all backends.

I have created 2 flavors:

| OS-FLV-DISABLED:disabled   | False  |
| OS-FLV-EXT-DATA:ephemeral  | 1  |
| disk   | 1  |
| extra_specs| {} |
| id | test-1 |
| name   | test-1 |
| os-flavor-access:is_public | True   |
| ram| 256|
| rxtx_factor| 1.0|
| swap   | 1  |
| vcpus  | 1  |

and:

| OS-FLV-DISABLED:disabled   | False  |
| OS-FLV-EXT-DATA:ephemeral  | 2  |
| disk   | 2  |
| extra_specs| {} |
| id | test-2 |
| name   | test-2 |
| os-flavor-access:is_public | True   |
| ram| 512|
| rxtx_factor| 1.0|
| swap   | 2  |
| vcpus  | 2  |

I boot an instance with flavor test-1 with:

$ nova boot --flavor test-1 --image cirros foo

It creates instance directory 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c
containing (amongst non-disk files) disk, disk.eph0, disk.swap, and
disk.config. disk.config is not relevant here.

I check the sizes of each of these disks:

instances]$ for disk in disk disk.eph0 disk.swap; do qemu-img info
3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/$disk; done

image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk
file format: qcow2
virtual size: 1.0G (1073741824 bytes)
disk size: 10M
cluster_size: 65536
backing file: 
/home/mbooth/data/nova/instances/_base/1ba6fbdbe52377ff7e075c3317a48205ac6c28c4
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk.eph0
file format: qcow2
virtual size: 1.0G (1073741824 bytes)
disk size: 324K
cluster_size: 65536
backing file: /home/mbooth/data/nova/instances/_base/ephemeral_1_40d1d2c
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk.swap
file format: qcow2
virtual size: 1.0M (1048576 bytes)
disk size: 196K
cluster_size: 65536
backing file: /home/mbooth/data/nova/instances/_base/swap_1
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

I resize foo with:

$ nova resize foo test-2 --poll

I check the sizes again:

instances]$ for disk in disk disk.eph0 disk.swap; do qemu-img info
3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/$disk; done

image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk
file format: qcow2
virtual size: 2.0G (2147483648 bytes)
disk size: 26M
cluster_size: 65536
backing file: 
/home/mbooth/data/nova/instances/_base/1ba6fbdbe52377ff7e075c3317a48205ac6c28c4
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk.eph0
file format: qcow2
virtual size: 1.0G (1073741824 bytes)
disk size: 384K
cluster_size: 65536
backing file: /home/mbooth/data/nova/instances/_base/ephemeral_1_40d1d2c
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

image: 3fab0565-2eb1-4fd9-933b-4e1d80b1b18c/disk.swap
file format: qcow2
virtual size: 2.0M (2097152 bytes)
disk size: 196K
cluster_size: 65536
backing file: /home/mbooth/data/nova/instances/_base/swap_2
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

Note that the root and swap disks have been resized, but the ephemeral
disk has not. This is caused by 2 bugs.

Firstly, there is some code in finish_migration in the libvirt driver
which purports to resize disks. This code is actually a no-op, because
disk resizing has already been done by _create_image, which called
cache() with the correct size, and therefore did the resizing. However,
as noted in a comment, the no-op code would not have covered our
ephemeral disk anyway, as it only loops over 'disk.local', which is the
legacy disk naming.

Secondly, _create_image does not iterate over ephemeral disks at all
when called by finish_migration, because finish_migration explicitly
passes block_device_info=None.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1594377

Title:
  resize does not resize ephemeral disks

Status in OpenStack Compute (nova):
  New

Bug description:
  Nova resize does not resize ephemeral disks. I have tested this with
  the default qcow2 backend, but I expect it to be true for all
  backends.

  I have created 2 flavors:

  | OS-FLV-DISABLED:disabled   | Fal

[Yahoo-eng-team] [Bug 1593155] [NEW] over_committed_disk_size is wrong for sparse flat files

2016-06-16 Thread Matthew Booth

Public bug reported:

The libvirt driver creates flat disks as sparse by default. However, it
always returns over_committed_disk_size=0 for flat disks in
_get_instance_disk_info(). This incorrect data ends up being reported to
the scheduler in the libvirt driver's get_available_resource() via
_get_disk_over_committed_size_total().

_get_instance_disk_info() should use allocated blocks, not file size,
when calculating over_commited_disk_size for flat disks.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: libvirt low-hanging-fruit

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1593155

Title:
  over_committed_disk_size is wrong for sparse flat files

Status in OpenStack Compute (nova):
  New

Bug description:
  The libvirt driver creates flat disks as sparse by default. However,
  it always returns over_committed_disk_size=0 for flat disks in
  _get_instance_disk_info(). This incorrect data ends up being reported
  to the scheduler in the libvirt driver's get_available_resource() via
  _get_disk_over_committed_size_total().

  _get_instance_disk_info() should use allocated blocks, not file size,
  when calculating over_commited_disk_size for flat disks.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1593155/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1590693] Re: libvirt's use of driver.get_instance_disk_info() is generally problematic

2016-06-10 Thread Matthew Booth

This was intended to be a low hanging fruit bug, but it doesn't meet the
criteria. Closing, at it has no other purpose.

** Changed in: nova
   Status: Incomplete => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1590693

Title:
  libvirt's use of driver.get_instance_disk_info() is generally
  problematic

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  The nova.virt.driver 'interface' defines a get_instance_disk_info
  method, which is called by compute manager to get disk info during
  live migration to get the source hypervisor's internal representation
  of disk info and pass it directly to the target hypervisor over rpc.
  To compute manager this is an opaque blob of stuff which only the
  driver understands, which is presumably why json was chosen. There are
  a couple of problems with it.

  This is a useful method within the libvirt driver, which uses it
  fairly liberally. However, the method returns a json blob. Every use
  of it internal to the libvirt driver first json encodes it in
  get_instance_disk_info, then immediately decodes it again, which is
  inefficient... except 2 uses of it in migrate_disk_and_power_off and
  check_can_live_migrate_source, which don't decode it and assume it's a
  dict. These are both broken, which presumably means something relating
  to migration of volume-backed instances is broken. The libvirt driver
  should not use this internally. We can have a wrapper method to do the
  json encoding for compute manager, and internally use the unencoded
  data data directly.

  Secondly, we're passing an unversioned blob of data over rpc. We
  should probably turn this data into a versioned object.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1590693/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1590693] [NEW] libvirt's use of driver.get_instance_disk_info() is generally problematic

2016-06-09 Thread Matthew Booth

Public bug reported:

The nova.virt.driver 'interface' defines a get_instance_disk_info
method, which is called by compute manager to get disk info during live
migration to get the source hypervisor's internal representation of disk
info and pass it directly to the target hypervisor over rpc. To compute
manager this is an opaque blob of stuff which only the driver
understands, which is presumably why json was chosen. There are a couple
of problems with it.

This is a useful method within the libvirt driver, which uses it fairly
liberally. However, the method returns a json blob. Every use of it
internal to the libvirt driver first json encodes it in
get_instance_disk_info, then immediately decodes it again, which is
inefficient... except 2 uses of it in migrate_disk_and_power_off and
check_can_live_migrate_source, which don't decode it and assume it's a
dict. These are both broken, which presumably means something relating
to migration of volume-backed instances is broken. The libvirt driver
should not use this internally. We can have a wrapper method to do the
json encoding for compute manager, and internally use the unencoded data
data directly.

Secondly, we're passing an unversioned blob of data over rpc. We should
probably turn this data into a versioned object.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: low-hanging-fruit

** Description changed:

  The nova.virt.driver 'interface' defines a get_instance_disk_info
  method, which is called by compute manager to get disk info during live
  migration to get the source hypervisor's internal representation of disk
  info and pass it directly to the target hypervisor over rpc. To compute
  manager this is an opaque blob of stuff which only the driver
  understands, which is presumably why json was chosen. There are a couple
  of problems with it.
  
  This is a useful method within the libvirt driver, which uses it fairly
  liberally. However, the method returns a json blob. Every use of it
  internal to the libvirt driver first json encodes it in
  get_instance_disk_info, then immediately decodes it again, which is
- efficient. Except 2 uses of it in migrate_disk_and_power_off and
+ inefficient... except 2 uses of it in migrate_disk_and_power_off and
  check_can_live_migrate_source, which don't decode it and assume it's a
  dict. These are both broken, which presumably means something relating
  to migration of volume-backed instances is broken. The libvirt driver
  should not use this internally. We can have a wrapper method to do the
  json encoding for compute manager, and internally use the unencoded data
  data directly.
  
  Secondly, we're passing an unversioned blob of data over rpc. We should
  probably turn this data into a versioned object.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1590693

Title:
  libvirt's use of driver.get_instance_disk_info() is generally
  problematic

Status in OpenStack Compute (nova):
  New

Bug description:
  The nova.virt.driver 'interface' defines a get_instance_disk_info
  method, which is called by compute manager to get disk info during
  live migration to get the source hypervisor's internal representation
  of disk info and pass it directly to the target hypervisor over rpc.
  To compute manager this is an opaque blob of stuff which only the
  driver understands, which is presumably why json was chosen. There are
  a couple of problems with it.

  This is a useful method within the libvirt driver, which uses it
  fairly liberally. However, the method returns a json blob. Every use
  of it internal to the libvirt driver first json encodes it in
  get_instance_disk_info, then immediately decodes it again, which is
  inefficient... except 2 uses of it in migrate_disk_and_power_off and
  check_can_live_migrate_source, which don't decode it and assume it's a
  dict. These are both broken, which presumably means something relating
  to migration of volume-backed instances is broken. The libvirt driver
  should not use this internally. We can have a wrapper method to do the
  json encoding for compute manager, and internally use the unencoded
  data data directly.

  Secondly, we're passing an unversioned blob of data over rpc. We
  should probably turn this data into a versioned object.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1590693/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1581382] [NEW] nova migration-list --status returns no results

2016-05-13 Thread Matthew Booth

Public bug reported:

'nova migration-list --status  ' returns no results. On further
investigation, this is because this status is passed down to
db.migration_get_all_by_filters() as unicode, which doesn't handle it
correctly.

** Affects: nova
 Importance: Undecided
 Assignee: Matthew Booth (mbooth-9)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1581382

Title:
  nova migration-list --status returns no results

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  'nova migration-list --status  ' returns no results. On further
  investigation, this is because this status is passed down to
  db.migration_get_all_by_filters() as unicode, which doesn't handle it
  correctly.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1581382/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1548884] [NEW] libvirt driver converts config drives to qcow2 during resize/migrate

2016-02-23 Thread Matthew Booth

Public bug reported:

In finish_migration(), after resize the driver does:

if info['type'] == 'raw' and CONF.use_cow_images:
self._disk_raw_to_qcow2(info['path'])

This ensures that if use_cow_images is set to True, all raw disks will
be converted to qcow2. This includes config disks, which isn't the
intention here.

A second part of this bug is that config disks are then subsequently
overwritten, which also doesn't seem to be intentional. This is why this
hasn't previously come to light. It is currently just very efficient: we
copy the config disk, convert it to qcow2, then overwrite it with a new
one. We should stop after the original copy.

This code was added here: https://review.openstack.org/#/c/78626/ . I
have read the change, the bug it related to, spoken to the original
author, and one of the core reviewers. None of us could work out why the
above code was there.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1548884

Title:
  libvirt driver converts config drives to qcow2 during resize/migrate

Status in OpenStack Compute (nova):
  New

Bug description:
  In finish_migration(), after resize the driver does:

  if info['type'] == 'raw' and CONF.use_cow_images:
  self._disk_raw_to_qcow2(info['path'])

  This ensures that if use_cow_images is set to True, all raw disks will
  be converted to qcow2. This includes config disks, which isn't the
  intention here.

  A second part of this bug is that config disks are then subsequently
  overwritten, which also doesn't seem to be intentional. This is why
  this hasn't previously come to light. It is currently just very
  efficient: we copy the config disk, convert it to qcow2, then
  overwrite it with a new one. We should stop after the original copy.

  This code was added here: https://review.openstack.org/#/c/78626/ . I
  have read the change, the bug it related to, spoken to the original
  author, and one of the core reviewers. None of us could work out why
  the above code was there.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1548884/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1547577] [NEW] ephemeral and swap disks on a single compute share UUIDs

2016-02-22 Thread Matthew Booth

Public bug reported:

The libvirt driver caches the output of mkfs and mkswap in the image
cache. One consequence of this is that all ephemeral disks of a
particular size and format on a single compute will have the same UUID.
The same applies to swap disks. These identifiers are intended to be
universally unique, but they are not.

This is unlikely to be an issue in practise for ephemeral disks, as they
will never be shared, however it is a wart.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1547577

Title:
  ephemeral and swap disks on a single compute share UUIDs

Status in OpenStack Compute (nova):
  New

Bug description:
  The libvirt driver caches the output of mkfs and mkswap in the image
  cache. One consequence of this is that all ephemeral disks of a
  particular size and format on a single compute will have the same
  UUID. The same applies to swap disks. These identifiers are intended
  to be universally unique, but they are not.

  This is unlikely to be an issue in practise for ephemeral disks, as
  they will never be shared, however it is a wart.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1547577/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1547582] [NEW] Block migrating an ephemeral or swap disk can result in filesystem corruption when using qcow2

2016-02-22 Thread Matthew Booth

Public bug reported:

The libvirt driver uses common backing files for ephemeral and swap
disks. These are generated on the local compute host by running mkfs or
mkswap as appropriate. The output of these files for a particular size
and format is stored in the image cache on the compute host which ran
it.

When all things are equal, 2 runs of mkfs or mkswap are guaranteed never
to produce identical output, because at the very least they have
different uuids. When you also consider the potential for different
patch levels on different compute hosts, the potential for other
differences is also significant.

When block migrating an ephemeral disk, the libvirt driver copies the
'overlay' qcow2 from source to dest. Assuming that some other instance
on dest also has a similar ephemeral disk, the backing file will already
exist on dest. However, it is guaranteed not to be the same as the
disk's original backing file for the reasons above. If this works
currently, it is either by luck, or because the tiny amount of metadata
originally written by mkfs or mkswap is likely to have been overwritten
if it has been in use for any amount of time.

The libvirt driver should not cache the output of mkfs and mkswap. The
space and performance benefits are negligible, but it introduces the
potential for data corruption.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1547582

Title:
  Block migrating an ephemeral or swap disk can result in filesystem
  corruption when using qcow2

Status in OpenStack Compute (nova):
  New

Bug description:
  The libvirt driver uses common backing files for ephemeral and swap
  disks. These are generated on the local compute host by running mkfs
  or mkswap as appropriate. The output of these files for a particular
  size and format is stored in the image cache on the compute host which
  ran it.

  When all things are equal, 2 runs of mkfs or mkswap are guaranteed
  never to produce identical output, because at the very least they have
  different uuids. When you also consider the potential for different
  patch levels on different compute hosts, the potential for other
  differences is also significant.

  When block migrating an ephemeral disk, the libvirt driver copies the
  'overlay' qcow2 from source to dest. Assuming that some other instance
  on dest also has a similar ephemeral disk, the backing file will
  already exist on dest. However, it is guaranteed not to be the same as
  the disk's original backing file for the reasons above. If this works
  currently, it is either by luck, or because the tiny amount of
  metadata originally written by mkfs or mkswap is likely to have been
  overwritten if it has been in use for any amount of time.

  The libvirt driver should not cache the output of mkfs and mkswap. The
  space and performance benefits are negligible, but it introduces the
  potential for data corruption.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1547582/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1543181] [NEW] Raw and qcow2 disks are never preallocated on systems with newer util-linux

2016-02-08 Thread Matthew Booth

Public bug reported:

imagebackend.Image._can_fallocate tests if fallocate works by running
the following command:

  fallocate -n -l 1 .fallocate_test

where  exists, but .fallocate_test does not.
This command line is copied from the code which actually fallocates a
disk. However, while this works on systems with an older version of
util-linux, such as RHEL 7, it does not work on systems with a newer
version of util-linux, such as Fedora 23. The result of this is that
this test will always fail, and preallocation with fallocate will be
erroneously disabled.

On RHEL 7, which has util-linux-2.23.2-26.el7.x86_64 on my system:

$ fallocate -n -l 1 foo
$ ls -lh foo
-rw-r--r--. 1 mbooth mbooth 0 Feb  8 15:33 foo
$ du -sh foo
4.0Kfoo

On Fedora 23, which has util-linux-2.27.1-2.fc23.x86_64 on my system:

$ fallocate -n -l 1 foo
fallocate: cannot open foo: No such file or directory

The F23 behaviour actually makes sense. From the fallocate man page:

  -n, --keep-size
  Do  not modify the apparent length of the file.

This doesn't make any sense if the file doesn't exist. That is, the -n
option makes sense when preallocating an existing disk image, but not
when testing if fallocate works on a given filesystem and the test file
doesn't already exist.

You could also reasonably argue that util-linux probably should be
breaking an interface like this, even when misused. However, that's a
separate discussion. We shouldn't be misusing it.

** Affects: nova
 Importance: Undecided
     Assignee: Matthew Booth (mbooth-9)
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1543181

Title:
  Raw and qcow2 disks are never preallocated on systems with newer util-
  linux

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  imagebackend.Image._can_fallocate tests if fallocate works by running
  the following command:

fallocate -n -l 1 .fallocate_test

  where  exists, but .fallocate_test does not.
  This command line is copied from the code which actually fallocates a
  disk. However, while this works on systems with an older version of
  util-linux, such as RHEL 7, it does not work on systems with a newer
  version of util-linux, such as Fedora 23. The result of this is that
  this test will always fail, and preallocation with fallocate will be
  erroneously disabled.

  On RHEL 7, which has util-linux-2.23.2-26.el7.x86_64 on my system:

  $ fallocate -n -l 1 foo
  $ ls -lh foo
  -rw-r--r--. 1 mbooth mbooth 0 Feb  8 15:33 foo
  $ du -sh foo
  4.0K  foo

  On Fedora 23, which has util-linux-2.27.1-2.fc23.x86_64 on my system:

  $ fallocate -n -l 1 foo
  fallocate: cannot open foo: No such file or directory

  The F23 behaviour actually makes sense. From the fallocate man page:

-n, --keep-size
Do  not modify the apparent length of the file.

  This doesn't make any sense if the file doesn't exist. That is, the -n
  option makes sense when preallocating an existing disk image, but not
  when testing if fallocate works on a given filesystem and the test
  file doesn't already exist.

  You could also reasonably argue that util-linux probably should be
  breaking an interface like this, even when misused. However, that's a
  separate discussion. We shouldn't be misusing it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1543181/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1392527] Re: [OSSA 2015-017] Deleting instance while resize instance is running leads to unuseable compute nodes (CVE-2015-3280)

2015-10-01 Thread Matthew Booth

** Changed in: nova
   Status: Fix Released => New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1392527

Title:
  [OSSA 2015-017] Deleting instance while resize instance is running
  leads to unuseable compute nodes (CVE-2015-3280)

Status in OpenStack Compute (nova):
  New
Status in OpenStack Compute (nova) juno series:
  In Progress
Status in OpenStack Compute (nova) kilo series:
  Fix Committed
Status in OpenStack Security Advisory:
  Fix Committed

Bug description:
  Steps to reproduce:
  1) Create a new instance,waiting until it’s status goes to ACTIVE state
  2) Call resize API
  3) Delete the instance immediately after the task_state is “resize_migrated” 
or vm_state is “resized”
  4) Repeat 1 through 3 in a loop

  I have kept attached program running for 4 hours, all instances
  created are deleted (nova list returns empty list) but I noticed
  instances directories with the name “_resize> are not
  deleted from the instance path of the compute nodes (mainly from the
  source compute nodes where the instance was running before resize). If
  I keep this program running for couple of more hours (depending on the
  number of compute nodes), then it completely uses the entire disk of
  the compute nodes (based on the disk_allocation_ratio parameter
  value). Later, nova scheduler doesn’t select these compute nodes for
  launching new vms and starts reporting error "No valid hosts found".

  Note: Even the periodic tasks doesn't cleanup these orphan instance
  directories from the instance path.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1392527/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1462957] [NEW] VMware driver cannot report non-contiguous resources to the scheduler

2015-06-08 Thread Matthew Booth

Public bug reported:

A VMware hypervisor can have various types of non-contiguous resource.
This includes:

* CPUs and memory, assuming a cluster has more than 1 member.
* Storage space, if a (VMware) host has more than 1 datastore.

Focussing on the latter, if a host has 5 datastores, each with 50GB of
free space, we currently report the largest contiguous free space to the
hypervisor: 50GB. This means that the scheduler knows it can allocate an
instance with a 50GB block device, but until the host stats are updated
it will not allow subsequent instances to be scheduled there. We could
alternatively report 250GB of free space, but would risk the scheduler
repeatedly sending us a request for an instance with a 100GB block
device, which we cannot fulfil. Without the ability to represent non-
contiguous resources we are left choosing between 2 suboptimal choices.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1462957

Title:
  VMware driver cannot report non-contiguous resources to the scheduler

Status in OpenStack Compute (Nova):
  New

Bug description:
  A VMware hypervisor can have various types of non-contiguous resource.
  This includes:

  * CPUs and memory, assuming a cluster has more than 1 member.
  * Storage space, if a (VMware) host has more than 1 datastore.

  Focussing on the latter, if a host has 5 datastores, each with 50GB of
  free space, we currently report the largest contiguous free space to
  the hypervisor: 50GB. This means that the scheduler knows it can
  allocate an instance with a 50GB block device, but until the host
  stats are updated it will not allow subsequent instances to be
  scheduled there. We could alternatively report 250GB of free space,
  but would risk the scheduler repeatedly sending us a request for an
  instance with a 100GB block device, which we cannot fulfil. Without
  the ability to represent non-contiguous resources we are left choosing
  between 2 suboptimal choices.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1462957/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1430223] [NEW] Live migration with ceph fails to cleanup instance directory on failure

2015-03-10 Thread Matthew Booth

Public bug reported:

When doing a live migration of an instance using ceph for shared
storage, if the migration fails then the instance directory will not be
cleaned up on the destination host. The next attempt to do the live
migration will fail with DestinationDiskExists, but will cleanup the
directory.

A simple way to test this is to setup a working system which allows a
ceph instance to be live migrated, then delete the relevant ceph secret
from libvirt on one of the hosts. Live migration to that host will fail,
triggering this bug.

** Affects: nova
 Importance: Undecided
 Status: New

** Description changed:

  When doing a live migration of an instance using ceph for shared
  storage, if the migration fails then the instance directory will not be
  cleaned up on the destination host. The next attempt to do the live
  migration will fail with DestinationDiskExists, but will cleanup the
  directory.
+ 
+ A simple way to test this is to setup a working system which allows a
+ ceph instance to be live migrated, then delete the relevant ceph secret
+ from libvirt on one of the hosts. Live migration to that host will fail,
+ triggering this bug.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1430223

Title:
  Live migration with ceph fails to cleanup instance directory on
  failure

Status in OpenStack Compute (Nova):
  New

Bug description:
  When doing a live migration of an instance using ceph for shared
  storage, if the migration fails then the instance directory will not
  be cleaned up on the destination host. The next attempt to do the live
  migration will fail with DestinationDiskExists, but will cleanup the
  directory.

  A simple way to test this is to setup a working system which allows a
  ceph instance to be live migrated, then delete the relevant ceph
  secret from libvirt on one of the hosts. Live migration to that host
  will fail, triggering this bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1430223/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1416000] Re: VMware: write error lost while transferring volume

2015-01-30 Thread Matthew Booth

** Also affects: oslo.vmware
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1416000

Title:
  VMware: write error lost while transferring volume

Status in Cinder:
  New
Status in OpenStack Compute (Nova):
  New
Status in Oslo VMware library for OpenStack projects:
  New

Bug description:
  I'm running the following command:

  cinder create --image-id a24f216f-9746-418e-97f9-aebd7fa0e25f 1

  The write side of the data transfer (a VMwareHTTPWriteFile object)
  returns an error in write() which I haven't debugged, yet. However,
  this error is never reported to the user, although it does show up in
  the logs. The effect is that the transfer sits in the 'downloading'
  state until the 7200 second timeout, when it reports the timeout.

  The reason is that the code which waits on transfer completion (in
  start_transfer) does:

  try:
  # Wait on the read and write events to signal their end
  read_event.wait()
  write_event.wait()
  except (timeout.Timeout, Exception) as exc:
  ...

  That is, it waits for the read thread to signal completion via
  read_event before checking write_event. However, because write_thread
  has died, read_thread is blocking and will never signal completion.
  You can demonstrate this by swapping the order. If you want for write
  first it will die immediately, which is what you want. However, that's
  not right either because now you're missing read errors.

  Ideally this code needs to be able to notice an error at either end
  and stop immediately.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/1416000/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1412436] [NEW] Race in instance_create with security_group_destroy

2015-01-19 Thread Matthew Booth

Public bug reported:

There is a race in instance_create between fetching security groups
(returned by _security_group_get_by_names) and adding them to the
instance. We have no guarantee that they have not been deleted in the
meantime.

The result is currently that the SecurityGroupInstanceAssociation is
created, pointing to the deleted SecurityGroup. This is different to the
result of deleting the SecurityGroup afterwards, when both
SecurityGroupInstanceAssociation and SecurityGroup are marked deleted.
It is also different to the result of deleting the SecurityGroup before,
which is to raise an error.

While this intermediate state doesn't appear to cause an immediate
problem, I feel it would be likely to result in unexpected behaviour at
some point in the future, probably during a datamodel upgrade.

My preference would be to cause it to fail, as that feels intuitively to
me to be the most useful response to the end user (they have just
requested an instance with a security group, but the returned instance
already does not have that security group). However, either behaviour
would be correct IMO. I suspect the failure behaviour would be harder to
achieve in practice.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1412436

Title:
  Race in instance_create with security_group_destroy

Status in OpenStack Compute (Nova):
  New

Bug description:
  There is a race in instance_create between fetching security groups
  (returned by _security_group_get_by_names) and adding them to the
  instance. We have no guarantee that they have not been deleted in the
  meantime.

  The result is currently that the SecurityGroupInstanceAssociation is
  created, pointing to the deleted SecurityGroup. This is different to
  the result of deleting the SecurityGroup afterwards, when both
  SecurityGroupInstanceAssociation and SecurityGroup are marked deleted.
  It is also different to the result of deleting the SecurityGroup
  before, which is to raise an error.

  While this intermediate state doesn't appear to cause an immediate
  problem, I feel it would be likely to result in unexpected behaviour
  at some point in the future, probably during a datamodel upgrade.

  My preference would be to cause it to fail, as that feels intuitively
  to me to be the most useful response to the end user (they have just
  requested an instance with a security group, but the returned instance
  already does not have that security group). However, either behaviour
  would be correct IMO. I suspect the failure behaviour would be harder
  to achieve in practice.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1412436/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1409024] [NEW] DNSDomain.register_for_zone races

2015-01-09 Thread Matthew Booth

Public bug reported:

2 simultaneous calls to DNSDomain.register_for_zone or
DNSDomain.register_for_project will race. The winner is undefined.
Consequently, the caller has no way of knowing if the DNSDomain is
appropriately registered following a call. register_for_zone or
register_for_project will not currently generate an error in this case.

I can think of 2 ways to resolve this:

1. Assert that only an unregistered domain can be registered. Attempting
to register a registered domain is an error. This would be a semantic
change to the existing APIs.

2. Create new APIs which additionally take the expected current
registration, and fail if it is not as expected. Deprecate the existing
APIs.

I favour the former.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1409024

Title:
  DNSDomain.register_for_zone races

Status in OpenStack Compute (Nova):
  New

Bug description:
  2 simultaneous calls to DNSDomain.register_for_zone or
  DNSDomain.register_for_project will race. The winner is undefined.
  Consequently, the caller has no way of knowing if the DNSDomain is
  appropriately registered following a call. register_for_zone or
  register_for_project will not currently generate an error in this
  case.

  I can think of 2 ways to resolve this:

  1. Assert that only an unregistered domain can be registered.
  Attempting to register a registered domain is an error. This would be
  a semantic change to the existing APIs.

  2. Create new APIs which additionally take the expected current
  registration, and fail if it is not as expected. Deprecate the
  existing APIs.

  I favour the former.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1409024/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1388095] [NEW] VMware fake driver returns invalid search results due to incorrect use of lstrip()

2014-10-31 Thread Matthew Booth

Public bug reported:

_search_ds in the fake driver does:

path = file.lstrip(dname).split('/')

The intention is to remove a prefix of dname from the beginning of file,
but this actually removes all instances of all characters in dname from
the left of file.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: vmware

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1388095

Title:
  VMware fake driver returns invalid search results due to incorrect use
  of lstrip()

Status in OpenStack Compute (Nova):
  New

Bug description:
  _search_ds in the fake driver does:

  path = file.lstrip(dname).split('/')

  The intention is to remove a prefix of dname from the beginning of
  file, but this actually removes all instances of all characters in
  dname from the left of file.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1388095/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1384309] [NEW] VMware: New permission required: Extension.Register

2014-10-22 Thread Matthew Booth

Public bug reported:

Change I1046576c448704841ae8e1800b8390e947b0d457 uses
ExtensionManager.RegisterExtension, which requires the additional
permission Extension.Register on the vSphere server. Unfortunately we
missed the DocImpact in review. This needs to be added to the relevant
docs.

The impact of not having this permission is that n-cpu fails to start
with the error:

WebFault: Server raised fault: 'Permission to perform this operation was
denied.'

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: documentation vmware

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1384309

Title:
  VMware: New permission required: Extension.Register

Status in OpenStack Compute (Nova):
  New

Bug description:
  Change I1046576c448704841ae8e1800b8390e947b0d457 uses
  ExtensionManager.RegisterExtension, which requires the additional
  permission Extension.Register on the vSphere server. Unfortunately we
  missed the DocImpact in review. This needs to be added to the relevant
  docs.

  The impact of not having this permission is that n-cpu fails to start
  with the error:

  WebFault: Server raised fault: 'Permission to perform this operation
  was denied.'

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1384309/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1381061] [NEW] VMware: ESX hosts must not be externally routable

2014-10-14 Thread Matthew Booth

Public bug reported:

Change I70fd7d3ee06040d6ce49d93a4becd9cbfdd71f78 removed passwords from
VNC hosts. This change is fine because we proxy the VNC connection and
do access control at the proxy, but it assumes that ESX hosts are not
externally routable.

In a non-OpenStack VMware deployment, accessing a VM's console requires
the end user to have a direct connection to an ESX host. This leads me
to believe that many VMware administrators may leave ESX hosts
externally routable if not specifically directed otherwise.

The above change makes a design decision which requires ESX hosts not to
be externally routable. There may also be other reasons. We need to
ensure that this is very clearly documented. This may already be
documented, btw, but I don't know how our documentation is organised,
and would prefer that somebody more familiar with it assures themselves
that this has been given appropriate weight.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1381061

Title:
  VMware: ESX hosts must not be externally routable

Status in OpenStack Compute (Nova):
  New

Bug description:
  Change I70fd7d3ee06040d6ce49d93a4becd9cbfdd71f78 removed passwords
  from VNC hosts. This change is fine because we proxy the VNC
  connection and do access control at the proxy, but it assumes that ESX
  hosts are not externally routable.

  In a non-OpenStack VMware deployment, accessing a VM's console
  requires the end user to have a direct connection to an ESX host. This
  leads me to believe that many VMware administrators may leave ESX
  hosts externally routable if not specifically directed otherwise.

  The above change makes a design decision which requires ESX hosts not
  to be externally routable. There may also be other reasons. We need to
  ensure that this is very clearly documented. This may already be
  documented, btw, but I don't know how our documentation is organised,
  and would prefer that somebody more familiar with it assures
  themselves that this has been given appropriate weight.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1381061/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1375688] [NEW] test failure in ShelveComputeManagerTestCase.test_unshelve

2014-09-30 Thread Matthew Booth

Public bug reported:

Full logs here: http://logs.openstack.org/02/124402/3/check/gate-nova-
python26/1d3512b/

Seen:

2014-09-26 15:20:46.795 | ExpectedMethodCallsError: Verify: Expected 
methods never called:
2014-09-26 15:20:46.796 |   0.  
_notify_about_instance_usage.__call__(, 
Instance(access_ip_v4=None,access_ip_v6=None,architecture='x86_64',auto_disk_config=False,availability_zone=None,cell_name=None,cleaned=False,config_drive=None,created_at=2014-09-26T15:09:38Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,disable_terminate=False,display_description=None,display_name=None,ephemeral_gb=0,ephemeral_key_uuid=None,fault=,host='fake-mini',hostname=None,id=1,image_ref='fake-image-ref',info_cache=,instance_type_id=2,kernel_id=None,key_data=None,key_name=None,launch_index=None,launched_at=2014-09-26T15:09:39Z,launched_on=None,locked=False,locked_by=None,memory_mb=0,metadata={},node='fakenode1',numa_topology=,os_type='Linux',pci_devices=,power_state=123,progress=None,project_id='fake',ramdisk_id=None,reservation_id='r-fakeres',root_device_name=None,root_gb=0,scheduled_at=None,security_gro
 
ups=,shutdown_terminate=False,system_metadata={instance_type_ephemeral_gb='0',instance_type_flavorid='1',instance_type_id='2',instance_type_memory_mb='512',instance_type_name='m1.tiny',instance_type_root_gb='1',instance_type_rxtx_factor='1.0',instance_type_swap='0',instance_type_vcpu_weight=None,instance_type_vcpus='1'},task_state=None,terminated_at=None,updated_at=2014-09-26T15:09:38Z,user_data=None,user_id='fake',uuid=cb73da32-e73e-4f52-a332-f66e9752ac9d,vcpus=0,vm_mode=None,vm_state='active'),
 'unshelve.end') -> None

and:

2014-09-26 15:20:46.800 | UnexpectedMethodCallError: Unexpected
method call
instance_update_and_get_original.__call__(, 'cb73da32-e73e-4f52-a332-f66e9752ac9d',
{'vm_state': u'active', 'expected_task_state': 'spawning', 'key_data':
None, 'host': u'fake-mini', 'image_ref': u'fake-image-ref',
'power_state': 123, 'auto_disk_config': False, 'task_state': None,
'launched_at': datetime.datetime(2014, 9, 26, 15, 9, 39, 224533,
tzinfo=)},
columns_to_join=['metadata', 'system_metadata'], update_cells=False) ->
None

My initial reaction is that the mox error messages don't contain enough
information to diagnose the problem, or at least they certainly don't
make it obvious to the uninitiated, due to the missing expected values.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1375688

Title:
  test failure in ShelveComputeManagerTestCase.test_unshelve

Status in OpenStack Compute (Nova):
  New

Bug description:
  Full logs here: http://logs.openstack.org/02/124402/3/check/gate-nova-
  python26/1d3512b/

  Seen:

  2014-09-26 15:20:46.795 | ExpectedMethodCallsError: Verify: Expected 
methods never called:
  2014-09-26 15:20:46.796 |   0.  
_notify_about_instance_usage.__call__(, 
Instance(access_ip_v4=None,access_ip_v6=None,architecture='x86_64',auto_disk_config=False,availability_zone=None,cell_name=None,cleaned=False,config_drive=None,created_at=2014-09-26T15:09:38Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,disable_terminate=False,display_description=None,display_name=None,ephemeral_gb=0,ephemeral_key_uuid=None,fault=,host='fake-mini',hostname=None,id=1,image_ref='fake-image-ref',info_cache=,instance_type_id=2,kernel_id=None,key_data=None,key_name=None,launch_index=None,launched_at=2014-09-26T15:09:39Z,launched_on=None,locked=False,locked_by=None,memory_mb=0,metadata={},node='fakenode1',numa_topology=,os_type='Linux',pci_devices=,power_state=123,progress=None,project_id='fake',ramdisk_id=None,reservation_id='r-fakeres',root_device_name=None,root_gb=0,scheduled_at=None,security_g
 
roups=,shutdown_terminate=False,system_metadata={instance_type_ephemeral_gb='0',instance_type_flavorid='1',instance_type_id='2',instance_type_memory_mb='512',instance_type_name='m1.tiny',instance_type_root_gb='1',instance_type_rxtx_factor='1.0',instance_type_swap='0',instance_type_vcpu_weight=None,instance_type_vcpus='1'},task_state=None,terminated_at=None,updated_at=2014-09-26T15:09:38Z,user_data=None,user_id='fake',uuid=cb73da32-e73e-4f52-a332-f66e9752ac9d,vcpus=0,vm_mode=None,vm_state='active'),
 'unshelve.end') -> None

  and:

  2014-09-26 15:20:46.800 | UnexpectedMethodCallError: Unexpected
  method call
  instance_update_and_get_original.__call__(, 'cb73da32-e73e-4f52-a332-f66e9752ac9d',
  {'vm_state': u'active', 'expected_task_state': 'spawning', 'key_data':
  None, 'host': u'fake-mini', 'image_ref': u'fake-image-ref',
  'power_state': 123, 'auto_disk_config': False, 'task_state': None,
  'launched_at': datetime.datetime(2014, 9, 26, 15, 9, 39, 224533,
  tzinfo=)},
  columns_to_join=['metadata', 'system_metadata'], update_cells=F

[Yahoo-eng-team] [Bug 1372369] [NEW] Blockdev reports 'No such device or address'

2014-09-22 Thread Matthew Booth

Public bug reported:

Tempest failure: http://logs.openstack.org/57/122757/1/check/check-
tempest-dsvm-neutron-full/a08fb08/

2014-09-19 18:48:47.388 | 2014-09-19 18:15:35,926 6578 INFO 
[tempest.common.rest_client] Request 
(DeleteServersTestJSON:test_delete_server_while_in_verify_resize_state): 500 
DELETE 
http://127.0.0.1:8774/v2/9959855b406d4563a6174eb27f11450e/servers/0a451fd1-72d8-4849-8b47-2095986f9cd4
 60.155s
2014-09-19 18:48:47.389 | }}}

Due to:

2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task   File 
"/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5671, in 
_get_instance_disk_info
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
dk_size = lvm.get_volume_size(path)
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task   File 
"/opt/stack/new/nova/nova/virt/libvirt/lvm.py", line 157, in get_volume_size
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
run_as_root=True)
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task   File 
"/opt/stack/new/nova/nova/virt/libvirt/utils.py", line 53, in execute
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
return utils.execute(*args, **kwargs)
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task   File 
"/opt/stack/new/nova/nova/utils.py", line 163, in execute
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
return processutils.execute(*cmd, **kwargs)
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task   File 
"/opt/stack/new/nova/nova/openstack/common/processutils.py", line 203, in 
execute
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
cmd=sanitized_cmd)
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
ProcessExecutionError: Unexpected error while running command.
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task Command: 
sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 
/dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-eef2d948-c15b-4525-b477-4ca2b194b8ae-lun-1
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task Exit 
code: 1
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task Stdout: 
u''
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task Stderr: 
u'blockdev: cannot open 
/dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-eef2d948-c15b-4525-b477-4ca2b194b8ae-lun-1:
 No such device or address\n'
2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1372369

Title:
  Blockdev reports 'No such device or address'

Status in OpenStack Compute (Nova):
  New

Bug description:
  Tempest failure: http://logs.openstack.org/57/122757/1/check/check-
  tempest-dsvm-neutron-full/a08fb08/

  2014-09-19 18:48:47.388 | 2014-09-19 18:15:35,926 6578 INFO 
[tempest.common.rest_client] Request 
(DeleteServersTestJSON:test_delete_server_while_in_verify_resize_state): 500 
DELETE 
http://127.0.0.1:8774/v2/9959855b406d4563a6174eb27f11450e/servers/0a451fd1-72d8-4849-8b47-2095986f9cd4
 60.155s
  2014-09-19 18:48:47.389 | }}}

  Due to:

  2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task   File 
"/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5671, in 
_get_instance_disk_info
  2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
dk_size = lvm.get_volume_size(path)
  2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task   File 
"/opt/stack/new/nova/nova/virt/libvirt/lvm.py", line 157, in get_volume_size
  2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
run_as_root=True)
  2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task   File 
"/opt/stack/new/nova/nova/virt/libvirt/utils.py", line 53, in execute
  2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
return utils.execute(*args, **kwargs)
  2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task   File 
"/opt/stack/new/nova/nova/utils.py", line 163, in execute
  2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
return processutils.execute(*cmd, **kwargs)
  2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task   File 
"/opt/stack/new/nova/nova/openstack/common/processutils.py", line 203, in 
execute
  2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
cmd=sanitized_cmd)
  2014-09-19 18:15:38.432 3076 TRACE nova.openstack.common.periodic_task 
ProcessExecutionError: Unexpected error while running command.
  2014-09-19 18:15:38.432 3076

[Yahoo-eng-team] [Bug 1365031] [NEW] VMware fake session doesn't detect implicitly created directory

2014-09-03 Thread Matthew Booth

Public bug reported:

The VMware fake session keeps an internal list of created files and
directories. Directories can be created explicitly, e.g. by
MakeDirectory(createParentDirectories=True), but the fake session will
not recognise these.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1365031

Title:
  VMware fake session doesn't detect implicitly created directory

Status in OpenStack Compute (Nova):
  New

Bug description:
  The VMware fake session keeps an internal list of created files and
  directories. Directories can be created explicitly, e.g. by
  MakeDirectory(createParentDirectories=True), but the fake session will
  not recognise these.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1365031/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1364849] [NEW] VMware driver doesn't return typed console

2014-09-03 Thread Matthew Booth

Public bug reported:

Change I8f6a857b88659ee30b4aa1a25ac52d7e01156a68 added typed consoles,
and updated drivers to use them. However, when it touched the VMware
driver, it modified get_vnc_console in VMwareVMOps, but not in
VMwareVCVMOps, which is the one which is actually used.

Incidentally, VMwareVMOps has now been removed, so this type of
confusion should not happen again.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1364849

Title:
  VMware driver doesn't return typed console

Status in OpenStack Compute (Nova):
  New

Bug description:
  Change I8f6a857b88659ee30b4aa1a25ac52d7e01156a68 added typed consoles,
  and updated drivers to use them. However, when it touched the VMware
  driver, it modified get_vnc_console in VMwareVMOps, but not in
  VMwareVCVMOps, which is the one which is actually used.

  Incidentally, VMwareVMOps has now been removed, so this type of
  confusion should not happen again.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1364849/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1357428] [NEW] DBDeadlock in gate test

2014-08-15 Thread Matthew Booth

Public bug reported:

gate test failed with:

DBDeadlock: (OperationalError) (1213, 'Deadlock found when trying to get
lock; try restarting transaction') 'UPDATE image_properties SET
updated_at=%s, deleted_at=%s, deleted=%s WHERE image_properties.image_id
= %s AND image_properties.deleted = false' (datetime.datetime(2014, 8,
15, 13, 42, 36, 164537), datetime.datetime(2014, 8, 15, 13, 42, 36,
144848), 1, '62832243-7165-4493-bacc-7801640cc718')

Above from:

http://logs.openstack.org/28/114528/1/check/check-tempest-dsvm-
full/176f0f2/logs/screen-g-reg.txt.gz

Full logs:

http://logs.openstack.org/28/114528/1/check/check-tempest-dsvm-
full/176f0f2/

** Affects: glance
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/1357428

Title:
  DBDeadlock in gate test

Status in OpenStack Image Registry and Delivery Service (Glance):
  New

Bug description:
  gate test failed with:

  DBDeadlock: (OperationalError) (1213, 'Deadlock found when trying to
  get lock; try restarting transaction') 'UPDATE image_properties SET
  updated_at=%s, deleted_at=%s, deleted=%s WHERE
  image_properties.image_id = %s AND image_properties.deleted = false'
  (datetime.datetime(2014, 8, 15, 13, 42, 36, 164537),
  datetime.datetime(2014, 8, 15, 13, 42, 36, 144848), 1,
  '62832243-7165-4493-bacc-7801640cc718')

  Above from:

  http://logs.openstack.org/28/114528/1/check/check-tempest-dsvm-
  full/176f0f2/logs/screen-g-reg.txt.gz

  Full logs:

  http://logs.openstack.org/28/114528/1/check/check-tempest-dsvm-
  full/176f0f2/

To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/1357428/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1357263] [NEW] Unhelpful error message when attempting to boot a guest with an invalid guestId

2014-08-15 Thread Matthew Booth

Public bug reported:

When booting a VMware instance from an image, guestId is taken from the
vmware_ostype property in glance. If this value is invalid, spawn() will
fail with the error message:

VMwareDriverException: A specified parameter was not correct.

As there are many parameters to CreateVM_Task, this error message does
not help us narrow down the offending one. Unfortunately this error
message is all that vSphere provides us, so we can't do better by
relying on vSphere alone.

As this is a user-editable parameter, we should try harder to provide an
indication of what the error might be. We can do this by validating the
field ourselves. As there is no way I'm aware of to extract a canonical
list of valid guestIds from a running vSphere host, I think we're left
embedding our own list and validating against it. This is not ideal,
because:

1. We will need to update our list for every ESX release
2. A simple list will not take account of the ESX version we're running against 
(i.e. we may have a list for 5.5, but be running against 5.1, which doesn't 
support everything on our list)

Consequently, to maintain a loose coupling we should validate the field,
but only warn for values we don't recognise. vSphere will continue to
return its non-specific error message, but there will be an additional
indication of what the root cause might be in the logs.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1357263

Title:
  Unhelpful error message when attempting to boot a guest with an
  invalid guestId

Status in OpenStack Compute (Nova):
  New

Bug description:
  When booting a VMware instance from an image, guestId is taken from
  the vmware_ostype property in glance. If this value is invalid,
  spawn() will fail with the error message:

  VMwareDriverException: A specified parameter was not correct.

  As there are many parameters to CreateVM_Task, this error message does
  not help us narrow down the offending one. Unfortunately this error
  message is all that vSphere provides us, so we can't do better by
  relying on vSphere alone.

  As this is a user-editable parameter, we should try harder to provide
  an indication of what the error might be. We can do this by validating
  the field ourselves. As there is no way I'm aware of to extract a
  canonical list of valid guestIds from a running vSphere host, I think
  we're left embedding our own list and validating against it. This is
  not ideal, because:

  1. We will need to update our list for every ESX release
  2. A simple list will not take account of the ESX version we're running 
against (i.e. we may have a list for 5.5, but be running against 5.1, which 
doesn't support everything on our list)

  Consequently, to maintain a loose coupling we should validate the
  field, but only warn for values we don't recognise. vSphere will
  continue to return its non-specific error message, but there will be
  an additional indication of what the root cause might be in the logs.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1357263/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1355928] [NEW] Deadlock in reservation commit

2014-08-12 Thread Matthew Booth

Public bug reported:

Details in http://logs.openstack.org/46/104146/15/check/check-tempest-
dsvm-full/d235389/, specifically in n-cond logs:

2014-08-12 14:58:57.099 ERROR nova.quota 
[req-7efe48be-f5b4-4343-898a-5b4b32694530 AggregatesAdminTestJSON-719157131 
AggregatesAdminTestJSON-1908648657] Failed to commit reservations 
[u'5bdde344-b26f-4e0a-9aa7-d91d775b6df0', 
u'5f757426-8f4e-454f-aedb-1186771f85fd', 
u'819aeaf6-9faf-4da5-a16d-ce1c571c4975']
2014-08-12 14:58:57.099 21994 TRACE nova.quota Traceback (most recent call 
last):
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/opt/stack/new/nova/nova/quota.py", line 1326, in commit
2014-08-12 14:58:57.099 21994 TRACE nova.quota user_id=user_id)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/opt/stack/new/nova/nova/quota.py", line 569, in commit
2014-08-12 14:58:57.099 21994 TRACE nova.quota user_id=user_id)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/opt/stack/new/nova/nova/db/api.py", line 1148, in reservation_commit
2014-08-12 14:58:57.099 21994 TRACE nova.quota user_id=user_id)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 167, in wrapper
2014-08-12 14:58:57.099 21994 TRACE nova.quota return f(*args, **kwargs)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 205, in wrapped
2014-08-12 14:58:57.099 21994 TRACE nova.quota return f(*args, **kwargs)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 3302, in 
reservation_commit
2014-08-12 14:58:57.099 21994 TRACE nova.quota for reservation in 
reservation_query.all():
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2241, in all
2014-08-12 14:58:57.099 21994 TRACE nova.quota return list(self)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2353, in 
__iter__
2014-08-12 14:58:57.099 21994 TRACE nova.quota return 
self._execute_and_instances(context)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2368, in 
_execute_and_instances
2014-08-12 14:58:57.099 21994 TRACE nova.quota result = 
conn.execute(querycontext.statement, self._params)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 662, in 
execute
2014-08-12 14:58:57.099 21994 TRACE nova.quota params)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 761, in 
_execute_clauseelement
2014-08-12 14:58:57.099 21994 TRACE nova.quota compiled_sql, 
distilled_params
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 874, in 
_execute_context
2014-08-12 14:58:57.099 21994 TRACE nova.quota context)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1024, in 
_handle_dbapi_exception
2014-08-12 14:58:57.099 21994 TRACE nova.quota exc_info
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 196, in 
raise_from_cause
2014-08-12 14:58:57.099 21994 TRACE nova.quota reraise(type(exception), 
exception, tb=exc_tb)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 867, in 
_execute_context
2014-08-12 14:58:57.099 21994 TRACE nova.quota context)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 324, in 
do_execute
2014-08-12 14:58:57.099 21994 TRACE nova.quota cursor.execute(statement, 
parameters)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in execute
2014-08-12 14:58:57.099 21994 TRACE nova.quota self.errorhandler(self, exc, 
value)
2014-08-12 14:58:57.099 21994 TRACE nova.quota   File 
"/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in 
defaulterrorhandler
2014-08-12 14:58:57.099 21994 TRACE nova.quota raise errorclass, errorvalue
2014-08-12 14:58:57.099 21994 TRACE nova.quota OperationalError: 
(OperationalError) (1213, 'Deadlock found when trying to get lock; try 
restarting transaction') 'SELECT reservations.created_at AS 
reservations_created_at, reservations.updated_at AS reservations_updated_at, 
reservations.deleted_at AS reservations_deleted_at, reservations.deleted AS 
reservations_deleted, reservations.id AS reservations_id, reservations.uuid AS 
reservations_uuid, reservations.usage_id AS reservations_usage_id, 
reservations.project_id AS reservations_project_id, reservations.user

[Yahoo-eng-team] [Bug 1354403] [NEW] Numerous config options ignored due to CONF used in import context

2014-08-08 Thread Matthew Booth

Public bug reported:

In general[1] it is incorrect to use the value of a config variable at
import time, because although the config variable may have been
registered, its value will not have been loaded. The result will always
be the default value, regardless of the contents of the relevant config
file.

I did a quick scan of Nova, and found the following instances of config
variables being used in import context:

nova/api/openstack/common.py:limited()
nova/api/openstack/common.py:get_limit_and_marker()
nova/compute/manager.py:_heal_instance_info_cache()
nova/compute/manager.py:_poll_shelved_instances()
nova/compute/manager.py:_poll_bandwidth_usage()
nova/compute/manager.py:_poll_volume_usage()
nova/compute/manager.py:_sync_power_states()
nova/compute/manager.py:_cleanup_running_deleted_instances()
nova/compute/manager.py:_run_image_cache_manager_pass()
nova/compute/manager.py:_run_pending_deletes()
nova/network/manager.py:_periodic_update_dns()
nova/scheduler/manager.py:_run_periodic_tasks()

Consequently, it appears that the given values of the following config
variables are being ignored:

osapi_max_limit
heal_instance_info_cache_interval
shelved_poll_interval
bandwidth_poll_interval
volume_usage_poll_interval
sync_power_state_interval
running_deleted_instance_poll_interval
image_cache_manager_interval
instance_delete_interval
dns_update_periodic_interval
scheduler_driver_task_period

[1] This doesn't apply to drivers, which are loaded dynamically after
the config has been loaded. However, relying on that seems even nastier.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1354403

Title:
  Numerous config options ignored due to CONF used in import context

Status in OpenStack Compute (Nova):
  New

Bug description:
  In general[1] it is incorrect to use the value of a config variable at
  import time, because although the config variable may have been
  registered, its value will not have been loaded. The result will
  always be the default value, regardless of the contents of the
  relevant config file.

  I did a quick scan of Nova, and found the following instances of
  config variables being used in import context:

  nova/api/openstack/common.py:limited()
  nova/api/openstack/common.py:get_limit_and_marker()
  nova/compute/manager.py:_heal_instance_info_cache()
  nova/compute/manager.py:_poll_shelved_instances()
  nova/compute/manager.py:_poll_bandwidth_usage()
  nova/compute/manager.py:_poll_volume_usage()
  nova/compute/manager.py:_sync_power_states()
  nova/compute/manager.py:_cleanup_running_deleted_instances()
  nova/compute/manager.py:_run_image_cache_manager_pass()
  nova/compute/manager.py:_run_pending_deletes()
  nova/network/manager.py:_periodic_update_dns()
  nova/scheduler/manager.py:_run_periodic_tasks()

  Consequently, it appears that the given values of the following config
  variables are being ignored:

  osapi_max_limit
  heal_instance_info_cache_interval
  shelved_poll_interval
  bandwidth_poll_interval
  volume_usage_poll_interval
  sync_power_state_interval
  running_deleted_instance_poll_interval
  image_cache_manager_interval
  instance_delete_interval
  dns_update_periodic_interval
  scheduler_driver_task_period

  [1] This doesn't apply to drivers, which are loaded dynamically after
  the config has been loaded. However, relying on that seems even
  nastier.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1354403/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1342055] [NEW] Suspending and restoring a rescued instance restores it to ACTIVE rather than RESCUED

2014-07-15 Thread Matthew Booth

Public bug reported:

If you suspend a rescued instance, resume returns it to the ACTIVE state
rather than the RESCUED state.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1342055

Title:
  Suspending and restoring a rescued instance restores it to ACTIVE
  rather than RESCUED

Status in OpenStack Compute (Nova):
  New

Bug description:
  If you suspend a rescued instance, resume returns it to the ACTIVE
  state rather than the RESCUED state.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1342055/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1337798] [NEW] VMware: snapshot operation copies a live image

2014-07-04 Thread Matthew Booth

Public bug reported:

N.B. This is based purely on code inspection. A reasonable resolution
would be to point out that I've misunderstood something and it's
actually fine. I'm filing this bug because it's potentially a subtle
data corruptor, and I'd like more eyes on it.

The snapshot code in vmwareapi/vmops.py does:

1. snapshot
2. copy disk image to vmware
3. delete snapshot
4. copy disk image from vmware to glance

I think the problem is in step 2. I don't see how it's copying the
snapshot it just created rather than the live disk image. i.e. I don't
think step 2 is copying the snapshot it created in step 1. It's possible
that there's some subtlety to do with path names here, but in that case
it could still do with a comment.

If it is in fact copying the live image, it would normally work.
However, this would potentially be a subtle data corruptor. For example,
consider that a file's data was towards the beginning of a disk, but its
metadata was towards the end of the disk. If the VM guest creates the
file during the copy operation, it copies the metadata at the end of the
disk, but misses the contents at the beginning of the disk.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: vmware

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1337798

Title:
  VMware: snapshot operation copies a live image

Status in OpenStack Compute (Nova):
  New

Bug description:
  N.B. This is based purely on code inspection. A reasonable resolution
  would be to point out that I've misunderstood something and it's
  actually fine. I'm filing this bug because it's potentially a subtle
  data corruptor, and I'd like more eyes on it.

  The snapshot code in vmwareapi/vmops.py does:

  1. snapshot
  2. copy disk image to vmware
  3. delete snapshot
  4. copy disk image from vmware to glance

  I think the problem is in step 2. I don't see how it's copying the
  snapshot it just created rather than the live disk image. i.e. I don't
  think step 2 is copying the snapshot it created in step 1. It's
  possible that there's some subtlety to do with path names here, but in
  that case it could still do with a comment.

  If it is in fact copying the live image, it would normally work.
  However, this would potentially be a subtle data corruptor. For
  example, consider that a file's data was towards the beginning of a
  disk, but its metadata was towards the end of the disk. If the VM
  guest creates the file during the copy operation, it copies the
  metadata at the end of the disk, but misses the contents at the
  beginning of the disk.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1337798/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1333587] [NEW] VMware: ExtendVirtualDisk_Task fails due to locked file

2014-06-24 Thread Matthew Booth

Public bug reported:

Extending a disk during spawn races, which can result in failure. It is
possible to hit this bug by launching a large number of instances of an
image which isn't already cached, simultaneously. Some of them will race
to extend the cached image, ultimately resulting in an error such as:

2014-06-17 10:49:26.006 9177 WARNING nova.virt.vmwareapi.driver [-] Task 
[ExtendVirtualDisk_Task] 
   value = "task-12073"
   _type = "Task"
 } status: error Unable to access file [datastore1] 
172.16.0.13_base/326153d2-1226-415a-a194-2ca47ac3c48b/326153d2-1226-415a-a194-2ca47ac3c48b.1.vmdk
 since it is locked

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: vmware

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1333587

Title:
  VMware: ExtendVirtualDisk_Task fails due to locked file

Status in OpenStack Compute (Nova):
  New

Bug description:
  Extending a disk during spawn races, which can result in failure. It
  is possible to hit this bug by launching a large number of instances
  of an image which isn't already cached, simultaneously. Some of them
  will race to extend the cached image, ultimately resulting in an error
  such as:

  2014-06-17 10:49:26.006 9177 WARNING nova.virt.vmwareapi.driver [-] Task 
[ExtendVirtualDisk_Task] 
 value = "task-12073"
 _type = "Task"
   } status: error Unable to access file [datastore1] 
172.16.0.13_base/326153d2-1226-415a-a194-2ca47ac3c48b/326153d2-1226-415a-a194-2ca47ac3c48b.1.vmdk
 since it is locked

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1333587/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1333232] [NEW] Gate failure: autodoc: failed to import module X

2014-06-23 Thread Matthew Booth

Public bug reported:

Spurious gate failure: http://logs.openstack.org/65/99065/4/check/gate-
nova-docs/af27af8/console.html

Logs are full of:

2014-06-23 09:55:32.057 | 
/home/jenkins/workspace/gate-nova-docs/doc/source/devref/api.rst:39: WARNING: 
autodoc: failed to import module u'nova.api.cloud'; the following exception was 
raised:
2014-06-23 09:55:32.057 | Traceback (most recent call last):
2014-06-23 09:55:32.057 |   File 
"/home/jenkins/workspace/gate-nova-docs/.tox/venv/local/lib/python2.7/site-packages/sphinx/ext/autodoc.py",
 line 335, in import_object
2014-06-23 09:55:32.057 | __import__(self.modname)
2014-06-23 09:55:32.057 | ImportError: No module named cloud
2014-06-23 09:55:32.057 | 
/home/jenkins/workspace/gate-nova-docs/doc/source/devref/api.rst:66: WARNING: 
autodoc: failed to import module u'nova.api.openstack.backup_schedules'; the 
following exception was raised:
2014-06-23 09:55:32.057 | Traceback (most recent call last):
2014-06-23 09:55:32.057 |   File 
"/home/jenkins/workspace/gate-nova-docs/.tox/venv/local/lib/python2.7/site-packages/sphinx/ext/autodoc.py",
 line 335, in import_object
2014-06-23 09:55:32.057 | __import__(self.modname)
2014-06-23 09:55:32.058 | ImportError: No module named backup_schedules

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: ci

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1333232

Title:
  Gate failure: autodoc: failed to import module X

Status in OpenStack Compute (Nova):
  New

Bug description:
  Spurious gate failure: http://logs.openstack.org/65/99065/4/check
  /gate-nova-docs/af27af8/console.html

  Logs are full of:

  2014-06-23 09:55:32.057 | 
/home/jenkins/workspace/gate-nova-docs/doc/source/devref/api.rst:39: WARNING: 
autodoc: failed to import module u'nova.api.cloud'; the following exception was 
raised:
  2014-06-23 09:55:32.057 | Traceback (most recent call last):
  2014-06-23 09:55:32.057 |   File 
"/home/jenkins/workspace/gate-nova-docs/.tox/venv/local/lib/python2.7/site-packages/sphinx/ext/autodoc.py",
 line 335, in import_object
  2014-06-23 09:55:32.057 | __import__(self.modname)
  2014-06-23 09:55:32.057 | ImportError: No module named cloud
  2014-06-23 09:55:32.057 | 
/home/jenkins/workspace/gate-nova-docs/doc/source/devref/api.rst:66: WARNING: 
autodoc: failed to import module u'nova.api.openstack.backup_schedules'; the 
following exception was raised:
  2014-06-23 09:55:32.057 | Traceback (most recent call last):
  2014-06-23 09:55:32.057 |   File 
"/home/jenkins/workspace/gate-nova-docs/.tox/venv/local/lib/python2.7/site-packages/sphinx/ext/autodoc.py",
 line 335, in import_object
  2014-06-23 09:55:32.057 | __import__(self.modname)
  2014-06-23 09:55:32.058 | ImportError: No module named backup_schedules

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1333232/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1328539] [NEW] Fixed IP allocation doesn't clean up properly on failure

2014-06-10 Thread Matthew Booth

Public bug reported:

If fixed IP allocation fails, for example because nova's network
interfaces got renamed after a reboot, nova will loop continuously
trying, and failing, to create a new instance. For every attempted spawn
the instance will end up with an additional fixed IP allocated to it.
This is because the code is associating the IP, but not disassociating
it if the function fails.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1328539

Title:
  Fixed IP allocation doesn't clean up properly on failure

Status in OpenStack Compute (Nova):
  New

Bug description:
  If fixed IP allocation fails, for example because nova's network
  interfaces got renamed after a reboot, nova will loop continuously
  trying, and failing, to create a new instance. For every attempted
  spawn the instance will end up with an additional fixed IP allocated
  to it. This is because the code is associating the IP, but not
  disassociating it if the function fails.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1328539/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1324036] [NEW] Can't add authenticated iscsi volume to a vmware instance

2014-05-28 Thread Matthew Booth

Public bug reported:

The VMware driver doesn't pass volume authentication information to the
hba when attaching an iscsi volume. Consequently, adding an iscsi volume
which requires authentication will always fail.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: vmware

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1324036

Title:
  Can't add authenticated iscsi volume to a vmware instance

Status in OpenStack Compute (Nova):
  New

Bug description:
  The VMware driver doesn't pass volume authentication information to
  the hba when attaching an iscsi volume. Consequently, adding an iscsi
  volume which requires authentication will always fail.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1324036/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1297375] [NEW] All nova apis relying on Instance.save(expected_*_state) for safety contain a race condition

2014-03-25 Thread Matthew Booth

Public bug reported:

Take, for example, resize_instance(). In manager.py, we assert that the
instance is in RESIZE_PREP state with:

  instance.save(expected_task_state=task_states.RESIZE_PREP)

This should mean that the first resize will succeed, and any subsequent
will fail. However, the underlying db implementation does not lock the
instance during the update, and therefore doesn't guarantee this.

Specifically, _instance_update() in db/sqlalchemy/apy.py starts a
session, and reads task_state from the instance. However, it does not
use a 'select ... for update', meaning the row is not locked. 2
concurrent calls to this method can both read the same state, then race
to the update. The last writer will win. Without 'select ... for
update', the db transaction is only ensuring that all writes are atomic,
not reads with dependent writes.

SQLAlchemy seems to support select ... for update, as do MySQL and
PostgreSQL, although MySQL will fall back to whole table locks for non-
InnoDB tables, which would likely be a significant performance hit.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1297375

Title:
  All nova apis relying on Instance.save(expected_*_state) for safety
  contain a race condition

Status in OpenStack Compute (Nova):
  New

Bug description:
  Take, for example, resize_instance(). In manager.py, we assert that
  the instance is in RESIZE_PREP state with:

instance.save(expected_task_state=task_states.RESIZE_PREP)

  This should mean that the first resize will succeed, and any
  subsequent will fail. However, the underlying db implementation does
  not lock the instance during the update, and therefore doesn't
  guarantee this.

  Specifically, _instance_update() in db/sqlalchemy/apy.py starts a
  session, and reads task_state from the instance. However, it does not
  use a 'select ... for update', meaning the row is not locked. 2
  concurrent calls to this method can both read the same state, then
  race to the update. The last writer will win. Without 'select ... for
  update', the db transaction is only ensuring that all writes are
  atomic, not reads with dependent writes.

  SQLAlchemy seems to support select ... for update, as do MySQL and
  PostgreSQL, although MySQL will fall back to whole table locks for
  non-InnoDB tables, which would likely be a significant performance
  hit.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1297375/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1290455] [NEW] libvirt inject_data assumes instance with kernel_id doesn't contain a partition table

2014-03-10 Thread Matthew Booth

Public bug reported:

libvirt/driver.py passes partition=None to disk.inject_data() for any
instance with kernel_id set. partition=None means that inject_data will
attempt to mount the whole image, i.e. assuming there is no partition
table. While this may be true for EC2, it is not safe to assume that Xen
images don't contain partition tables. This should check something more
directly related to the disk image. In fact, ideally it would leave it
up to libguestfs to work it out, as libguestfs is very good at this.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1290455

Title:
  libvirt inject_data assumes instance with kernel_id doesn't contain a
  partition table

Status in OpenStack Compute (Nova):
  New

Bug description:
  libvirt/driver.py passes partition=None to disk.inject_data() for any
  instance with kernel_id set. partition=None means that inject_data
  will attempt to mount the whole image, i.e. assuming there is no
  partition table. While this may be true for EC2, it is not safe to
  assume that Xen images don't contain partition tables. This should
  check something more directly related to the disk image. In fact,
  ideally it would leave it up to libguestfs to work it out, as
  libguestfs is very good at this.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1290455/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1275773] [NEW] VMware session not logged out on VMwareAPISession garbage collection

2014-02-03 Thread Matthew Booth

Public bug reported:

A bug in VMwareAPISession.__del__() prevents the session being logged
out when the session object is garbage collected.

** Affects: nova
 Importance: Medium
 Status: New


** Tags: havana-backport-potential vmware

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1275773

Title:
  VMware session not logged out on VMwareAPISession garbage collection

Status in OpenStack Compute (Nova):
  New

Bug description:
  A bug in VMwareAPISession.__del__() prevents the session being logged
  out when the session object is garbage collected.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1275773/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1271966] [NEW] Not possible to spawn vmware instance with multiple disks

2014-01-23 Thread Matthew Booth

Public bug reported:

The behaviour of spawn() in the vmwareapi driver wrt images and block
device mappings is currently as follows:

If there are any block device mappings, images are ignored
If there are any block device mappings, the last becomes the root device and 
all others are ignored

This means that, for example, the following scenarios are not possible:

1. Spawn an instance with a root device from an image, and a secondary volume
2. Spawn an instance with a volume as a root device, and a secondary volume

The behaviour of the libvirt driver is as follows:

If there is an image, it will be the root device unless there is also a block 
device mapping for the root device
All block device mappings are used
If there are multiple block device mappings for the same device, the last one 
is used

The vmwareapi driver's behaviour is surprising, and should be modified
to follow the libvirt driver.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: vmware

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1271966

Title:
  Not possible to spawn vmware instance with multiple disks

Status in OpenStack Compute (Nova):
  New

Bug description:
  The behaviour of spawn() in the vmwareapi driver wrt images and block
  device mappings is currently as follows:

  If there are any block device mappings, images are ignored
  If there are any block device mappings, the last becomes the root device and 
all others are ignored

  This means that, for example, the following scenarios are not
  possible:

  1. Spawn an instance with a root device from an image, and a secondary volume
  2. Spawn an instance with a volume as a root device, and a secondary volume

  The behaviour of the libvirt driver is as follows:

  If there is an image, it will be the root device unless there is also a block 
device mapping for the root device
  All block device mappings are used
  If there are multiple block device mappings for the same device, the last one 
is used

  The vmwareapi driver's behaviour is surprising, and should be modified
  to follow the libvirt driver.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1271966/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

68 matches

Mail list logo