Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

Matt Riedemann Wed, 14 Jan 2015 14:05:23 -0800


On 12/8/2014 3:12 PM, Jeremy Stanley wrote:

On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote:

As Dan Berrangé noted, it's nearly impossible to reproduce this issue
independently outside of OpenStack Gating environment. I brought this up
at the recently concluded KVM Forum earlier this October. To debug this
any further, one of the QEMU block layer developers asked if we can get
QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested
this too, previously) to get further tracing details.


We document thoroughly how to reproduce the environments we use for
testing OpenStack. There's nothing rarified about "a Gate run" that
anyone with access to a public cloud provider would be unable to
reproduce, save being able to run it over and over enough times to
expose less frequent failures.

FWIW, I myself couldn't reproduce it independently via libvirt
alone or via QMP (QEMU Machine Protocol) commands.

Dan's workaround ("enable it permanently, except for under the
gate") sounds sensible to me.

[...]

I'm dubious of this as it basically says "we know this breaks
sometimes, so we're going to stop testing that it works at all and
possibly let it get even more broken, but you should be safe to rely
on it anyway."

The QA team tries very hard to make our integration testing
environment as closely as possible mimic real-world deployment
configurations. If these sorts of bugs emerge more often because of,
for example, resource constraints in the test environment then it
should be entirely likely they'd also be seen in production with the
same frequency if run on similarly constrained equipment. And as
we've observed in the past, any code path we stop testing quickly
accumulates new bugs that go unnoticed until they impact someone's
production environment at 3am.

Bringing this back up since Jesse Keating in IRC was asking about thisagain today. Sounds like we've heard from a few people that are runningthis in labs without problems, maybe they are patching libvirt/qemu, Idon't know, but we have other things that we know have broken parts andthat's why they run on the experimental queue, e.g. cells, nova +ceph/rbd. We also know we're a bit busted in the ec2 API right now withthe latest boto release (2.35.1), so we have a cap on that.

These issues are being worked, but regarding this particular way thatwe've disabled the function (with a version cap in the code), someonehas to go in and patch that out, which kind of sucks if they could havejust used a config option to enable it at their own risk.

That's why I'm proposing something like an [experimental] group. Wecould put this into the [workarounds] group but this isn't really aworkaround for anything so that doesn't really make sense to me.

I'd personally be OK with putting it into the [libvirt] group with awarning in the config option help and code that this isn't currentlytested in the gate so we aren't sure it's going to work, which we'vedone for cells and some of the virt drivers, e.g. libvirt onnon-x86_64/QEMU systems.


--

Thanks,

Matt Riedemann


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

Reply via email to