On Mon, Dec 08, 2014 at 09:12:24PM +0000, Jeremy Stanley wrote: > On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote: > > As Dan Berrangé noted, it's nearly impossible to reproduce this issue > > independently outside of OpenStack Gating environment. I brought this up > > at the recently concluded KVM Forum earlier this October. To debug this > > any further, one of the QEMU block layer developers asked if we can get > > QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested > > this too, previously) to get further tracing details. > > We document thoroughly how to reproduce the environments we use for > testing OpenStack.
Yep, documentation is appreciated. > There's nothing rarified about "a Gate run" that anyone with access to > a public cloud provider would be unable to reproduce, save being able > to run it over and over enough times to expose less frequent failures. Sure. To be fair, this was actually tried. At the risk of over discussing the topic, allow me to provide a bit more context, quoting Dan's email from an old thread ("Thoughts on the patch test failure rate and moving forward" Jul 23, 2014) here for convenience: "In some of the harder gate bugs I've looked at (especially the infamous 'live snapshot' timeout bug), it has been damn hard to actually figure out what's wrong. AFAIK, no one has ever been able to reproduce it outside of the gate infrastructure. I've even gone as far as setting up identical Ubuntu VMs to the ones used in the gate on a local cloud, and running the tempest tests multiple times, but still can't reproduce what happens on the gate machines themselves :-( As such we're relying on code inspection and the collected log messages to try and figure out what might be wrong. The gate collects alot of info and publishes it, but in this case I have found the published logs to be insufficient - I needed to get the more verbose libvirtd.log file. devstack has the ability to turn this on via an environment variable, but it is disabled by default because it would add 3% to the total size of logs collected per gate job. There's no way for me to get that environment variable for devstack turned on for a specific review I want to test with. In the end I uploaded a change to nova which abused rootwrap to elevate privileges, install extra deb packages, reconfigure libvirtd logging and restart the libvirtd daemon. https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py My next attack is to build a custom QEMU binary and hack nova further so that it can download my custom QEMU binary from a website onto the gate machine and run the test with it. Failing that I'm going to be hacking things to try to attach to QEMU in the gate with GDB and get stack traces. Anything is doable thanks to rootwrap giving us a way to elevate privileges from Nova, but it is a somewhat tedious approach."  http://lists.openstack.org/pipermail/openstack-dev/2014-July/041148.html To add to the above, from the bug, you can find in one of the plenty of invocations, the above issue _was_ reproduced once, albiet with questionable likelihood (details in the bug). So, it's not that what you're suggesting was never tried. But, from the above, you can clearly see what kind of convoluted methods you need to resort to. One concrete point from the above: it'd be very useful to have an env variable that can be toggled to enable libvirt/QEMU run under `gdb` for $REVIEW. (Sure, it's a patch that needs to be worked on. . .) [. . .] > The QA team tries very hard to make our integration testing > environment as closely as possible mimic real-world deployment > configurations. If these sorts of bugs emerge more often because of, > for example, resource constraints in the test environment then it > should be entirely likely they'd also be seen in production with the > same frequency if run on similarly constrained equipment. And as we've > observed in the past, any code path we stop testing quickly > accumulates new bugs that go unnoticed until they impact someone's > production environment at 3am. I realize you're raising the point that it should not be taken lightly -- hope the context provided in this email demonstrates that it's not the case. PS: FWIW, I do enable this codepath in my test environments (sure, it's not *representative*), but I'm yet to reproduce the bug. -- /kashyap _______________________________________________ OpenStack-dev mailing list OpenStackemail@example.com http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev