Clark is right, trove does detect and try to use kvm where possible. The performance has been well worth the change (IMHO).
-amrith On Jan 17, 2017 6:53 PM, "Clark Boylan" <[email protected]> wrote: > On Tue, Jan 17, 2017, at 03:41 PM, Jay Faulkner wrote: > > Hi all, > > > > Back in late October, Vasyl wrote support for devstack to auto detect, > > and when possible, use kvm to power Ironic gate jobs > > (0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job > > time when it works, but has caused failures — how many? It’s hard to > > quantify as the log messages that show the error don’t appear to be > > indexed by elastic search. It’s something seen often enough that the > > issue has become a permanent staple on our gate whiteboard, and doesn’t > > appear to be decreasing in quantity. > > > > I pushed up a patch, https://review.openstack.org/#/c/421581, which > keeps > > the auto detection behavior, but defaults devstack to use qemu emulation > > instead of kvm. > > > > I have two questions: > > 1) Is there any way I’m not aware of we can quantify the number of > > failures this is causing? The key log message, "KVM: entry failed, > > hardware error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz. > > 2) Are these failures avoidable or visible in any way? > > > > IMO, if we can’t fix these failures, in my opinion, we have to do a > > change to avoid using nested KVM altogether. Lower reliability for our > > jobs is not worth a small decrease in job run time. > > Part of the problem with nested KVM failures is that in many cases they > destroy the test nodes in unrecoverable ways. In which case you don't > get any logs, and zuul will restart the job for you. I think that > graphite will capture this as a job that resulted in a Null/None status > though (rather than SUCCESS/FAILURE). > > As for collecting info when you do get logs, we don't index the libvirt > instance logs currently and I am not sure we want to. We already > struggle to keep up with the existing set of logs when we are busy. > Instead we might have job cleanup do a quick grep for known nested virt > problem indicators and then log that to the console log which will be > indexed. > > I think trove has also seen kernel panic type errors in syslog that we > hypothesized were a result of using nested virt. > > The infra team explicitly attempts to force qemu instead of kvm on jobs > using devstack-gate for these reasons. We know it doesn't work reliably > and not all clouds support it. Unfortunately my understanding of the > situation is that base hypervisor cpu and kernel, second level > hypervisor kernel, and nested guest kernel all come into play here. And > there can be nasty interactions between them causing a variety of > problems. > > Put another way: > > 2017-01-14T00:42:00 <mnaser> if we're talking nested kvm > 2017-01-14T00:42:04 <mnaser> it's kindof a nightmare > from > http://eavesdrop.openstack.org/irclogs/%23openstack- > infra/%23openstack-infra.2017-01-14.log > > Clark > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
