Hi, all
We have a openstack cluster newly installed. Every vm in the cluster randomly
freezes(about 2~3 seconds) every few minutes. When a vm freeze, ping from
outside (a physical machine or another vm) will have latency of a few seconds
or timeout (vs <1ms in normal situation). Also, processes on the vm that does
not use network (eg. writing data time to a file each second) also stops
working during those freezeing period (so that we dont have lines in those
seconds).
On the compute node that runs the vm, we found high cpu usage (often near or
over 100%) of the qemu process running the vm when it freezes. But inside the
vm, the cpu utilization remains low all the time. This indicates the cpu time
is given to qemu to do some busy stuff but not given to its vcpu threads, or it
is given to the vcpu threads but they do not get into guest mode during that.
We have a very simillar setup of openstack using the same versions of
openstack/qemu/kvm/host OS/guest OS, but which does not have such freezes. The
only obvious difference is the "random freezing"compute nodes are Huawei RH
2288H V2, and the "good" ones are some Dell servers (I can get that info if it
is important). The CPUs are Xeon E5-2560 and Xeon 5405 respectively, with the
former having more advanced virtualization support (VT-d and EPT).
The host OS is ubuntu 14.04 LTS (kernel 3.13.0-32-generic), qemu version is
2.0. It looks the guest OS does not matter (it happens on a few difference
guest OS's we have tried).
We have only a rough idea it is related to some scheduling problem on the host
leading to starvation of vcpu threads. There are other freezing problems
reported on the network that are solved by disabling kvm-clock, but we tried
that and failed.
We lack a diagnostic method to identify the root cause. Could anyone give
suggestions where should we start? Any "suspected fixes" are also welcome.
We