On 10/12/17 03:26, Landry Breuil wrote:
On Sat, Dec 09, 2017 at 04:33:28PM +0100, Juan Francisco Cantero Hurtado wrote:
On Thu, Dec 07, 2017 at 02:27:29PM +0100, Landry Breuil wrote:
On Thu, Dec 07, 2017 at 11:52:46AM +0100, Martin Pieuchot wrote:
On 07/12/17(Thu) 08:34, Landry Breuil wrote:
Hi,

i've been having kvm VMs running 6.2 hardlocking/deadlocking since a
while, all those running on proxmox 5.1 using linux 4.13.8 & qemu-kvm
2.9.1. There were hardlocks upon reboot which were 'solved' by disabling
x2apic emulation in kvm (args: -cpu=kvm64,-x2apic) or giving the host
cpu flags to the vm (args: -cpu host) but there still remains deadlocks
during normal operation.

I'm now running a kernel with MP_LOCKDEBUG, so i'm collecting traces in
the vain hope that it might help someone interested in locking issues.
Here's the latest one:

Let me add that when you had x2apic enabled the kernel 'froze' inside
x2apic_readreg, trace below:

   ddb{0}> tr
   x2apic_readreg(10) at x2apic_readreg+0xf
   lapic_delay(ffff800022136900) at lapic_delay+0x5c
   rtcput(ffff800022136960) at rtcput+0x65
   resettodr() at resettodr+0x1d6
   perform_resettodr(ffffffff81769b29) at perform_resettodr+0x9
   taskq_thread(0) at taskq_thread+0x67
   end trace frame: 0x0, count: -6

What you're seeing with a MP_LOCKDEBUG kernel is just a symptom.  A CPU
enters DDB because another one is 'frozen' while holding the
KERNEL_LOCK().  What's interesting is that in both case the frozen CPU
is trying to execute apic related code:
   - x2apic_readreg
   - lapic_delay

I believe this issue should be reported to KVM developers as well.

*very* interestingly, i had a new lock, running bsd.sp.. So i think that
rules out openbsd mp.

ddb> tr
i82489_readreg(0) at i82489_readreg+0xd
lapic_delay(ffffffff81a84090) at lapic_delay+0x5c
rtcget(ffffffff81a84090) at rtcget+0x1a
resettodr() at resettodr+0x3a
perform_resettodr(ffffffff81659e99) at perform_resettodr+0x9
taskq_thread(0) at taskq_thread+0x57
end trace frame: 0x0, count: -6

Try running with "-machine q35". It changes the emulated machine to
a modern platform.

Right, i suppose that matches https://wiki.qemu.org/Features/Q35,
interesting. Will definitely try, mailed the kvm mailing list but got no
feedback so far.

I've been seeing this for a while too, on VMs that are already run with -machine q35 and -cpu host. I was blaming my (still unfinished) pvclock code, but now I can fairly easily trigger it on single cpu VMs without that, mostly by running kernel compiles in a loop in a couple of different guests. I'm using a Fedora 25 (4.10.15-200.fc25.x86_64) kernel.

Adding some debug output to lapic_delay, it appears the KVM virtualized lapic counter hits zero and doesn't reset, so the lapic_delay loop in the guest never terminates. KVM has several different ways it can provide the lapic counter and I'm not sure which one I'm using yet.

I just tried making lapic_delay give up after a million zero reads, and it seems to recover after a minute or so. I'll leave it running to see if it happens again.

Reply via email to