On Tue, Dec 12, 2017 at 09:11:40PM +1000, Jonathan Matthew wrote:
> On Mon, Dec 11, 2017 at 09:34:00AM +0100, Landry Breuil wrote:
> > On Mon, Dec 11, 2017 at 06:21:01PM +1000, Jonathan Matthew wrote:
> > > On 10/12/17 03:26, Landry Breuil wrote:
> > > > On Sat, Dec 09, 2017 at 04:33:28PM +0100, Juan Francisco Cantero 
> > > > Hurtado wrote:
> > > > > On Thu, Dec 07, 2017 at 02:27:29PM +0100, Landry Breuil wrote:
> > > > > > On Thu, Dec 07, 2017 at 11:52:46AM +0100, Martin Pieuchot wrote:
> > > > > > > On 07/12/17(Thu) 08:34, Landry Breuil wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > i've been having kvm VMs running 6.2 hardlocking/deadlocking 
> > > > > > > > since a
> > > > > > > > while, all those running on proxmox 5.1 using linux 4.13.8 & 
> > > > > > > > qemu-kvm
> > > > > > > > 2.9.1. There were hardlocks upon reboot which were 'solved' by 
> > > > > > > > disabling
> > > > > > > > x2apic emulation in kvm (args: -cpu=kvm64,-x2apic) or giving 
> > > > > > > > the host
> > > > > > > > cpu flags to the vm (args: -cpu host) but there still remains 
> > > > > > > > deadlocks
> > > > > > > > during normal operation.
> > > > > > > > 
> > > > > > > > I'm now running a kernel with MP_LOCKDEBUG, so i'm collecting 
> > > > > > > > traces in
> > > > > > > > the vain hope that it might help someone interested in locking 
> > > > > > > > issues.
> > > > > > > > Here's the latest one:
> > > > > > > 
> > > > > > > Let me add that when you had x2apic enabled the kernel 'froze' 
> > > > > > > inside
> > > > > > > x2apic_readreg, trace below:
> > > > > > > 
> > > > > > >    ddb{0}> tr
> > > > > > >    x2apic_readreg(10) at x2apic_readreg+0xf
> > > > > > >    lapic_delay(ffff800022136900) at lapic_delay+0x5c
> > > > > > >    rtcput(ffff800022136960) at rtcput+0x65
> > > > > > >    resettodr() at resettodr+0x1d6
> > > > > > >    perform_resettodr(ffffffff81769b29) at perform_resettodr+0x9
> > > > > > >    taskq_thread(0) at taskq_thread+0x67
> > > > > > >    end trace frame: 0x0, count: -6
> > > > > > > 
> > > > > > > What you're seeing with a MP_LOCKDEBUG kernel is just a symptom.  
> > > > > > > A CPU
> > > > > > > enters DDB because another one is 'frozen' while holding the
> > > > > > > KERNEL_LOCK().  What's interesting is that in both case the 
> > > > > > > frozen CPU
> > > > > > > is trying to execute apic related code:
> > > > > > >    - x2apic_readreg
> > > > > > >    - lapic_delay
> > > > > > > 
> > > > > > > I believe this issue should be reported to KVM developers as well.
> > > > > > 
> > > > > > *very* interestingly, i had a new lock, running bsd.sp.. So i think 
> > > > > > that
> > > > > > rules out openbsd mp.
> > > > > > 
> > > > > > ddb> tr
> > > > > > i82489_readreg(0) at i82489_readreg+0xd
> > > > > > lapic_delay(ffffffff81a84090) at lapic_delay+0x5c
> > > > > > rtcget(ffffffff81a84090) at rtcget+0x1a
> > > > > > resettodr() at resettodr+0x3a
> > > > > > perform_resettodr(ffffffff81659e99) at perform_resettodr+0x9
> > > > > > taskq_thread(0) at taskq_thread+0x57
> > > > > > end trace frame: 0x0, count: -6
> > > > > 
> > > > > Try running with "-machine q35". It changes the emulated machine to
> > > > > a modern platform.
> > > > 
> > > > Right, i suppose that matches https://wiki.qemu.org/Features/Q35,
> > > > interesting. Will definitely try, mailed the kvm mailing list but got no
> > > > feedback so far.
> > > 
> > > I've been seeing this for a while too, on VMs that are already run with
> > > -machine q35 and -cpu host.  I was blaming my (still unfinished) pvclock
> > > code, but now I can fairly easily trigger it on single cpu VMs without 
> > > that,
> > > mostly by running kernel compiles in a loop in a couple of different 
> > > guests.
> > > I'm using a Fedora 25 (4.10.15-200.fc25.x86_64) kernel.
> > > 
> > > Adding some debug output to lapic_delay, it appears the KVM virtualized
> > > lapic counter hits zero and doesn't reset, so the lapic_delay loop in the
> > > guest never terminates.  KVM has several different ways it can provide the
> > > lapic counter and I'm not sure which one I'm using yet.
> > > 
> > > I just tried making lapic_delay give up after a million zero reads, and it
> > > seems to recover after a minute or so.  I'll leave it running to see if it
> > > happens again.
> > 
> > Hah, interesting. So i know i can try q35 just for the sake of emulating
> > a newer hw platform, but that wont fix this issue.
> > 
> > I see upstream this thread/patchset: https://lkml.org/lkml/2017/9/28/773
> > But i'm not sure that's the same issue.
> > https://patchwork.kernel.org/patch/9036161/ might be too ?
> > 
> > When you say you modified lapic_delay, i guess that's on the guest side ?
> 
> yeah, I haven't changed anything on the host side yet.
> 
> I have a sort of working theory now.  It looks like polling the lapic counter
> in a tight loop starves whatever bit of kvm is responsible for resetting the
> counter.  If the counter hits 0 and then we enter lapic_delay before kvm gets
> a chance to reset it, it'll never get reset and lapic_delay will never finish.
> 
> The gross hack below makes lapic_delay issue a hlt instruction when it looks
> like the lapic counter is stuck, which seems to be enough for kvm to reset it.
> To make it easier to hit, I also reduced RESETTODR_PERIOD (in 
> sys/kern/kern_time.c)
> to 30, since it always seems to be the RTC read/write operations that get 
> stuck.
> Maybe there aren't any other delay()s happening.
> 
> Next I guess I'm going to dig through the kvm code to see how the lapic 
> counter
> reset works.

So, for me at least, adding 'kvm-intel.preemption_timer=0' to the linux kernel
commandline (on the host) fixes this.  This option disables kvm's use of the
vmx preemption timer to schedule lapic counter events.  I think there's some
problem in how kvm calculates the tsc deadline, or the relationship between
that and the time value it uses for lapic counter reads, but I'm not sure what
exactly.

Reply via email to