Josh Rickmar writes:
> On Sun, May 09, 2021 at 01:50:58PM +0000, Dave Voutila wrote: >> >> Mike Larkin writes: >> >> > On Sat, May 08, 2021 at 08:14:35AM -0400, Dave Voutila wrote: >> >> >> >> Josh Rickmar writes: >> >> >> >> > On Fri, May 07, 2021 at 04:19:18PM -0400, Dave Voutila wrote: >> >> >> >> >> >> Josh Rickmar writes: >> >> >> >> >> >> >>Synopsis: vmm protection fault trap >> >> >> >>Category: vmm >> >> >> >>Environment: >> >> >> > System : OpenBSD 6.9 >> >> >> > Details : OpenBSD 6.9-current (GENERIC.MP) #6: Thu May 6 >> >> >> > 10:16:53 MDT 2021 >> >> >> > >> >> >> > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP >> >> >> > >> >> >> > Architecture: OpenBSD.amd64 >> >> >> > Machine : amd64 >> >> >> >>Description: >> >> >> > >> >> >> > My nixos vm is causing the host kernel to crash (after cold boot) >> >> >> > with >> >> >> > 'protection fault trap, code=0'. The guest is running Linux 5.11.14 >> >> >> > (guest dmesg included after the host dmesg below). I've also >> >> >> > attached >> >> >> > a screenshot of ddb showing the backtrace and registers. >> >> >> > >> >> >> >>How-To-Repeat: >> >> >> > >> >> >> > The crash can be reliably triggered by doing heavy disk IO on the vm. >> >> >> > Upgrading the VM actually got the nixos install wedged during an >> >> >> > initial crash, and attempting to repair it with "nix-build -A system >> >> >> > '<nixpkgs/nixos>' --repair" is reliably repeating the crash. >> >> >> >> >> >> Any chance you've experienced this with a non-NixOS guest? I can't >> >> >> reproduce this error on my Ryzen5 Pro host. >> >> >> >> >> >> >> I've reproduced this locally with the help of abieber@. Seems I just >> >> need to boot a nixos iso (nixos-21.05pre287333.63586475587-x86_64) and >> >> try installing a package like git into the ramdisk: >> >> >> >> # nix-env -f '<nixpkgs>' -iA git >> >> >> >> I still haven't triggered this without nixos, but at least I can >> >> reproduce it locally now. :-) >> >> >> >> -dv >> >> >> > >> > robert@ reported this same bug a long time ago and I could never reproduce >> > it. >> > >> > I'll see if it repros against my R415 using these instructions. >> > >> > -ml >> >> So far I haven't managed to trigger it using this diff. I don't know >> why, but maybe the guest is mucking with the GDTR? I checked our logic >> vs. netbsd nvmm's...as well as our acpi resume handling...and that's all >> I can think of to explain it. <snip> > > I was able to repair my nix store with this diff (twice, first time on > a derived qcow2 image for testing). Updated diff below after working with mlarkin@ on identifying the root cause. We were being overly fancy tracking which CPU we were on leading to a rare edgecase where the gdt we use for deriving the input to ltr caused the #GP. (Which explains why my previous attempt of using lgdtq prevented ltrw from barfing.) Josh & abieber@, can you give this a quick test please? Index: sys/arch/amd64/amd64/vmm.c =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v retrieving revision 1.280 diff -u -p -r1.280 vmm.c --- sys/arch/amd64/amd64/vmm.c 6 Apr 2021 00:19:58 -0000 1.280 +++ sys/arch/amd64/amd64/vmm.c 11 May 2021 16:44:46 -0000 @@ -6970,15 +6970,14 @@ vmm_handle_cpuid(struct vcpu *vcpu) int vcpu_run_svm(struct vcpu *vcpu, struct vm_run_params *vrp) { - int ret = 0, resume; + int ret = 0; struct region_descriptor gdt; - struct cpu_info *ci; + struct cpu_info *ci = NULL; uint64_t exit_reason; struct schedstate_percpu *spc; uint16_t irq; struct vmcb *vmcb = (struct vmcb *)vcpu->vc_control_va; - resume = 0; irq = vrp->vrp_irq; /* @@ -7000,7 +6999,7 @@ vcpu_run_svm(struct vcpu *vcpu, struct v while (ret == 0) { vmm_update_pvclock(vcpu); - if (!resume) { + if (ci != curcpu()) { /* * We are launching for the first time, or we are * resuming from a different pcpu, so we need to @@ -7106,8 +7105,6 @@ vcpu_run_svm(struct vcpu *vcpu, struct v /* If we exited successfully ... */ if (ret == 0) { - resume = 1; - vcpu->vc_gueststate.vg_rflags = vmcb->v_rflags; /* @@ -7149,7 +7146,6 @@ vcpu_run_svm(struct vcpu *vcpu, struct v /* Check if we should yield - don't hog the cpu */ spc = &ci->ci_schedstate; if (spc->spc_schedflags & SPCF_SHOULDYIELD) { - resume = 0; yield(); } }