On Tue, May 11, 2021 at 12:50:26PM -0400, Dave Voutila wrote: > > Josh Rickmar writes: > > > On Sun, May 09, 2021 at 01:50:58PM +0000, Dave Voutila wrote: > >> > >> Mike Larkin writes: > >> > >> > On Sat, May 08, 2021 at 08:14:35AM -0400, Dave Voutila wrote: > >> >> > >> >> Josh Rickmar writes: > >> >> > >> >> > On Fri, May 07, 2021 at 04:19:18PM -0400, Dave Voutila wrote: > >> >> >> > >> >> >> Josh Rickmar writes: > >> >> >> > >> >> >> >>Synopsis: vmm protection fault trap > >> >> >> >>Category: vmm > >> >> >> >>Environment: > >> >> >> > System : OpenBSD 6.9 > >> >> >> > Details : OpenBSD 6.9-current (GENERIC.MP) #6: Thu May 6 > >> >> >> > 10:16:53 MDT 2021 > >> >> >> > > >> >> >> > [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP > >> >> >> > > >> >> >> > Architecture: OpenBSD.amd64 > >> >> >> > Machine : amd64 > >> >> >> >>Description: > >> >> >> > > >> >> >> > My nixos vm is causing the host kernel to crash (after cold boot) > >> >> >> > with > >> >> >> > 'protection fault trap, code=0'. The guest is running Linux > >> >> >> > 5.11.14 > >> >> >> > (guest dmesg included after the host dmesg below). I've also > >> >> >> > attached > >> >> >> > a screenshot of ddb showing the backtrace and registers. > >> >> >> > > >> >> >> >>How-To-Repeat: > >> >> >> > > >> >> >> > The crash can be reliably triggered by doing heavy disk IO on the > >> >> >> > vm. > >> >> >> > Upgrading the VM actually got the nixos install wedged during an > >> >> >> > initial crash, and attempting to repair it with "nix-build -A > >> >> >> > system > >> >> >> > '<nixpkgs/nixos>' --repair" is reliably repeating the crash. > >> >> >> > >> >> >> Any chance you've experienced this with a non-NixOS guest? I can't > >> >> >> reproduce this error on my Ryzen5 Pro host. > >> >> >> > >> >> > >> >> I've reproduced this locally with the help of abieber@. Seems I just > >> >> need to boot a nixos iso (nixos-21.05pre287333.63586475587-x86_64) and > >> >> try installing a package like git into the ramdisk: > >> >> > >> >> # nix-env -f '<nixpkgs>' -iA git > >> >> > >> >> I still haven't triggered this without nixos, but at least I can > >> >> reproduce it locally now. :-) > >> >> > >> >> -dv > >> >> > >> > > >> > robert@ reported this same bug a long time ago and I could never > >> > reproduce it. > >> > > >> > I'll see if it repros against my R415 using these instructions. > >> > > >> > -ml > >> > >> So far I haven't managed to trigger it using this diff. I don't know > >> why, but maybe the guest is mucking with the GDTR? I checked our logic > >> vs. netbsd nvmm's...as well as our acpi resume handling...and that's all > >> I can think of to explain it. > <snip> > > > > I was able to repair my nix store with this diff (twice, first time on > > a derived qcow2 image for testing). > > Updated diff below after working with mlarkin@ on identifying the root > cause. We were being overly fancy tracking which CPU we were on leading > to a rare edgecase where the gdt we use for deriving the input to ltr > caused the #GP. (Which explains why my previous attempt of using lgdtq > prevented ltrw from barfing.) > > Josh & abieber@, can you give this a quick test please? > > > Index: sys/arch/amd64/amd64/vmm.c > =================================================================== > RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v > retrieving revision 1.280 > diff -u -p -r1.280 vmm.c > --- sys/arch/amd64/amd64/vmm.c 6 Apr 2021 00:19:58 -0000 1.280 > +++ sys/arch/amd64/amd64/vmm.c 11 May 2021 16:44:46 -0000 > @@ -6970,15 +6970,14 @@ vmm_handle_cpuid(struct vcpu *vcpu) > int > vcpu_run_svm(struct vcpu *vcpu, struct vm_run_params *vrp) > { > - int ret = 0, resume; > + int ret = 0; > struct region_descriptor gdt; > - struct cpu_info *ci; > + struct cpu_info *ci = NULL; > uint64_t exit_reason; > struct schedstate_percpu *spc; > uint16_t irq; > struct vmcb *vmcb = (struct vmcb *)vcpu->vc_control_va; > > - resume = 0; > irq = vrp->vrp_irq; > > /* > @@ -7000,7 +6999,7 @@ vcpu_run_svm(struct vcpu *vcpu, struct v > > while (ret == 0) { > vmm_update_pvclock(vcpu); > - if (!resume) { > + if (ci != curcpu()) { > /* > * We are launching for the first time, or we are > * resuming from a different pcpu, so we need to > @@ -7106,8 +7105,6 @@ vcpu_run_svm(struct vcpu *vcpu, struct v > > /* If we exited successfully ... */ > if (ret == 0) { > - resume = 1; > - > vcpu->vc_gueststate.vg_rflags = vmcb->v_rflags; > > /* > @@ -7149,7 +7146,6 @@ vcpu_run_svm(struct vcpu *vcpu, struct v > /* Check if we should yield - don't hog the cpu */ > spc = &ci->ci_schedstate; > if (spc->spc_schedflags & SPCF_SHOULDYIELD) { > - resume = 0; > yield(); > } > }
This also fixes the crash for me. Tested by installing git and electron into the ramdisk with abieber@'s iso as well as temporarily installing these on my real nixos vm.
