Re: vmm protection fault trap

Josh Rickmar Tue, 11 May 2021 10:52:10 -0700

On Tue, May 11, 2021 at 12:50:26PM -0400, Dave Voutila wrote:
> 
> Josh Rickmar writes:
> 
> > On Sun, May 09, 2021 at 01:50:58PM +0000, Dave Voutila wrote:
> >>
> >> Mike Larkin writes:
> >>
> >> > On Sat, May 08, 2021 at 08:14:35AM -0400, Dave Voutila wrote:
> >> >>
> >> >> Josh Rickmar writes:
> >> >>
> >> >> > On Fri, May 07, 2021 at 04:19:18PM -0400, Dave Voutila wrote:
> >> >> >>
> >> >> >> Josh Rickmar writes:
> >> >> >>
> >> >> >> >>Synopsis:  vmm protection fault trap
> >> >> >> >>Category:  vmm
> >> >> >> >>Environment:
> >> >> >> >    System      : OpenBSD 6.9
> >> >> >> >    Details     : OpenBSD 6.9-current (GENERIC.MP) #6: Thu May  6 
> >> >> >> > 10:16:53 MDT 2021
> >> >> >> >                     
> >> >> >> > [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> >> >> >> >
> >> >> >> >    Architecture: OpenBSD.amd64
> >> >> >> >    Machine     : amd64
> >> >> >> >>Description:
> >> >> >> >
> >> >> >> > My nixos vm is causing the host kernel to crash (after cold boot) 
> >> >> >> > with
> >> >> >> > 'protection fault trap, code=0'.  The guest is running Linux 
> >> >> >> > 5.11.14
> >> >> >> > (guest dmesg included after the host dmesg below).  I've also 
> >> >> >> > attached
> >> >> >> > a screenshot of ddb showing the backtrace and registers.
> >> >> >> >
> >> >> >> >>How-To-Repeat:
> >> >> >> >
> >> >> >> > The crash can be reliably triggered by doing heavy disk IO on the 
> >> >> >> > vm.
> >> >> >> > Upgrading the VM actually got the nixos install wedged during an
> >> >> >> > initial crash, and attempting to repair it with "nix-build -A 
> >> >> >> > system
> >> >> >> > '<nixpkgs/nixos>' --repair" is reliably repeating the crash.
> >> >> >>
> >> >> >> Any chance you've experienced this with a non-NixOS guest? I can't
> >> >> >> reproduce this error on my Ryzen5 Pro host.
> >> >> >>
> >> >>
> >> >> I've reproduced this locally with the help of abieber@. Seems I just
> >> >> need to boot a nixos iso (nixos-21.05pre287333.63586475587-x86_64) and
> >> >> try installing a package like git into the ramdisk:
> >> >>
> >> >>   # nix-env -f '<nixpkgs>' -iA git
> >> >>
> >> >> I still haven't triggered this without nixos, but at least I can
> >> >> reproduce it locally now. :-)
> >> >>
> >> >> -dv
> >> >>
> >> >
> >> > robert@ reported this same bug a long time ago and I could never 
> >> > reproduce it.
> >> >
> >> > I'll see if it repros against my R415 using these instructions.
> >> >
> >> > -ml
> >>
> >> So far I haven't managed to trigger it using this diff. I don't know
> >> why, but maybe the guest is mucking with the GDTR? I checked our logic
> >> vs. netbsd nvmm's...as well as our acpi resume handling...and that's all
> >> I can think of to explain it.
> <snip>
> >
> > I was able to repair my nix store with this diff (twice, first time on
> > a derived qcow2 image for testing).
> 
> Updated diff below after working with mlarkin@ on identifying the root
> cause. We were being overly fancy tracking which CPU we were on leading
> to a rare edgecase where the gdt we use for deriving the input to ltr
> caused the #GP. (Which explains why my previous attempt of using lgdtq
> prevented ltrw from barfing.)
> 
> Josh & abieber@, can you give this a quick test please?
> 
> 
> Index: sys/arch/amd64/amd64/vmm.c
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.280
> diff -u -p -r1.280 vmm.c
> --- sys/arch/amd64/amd64/vmm.c        6 Apr 2021 00:19:58 -0000       1.280
> +++ sys/arch/amd64/amd64/vmm.c        11 May 2021 16:44:46 -0000
> @@ -6970,15 +6970,14 @@ vmm_handle_cpuid(struct vcpu *vcpu)
>  int
>  vcpu_run_svm(struct vcpu *vcpu, struct vm_run_params *vrp)
>  {
> -     int ret = 0, resume;
> +     int ret = 0;
>       struct region_descriptor gdt;
> -     struct cpu_info *ci;
> +     struct cpu_info *ci = NULL;
>       uint64_t exit_reason;
>       struct schedstate_percpu *spc;
>       uint16_t irq;
>       struct vmcb *vmcb = (struct vmcb *)vcpu->vc_control_va;
> 
> -     resume = 0;
>       irq = vrp->vrp_irq;
> 
>       /*
> @@ -7000,7 +6999,7 @@ vcpu_run_svm(struct vcpu *vcpu, struct v
> 
>       while (ret == 0) {
>               vmm_update_pvclock(vcpu);
> -             if (!resume) {
> +             if (ci != curcpu()) {
>                       /*
>                        * We are launching for the first time, or we are
>                        * resuming from a different pcpu, so we need to
> @@ -7106,8 +7105,6 @@ vcpu_run_svm(struct vcpu *vcpu, struct v
> 
>               /* If we exited successfully ... */
>               if (ret == 0) {
> -                     resume = 1;
> -
>                       vcpu->vc_gueststate.vg_rflags = vmcb->v_rflags;
> 
>                       /*
> @@ -7149,7 +7146,6 @@ vcpu_run_svm(struct vcpu *vcpu, struct v
>                       /* Check if we should yield - don't hog the cpu */
>                       spc = &ci->ci_schedstate;
>                       if (spc->spc_schedflags & SPCF_SHOULDYIELD) {
> -                             resume = 0;
>                               yield();
>                       }
>               }


This also fixes the crash for me.  Tested by installing git and
electron into the ramdisk with abieber@'s iso as well as temporarily
installing these on my real nixos vm.

Re: vmm protection fault trap

Reply via email to