Josh Rickmar writes:

> On Sun, May 09, 2021 at 01:50:58PM +0000, Dave Voutila wrote:
>>
>> Mike Larkin writes:
>>
>> > On Sat, May 08, 2021 at 08:14:35AM -0400, Dave Voutila wrote:
>> >>
>> >> Josh Rickmar writes:
>> >>
>> >> > On Fri, May 07, 2021 at 04:19:18PM -0400, Dave Voutila wrote:
>> >> >>
>> >> >> Josh Rickmar writes:
>> >> >>
>> >> >> >>Synopsis:    vmm protection fault trap
>> >> >> >>Category:    vmm
>> >> >> >>Environment:
>> >> >> >      System      : OpenBSD 6.9
>> >> >> >      Details     : OpenBSD 6.9-current (GENERIC.MP) #6: Thu May  6 
>> >> >> > 10:16:53 MDT 2021
>> >> >> >                       
>> >> >> > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
>> >> >> >
>> >> >> >      Architecture: OpenBSD.amd64
>> >> >> >      Machine     : amd64
>> >> >> >>Description:
>> >> >> >
>> >> >> > My nixos vm is causing the host kernel to crash (after cold boot) 
>> >> >> > with
>> >> >> > 'protection fault trap, code=0'.  The guest is running Linux 5.11.14
>> >> >> > (guest dmesg included after the host dmesg below).  I've also 
>> >> >> > attached
>> >> >> > a screenshot of ddb showing the backtrace and registers.
>> >> >> >
>> >> >> >>How-To-Repeat:
>> >> >> >
>> >> >> > The crash can be reliably triggered by doing heavy disk IO on the vm.
>> >> >> > Upgrading the VM actually got the nixos install wedged during an
>> >> >> > initial crash, and attempting to repair it with "nix-build -A system
>> >> >> > '<nixpkgs/nixos>' --repair" is reliably repeating the crash.
>> >> >>
>> >> >> Any chance you've experienced this with a non-NixOS guest? I can't
>> >> >> reproduce this error on my Ryzen5 Pro host.
>> >> >>
>> >>
>> >> I've reproduced this locally with the help of abieber@. Seems I just
>> >> need to boot a nixos iso (nixos-21.05pre287333.63586475587-x86_64) and
>> >> try installing a package like git into the ramdisk:
>> >>
>> >>   # nix-env -f '<nixpkgs>' -iA git
>> >>
>> >> I still haven't triggered this without nixos, but at least I can
>> >> reproduce it locally now. :-)
>> >>
>> >> -dv
>> >>
>> >
>> > robert@ reported this same bug a long time ago and I could never reproduce 
>> > it.
>> >
>> > I'll see if it repros against my R415 using these instructions.
>> >
>> > -ml
>>
>> So far I haven't managed to trigger it using this diff. I don't know
>> why, but maybe the guest is mucking with the GDTR? I checked our logic
>> vs. netbsd nvmm's...as well as our acpi resume handling...and that's all
>> I can think of to explain it.
<snip>
>
> I was able to repair my nix store with this diff (twice, first time on
> a derived qcow2 image for testing).

Updated diff below after working with mlarkin@ on identifying the root
cause. We were being overly fancy tracking which CPU we were on leading
to a rare edgecase where the gdt we use for deriving the input to ltr
caused the #GP. (Which explains why my previous attempt of using lgdtq
prevented ltrw from barfing.)

Josh & abieber@, can you give this a quick test please?


Index: sys/arch/amd64/amd64/vmm.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.280
diff -u -p -r1.280 vmm.c
--- sys/arch/amd64/amd64/vmm.c  6 Apr 2021 00:19:58 -0000       1.280
+++ sys/arch/amd64/amd64/vmm.c  11 May 2021 16:44:46 -0000
@@ -6970,15 +6970,14 @@ vmm_handle_cpuid(struct vcpu *vcpu)
 int
 vcpu_run_svm(struct vcpu *vcpu, struct vm_run_params *vrp)
 {
-       int ret = 0, resume;
+       int ret = 0;
        struct region_descriptor gdt;
-       struct cpu_info *ci;
+       struct cpu_info *ci = NULL;
        uint64_t exit_reason;
        struct schedstate_percpu *spc;
        uint16_t irq;
        struct vmcb *vmcb = (struct vmcb *)vcpu->vc_control_va;

-       resume = 0;
        irq = vrp->vrp_irq;

        /*
@@ -7000,7 +6999,7 @@ vcpu_run_svm(struct vcpu *vcpu, struct v

        while (ret == 0) {
                vmm_update_pvclock(vcpu);
-               if (!resume) {
+               if (ci != curcpu()) {
                        /*
                         * We are launching for the first time, or we are
                         * resuming from a different pcpu, so we need to
@@ -7106,8 +7105,6 @@ vcpu_run_svm(struct vcpu *vcpu, struct v

                /* If we exited successfully ... */
                if (ret == 0) {
-                       resume = 1;
-
                        vcpu->vc_gueststate.vg_rflags = vmcb->v_rflags;

                        /*
@@ -7149,7 +7146,6 @@ vcpu_run_svm(struct vcpu *vcpu, struct v
                        /* Check if we should yield - don't hog the cpu */
                        spc = &ci->ci_schedstate;
                        if (spc->spc_schedflags & SPCF_SHOULDYIELD) {
-                               resume = 0;
                                yield();
                        }
                }

Reply via email to