The GPCs unload when we finalize the context, which happens before the
process leaves the core at all.  The VMCS lingers, and my intent was to
remove it from the core when the process leaves the core.

However, abandon_core() was the wrong spot for that.  abandon_core() is
more of a "make sure we get out of whatever process context was running on
this core".  There's no guarantee that a process is even loaded at that
time.  What I really wanted was to drop the VMCS (if it was there at all)
when the core is no longer designated as belonging to the proc.

The distinction between these two is related to the difference between
cur_proc and owning_proc.  owning_proc is the process who should be running
on that core, specifically running owning_vcoreid with cur_ctx.  In
contrast, cur_proc (a.k.a. 'current') is the process whose address space is
loaded, for whatever reason.  If the process is running in userspace,
cur_proc == owning_proc.  However, kthreads that work on syscalls for a
process can run on cores that aren't owned by the proc.  Likewise, the
kernel can temporarily enter a process's address space, which usually
involves setting cur_proc.  e.g. send_event(), switch_to(), etc.

Given all this, here was the bug that caused this.  VMs would occasionally
die with an Invalid Opcode trap in the kernel.  I noticed this when they
were killed from ssh (ctrl-c), especially when the VM/VMM was busy.  I
could deterministically recreate it by having the guest spin and sending a
ctrl-c (kill from bash also worked, but kill -9 did not).

The invalid opcode happened during __invept().  From looking at the hexdump
of the arguments, the eptp was 0.  That only gets cleared in __proc_free(),
which made me suspect a refcnt problem.  The ref was indeed 0, so I thought
there was a problem with someone dropping a ref or not upping enough.

It turns out that all refs were accounted for, but the problem was
triggered by kthreads restarting and decreffing the proc before we called
abandon_core().  Specifically:
- kill sends a POSIX signal, the process calls sys_proc_destroy()
- proc_destroy() wakes the parent and sends itself __death, all via KMSGs
  to core_id().
- By the time we get to __death, we're down to about two references:
  owning_proc and cur_proc.  (I'm ignoring the FDs here - there were a
bunch of alarm FDs and syscalls that had refs).
- __death clears_owning_proc, dropping the owning_proc ref.
- Normally, you'd think we'd call abandon_core soon, but first we have the
  __launch_kthread() KMSG, which is restarting our parent's wait syscall
- When restarting that kthread, we switch cur_proc from the VMM (the one
  that is dying, and has only one ref) to the parent.  In doing so, we drop
the final ref for the VMM, which triggers __proc_free() and clears the
eptp.
- Once the parent's syscall is done, we prepare to idle and abandon_core().
  This abandon call isn't clearing the VMM's context, it's clearing the
parents.  abandon() doesn't really care what is running there.
- At this point, __abandon_core() tries to clear the VMCS.  It had never
  been cleared, but it's process (including the GPC!) had been freed.
Yikes!
- The reason the GPC->proc refcnt was zero wasn't because someone messed up
  the refcnts, it's because we didn't clear the GPC before dropping all the
refs.  That GPC->proc ref is internal (aka weak, uncounted).  The ref that
keeps the GPC alive is owning_proc (at least, now it is, after the fix).

Signed-off-by: Barret Rhoden <[email protected]>
---
 kern/arch/riscv/process.c | 4 ++++
 kern/arch/x86/process64.c | 6 +++++-
 kern/include/process.h    | 1 +
 kern/src/kthread.c        | 8 +++++++-
 kern/src/process.c        | 2 ++
 5 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/kern/arch/riscv/process.c b/kern/arch/riscv/process.c
index 7b72d2e01d46..47d0f5ddfddd 100644
--- a/kern/arch/riscv/process.c
+++ b/kern/arch/riscv/process.c
@@ -65,3 +65,7 @@ void __abandon_core(void)
        proc_decref(pcpui->cur_proc);
        pcpui->cur_proc = 0;
 }
+
+void __clear_owning_proc(uint32_t coreid)
+{
+}
diff --git a/kern/arch/x86/process64.c b/kern/arch/x86/process64.c
index 810d453bf680..9ad7ba5cde8d 100644
--- a/kern/arch/x86/process64.c
+++ b/kern/arch/x86/process64.c
@@ -341,7 +341,11 @@ void __abandon_core(void)
        struct per_cpu_info *pcpui = &per_cpu_info[core_id()];
 
        lcr3(boot_cr3);
-       vmx_clear_vmcs();
        proc_decref(pcpui->cur_proc);
        pcpui->cur_proc = 0;
 }
+
+void __clear_owning_proc(uint32_t coreid)
+{
+       vmx_clear_vmcs();
+}
diff --git a/kern/include/process.h b/kern/include/process.h
index 1ccab3b5600f..b2d7bf14562f 100644
--- a/kern/include/process.h
+++ b/kern/include/process.h
@@ -175,6 +175,7 @@ void proc_init_ctx(struct user_context *ctx, uint32_t 
vcoreid, uintptr_t entryp,
                    uintptr_t stack_top, uintptr_t tls_desc);
 void proc_secure_ctx(struct user_context *ctx);
 void __abandon_core(void);
+void __clear_owning_proc(uint32_t coreid);
 
 /* Degubbing */
 void print_allpids(void);
diff --git a/kern/src/kthread.c b/kern/src/kthread.c
index 518f37ab4095..59db13c66c78 100644
--- a/kern/src/kthread.c
+++ b/kern/src/kthread.c
@@ -154,7 +154,13 @@ void restart_kthread(struct kthread *kthread)
                        proc_decref(kthread->proc);
                        kthread->proc = 0;
                } else {
-                       /* Load our page tables before potentially decreffing 
cur_proc */
+                       /* Load our page tables before potentially decreffing 
cur_proc.
+                        *
+                        * We don't need to do an EPT flush here.  The EPT is 
flushed and
+                        * managed in sync with the VMCS.  We won't run a 
different VM (and
+                        * thus *need* a different EPT) without first removing 
the old GPC,
+                        * which ultimately will result in a flushed EPT (on 
x86, this
+                        * actually happens when we clear_owning_proc()). */
                        lcr3(kthread->proc->env_cr3);
                        /* Might have to clear out an existing current.  If 
they need to be
                         * set later (like in restartcore), it'll be done on 
demand. */
diff --git a/kern/src/process.c b/kern/src/process.c
index 43a18c0199eb..5ef5ee3a5640 100644
--- a/kern/src/process.c
+++ b/kern/src/process.c
@@ -1798,6 +1798,8 @@ void clear_owning_proc(uint32_t coreid)
 {
        struct per_cpu_info *pcpui = &per_cpu_info[coreid];
        struct proc *p = pcpui->owning_proc;
+
+       __clear_owning_proc(coreid);
        pcpui->owning_proc = 0;
        pcpui->owning_vcoreid = 0xdeadbeef;
        pcpui->cur_ctx = 0;                     /* catch bugs for now (may go 
away) */
-- 
2.11.0.483.g087da7b7c-goog

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to