1) I've been making more progress and mods to plex86.
   Still not at a point where it's good to release code,
   since I've rethought some things.  I moved the SBE
   code to the monitor space.  Ultimately, I want to
   migrate as much as possible to the monitor space, to
   eliminate context switches to/from the host.  Plex86
   should end up pretty fast.

2) Am taking first vacation in eons for a little over
   a week.  Will be back at plex86 after resting my
   virtualized self thereafter.

3) Someone contacted me recently with an interesting request.
   They are looking at plex86 from an Application Service Provider
   (ASP) angle, and wondering if eventually we could put in
   a small amount of instrumentation in the VM, so they
   know how much resources each VM is consuming.  (Billability)
   I can't think of any reason not to, since it can be compiled
   out with #ifdefs and I want to add certain instrumentation
   capabilities into plex86 so I can monitor it's performance
   to tweak it out anyways.

4) Here's some more info I wrote up, for those of you who
   are interested in the VM internals.

Talk soon,
-Kevin


Handling descriptor caches vs descriptor table entries
======================================================

In order to virtualize/emulate certain instructions, we need
to know what values are contained in the segment descriptor
caches.  On a real processor, once the invisible segment descriptor
cache is loaded with a descriptor, we can't look into the
cache to see what values are being used.

There are situations where values in the segment descriptor caches
differ from those in a descriptor table.  In such situations,
we can not count on reading descriptor values from the guest
descriptor tables (ie the GDT) for use by our virtualization.
Following is a list of situations where such a inconsistency can occur:

  - execution of the LGDT instruction
  - write to a page containing a descriptor table
  - write to CR3 (PDBR) register
  - change of a page dir entry spanning a descriptor table
  - change of a page table entry spanning a descriptor table

When we are employing SBE, this does not pose a problem.  As
we have the flexibility to virtualize any instructions we want,
we can chose to virtualize instructions which load and store
the segment registers, and catch the other events listed above.
Then during the emulation of those instructions, we can maintain
the shadow cache values for each segment register.

However, it is desirable under the right circumstances, to let guest
application code (ring3) execute natively without SBE intervention.
Of course, while executing in such a fashion, the guest code
is free to reload the segment registers at will.  This means,
that we have no way to know what values are in the segment
register shadow caches.  Given that the descriptor table entry
corresponding to a particular segment register has not changed
since execution of the code without SBE intervention begins, this
is not a problem.  Using the segment selector, we can simply look
up equivalent values in the guest descriptor table.  And more
importantly, we are able to save/restore the guest segment registers
in the monitor exception or hardware interrupt redirect handler,
merely by pushing/popping segment selectors.

But, a problem arises when we are executing code without SBE
intervention, and when there is an inconsistency between a segment
register's descriptor cache and the corresponding descriptor table entry.
Consider the following two scenarios (I'm not claiming either
represents good OS architecture):

  1) Ring3 guest code invokes ring0 guest code.  The ring0 code
     writes to the descriptor table entry that say GS was loaded
     from, or reloads GDT with a copy of itself located somewhere
     else.  (Hey, this is just a scenario)  Then the ring0 code
     performs an IRET back to the ring3 code, without reloading
     GS.  In this case, GS should retain the descriptor cache values
     it held before the initial transition from ring 3 to 0, which
     are now different from the ones in the descriptor table.

     Keep in mind that the monitor is intervening in these transitions.
     After each intervention, we must return from the monitor to
     execution of the guest by restoring the segment registers from
     selector values popped off the stack.  (then issuing an IRET)
     So the issue becomes, how to return back to guest execution
     without SBE, with values in the segment descriptor caches,
     segment selectors and monitor descriptor tables so execution
     operates as expected.  The selectors have to be as expected
     since the ring3 code may look at them.  The segment descriptor
     caches have to be loaded with (virtualized) values which are
     different than what is in the descriptor table.  We could
     achieve this by manipulating the monitor descriptor tables
     before the final IRET back to the guest code:

       mon_gdt[GS_index] = GS.cache
       reload GS
       mon_gdt[GS_index] = virtualize(guest_gdt[GS_index])
       ...
       IRET

     Now the cache will be correct and the descriptor table will
     have the newly changed values.  But how do we keep track of
     when such a segment register is reloaded from the descriptor
     table (and thus not inconsistent anymore) if we let the natural
     segment loading facilities work without intervention?  We need
     to know if there exists a condition of inconsistency between
     the cache and descriptor tables, for the return from the next
     monitor intervention (or hardware interrupt).

     A first solution would be to detect such a condition, and
     add it to the list of requirements before executing
     any guest ring3 code without SBE.  Only when all requirements
     are met further down the execution path, would ring3 code
     be allowed to execute without SBE.

     A second solution is to modify the sequence above so we
     load the descriptor cache with an appropriate value, but
     mark the descriptor table entry thereafter as inaccessible.
     Accesses to the table by instructions will cause an exception,
     at which point the monitor can emulate the instruction.

       mon_gdt[GS_index] = GS.cache
       reload GS
       mon_gdt[GS_index] = inaccessible
       ...
       IRET

     Now instructions like "MOV GS, AX" will be virtualized.  Since
     we can determine when there is no longer any conditions of
     inconsistency between segment descriptor cache and descriptor
     table entry, the monitor can re-evaluate when to return to
     execution of ring3 guest code by way of the first sequence.
     Thereafter, the native CPU segment register loading mechanisms will
     work without intervention.  It's worth pointing out, that
     multiple segment registers may use the same table entry.
     For instance, DS/ES/FS/GS may all use the same entry.  So
     instructions which reload them will get virtualized as
     a side effect.

     Also note, that for this case, we don't have to worry about
     the CS and SS segment registers.  That is because the only
     time we will run guest code without SBE is when we transition
     to guest ring3.  This always involves a guest operation like IRET
     (or task switch) that reloads CS and SS anyways, so there isn't
     potential for an inconsistency.

  2) This one is really horrible and should never be encountered! :^)
     If the guest OS exported a page with user-level write permissions
     to some ring3 guest code and which spanned a descriptor table,
     the ring3 guest code could conceivably modify the CS or SS
     descriptors.  Since the monitor virtualizes the page tables,
     it would still receive a page fault, but would have to pass
     the write through since it is what was requested.  (A panic
     message might be more useful)  The problem would be returning
     back from the monitor after this event.  After such modification,
     there is then no way to allow the ring3 guest code to run without
     SBE.  Since the monitor issues an IRET instruction to return back
     to execution of the guest code, the reloading of CS and SS are taken
     care of by the processor, so we can not play the tricks above.
     Thus, our only option is to use SBE until the inconsistency no
     longer exists.

Reply via email to