1) I've been making more progress and mods to plex86.
Still not at a point where it's good to release code,
since I've rethought some things. I moved the SBE
code to the monitor space. Ultimately, I want to
migrate as much as possible to the monitor space, to
eliminate context switches to/from the host. Plex86
should end up pretty fast.
2) Am taking first vacation in eons for a little over
a week. Will be back at plex86 after resting my
virtualized self thereafter.
3) Someone contacted me recently with an interesting request.
They are looking at plex86 from an Application Service Provider
(ASP) angle, and wondering if eventually we could put in
a small amount of instrumentation in the VM, so they
know how much resources each VM is consuming. (Billability)
I can't think of any reason not to, since it can be compiled
out with #ifdefs and I want to add certain instrumentation
capabilities into plex86 so I can monitor it's performance
to tweak it out anyways.
4) Here's some more info I wrote up, for those of you who
are interested in the VM internals.
Talk soon,
-Kevin
Handling descriptor caches vs descriptor table entries
======================================================
In order to virtualize/emulate certain instructions, we need
to know what values are contained in the segment descriptor
caches. On a real processor, once the invisible segment descriptor
cache is loaded with a descriptor, we can't look into the
cache to see what values are being used.
There are situations where values in the segment descriptor caches
differ from those in a descriptor table. In such situations,
we can not count on reading descriptor values from the guest
descriptor tables (ie the GDT) for use by our virtualization.
Following is a list of situations where such a inconsistency can occur:
- execution of the LGDT instruction
- write to a page containing a descriptor table
- write to CR3 (PDBR) register
- change of a page dir entry spanning a descriptor table
- change of a page table entry spanning a descriptor table
When we are employing SBE, this does not pose a problem. As
we have the flexibility to virtualize any instructions we want,
we can chose to virtualize instructions which load and store
the segment registers, and catch the other events listed above.
Then during the emulation of those instructions, we can maintain
the shadow cache values for each segment register.
However, it is desirable under the right circumstances, to let guest
application code (ring3) execute natively without SBE intervention.
Of course, while executing in such a fashion, the guest code
is free to reload the segment registers at will. This means,
that we have no way to know what values are in the segment
register shadow caches. Given that the descriptor table entry
corresponding to a particular segment register has not changed
since execution of the code without SBE intervention begins, this
is not a problem. Using the segment selector, we can simply look
up equivalent values in the guest descriptor table. And more
importantly, we are able to save/restore the guest segment registers
in the monitor exception or hardware interrupt redirect handler,
merely by pushing/popping segment selectors.
But, a problem arises when we are executing code without SBE
intervention, and when there is an inconsistency between a segment
register's descriptor cache and the corresponding descriptor table entry.
Consider the following two scenarios (I'm not claiming either
represents good OS architecture):
1) Ring3 guest code invokes ring0 guest code. The ring0 code
writes to the descriptor table entry that say GS was loaded
from, or reloads GDT with a copy of itself located somewhere
else. (Hey, this is just a scenario) Then the ring0 code
performs an IRET back to the ring3 code, without reloading
GS. In this case, GS should retain the descriptor cache values
it held before the initial transition from ring 3 to 0, which
are now different from the ones in the descriptor table.
Keep in mind that the monitor is intervening in these transitions.
After each intervention, we must return from the monitor to
execution of the guest by restoring the segment registers from
selector values popped off the stack. (then issuing an IRET)
So the issue becomes, how to return back to guest execution
without SBE, with values in the segment descriptor caches,
segment selectors and monitor descriptor tables so execution
operates as expected. The selectors have to be as expected
since the ring3 code may look at them. The segment descriptor
caches have to be loaded with (virtualized) values which are
different than what is in the descriptor table. We could
achieve this by manipulating the monitor descriptor tables
before the final IRET back to the guest code:
mon_gdt[GS_index] = GS.cache
reload GS
mon_gdt[GS_index] = virtualize(guest_gdt[GS_index])
...
IRET
Now the cache will be correct and the descriptor table will
have the newly changed values. But how do we keep track of
when such a segment register is reloaded from the descriptor
table (and thus not inconsistent anymore) if we let the natural
segment loading facilities work without intervention? We need
to know if there exists a condition of inconsistency between
the cache and descriptor tables, for the return from the next
monitor intervention (or hardware interrupt).
A first solution would be to detect such a condition, and
add it to the list of requirements before executing
any guest ring3 code without SBE. Only when all requirements
are met further down the execution path, would ring3 code
be allowed to execute without SBE.
A second solution is to modify the sequence above so we
load the descriptor cache with an appropriate value, but
mark the descriptor table entry thereafter as inaccessible.
Accesses to the table by instructions will cause an exception,
at which point the monitor can emulate the instruction.
mon_gdt[GS_index] = GS.cache
reload GS
mon_gdt[GS_index] = inaccessible
...
IRET
Now instructions like "MOV GS, AX" will be virtualized. Since
we can determine when there is no longer any conditions of
inconsistency between segment descriptor cache and descriptor
table entry, the monitor can re-evaluate when to return to
execution of ring3 guest code by way of the first sequence.
Thereafter, the native CPU segment register loading mechanisms will
work without intervention. It's worth pointing out, that
multiple segment registers may use the same table entry.
For instance, DS/ES/FS/GS may all use the same entry. So
instructions which reload them will get virtualized as
a side effect.
Also note, that for this case, we don't have to worry about
the CS and SS segment registers. That is because the only
time we will run guest code without SBE is when we transition
to guest ring3. This always involves a guest operation like IRET
(or task switch) that reloads CS and SS anyways, so there isn't
potential for an inconsistency.
2) This one is really horrible and should never be encountered! :^)
If the guest OS exported a page with user-level write permissions
to some ring3 guest code and which spanned a descriptor table,
the ring3 guest code could conceivably modify the CS or SS
descriptors. Since the monitor virtualizes the page tables,
it would still receive a page fault, but would have to pass
the write through since it is what was requested. (A panic
message might be more useful) The problem would be returning
back from the monitor after this event. After such modification,
there is then no way to allow the ring3 guest code to run without
SBE. Since the monitor issues an IRET instruction to return back
to execution of the guest code, the reloading of CS and SS are taken
care of by the processor, so we can not play the tricks above.
Thus, our only option is to use SBE until the inconsistency no
longer exists.