Hey all,
I have been thinking about the way we currently perform virtualisation, and
I think that by changing some of the framework we can get some dramatic
speedups. This is a pretty big job which I'd like to work on, so perhaps
Kevin and I can tackle it together sometime when ring3 code runs natively
(that is, if it IS a good idea :)).
Assupmtions:
- Most code will run natively on the processor in ring3 and require no
virtualisation whatsoever.
- There may be ring3 code which contains natively virtualisable code, but there
will generally be a very limited amount of it, if any.
- Code running in the higher rings will generally require a lot of (non-native)
virtualisation. There will generally be a very limited amount of code running
in ring0 (compared to ring3 code).
Natively virtualisable code is code that will neatly generate an exception
if we push it to ring3. This is the basis of our virtualisation scheme.
However, it fails because x86 is so braindead, so we had to introduce SBE
for ring0 code.
The basis for the optimisation is the realisation that once you have to SBE
anyway, there is absolutely no reason anymore to execute ring0 code in
ring3.
We can replace all dangerous instructions and run ring0 code in ring0, with
the monitor.
The advantage is clear: currently we replace dangerous instructions with int3.
This generates an exception on every such instruction, which means two mode
switches and a lot of state reconstruction. If we run SBE'd code in monitor
space, int3 can be replaced by direct calls to the appropiate emulation
instructions. This tight binding to the monitor should eliminate a lot of
overhead. The only modeswitches now occur when the guest generates a "virtual"
modeswitch.
Well, almost --- there may also be ring3 code which causes exceptions in plex
which it wouldn't on the real processor. For instance, ring3 code may use I/O
permission bitmaps or it may access MMIO. We also win by SBEing such code and
running it in ring0. Note that SBEing MMIO is one part of the solution to the
I/O performance problem (a much faster solution than the one used by DOSEMU).
So basically plex would have to start executing SBEed code in ring0 when
either (1) the guest switches to supervisor mode, or (2) when a piece of ring3
code enters an I/O intensive region. The rest of the ring3 code is executed
natively in ring3 and should run at native speed, because it contains no
instructions to be virtualised. This will keep the required size of the prescan
cache down as well (I assume that almost all OSes have a small working set of
pages that need virtualisation).
And now for something completely different: using the method above we can
eliminate mode switching overhead. An even bigger overhead however, which is
a problem only with I/O, is context switching between the monitor and the host.
Though having hardware emulation in the host (which is still a good thing, IMO)
make context switching a necessity, I am of the opinion that it should be
possible to drastically cut down on the amount of context switches.
This can be done by caching I/O operations (I/O ports or MMIO). Currently
every
I/O operation incurs a context switch. However, most consecutive I/O outputs
(inputs are of course a different matter, though we can cut down here too) are
independent, and there is often no reason not to cache them up and send them to
the host as a batch. For instance,
program right now maybe possible
out ... ctx switch cache
out ... ctx switch cache
out ... ctx switch cache
out ... ctx switch cache
in ... ctx switch ctx switch, flush I/O cache
However, some outputs and inputs are dependant on each other (or maybe on
timing). So I suggest making an interface to the monitor, using which plugins
can specify which ports are cachable. In default all outports are cached;
non-cachable ports are marked by the plugins and incur a context switch.
There is a way of caching some inports as well, though I believe this will not
lead to as big a performance enhancement as caching outports (but we can try
it). Only some inports are cachable. A plugin may specify a set value on an
inport, certainly if it is a status port. The value is then loaded into the
monitor. Next time the inport is read the monitor looks in its inport table;
if a value has been set then that value is returned, in stead of context
switching. However this requires us to be careful, because usually inport
values will be dependant on outports.
For instance: a device has a status port and a command port. The status port
had a value set in the monitor during the previous status switch. The guest
writes to the control port and then reads the status port; however the status
port will return the wrong value because the status change command is still
pending in the outport cache. The solution is to couple outport caching to
inport caching: an inport would conditionally return a value from the cache or
context switch, depending on whether a specific outport is currently cached.
This scheme however sounds a bit too complicated to be worth it?
It gets sticky when there is a dependency between I/O ports and MMIO
(DMA, VGA latch). This needs to be thought about.
Tell me what you think.
-- Ramon