Ramon van Handel wrote:
> 
> >Your understanding and observation is correct.  The trade-off of
> >that option is that we have to trash the whole cache for
> >a vcode page if we find the dirty bit set.
> 
> Okay, I still don't think I understood correctly.  Is Jens correct, that we
> map the scanned page in the I-cache while we map the original page in the
> D-cache ?  I didn't realise that, I thought we used split-I/D in order to
> *trap* writes to the code page.  If we do this, then there is no problem.

Currently, we load the I-cache with the linear address of the
modified (virtualized) code page.  The D-cache gets loaded
with a value (PTE.P==0) that causes any access to get trapped.
If we loaded it with PTE.P==1 and PTE.RW=0, we could let
data reads work, but if for some weird reason the processor
decided to dump the TLB code cache entry, it would revert
back to running in the real unmodified code when it reloaded
the TLB entry.  Since we virtualize any system-like instructions,
my guess would be the processor won't dump the current
I TLB entry.  Imagine the performance if that happened
in a big loop.  We could probably have an option for
people who don't care, but want the reads to the current
code page to happen quicker.

So sure, we can detect writes to the code page.  Reads too,
for the current code page.  The issue I was talking about in
the beginning of this thread is what to do with writes to
_other_ pages which have virtualized code in them, since we
have an option.


> I'm starting to doubt in the practicability of this method.  Trapping on
> page boundaries is slow... and even if we use a "mixed" case, there is no
> way we can separate out page-boundary transfers to pages in the two modes (or
> is there ?)  So we'd get an overall performance degrade.  Does this really weigh
> up to the performance gain we get, if there is a lot of data writing to the
> code page ?  Hopefully, this will be a rarity anyway.

I'm not sure I understand the questions fully.  But, FWIW, we can load
the I-cache with multiple entries if we want to view multiple pages
as a group.  Have to prescan pages as a group as well, meaning we
can allow intra-group branches within the cluster, but not out of
it.

System code that does naughty things is not going to execute
fast in plex86.  CPU intensive guest code that can be run without SBE will
run near native speed.  Our yield will be somewhere in the middle,
when looking at overall system performance.

WRT using the TLB trick, the idea with this is that is allows us
to use extent of the requested guest data segments as-is.  The
other way is to create a virtualized code segment/eip such that the
scanned code is executing, and let the data segments access the
normal address space.  This issue with this, is where to place
this scanned code in the linear addr space when the data segments
span the full 32-bit addr range?  Thus, we'd have to do some
segment shortening, and then emulate accesses to the shortened
area when they occur.  It can get complicated figuring out where
to place this small code range, depending on if the segments
have some weird overlapping scheme.  Though, of course, you would
want to favor the DS, since it is used mostly by code.

Incedently, this shortening method is a way to push guest
ring0 code down to monitor ring1.  If you place the monitor
in a hole made by the 'shortening' of guest segments, even
guest kernel code pushed to ring1 (which can access supervisor
pages in the monitor) won't be able to touch the monitor since
the segmentation checks will trap out first.

-Kevin

Reply via email to