[9fans] vx32 and 9vx performance, and on x86-64

Russ Cox Wed, 02 Jul 2008 09:01:07 -0700

There's not likely anything in the guts of vx32 that hasn't been done
before.  What's new is the fact that we managed to package it up in
a way that runs on a variety of out-of-the-box OS'es with
neither kernel modifications nor special privileges on any x86.
That portability is key to being able to deploy interesting apps,
like 9vx.


> Here's a bigger question, now that I've read the paper and briefly
> scanned the code. Do you have some thoughts on the long term ability
> of vx32 to get close to unity performance on a system (like Plan 9)
> with a high rate of context switches between file server processes
> (you allude ot this cost in the paper).  It's an ideal terminal right
> now. I don't see a need to use drawterm any more.
> 
> But running fossil and venti, it's got a ways to go in terms of
> performance (i.e. mk clean in /sys/src/9/pc takes ~60 seconds).

I have spent approximately no time at all measuring 9vx 
performance other than the numbers in the paper, and even
those were just what it was the first time I measured, not
something I tuned for.  I've been much more focused on
correctness and functionality than speed.  That should be
encouraging, because there's probably a lot of room for
improvement.

Creating new processes and context switching is definitely slow.
One of the slowest parts of the kernel build for me is the line

        rc ../port/mksystab > ../port/systab.h

which invokes a sam script that does

        ,x/SYS[A-Z0-9_]+,/ | tr A-Z a-z

which forks and execs tr many times.  There are three potential
sources of slowdown that I can think of right now:

        * context switches, which involve a lot of mmap/munmap
          and trigger potentially many page faults

        * floating point: apps that use floating point are probably flushing
          the vx32 translation cache more than they could be.

        * the kprocdev framework.  all i/o into devip, devfs, and devdraw
          is marshalled and handed off to a kproc running in a different
          pthread, so that blocking i/o won't block the cpu0 pthread,
          which is the only one that can run vx32.  this means that
          all i/o gets copied one extra time inside the kernel.

All of these could be improved, but you'd have to profile 9vx to 
figure out where the time is going first.  

You can reduce the effect of context switches by maintaining more
than one user address space and by keeping track of which processes
have address spaces that differ only in their stack segment.  

You can rework the way vx32 signals "no floating point" exceptions
so that they wouldn't require flushing the translation cache.

If one is doing i/o into a kernel buffer, like in demand paging, that
doesn't need to be copied during the kprocdev switch, but it is.
If the i/o doesn't span a page boundary and an appropriate fault-free
physical mapping is known, kprocdev could use the kernel's physical
mapping of that page instead of doing a copy.

But again, you'd have to profile to figure out which of these is 
worth doing, if any.  I don't have any sense of where the time is
going.  User-level profilers are going to be difficult to use, because
9vx wants to handle SIGVTALRM itself.  You should be able to do
pretty well with oprofile on Linux, or maybe dtrace on OS X.
That would also have the benefit of telling you how much kernel
effort 9vx is inducing.

I hope that people will do this.  I have very little time to put into
this for the rest of the summer, but I'm always happy to explain
things and process patches.

> At this point, the fastest virtualization system for kernel mk on my
> x60 is still xen, at 12 seconds. I had expected lguest to beat that,
> but it never has. There are claims that kvm is running at close to
> unity, but that's probably for linux -- I have not tested kvm lately
> with plan 9.

Notice that vx32 itself is not on my list above.  I think that
there are plenty of things 9vx is doing inefficiently that 
dwarf the potential 1.8x performance hit in raw x86 execution
speed.  Also, inner loops tend to run close to 1.0 already.
Once everyone's x86 processors have hardware support
for virtualization and the operating systems allow arbitrary
user code to get at it, maybe it would make sense to let vx32
take advantage of that instead, but right now it's not a
priority for me.

> Also, opteron. lguest on opteron should be ready soonish. But vx32 is
> still a highly desirable alternative. Do you have thoughts on how to
> sandbox on opteron, where you don't have the segment registers? Could
> you use mctl to sandbox and then filter mmap system calls from the
> sandboxed code to make sure the sandbox can not be escaped?

Vx32 already runs on x86-64 hosts, but it can only run
x86-32 code.  I don't see any reasonable way around that
limitation right now, but it also doesn't bother me.
Maybe in a few years kvm would be an answer.

Right now, you can build 9vx with -m32 and get a binary
that will work on x86-64, assuming you already have a
32-bit libX11 or you give up graphics.  Plan 9, like many systems,
assumes that kernel pointers and user pointers are the same size.
A native x86-64 version of 9vx that ran 32-bit x86 user code
would be possible, but you'd have to remove that assumption 
from the code.  That probably wouldn't be as bad as it sounds:
I removed the assumption that user 0 = kernel 0 already, and it
only took a day.  Also, while doing that I made sure that the
kernel never has a C pointer holding a user address (always a
ulong instead), and all the translations between kernel and user
pointer now either mention uzero or uvalidaddr.  So they should
be easy to find.

Again I hope that people will do this, but it won't be me
any time soon.

Russ

[9fans] vx32 and 9vx performance, and on x86-64

Reply via email to