On 12/11/15 18:15, Gabriel Parmer wrote:
Antti,

Thanks again for the responses!  I hope we aren't absolutely boring
everyone else.

If someone on this list gets bored by useful technical discussion, they have the simple options of either not reading the mails which bore them or unsubscribing entirely.

There a few factors here.  1. device unpredictability, 2. driver
unpredictability, and 3. workload unpredictability.  We can do nothing
about 1.  For 2. many drivers have relatively constrained control flows.
If you have bounded loops (explicitly bounded, or implicitly bounded by
assumptions we can make about the hardware).  Many drivers would satisfy
this.  There might be something about netbsd's structure that I don't know
about that complicates this.  3. is likely the one that you're referring
to.  For a network device, can you put a bound on the arrival rate?  A lot
of work has been done on this in the past to minimize the impact of
workloads that go beyond what is expected, but in the end, you have to make
an assumption somewhere that either the interrupts are rate-limited (i.e.
by HW), or that you can disable the interrupt for the device to control the
rate (as in NAPI in Linux).

It's not just the driver itself. The driver is part of the system, and a system resource the driver wants to access might be busy; determining the interrelations of the various resources is the killer. Also, a driver alone rarely does anything useful, it's the cooperation of various drivers that gets the job done.

Interrupts are not a problem. As you probably know, a rump kernel is "virtually" non-preemptible and knows nothing about interrupts, so you can mask interrupts in your underlying system exactly when/how you please (as long as your underlying system can handle it, of course). If you need something to run when there is a thread inside the rump kernel, you can define multiple virtual cores in a rump kernel, preempt the underlying scheduler, and run high-priority tasks on dedicated virtual cores. Priority inversion may still apply, but handling priority inversion is up to the underlying scheduler, not the rump kernel.

But, yea, I think the way to proceed is to experiment with some workloads, see what happens, and determine the next actions based on that.

You can't really do zero-copy without an extensive rework.  System call
driver code is generally something like this:

fun(int *userptr)
{
         int v;

         copyin(userptr, &v);
         do_stuff(v);
}


We have an efficient framework for using shared memory sanely.  For apps
that directly use our interfaces (as opposed to POSIX read/write), the
copyin is avoided,  It is not uncommon for the copyin to be performed deep
into the system call path (e.g. to amortize the cost of network packet
checksum with the copy).  I don't have a good answer for these specific
situations.  However, for system calls of the form of your pseudocode, we
can avoid the copy.  The copyout example is similar.

The contents from userptr must be in v for do_stuff() to work. Since the copyins and outs are generally sprinkled inline in the drivers at arbitrary points, you really need to go over and modify every single call path ... which is what I'd call extensive rework (not to mention error-prone).

That said, I'm getting the feeling I'm misunderstanding what you're saying.

But as long as you can peek and poke the client's memory from the server,
you can at least eliminate the RTT, which is the big cost. (handwaved
around the subject in fig 3.37 of book)


There are two factors that make the data copy matter.  1. as the RTT
decreases, the copy matters more.  600 cycle RTT is quite low.  RTTs over
sockets are very slow in comparison.  2.  Your measurements were for 64
byte data payloads.  For system calls similar to read/write, that is a low
value.  A real cost of copying is often in the cache-lines that were
evicted due to doubling the data's footprint.  That only shows up when
you're running a real workload.  Long story short, there's a long history
of zero copy helping performance (esp. in networking).  If the syscall RTT
is high, or the amount of data copied is small, then the impact of copying
is minimal.  Remove either or both of those constraints, and the copy
becomes more prominent.

Agreed, but you're working on a different premise. As I indicated above, avoiding the copy entirely is not possible unless the driver knows how to handle it, so it was not something that was possible for me to consider at all.

If we change the subject from "eliminate most-to-all copyin/out" to "provide better networking performance", the discussion takes an entirely different path. It might be possible to use the page loaning codepath in the socket code to avoid the data copy, though it would not preserve socket semantics -- no virtual memory in a rump kernel, so you can't handle the case where the caller modifies a shared page which is queued for transport. That said, might be worth a shot if you're willing to bend the constraints.

The work put into RK to decompose the system into different functional
chunks that can each be included or not (with considerations for
dependencies) is great.  One use of that is to make unikernels with just
the right level of functionality.  However, I think the exciting step is to
have multiple RKs in the system, perhaps each with differently configured
modules, that communicate to accomplish the system's goals.  This enables
the system to trade isolation and performance and in many ways generalizes
virtualization.

I understand that without context, this sounds like a lot of work for not
much payoff.  However, combine that with our facilities for
failing/recovering components, and harnessing parallelism around component
barriers, and it makes enough sense that I'm excited about it.

That sounds like how you'd use rump kernels to build a multiserver microkernel system. There's probably a twist or two along the way, but unless you want to go overboard with modularization, it should be pretty smooth sailing.

That said, recovery is going to be somewhat challenging. I tried it for file system servers in 2008(?), realized I couldn't hack it up in one evening, and then it went into the "maybe look some day" pile. However, for some applications it's actually trivial. For example, you can just kill, restart and reconfigure a networking stack from underneath firefox without any recovery code. Apart from current transfers dying (which they sometimes do anyway), the user will never notice a thing. But try doing the same for vanilla ssh, and the user is guaranteed to notice.

If you can come up with a patch for bmk which
1) solves your problem
2) incurs no runtime penalty for the current case
3) does not look like civet coffee before the cleanup process

I see no reason not to integrate such a patch.


It is not smart, beautiful, or interesting.  We assume limits on the number
of threads in the system, and fast access to the current thread id.  Given
this, a simple lookup in a "TLS" array by thread id yields what we want.  I
can't imagine this would pass the smell test.

You misunderstood. I would not integrate your version, so I don't care what it looks like. However, I'm not against integrating a compile-time indirection. IOW, the aim is to minimize you having to do patch roll-forward when you pull a newer version of Rumprun. A compile-time indirection is just a semantic patch, really ... ;)

Some context from our side.  Our current implementation plan is as follows:
1. Create a layer below bmk which removes the lowest-level functions, and
implements them using Composite primitives (context switch, initial memory
image allocation, interrupts).  We're testing this now, and it has been in
place for the past few months.


So more or less rumprun/platform/$x?  (not exactly, but just trying to
locate the ballpark)


Exactly.

So, essentially, apart from the context switch code which is currently platform-independent (in libbmk_core) and would need some sort of toggle, Composite would be another platform. So instead of hw|xen in build-rr, we'd have composite|hw|xen. I guess that could be somehow pluggable, too.

3. Get PCI working in the system by implementing the rumpuser interface for
PCI.


.. which reminds me I need to separate the DMA/bus hypercalls from the PCI
components for the benefit of ARM-based system which want to do device
access but don't have PCI.


This is the area I'm most worried about.  I haven't spent more than a
couple of hours looking into the PCI hypercall stuff.  I hope we can
provide an implementation of it, and PCI will "just work".  I have little
faith in that happening.

Can't say for sure about your case, but that's more or less what happened for me ;)

Robert Millan was able to create a working implementation for Hurd, so we have now four different working [known] implementations of the interface (Rumprun-{hw,xen}, Linux pci-uio-generic and Hurd). It's at least a four-trick pony.

Also notable: once you move below the syscall drivers, you can't support
remote clients.


...or we just need to write our own support for them (e.g. similar to the
current cookie support).  That is the current plan.  We'll need to modify
sysproxy so much to do what we need anyway (in the previous steps of our
plan), that I'm waving my hands and hoping this won't be too much of an
issue.  I'm likely missing a truck-load of issues here as we haven't looked
into it in too much detail.

Well, you need to write/maintain your own drivers too, because the copyin/copyout calls are not [guaranteed to be] there. But, I think you already hinted that you're not afraid of making non-upstreamable changes to code, so it might be a non-issue for you if you're just interested in a few cases. However, in terms of effort and complexity, I wouldn't put modifying sysproxy even in the same universe as modifying all of the drivers you need.

Reply via email to