Re: Segmentation Registers

Antti Kantee Sun, 15 Nov 2015 07:09:46 -0800

On 12/11/15 18:15, Gabriel Parmer wrote:

Antti,


Thanks again for the responses!  I hope we aren't absolutely boring
everyone else.

If someone on this list gets bored by useful technical discussion, theyhave the simple options of either not reading the mails which bore themor unsubscribing entirely.

There a few factors here.  1. device unpredictability, 2. driver
unpredictability, and 3. workload unpredictability.  We can do nothing
about 1.  For 2. many drivers have relatively constrained control flows.
If you have bounded loops (explicitly bounded, or implicitly bounded by
assumptions we can make about the hardware).  Many drivers would satisfy
this.  There might be something about netbsd's structure that I don't know
about that complicates this.  3. is likely the one that you're referring
to.  For a network device, can you put a bound on the arrival rate?  A lot
of work has been done on this in the past to minimize the impact of
workloads that go beyond what is expected, but in the end, you have to make
an assumption somewhere that either the interrupts are rate-limited (i.e.
by HW), or that you can disable the interrupt for the device to control the
rate (as in NAPI in Linux).

It's not just the driver itself. The driver is part of the system, anda system resource the driver wants to access might be busy; determiningthe interrelations of the various resources is the killer. Also, adriver alone rarely does anything useful, it's the cooperation ofvarious drivers that gets the job done.

Interrupts are not a problem. As you probably know, a rump kernel is"virtually" non-preemptible and knows nothing about interrupts, so youcan mask interrupts in your underlying system exactly when/how youplease (as long as your underlying system can handle it, of course). Ifyou need something to run when there is a thread inside the rump kernel,you can define multiple virtual cores in a rump kernel, preempt theunderlying scheduler, and run high-priority tasks on dedicated virtualcores. Priority inversion may still apply, but handling priorityinversion is up to the underlying scheduler, not the rump kernel.

But, yea, I think the way to proceed is to experiment with someworkloads, see what happens, and determine the next actions based on that.

You can't really do zero-copy without an extensive rework.  System call
driver code is generally something like this:

fun(int *userptr)
{
         int v;

         copyin(userptr, &v);
         do_stuff(v);
}


We have an efficient framework for using shared memory sanely.  For apps
that directly use our interfaces (as opposed to POSIX read/write), the
copyin is avoided,  It is not uncommon for the copyin to be performed deep
into the system call path (e.g. to amortize the cost of network packet
checksum with the copy).  I don't have a good answer for these specific
situations.  However, for system calls of the form of your pseudocode, we
can avoid the copy.  The copyout example is similar.

The contents from userptr must be in v for do_stuff() to work. Sincethe copyins and outs are generally sprinkled inline in the drivers atarbitrary points, you really need to go over and modify every singlecall path ... which is what I'd call extensive rework (not to mentionerror-prone).


That said, I'm getting the feeling I'm misunderstanding what you're saying.

But as long as you can peek and poke the client's memory from the server,
you can at least eliminate the RTT, which is the big cost. (handwaved
around the subject in fig 3.37 of book)



There are two factors that make the data copy matter.  1. as the RTT
decreases, the copy matters more.  600 cycle RTT is quite low.  RTTs over
sockets are very slow in comparison.  2.  Your measurements were for 64
byte data payloads.  For system calls similar to read/write, that is a low
value.  A real cost of copying is often in the cache-lines that were
evicted due to doubling the data's footprint.  That only shows up when
you're running a real workload.  Long story short, there's a long history
of zero copy helping performance (esp. in networking).  If the syscall RTT
is high, or the amount of data copied is small, then the impact of copying
is minimal.  Remove either or both of those constraints, and the copy
becomes more prominent.

Agreed, but you're working on a different premise. As I indicatedabove, avoiding the copy entirely is not possible unless the driverknows how to handle it, so it was not something that was possible for meto consider at all.

If we change the subject from "eliminate most-to-all copyin/out" to"provide better networking performance", the discussion takes anentirely different path. It might be possible to use the page loaningcodepath in the socket code to avoid the data copy, though it would notpreserve socket semantics -- no virtual memory in a rump kernel, so youcan't handle the case where the caller modifies a shared page which isqueued for transport. That said, might be worth a shot if you'rewilling to bend the constraints.

The work put into RK to decompose the system into different functional
chunks that can each be included or not (with considerations for
dependencies) is great.  One use of that is to make unikernels with just
the right level of functionality.  However, I think the exciting step is to
have multiple RKs in the system, perhaps each with differently configured
modules, that communicate to accomplish the system's goals.  This enables
the system to trade isolation and performance and in many ways generalizes
virtualization.

I understand that without context, this sounds like a lot of work for not
much payoff.  However, combine that with our facilities for
failing/recovering components, and harnessing parallelism around component
barriers, and it makes enough sense that I'm excited about it.

That sounds like how you'd use rump kernels to build a multiservermicrokernel system. There's probably a twist or two along the way, butunless you want to go overboard with modularization, it should be prettysmooth sailing.

That said, recovery is going to be somewhat challenging. I tried it forfile system servers in 2008(?), realized I couldn't hack it up in oneevening, and then it went into the "maybe look some day" pile. However,for some applications it's actually trivial. For example, you can justkill, restart and reconfigure a networking stack from underneath firefoxwithout any recovery code. Apart from current transfers dying (whichthey sometimes do anyway), the user will never notice a thing. But trydoing the same for vanilla ssh, and the user is guaranteed to notice.

If you can come up with a patch for bmk which
1) solves your problem
2) incurs no runtime penalty for the current case
3) does not look like civet coffee before the cleanup process

I see no reason not to integrate such a patch.



It is not smart, beautiful, or interesting.  We assume limits on the number
of threads in the system, and fast access to the current thread id.  Given
this, a simple lookup in a "TLS" array by thread id yields what we want.  I
can't imagine this would pass the smell test.

You misunderstood. I would not integrate your version, so I don't carewhat it looks like. However, I'm not against integrating a compile-timeindirection. IOW, the aim is to minimize you having to do patchroll-forward when you pull a newer version of Rumprun. A compile-timeindirection is just a semantic patch, really ... ;)

Some context from our side.  Our current implementation plan is as follows:

1. Create a layer below bmk which removes the lowest-level functions, and
implements them using Composite primitives (context switch, initial memory
image allocation, interrupts).  We're testing this now, and it has been in
place for the past few months.


So more or less rumprun/platform/$x?  (not exactly, but just trying to
locate the ballpark)



Exactly.

So, essentially, apart from the context switch code which is currentlyplatform-independent (in libbmk_core) and would need some sort oftoggle, Composite would be another platform. So instead of hw|xen inbuild-rr, we'd have composite|hw|xen. I guess that could be somehowpluggable, too.

3. Get PCI working in the system by implementing the rumpuser interface for

PCI.


.. which reminds me I need to separate the DMA/bus hypercalls from the PCI
components for the benefit of ARM-based system which want to do device
access but don't have PCI.


This is the area I'm most worried about.  I haven't spent more than a
couple of hours looking into the PCI hypercall stuff.  I hope we can
provide an implementation of it, and PCI will "just work".  I have little
faith in that happening.

Can't say for sure about your case, but that's more or less whathappened for me ;)

Robert Millan was able to create a working implementation for Hurd, sowe have now four different working [known] implementations of theinterface (Rumprun-{hw,xen}, Linux pci-uio-generic and Hurd). It's atleast a four-trick pony.

Also notable: once you move below the syscall drivers, you can't support
remote clients.



...or we just need to write our own support for them (e.g. similar to the
current cookie support).  That is the current plan.  We'll need to modify
sysproxy so much to do what we need anyway (in the previous steps of our
plan), that I'm waving my hands and hoping this won't be too much of an
issue.  I'm likely missing a truck-load of issues here as we haven't looked
into it in too much detail.

Well, you need to write/maintain your own drivers too, because thecopyin/copyout calls are not [guaranteed to be] there. But, I think youalready hinted that you're not afraid of making non-upstreamable changesto code, so it might be a non-issue for you if you're just interested ina few cases. However, in terms of effort and complexity, I wouldn't putmodifying sysproxy even in the same universe as modifying all of thedrivers you need.

Re: Segmentation Registers

Reply via email to