Re: Segmentation Registers

Gabriel Parmer Thu, 12 Nov 2015 10:17:24 -0800

Antti,

Thanks again for the responses!  I hope we aren't absolutely boring
everyone else.


I don't think you can say anything conclusive about even just the device
> drivers.  It's of course easy to see what happens for some set of events,
> and even make it deterministic, but that, as I'm sure you know, won't tell
> you anything about the situation when various substances hit the fan.


There a few factors here.  1. device unpredictability, 2. driver
unpredictability, and 3. workload unpredictability.  We can do nothing
about 1.  For 2. many drivers have relatively constrained control flows.
If you have bounded loops (explicitly bounded, or implicitly bounded by
assumptions we can make about the hardware).  Many drivers would satisfy
this.  There might be something about netbsd's structure that I don't know
about that complicates this.  3. is likely the one that you're referring
to.  For a network device, can you put a bound on the arrival rate?  A lot
of work has been done on this in the past to minimize the impact of
workloads that go beyond what is expected, but in the end, you have to make
an assumption somewhere that either the interrupts are rate-limited (i.e.
by HW), or that you can disable the interrupt for the device to control the
rate (as in NAPI in Linux).

Don't take my statements as saying "this is easy."  It is hard.  And likely
we won't be able to reasonably make aviation-level guarantees.  I do wonder
how strong of statements we can make about device execution.


> You can't really do zero-copy without an extensive rework.  System call
> driver code is generally something like this:
>
> fun(int *userptr)
> {
>         int v;
>
>         copyin(userptr, &v);
>         do_stuff(v);
> }
>

We have an efficient framework for using shared memory sanely.  For apps
that directly use our interfaces (as opposed to POSIX read/write), the
copyin is avoided,  It is not uncommon for the copyin to be performed deep
into the system call path (e.g. to amortize the cost of network packet
checksum with the copy).  I don't have a good answer for these specific
situations.  However, for system calls of the form of your pseudocode, we
can avoid the copy.  The copyout example is similar.


> But as long as you can peek and poke the client's memory from the server,
> you can at least eliminate the RTT, which is the big cost. (handwaved
> around the subject in fig 3.37 of book)


There are two factors that make the data copy matter.  1. as the RTT
decreases, the copy matters more.  600 cycle RTT is quite low.  RTTs over
sockets are very slow in comparison.  2.  Your measurements were for 64
byte data payloads.  For system calls similar to read/write, that is a low
value.  A real cost of copying is often in the cache-lines that were
evicted due to doubling the data's footprint.  That only shows up when
you're running a real workload.  Long story short, there's a long history
of zero copy helping performance (esp. in networking).  If the syscall RTT
is high, or the amount of data copied is small, then the impact of copying
is minimal.  Remove either or both of those constraints, and the copy
becomes more prominent.

Only time and code will tell how big an impact this zero copy will have.
But results in Composite show that especially when you start using
fine-grained components with the corresponding increase in IPC/copy
frequency, these things start to matter.  I'll be happy to be wrong if it
doesn't end up that way with RK ;-)


> I like "4".
>>>
>>
>>
>> Me too.  I see breaking the RK functionality into separate, isolated
>> components as a logical conclusion.
>>
>
> Can you expand on that thought?


The work put into RK to decompose the system into different functional
chunks that can each be included or not (with considerations for
dependencies) is great.  One use of that is to make unikernels with just
the right level of functionality.  However, I think the exciting step is to
have multiple RKs in the system, perhaps each with differently configured
modules, that communicate to accomplish the system's goals.  This enables
the system to trade isolation and performance and in many ways generalizes
virtualization.

I understand that without context, this sounds like a lot of work for not
much payoff.  However, combine that with our facilities for
failing/recovering components, and harnessing parallelism around component
barriers, and it makes enough sense that I'm excited about it.


> I work in the research domain, so I'm fine with hacking applications to use
>> our own support for TLS.  That said, we've moved on to both 1. manually
>> changing the offending code that uses __thread to make progress, while
>> also
>> 2. trying to hack in to Composite sufficient, but not complete, support
>> for
>> using %gs.  1. is motivated by the fact that we'd like to not wait till 2.
>> is done while still making progress on booting the system.
>>
>
> If you can come up with a patch for bmk which
> 1) solves your problem
> 2) incurs no runtime penalty for the current case
> 3) does not look like civet coffee before the cleanup process
>
> I see no reason not to integrate such a patch.


It is not smart, beautiful, or interesting.  We assume limits on the number
of threads in the system, and fast access to the current thread id.  Given
this, a simple lookup in a "TLS" array by thread id yields what we want.  I
can't imagine this would pass the smell test.


> Some context from our side.  Our current implementation plan is as follows:
>> 1. Create a layer below bmk which removes the lowest-level functions, and
>> implements them using Composite primitives (context switch, initial memory
>> image allocation, interrupts).  We're testing this now, and it has been in
>> place for the past few months.
>>
>
> So more or less rumprun/platform/$x?  (not exactly, but just trying to
> locate the ballpark)


Exactly.


> 2. Get the rumprun unikernel booting with trivial POSIX test programs and
>> no real devices.  We're currently at the point where libc pthread data is
>> being initialized.
>>
>
> Hmm, libpthread also uses TLS.


Right ;-)

3. Get PCI working in the system by implementing the rumpuser interface for
>> PCI.
>>
>
> .. which reminds me I need to separate the DMA/bus hypercalls from the PCI
> components for the benefit of ARM-based system which want to do device
> access but don't have PCI.
>

This is the area I'm most worried about.  I haven't spent more than a
couple of hours looking into the PCI hypercall stuff.  I hope we can
provide an implementation of it, and PCI will "just work".  I have little
faith in that happening.

9. Look into directly communicating with the rump kernel (avoiding libc as
>> in your stack "2" above), and see if we can support a. zero-copy
>> communication, and b. direct communication with the lower layers of the
>> rump kernel (i.e. talk almost directly to drivers).
>>
>
> Notably, once you plug into something below the syscall API, you're no
> longer in the realm of backward-compatible, stable interfaces -- something
> to keep in mind.
>

Absolutely.  We already have a libc (first dietlibc, then musl lbc) that is
hooked into a non-POSIX interface.  We're comfortable going directly to the
non-POSIX stuff.  Only applications that need the performance would do so,
and they are often willing to use more exotic interfaces.


> If you look at libp2k, you'll see one way of plugging in to some
> lower-level interfaces.  I'm not saying it's the best way, not saying I can
> immediately think of a better way either, but at least it's one way.  There
> are a bunch of ad-hoc solutions for bypassing the syscall layer, but no
> real general solution which automatically takes care of prototypes and data
> types.  Some years ago I tried to get the NetBSD interface definitions
> written in xml (because it was the s-expressions du jour back then) and .h
> autogenerated from that, but people who completely missed the point kept
> screaming "ewww, xml".  Oh well.
>

Thanks for the pointer.  Well-defined interfaces would be a huge help, but
I understand the difficulty getting those in an existing, large code-base.
I don't see this as something we will get to anytime soon, so I'm sure I'll
troll the list in the future when we get to this point.


> Also notable: once you move below the syscall drivers, you can't support
> remote clients.


...or we just need to write our own support for them (e.g. similar to the
current cookie support).  That is the current plan.  We'll need to modify
sysproxy so much to do what we need anyway (in the previous steps of our
plan), that I'm waving my hands and hoping this won't be too much of an
issue.  I'm likely missing a truck-load of issues here as we haven't looked
into it in too much detail.

Thanks again!

Best
Gabe

Re: Segmentation Registers

Reply via email to