Caches and threads (was Re: rump kernel successes?)

Norman Feske Sun, 25 Jan 2015 07:36:57 -0800

Hi Antti,

> The problem there is that file system implementations are very 
> intertwined with the vfs and block/page cache interface semantics.  For 
> example, blocks must be written in correct order to retain file system 
> consistency, and some blocks must not be written before other are. 
> Furthermore, there's no global right way to do it, and e.g. ext2 and 
> journalled FFS have slightly different kinks.


that is very interesting. I can understand that the order of write
operations must be maintained when writing back the cache. But I would
love to learn about further considerations. It just occurred to me that
it might be a good idea to incorporate such consistency information into
Genode's block-session interface. Can you recommend specific learning
material on this consistency issue?

> Notably, though, the very early versions of rump kernels ca 2007 (called 
> "RUMP" back then) did run with just a shim version of block/page cache. 
>   This was because rump kernels were only used for running kernel file 
> system drivers in userspace servers.  Since caching was already done in 
> the host kernel, it was unnecessary in "RUMP".  Things worked, 
> superficially, but it wasn't correct in the sense that a crash could 
> curdle the file system.
> 
> Simply, block/page caches (and furthermore some bits of the VM) are a 
> part of the current implementations of the file system drivers -- you 
> even noted something like this in your description of the Linux file 
> system porting efforts.  What we should be able to do fairly easily, 
> though, is to figure out how to make those layers work as write-through 
> as possible, so you could have control of the actual caching in a 
> separate component at/under the block device layer.  If that approach 
> could solve a reasonable chunk of the issue for you, maybe we can 
> discuss it more in a separate thread or e.g. at FOSDEM.

That sounds very good.

Let me give you the rationale behind my request: In Genode, we have to
approach caching in a way that is very different from monolithic
kernels. A monolithic kernel has a global view of the used memory, the
slack memory, and it has the system-global notion of files. On Genode,
there is no single component with such an all-encompassing view.

Right now, when using Rump kernels including the cache, we have to
assign a certain amount of memory to rump-fs. But inevitably, the amount
is either too low (there remains a lot of unused memory in the system)
or too high (the memory is needed somewhere else). Whatever value we
pick, it is wrong. This issue gets amplified by the fact that we use
multiple rump-fs instances - one for each file system that we use.

To allow a block cache to utilize slack memory but still be able to use
this memory otherwise as soon as the system is under pressure, Genode
supports a "ballooning" protocol that allows components to cooperatively
adapt their memory usage over runtime. E.g., a cache would try it
request additional memory up to a point where such requests are denied.
So slack memory gets sucked up by the cache. Once the memory is needed
for other purposes, the cache cooperatively responds to so-called "yield
requests" by evicting the cached blocks and handing back the underlying
backing store.

I see two approaches to employ the ballooning mechanism with rump-fs.
First, we could make rump-fs able to support ballooning. But I honestly
don't know what this would take. The second approach would be to
separate the cache from rump-fs. Rump-fs would merely translate
file-system operations to block-level operations, what it already does.
It would leave the caching to a separate cache component that sits
between rump-fs and the block device. We already have a simple version
of such a cache component (using LRU as eviction strategy), which is
implemented in less than 1000 lines of code. The low complexity is good
because this component must be trusted to release resources on request.
A small program can be validated easier than a complex component like
rump-fs.

Your write-though suggestion looks promising. So we could dimension the
rump_fs cache to be quite small and use the separate cache component.

>> Second, since the Rump
>> kernel spawns one host thread for each NetBSD kernel thread, the
>> component's footprint with respect to the usage of threads is much
>> higher than it could be. But those things do not at all diminish the
>> value that Rump kernels provide to us.
> 
> There's no mandate to create a host thread per se.  You need to create a 
> separately schedulable entity with a stack and thread-local storage, but 
> if you choose to implement that multiplexed on top of a single host 
> thread, that's fine.

Does that mean that Rump kernels do not rely on preemptive threading? If
yes, user-level thread scheduling (e.g., based on setjmp, longjmp) would
do? What keeps you back from doing this by default? This would make Rump
kernel behave deterministically across all host platforms and possibly
simplify the hypercall interface. Wouldn't that be desirable?

> Due to the way the NetBSD kernel is structured,
> they do need to be independently runnable, i.e. t1 must be able to run 
> even if t2 is blocking.  Otherwise you'll run into deadlocks ... very 
> non-obvious deadlocks ... everywhere.  Notably, the non-kernel threads 
> accessing the rump kernel must be under the same scheduler architecture 
> as the kernel threads.  Otherwise, say a non-kernel thread calls the 
> rump kernel, and blocks.  Also assume that unblocking that thread 
> depends on a kernel thread running.  The scheduler must have knowledge 
> that it needs to run the kernel thread.  The same story applies when the 
> kernel thread has finished running and the user thread should run again.
> 
> A thread itself should be reasonably cheap (~10kB memory including max. 
> stack consumption).  So, the 10-or-so kernel threads will set you back 
> ~100k, a non-trivial amount if you run hundreds of rump kernel servers, 
> but probably not the major cost.  If you can be more specific about what 
> the "thread footprint" is, I can at least think of how to reduce it to 
> an acceptable level.

I should have been more specific. With "footprint" I subsumed multiple
things:

* Stacks (obviously)

* Kernel memory (on kernels with kernel threads, this is one additional
  context in the kernel + the kernel object that represents the
  thread). Unfortunately, on most L4 kernels, kernel memory is strictly
  bounded. So we are a bit cautious about kernel-memory usage.

* Kernel interactions needed for each blocking operation such as
  blocking on a contended lock

* Resources in Genode's core component. I.e., when running on NOVA, core
  needs to allocate a pager thread for each user thread. This pager
  thread consumes one thread context (which is 1 MiB of virtual memory)
  within core. Consequently, on 32-bit platforms, core's virtual memory
  becomes suddenly a scarce resource. But this is an issue we are
  working on.

Also consider that we are using multiple rump-fs instances. So the
10-or-so threads become N*10-or-so threads. ;-)

In general, we try to avoid using multiple threads these days except for
two cases: Where work load is to be distributed over multiple cores, or
where different code paths should be schedulable independently (think of
low-latency IRQ handlers). Both cases certainly do not apply to file
systems.

In all other cases where threads had traditionally been used, we prefer
to model components as state machines that respond to asynchronous
events and incoming RPC requests (similar to a select loop). I agree
that this can produce dead locks. But on the other hand, the use of
threads is even worse because, when not properly synchronized, they
become prone to race conditions, which may eventually result in silent
memory corruptions. When debugging, I vastly prefer a deterministic
deadlock (where I can look at the backtrace to spot the problem) over a
sporadic memory corruption issue (where I can spot a symptom but rarely
the cause).

We are successfully applying this single-threaded approach to all new
components and are in the process of reworking all existing components
to get rid of threads. Even for the Wifi stack and the Linux TCP/IP, we
execute all Linux kernel threads by a single Genode thread.

My slight objection regarding Rump kernels's reliance on host threads
stems from this line of thinking. But as I mentioned in my previous
email, it is not a problem that keeps us back! Please do not feel
pressured by me.

Cheers
Norman

-- 
Dr.-Ing. Norman Feske
Genode Labs

http://www.genode-labs.com · http://genode.org

Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden
Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth

------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
rumpkernel-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rumpkernel-users

Caches and threads (was Re: rump kernel successes?)

Reply via email to