Hi Antti, > The problem there is that file system implementations are very > intertwined with the vfs and block/page cache interface semantics. For > example, blocks must be written in correct order to retain file system > consistency, and some blocks must not be written before other are. > Furthermore, there's no global right way to do it, and e.g. ext2 and > journalled FFS have slightly different kinks.
that is very interesting. I can understand that the order of write operations must be maintained when writing back the cache. But I would love to learn about further considerations. It just occurred to me that it might be a good idea to incorporate such consistency information into Genode's block-session interface. Can you recommend specific learning material on this consistency issue? > Notably, though, the very early versions of rump kernels ca 2007 (called > "RUMP" back then) did run with just a shim version of block/page cache. > This was because rump kernels were only used for running kernel file > system drivers in userspace servers. Since caching was already done in > the host kernel, it was unnecessary in "RUMP". Things worked, > superficially, but it wasn't correct in the sense that a crash could > curdle the file system. > > Simply, block/page caches (and furthermore some bits of the VM) are a > part of the current implementations of the file system drivers -- you > even noted something like this in your description of the Linux file > system porting efforts. What we should be able to do fairly easily, > though, is to figure out how to make those layers work as write-through > as possible, so you could have control of the actual caching in a > separate component at/under the block device layer. If that approach > could solve a reasonable chunk of the issue for you, maybe we can > discuss it more in a separate thread or e.g. at FOSDEM. That sounds very good. Let me give you the rationale behind my request: In Genode, we have to approach caching in a way that is very different from monolithic kernels. A monolithic kernel has a global view of the used memory, the slack memory, and it has the system-global notion of files. On Genode, there is no single component with such an all-encompassing view. Right now, when using Rump kernels including the cache, we have to assign a certain amount of memory to rump-fs. But inevitably, the amount is either too low (there remains a lot of unused memory in the system) or too high (the memory is needed somewhere else). Whatever value we pick, it is wrong. This issue gets amplified by the fact that we use multiple rump-fs instances - one for each file system that we use. To allow a block cache to utilize slack memory but still be able to use this memory otherwise as soon as the system is under pressure, Genode supports a "ballooning" protocol that allows components to cooperatively adapt their memory usage over runtime. E.g., a cache would try it request additional memory up to a point where such requests are denied. So slack memory gets sucked up by the cache. Once the memory is needed for other purposes, the cache cooperatively responds to so-called "yield requests" by evicting the cached blocks and handing back the underlying backing store. I see two approaches to employ the ballooning mechanism with rump-fs. First, we could make rump-fs able to support ballooning. But I honestly don't know what this would take. The second approach would be to separate the cache from rump-fs. Rump-fs would merely translate file-system operations to block-level operations, what it already does. It would leave the caching to a separate cache component that sits between rump-fs and the block device. We already have a simple version of such a cache component (using LRU as eviction strategy), which is implemented in less than 1000 lines of code. The low complexity is good because this component must be trusted to release resources on request. A small program can be validated easier than a complex component like rump-fs. Your write-though suggestion looks promising. So we could dimension the rump_fs cache to be quite small and use the separate cache component. >> Second, since the Rump >> kernel spawns one host thread for each NetBSD kernel thread, the >> component's footprint with respect to the usage of threads is much >> higher than it could be. But those things do not at all diminish the >> value that Rump kernels provide to us. > > There's no mandate to create a host thread per se. You need to create a > separately schedulable entity with a stack and thread-local storage, but > if you choose to implement that multiplexed on top of a single host > thread, that's fine. Does that mean that Rump kernels do not rely on preemptive threading? If yes, user-level thread scheduling (e.g., based on setjmp, longjmp) would do? What keeps you back from doing this by default? This would make Rump kernel behave deterministically across all host platforms and possibly simplify the hypercall interface. Wouldn't that be desirable? > Due to the way the NetBSD kernel is structured, > they do need to be independently runnable, i.e. t1 must be able to run > even if t2 is blocking. Otherwise you'll run into deadlocks ... very > non-obvious deadlocks ... everywhere. Notably, the non-kernel threads > accessing the rump kernel must be under the same scheduler architecture > as the kernel threads. Otherwise, say a non-kernel thread calls the > rump kernel, and blocks. Also assume that unblocking that thread > depends on a kernel thread running. The scheduler must have knowledge > that it needs to run the kernel thread. The same story applies when the > kernel thread has finished running and the user thread should run again. > > A thread itself should be reasonably cheap (~10kB memory including max. > stack consumption). So, the 10-or-so kernel threads will set you back > ~100k, a non-trivial amount if you run hundreds of rump kernel servers, > but probably not the major cost. If you can be more specific about what > the "thread footprint" is, I can at least think of how to reduce it to > an acceptable level. I should have been more specific. With "footprint" I subsumed multiple things: * Stacks (obviously) * Kernel memory (on kernels with kernel threads, this is one additional context in the kernel + the kernel object that represents the thread). Unfortunately, on most L4 kernels, kernel memory is strictly bounded. So we are a bit cautious about kernel-memory usage. * Kernel interactions needed for each blocking operation such as blocking on a contended lock * Resources in Genode's core component. I.e., when running on NOVA, core needs to allocate a pager thread for each user thread. This pager thread consumes one thread context (which is 1 MiB of virtual memory) within core. Consequently, on 32-bit platforms, core's virtual memory becomes suddenly a scarce resource. But this is an issue we are working on. Also consider that we are using multiple rump-fs instances. So the 10-or-so threads become N*10-or-so threads. ;-) In general, we try to avoid using multiple threads these days except for two cases: Where work load is to be distributed over multiple cores, or where different code paths should be schedulable independently (think of low-latency IRQ handlers). Both cases certainly do not apply to file systems. In all other cases where threads had traditionally been used, we prefer to model components as state machines that respond to asynchronous events and incoming RPC requests (similar to a select loop). I agree that this can produce dead locks. But on the other hand, the use of threads is even worse because, when not properly synchronized, they become prone to race conditions, which may eventually result in silent memory corruptions. When debugging, I vastly prefer a deterministic deadlock (where I can look at the backtrace to spot the problem) over a sporadic memory corruption issue (where I can spot a symptom but rarely the cause). We are successfully applying this single-threaded approach to all new components and are in the process of reworking all existing components to get rid of threads. Even for the Wifi stack and the Linux TCP/IP, we execute all Linux kernel threads by a single Genode thread. My slight objection regarding Rump kernels's reliance on host threads stems from this line of thinking. But as I mentioned in my previous email, it is not a problem that keeps us back! Please do not feel pressured by me. Cheers Norman -- Dr.-Ing. Norman Feske Genode Labs http://www.genode-labs.com · http://genode.org Genode Labs GmbH · Amtsgericht Dresden · HRB 28424 · Sitz Dresden Geschäftsführer: Dr.-Ing. Norman Feske, Christian Helmuth ------------------------------------------------------------------------------ New Year. New Location. New Benefits. New Data Center in Ashburn, VA. GigeNET is offering a free month of service with a new server in Ashburn. Choose from 2 high performing configs, both with 100TB of bandwidth. Higher redundancy.Lower latency.Increased capacity.Completely compliant. http://p.sf.net/sfu/gigenet _______________________________________________ rumpkernel-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rumpkernel-users
