|
http://lwn.net/Articles/249080/ For the first time in a few years, virtualization was not on the agenda at the 2007 kernel summit. The related field of containers, however, was deemed worth talking about. The virtualization problem has been mostly solved, at least at the kernel level, but there is still a lot of work to do in the containers area. Paul Menage talked about the process containers patch, which has recently been rebranded "control groups." The control groups API is currently being used by the CFS scheduler, cpusets, and the memory controller code. Work in progress includes rlimits and an interface to the process freezer used by the suspend/resume code. Controlling the freezer via control groups allows user space to freeze specific groups of processes, which, in turn, is very useful when implementing checkpointing and live migration. In particular, with control groups, it will be possible to freeze an entire group of processes in an atomic way. Control groups have very little overhead when not in use. There is an approximately 1% hit on the fork() and exec() calls when control groups are being used. The control groups code is managed by way of a virtual filesystem. This filesystem is a user-space API which must be managed carefully; there needs to be consistency across the various controllers which can work with control groups. To that end, parts of this interface are being pushed into generic code when possible. One other issue is the use of control groups within containers. It would be nice if a containerized system could manage control groups for processes within the container, but that is not yet implemented. Eric Biederman talked about the container situation in general. Implementing containers requires the creation of container-specific namespaces for all of the global resources found on the system. Namespaces for time, SYSV interprocess communication primitives, and users are in the mainline now. There is a process ID namespace patch in -mm which is getting close. Network namespaces are in development now. Resources which still need to have namespaces created for them include system time (important to keep time from moving backward when containers are migrated from one system to another) and devices. Each namespace which is created requires an option to the clone() system call to say whether it should be shared or not. It seems that there may not be enough clone bits to go around; how that problem will be solved is not clear. So, how close are we to having a working container solution? It is still somewhat distant, says Eric. But, when it's done, the support for containers in Linux will be more general and more capable than the options which are available now. It is, he says, a more general solution than OpenVZ, and, unlike Solaris Zones, it will have network namespaces. An important milestone will be the incorporation of PID namespaces, which will make it possible to start actually playing with Linux containers. That code should, with luck, be merged before too long, though it is proving to be a bit of a challenge: kernel code has process IDs hidden away in a number of unexpected places. Stay tuned; perhaps, by the next kernel summit, containers will be considered to be a solved problem as well.
KS2007: Containers Posted Sep 10, 2007 16:59 UTC (Mon) by kolyshkin (subscriber, #34342) [Link] By the way, slides used for this session are available here. An important milestone will be the incorporation of PID namespaces, which will make it possible to start actually playing with Linux containers. That code should, with luck, be merged before too long (Most of) PID namespaces code are already in -mm tree. It is, he says, a more general solution than OpenVZYes, in a sense that one can only use parts of container functionality (like only have a PID namespace, or a network namespace) -- which makes sense in some situations. Currently, OpenVZ kernel only lets you use just some parts separately (like beancounters, or fair CPU scheduler), and this is only from the kernel side -- user-level tools can only deal with "full scale" containers. >From the other side, checkpointing is only possible when container is a closed object, so "half-containers" can not be checkpointed. So, how close are we to having a working container solution? A big part here is resource management. Memory controller that is now in -mm is just the very beginning -- there is a whole lot more than RSS and page cache (from the other side, Pavel Emelyanov already sent kernel memory controller patchset as an RFC). Group-based CFQ scheduling is not yet merged AFAIK. Group I/O scheduling (based on Jens Axboe's CFQ) will probably be sent for review soon; but scheduling delayed writes requires some dirty page tracking mechanism that only exists in OpenVZ for now (described in Pavel's paper), a discussion of how to implement that for mainstream is not even started. At the end -- there are a lot of issues to be solved, but given the latest progress, most of the functionality could be there in a year or so, so I more or less agree with your optimistic forecast. :) When containers are ready, we can start work on checkpointing.
What is a network namespace? Posted Sep 11, 2007 13:33 UTC (Tue) by cajal (guest, #4167) [Link] I'm puzzled by this quote "unlike Solaris Zones, it will have network namespaces." What is a network namespace?
What is a network namespace? Posted Sep 12, 2007 3:13 UTC (Wed) by zdzichu (subscriber, #17118) [Link] It's an ability to have different network stacks running along. It's network stack virtualization. And, contrary to comment above, it's available in Solaris 10u4 and OpenSolaris. It's nicknamed project Crossbow.
What is a network namespace? Posted Sep 12, 2007 8:42 UTC (Wed) by ebiederm (subscriber, #35028) [Link] Odd. I don't think I actually made that comment.
KS2007: Containers Posted Sep 12, 2007 8:41 UTC (Wed) by ebiederm (subscriber, #35028) [Link] When I claimed the current kernel infrastructure is more general thenvserver and OpenVZ what I meant is that we have to support the entire kernel and everything it can do, and doing it with code that can pass a code review by the kernel community. Ensuring that architecture and subarchitecture will work, and that every weird kernel subsystem will work appears to me to be more then the out of tree projects have tackled.
Doing this this with namespaces makes decomposes the problem so we can The question asked of me is how long until we have in kernel support
that If you only need a subset of that functionality (like a lot of
projects) Having the additional resource management seems to be a big part of
the For global resources there are two approaches that a designer can
choose What little I know of Solaris Zones is that they grew out of efforts
to As for the question of what are network namespaces. They are a way
to Eric
KS2007: Containers Posted Sep 12, 2007 14:51 UTC (Wed) by kolyshkin (subscriber, #34342) [Link] Gerrit Huizenga's coverage of the same containers session is here:http://gh-linux.blogspot.com/2007/09/linux-kernel-summit-... |
