Checkpoint/restart tries to head towards the mainline

[LWN subscriber-only content]

By Jake Edge
February 25, 2009

In kernel development, there is always tension between the needs of a new feature versus the needs of the kernel as a whole. Projects generally want to get their code merged as early as possible, for a variety of reasons, while the rest of the kernel community needs to be comfortable that the feature is sensible, desirable, and, perhaps most importantly, maintainable. The current push for inclusion of a feature to checkpoint and restart processes highlights this tension.

In late January, Oren Laadan posted the latest version of his kernel-based checkpoint and restart code with the notation: "Aiming for -mm". There are many possible uses for checkpoints, but it is an extremely complex problem. Laadan's current version is quite minimal, implementing only a fairly small subset of the features envisioned, but he would like to get the kind of review and testing that goes along with pushing it towards the mainline.

After two weeks without much in the way of comments, another proponent, Dave Hansen asked what, if anything, was holding the patchset back from -mm inclusion. Andrew Morton replied that he had raised some concerns which were "inconclusively waffled at" a few months back. Morton's opinion carries a fair amount of weight—not least because he runs the targeted tree. He is looking to the future and trying to ensure that the patches make sense:

I am concerned that this implementation is a bit of a toy, and that we don't know what a sufficiently complete implementation will look like. There is a risk that if we merge the toy we either:

a) end up having to merge unacceptably-expensive-to-maintain code to make it a non-toy or

b) decide not to merge the unacceptably-expensive-to-maintain code, leaving us with a toy or

c) simply cannot work out how to implement the missing functionality.

Morton asked for answers to several questions regarding what features are available in the current implementation, as well as information on what needs to be added. He also asked for indications that Laadan and Hansen had some thoughts on the design for required, but not yet implemented, features. In short, he wants to avoid any of the scenarios he outlined. In response to further questions from Ingo Molnar, Hansen outlined some of the shortcomings of the current implementation:

Right now, it is good for very little. An app has to basically be either specifically designed to work, or be pretty puny in its capabilities. Any fds that are open can only be restored if a simple open();lseek(); would have been sufficient to get it back into a good state. The process must be single-threaded. Shared memory, hugetlbfs, VM_NONLINEAR are not supported.

Hansen also had a more detailed answer to Morton's questions, which showed a lot of work still to be done. The current code only works for x86 architectures, for example, and only for basic file types, essentially just pipes and regular files. He likened the progress of checkpoint/restart to that of kernel scalability; it is a work in progress, not something that will ever be complete:

We intend to make core kernel functionality checkpointable first. We'll move outwards from there as we (and our users) deem things important, but we'll certainly never be done.

One of the main concerns is not that there is a lot still to be done, but that there may be lurking problems that either don't have solutions or can only be solved by very intrusive kernel changes. Matt Mackall looked at Hansen's list of additional features needing to be implemented and summed up the worries this way:

I think the real questions is: where are the dragons hiding? Some of these are known to be hard. And some of them are critical [for] checkpointing typical applications. If you have plans or theories for implementing all of the above, then great. But this list doesn't really give any sense of whether we should be scared of what lurks behind those doors.

There is, however, a free out-of-tree implementation of checkpoint/restart in the OpenVZ project. OpenVZ is a virtualization scheme using its own implementation of containers—different from that in more recent kernels—that supports checkpointing and migrating those containers. But it is a large patch, which Morton looked at several years ago and concluded that it would not be welcome in the mainline. Hansen sees OpenVZ as a useful example, but "with all the input from the OpenVZ folks and at least three other projects, I bet we can come up with something better".

An incremental approach to implementing checkpoints is reasonable, but Morton is concerned that by merging the current patches, the kernel developers will be committed to merging something that looks a lot like—and is as intrusive as—the OpenVZ patches. Molnar is more upbeat: he sees it as an important feature without "many long-term dragons". He does see one potential problem area in the incremental approach, though:

There is _one_ interim runtime cost: the "can we checkpoint or not" decision that the kernel has to make while the feature is not complete.

That, if this feature takes off, is just a short-term worry - as basically everything will be checkpointable in the long run.

That is one of the technical issues still to be resolved with the current patchset: how does a process programmatically determine whether it is able to be checkpointed? If the process has performed some action while running on a kernel that does not support checkpointing the state caused by that action, there is a need to be able to decide that. Molnar suggested overloading the LSM security checks such that performing those actions sets a one-way "not checkpointable" flag as appropriate. That flag could be checked by the process or by some other program that was interested. Overloading the LSM hooks is not completely uncontroversial, but it does hook the kernel in many of the right places—adding an additional call to those same places for checkpointing is not likely to fly.

There was also some question about whether the "not checkpointable" flag needs to be a one-way flag, as it could be cleared once the process has returned to a state that is able to be checkpointed. Molnar argued that the one-way flag is desirable: "uncheckpointable functionality should be as painful as possible, to make sure it's getting fixed". Users who run into problems checkpointing their applications will then apply pressure to get the requisite state added to checkpoints. As a starting point, Hansen has posted a patch that would add a one-way flag based on the kinds of files a process had opened.

Checkpoints are a useful feature that could be used for migrating processes to different machines, protecting long-running processes against kernel crashes or upgrades, system hibernation, and more. It is a difficult problem that may never really be completely finished and it touches a lot of core kernel code. For these reasons, caution is certainly justified, but one gets the sense that some kind checkpoint/restart feature will eventually make its way into the mainline. Whether it is Laadan's version, something derived from OpenVZ, or some other mechanism entirely remains to be seen.

BLCR?

Posted Feb 26, 2009 4:23 UTC (Thu) by mattdm (subscriber, #18) [Link]

What about? Berkeley Lab Checkpoint/Restart (aka BLCR)? It's currently implemented as an out-of-tree kernel module, and seems to be in active development. Is there anything useful to be gained there?

BLCR?

Posted Feb 26, 2009 6:43 UTC (Thu) by ntl (subscriber, #40518) [Link]

>From a review of the FAQ, it looks kinda hacky: BLCR uses LD_PRELOAD tricks, and its build process requires access to the kernel's System.map, which implies that it uses unexported symbols.