On Thu, Jul 31, 2008 at 11:09:54AM -0400, Oren Laadan wrote: > > > Louis Rilling wrote: >> On Wed, Jul 30, 2008 at 06:20:32PM -0400, Oren Laadan wrote: >>> >>> Serge E. Hallyn wrote: >>>> Quoting Oren Laadan ([EMAIL PROTECTED]): >>>>> +int do_checkpoint(struct cr_ctx *ctx) >>>>> +{ >>>>> + int ret; >>>>> + >>>>> + /* FIX: need to test whether container is checkpointable */ >>>>> + >>>>> + ret = cr_write_hdr(ctx); >>>>> + if (!ret) >>>>> + ret = cr_write_task(ctx, current); >>>>> + if (!ret) >>>>> + ret = cr_write_tail(ctx); >>>>> + >>>>> + /* on success, return (unique) checkpoint identifier */ >>>>> + if (!ret) >>>>> + ret = ctx->crid; >>>> Does this crid have a purpose? >>> yes, at least three; both are for the future, but important to set the >>> meaning of the return value of the syscall already now. The "crid" is >>> the CR-identifier that identifies the checkpoint. Every checkpoint is >>> assigned a unique number (using an atomic counter). >>> >>> 1) if a checkpoint is taken and kept in memory (instead of to a file) then >>> this will be the identifier with which the restart (or cleanup) would refer >>> to the (in memory) checkpoint image >>> >>> 2) to reduce downtime of the checkpoint, data will be aggregated on the >>> checkpoint context, as well as referenced to (cow-ed) pages. This data can >>> persist between calls to sys_checkpoint(), and the 'crid', again, will be >>> used to identify the (in-memory-to-be-dumped-to-storage) context. >>> >>> 3) for incremental checkpoint (where a successive checkpoint will only >>> save what has changed since the previous checkpoint) there will be a need >>> to identify the previous checkpoints (to be able to know where to take >>> data from during restart). Again, a 'crid' is handy. >>> >>> [in fact, for the 3rd use, it will make sense to write that number as >>> part of the checkpoint image header] >>> >>> Note that by doing so, a process that checkpoints itself (in its own >>> context), can use code that is similar to the logic of fork(): >>> >>> ... >>> crid = checkpoint(...); >>> switch (crid) { >>> case -1: >>> perror("checkpoint failed"); >>> break; >>> default: >>> fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret); >>> /* proceed with execution after checkpoint */ >>> ... >>> break; >>> case 0: >>> fprintf(stderr, "returned after restart\n"); >>> /* proceed with action required following a restart */ >>> ... >>> break; >>> } >>> ... >> >> If I understand correctly, this crid can live for quite a long time. So many >> of >> them could be generated while some container would accumulate incremental >> checkpoints on, say crid 5, and possibly crid 5 could be reused for another >> unrelated checkpoint during that time. This brings the issue of allocating >> crids >> reliably (using something like a pidmap for instance). Moreover, if such ids >> are >> exposed to userspace, we need to remember which ones are allocated accross >> reboots and migrations. >> >> I'm afraid that this becomes too complex... > > And I'm afraid I didn't explain myself well. So let me rephrase: > > CRIDs are always _local_ to a specific node. The local CRID counter is > bumped (atomically) with each checkpoint attempt. The main use case is > for when the checkpoint is kept is memory either shortly (until it is > written back to disk) or for a longer time (use-cases that want to keep > it there). It only remains valid as long as the checkpoint image is > still in memory and have not been committed to storage/network. Think > of it as a way to identify the operation instance. > > So they can live quite a long time, but only as long as the original > node is still alive and the checkpoint is still kept in memory. They > are meaningless across reboots and migrations. I don't think a wrap > around is a concern, but we can use 64 bit if that is the case. > > Finally, the incremental checkpoint use-case: imagine a container that > is checkpointed regularly every minutes. The first checkpoint will be > a full checkpoint, say CRID=1. The second will be incremental with > respect to the first, with CRID=2, and so on the third and the forth. > Userspace could use these CRID to name the image files (for example, > app.img.CRID). Assume that we decide (big "if") that the convention is > that the last part of the filename must be the CRID, and if we decide > (another big "if") to save the CRID as part of the checkpoint image -- > the part that describe the "incremental nature" of a new checkpoint. > (That part would specify where to get state that wasn't really saved > in the new checkpoint but instead can be retrieved from older ones). > If that was the case, then the logic in the kernel would be fairly > to find (and access) the actual files that hold the data. Note, that > in this case - the CRID are guaranteed to be unique per series of > incremental checkpoints, and incremental chekcpoint is meaningless > across reboots (and we can require that across migration too).
Letting the kernel guess where to find the missing data of an incremental checkpoint seems a bit hazardous indeed. What about just appending incremental checkpoints to the last full checkpoint file? > > We probably don't want to use something like a pid to identify the > checkpoint (while in memory), because we may have multiple checkpoints > in memory at a time (of the same container). Agreed. > >> >> It would be way easier if the only (kernel-level) references to a checkpoint >> were pointers to its context. Ideally, the only reference would live in a >> 'struct container' and would be easily updated at restart-time. > > Consider the following scenario of calls from user-space (which is > how I envision the checkpoint optimized for minimal downtime, in the > future): > > 1) while (syscall_to_do_precopy) <- do precopy until ready to > if (too_long_already) <- checkpoint or too long > break; > > 2) freeze_container(); > > 3) crid = checkpoint(.., .., CR_CKPT_LAZY); <- checkpoint container > <- don't commit to disk > <- (minimize owntime) > > 4) unfreeze_container(); <- now can unfreeze container > <- already as soon as possible > > 5) ckpt_writeback(crid, fd); <- container is back running. we > <- can commit data to storage or > <- network in the background. > > #2 and #4 are done with freezer_cgroup() > > #1, #3 and #5 must be syscalls > > More specifically, syscall #5 must be able to refer to the result of syscall > #3 > (that is the CRID !). It is possible that another syscall #3 occur, on the > same > container, between steps 4 and 5 ... but then that checkpoint will be assigned > another, unique CRID. Hm, assuming that, as proposed above, incremental checkpoints are stored in the same file as the ancestor full checkpoint, why not simply give fd as argument in #5? I'd expect that the kernel would associate the file descriptor to the checkpoint until it is finalized (written back, sent over the wire, etc.). Maybe I'm still missing something... > >> My $0.02 ... > > Thanks... American or Canadian ? ;) Since I only have the canadian cityzenship, you can guess easily ;) Thanks for your patient explanations! Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
signature.asc
Description: Digital signature
_______________________________________________ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers
_______________________________________________ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel