[Devel] Re: [PATCH] cgroups: handle failure of cgroup_populate_dir() at mount/remount
On Wed, 27 May 2009 11:24:22 +0800 Li Zefan l...@cn.fujitsu.com wrote: KAMEZAWA Hiroyuki wrote: On Wed, 27 May 2009 09:07:31 +0800 Li Zefan l...@cn.fujitsu.com wrote: Paul Menage wrote: On Fri, May 22, 2009 at 1:25 AM, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: Hm, shouldn't we allow noprefix to be effective only agaisnt cpuset ? I think it's just for backward-compatibility of cpuset. (I don't like the option at all.) Yes, exposing the noprefix option externally was one of the mistakes I made when developing cgroups. It seems to me really unlikely that anyone is using noprefix for And noprefix is not documented in cgroups.txt, so I guess not many people know this option. Even libcgroup doesn't handle it. anything other than implicitly when mounting the cpuset filesystem. So I'd be inclined to just forbid it if we're mounting more than just the cpuset subsystem. A bit of a nasty abstraction violation, but it makes more sense overall. The only problem is that someone *might* be using it - do we have any way to determine how, and how big do they have to be before we care? I think we can never know.. How about this method ? - add noprefix to to-be-removed list. - add WARNING: noprefix option will be removed in 2.6.32 (or 2.6.31) now - remove noprefix in 2.6.31-rc or later I don't see how we can remove noprefix while reserve the compatibility of old cpuset.. As Paul Menage said, we can allow noprefix to be used only if we mount just cpuset subsystem: I have no objection. -Kame (pseudo code) diff --git a/kernel/cgroup.c b/kernel/cgroup.c --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -886,6 +886,11 @@ static int parse_cgroupfs_options(char *data, } } + + if (test_bit(ROOT_NOPREFIX, opts-flags) + (opts-subsys_bits ~cpuset_subsys_id) != 0) + return -EINVAL; + /* We can't have an empty hierarchy */ if (!opts-subsys_bits) return -EINVAL; ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] io-controller: Add io group reference handling for request
Hi Andrea and Vivek, Ryo Tsuruta r...@valinux.co.jp wrote: Hi Andrea and Vivek, From: Andrea Righi righi.and...@gmail.com Subject: Re: [PATCH] io-controller: Add io group reference handling for request Date: Mon, 18 May 2009 16:39:23 +0200 On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote: On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote: On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote: On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote: On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote: Vivek Goyal wrote: ... } @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru /* * Find the io group bio belongs to. * If create is set, io group is created if it is not already present. + * If curr is set, io group is information is searched for current + * task and not with the help of bio. + * + * FIXME: Can we assume that if bio is NULL then lookup group for current + * task and not create extra function parameter ? * - * Note: There is a narrow window of race where a group is being freed - * by cgroup deletion path and some rq has slipped through in this group. - * Fix it. */ -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio, - int create) +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio, + int create, int curr) Hi Vivek, IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL, get iog from bio, otherwise get it from current task. Consider also that get_cgroup_from_bio() is much more slow than task_cgroup() and need to lock/unlock_page_cgroup() in get_blkio_cgroup_id(), while task_cgroup() is rcu protected. True. BTW another optimization could be to use the blkio-cgroup functionality only for dirty pages and cut out some blkio_set_owner(). For all the other cases IO always occurs in the same context of the current task, and you can use task_cgroup(). Yes, may be in some cases we can avoid setting page owner. I will get to it once I have got functionality going well. In the mean time if you have a patch for it, it will be great. However, this is true only for page cache pages, for IO generated by anonymous pages (swap) you still need the page tracking functionality both for reads and writes. Right now I am assuming that all the sync IO will belong to task submitting the bio hence use task_cgroup() for that. Only for async IO, I am trying to use page tracking functionality to determine the owner. Look at elv_bio_sync(bio). You seem to be saying that there are cases where even for sync IO, we can't use submitting task's context and need to rely on page tracking functionlity? I think that there are some kernel threads (e.g., dm-crypt, LVM and md devices) which actually submit IOs instead of tasks which originate the IOs. When IOs are submitted from such kernel threads, we can't use submitting task's context to determine to which cgroup the IO belongs. In case of getting page (read) from swap, will it not happen in the context of process who will take a page fault and initiate the swap read? No, for example in read_swap_cache_async(): @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, */ __set_page_locked(new_page); SetPageSwapBacked(new_page); + blkio_cgroup_set_owner(new_page, current-mm); err = add_to_swap_cache(new_page, entry, gfp_mask GFP_KERNEL); if (likely(!err)) { /* This is a read, but the current task is not always the owner of this swap cache page, because it's a readahead operation. But will this readahead be not initiated in the context of the task taking the page fault? handle_pte_fault() do_swap_page() swapin_readahead() read_swap_cache_async() If yes, then swap reads issued will still be in the context of process and we should be fine? Right. I was trying to say that the current task may swap-in also pages belonging to a different task, so from a certain point of view it's not so fair to charge the current task for the whole activity. But ok, I think it's a minor issue. Anyway, this is a minor corner case I think. And probably it
[Devel] Re: [PATCH] io-controller: Add io group reference handling for request
On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote: I think that only putting the hook in try_to_unmap() doesn't work correctly, because IOs will be charged to reclaiming processes or kswapd. These IOs should be charged to processes which cause memory pressure. Consider the following case: (1) There are two processes Proc-A and Proc-B. (2) Proc-A maps a large file into many pages by mmap() and writes many data to the file. (3) After (2), Proc-B try to get a page, but there are no available pages because Proc-A has used them. (4) kernel starts to reclaim pages, call try_to_unmap() to unmap a page which is owned by Proc-A, then blkio_cgroup_set_owner() sets Proc-B's ID on the page because the task's context is Proc-B. (5) After (4), kernel writes the page out to a disk. This IO is charged to Proc-B. In the above case, I think that the IO should be charged to a Proc-A, because the IO is caused by Proc-A's memory pressure. I think we should consider in the case without memory and swap isolation. mmmh.. even if they're strictly related I think we're mixing two different problems in this way: memory pressure control and IO control. It seems you're proposing something like the badness() for OOM conditions to charge swap IO depending on how bad is a cgroup in terms of memory consumption. I don't think this is the right way to proceed, also because we already have the memory and swap control. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/8] a start to credentials c/r
Quoting Casey Schaufler (ca...@schaufler-ca.com): Serge E. Hallyn wrote: Following is the next version of the credentials c/r patchset, on top of the c/r patchset at git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git It implements checkpoint and restart of user, user namespaces, groups, supplementary groups, and struct cred. There is a question as to what to do about LSM data at restart. Right now I'm ignoring it, which means that prepare_creds() should ensure that the restart tasks get the context of the task calling sys_restart(). I suspect the right thing to do is to add two new LSM hooks, one which checks current's authorization to restart from the checkpoint file, How would that work? Based on information in the file? You have to assume that some number of checkpoint files have been hand written by Elbonian ne'er do wells. Not based on information in the file, but based on the credentials of the task which created the file, and whether an unprivileged task could have hand-edited the file before feeding it to sys_restart(). So some example decisions in terms of selinux contexts might be, 1. a task of user_u may restart a file of type user_u if the checkpointed context is user_u 2. a task of user_u may NOT restart a file of type user_u if the checkpointed context is root_u 3. a task of root_u may restart a file of type root_u if the checkpointed context is user_u Uh, so yes, bsaed on info in the file as well :) Except of course the LSM would just be fed the checkpointed context and the checkpoint file context (and can deduce current's context). and one which determines the task-cred-security filed based upon any of: 1. current_security() of the task calling sys_restart() 2. the task-cred-security checkpointed in the ckpt file 3. the -security of the checkpoint file For Smack the correct behavior would be: 1. for sys_restart() callers without CAP_MAC_ADMIN 2. for sys_restart() callers with CAP_MAC_ADMIN 3. never That makes sense, and is basically analagous (if I'm thinking right) to how I'm doing capabilities. So the first (authorization hook) for smack would just always return TRUE? I can hook that up right now... sys_restart() callers running with CAP_MAC_ADMIN would have to be very very careful about the files they restart. But that's nothing new in the MAC world. Yup. Mind you eventually I expect a setup where some privileged program is asked (by privileged or unprivilegd tasks) to create a checkpoint and ask the TPM to sign it. No unprivileged program can sign an image directly, so then a restart of a task with privilege can be restricted to anything with a valid signature. In that case, it may be safe to have the checkpointed task's credentials completely restored, including LSM labels. But that's a ways off. thanks, -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:39 -0700 Subject: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- kernel/pid.c | 43 --- 1 files changed, 28 insertions(+), 15 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index b2e5f78..c0aaebe 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -122,6 +122,31 @@ static void free_pidmap(struct upid *upid) atomic_inc(map-nr_free); } +static int alloc_pidmap_page(struct pidmap *map) +{ + void *page; + + if (likely(map-page)) + return 0; + + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + + /* +* Free the page if someone raced with us installing it: +*/ + spin_lock_irq(pidmap_lock); + if (map-page) + kfree(page); + else + map-page = page; + spin_unlock_irq(pidmap_lock); + + if (unlikely(!map-page)) + return -1; + + return 0; +} + static int alloc_pidmap(struct pid_namespace *pid_ns) { int i, offset, max_scan, pid, last = pid_ns-last_pid; @@ -134,21 +159,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) map = pid_ns-pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i = max_scan; ++i) { - if (unlikely(!map-page)) { - void *page = kzalloc(PAGE_SIZE, GFP_KERNEL); - /* -* Free the page if someone raced with us -* installing it: -*/ - spin_lock_irq(pidmap_lock); - if (map-page) - kfree(page); - else - map-page = page; - spin_unlock_irq(pidmap_lock); - if (unlikely(!map-page)) - break; - } + if (alloc_pidmap_page(map)) + break; + if (likely(atomic_read(map-nr_free))) { do { if (!test_and_set_bit(offset, map-page)) { -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/7] [PATCH] Add target_pid parameter to alloc_pidmap()
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:41 -0700 Subject: [PATCH 3/7] [PATCH] Add target_pid parameter to alloc_pidmap() Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- kernel/pid.c | 28 ++-- 1 files changed, 26 insertions(+), 2 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index fd72ad9..93406c6 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -147,12 +147,36 @@ static int alloc_pidmap_page(struct pidmap *map) return 0; } -static int alloc_pidmap(struct pid_namespace *pid_ns) +static int set_pidmap(struct pid_namespace *pid_ns, int pid) +{ + int offset; + struct pidmap *map; + + if (pid = pid_max) + return -EINVAL; + + offset = pid BITS_PER_PAGE_MASK; + map = pid_ns-pidmap[pid/BITS_PER_PAGE]; + + if (alloc_pidmap_page(map)) + return -ENOMEM; + + if (test_and_set_bit(offset, map-page)) + return -EBUSY; + + atomic_dec(map-nr_free); + return pid; +} + +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid) { int i, offset, max_scan, pid, last = pid_ns-last_pid; struct pidmap *map; int rc = -EAGAIN; + if (target_pid) + return set_pidmap(pid_ns, target_pid); + pid = last + 1; if (pid = pid_max) pid = RESERVED_PIDS; @@ -269,7 +293,7 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns-level; i = 0; i--) { - nr = alloc_pidmap(tmp); + nr = alloc_pidmap(tmp, 0); if (nr 0) goto out_free; -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process()
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:43 -0700 Subject: [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process() The new parameter will be used in a follow-on patch when clone_with_pids() is implemented. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- kernel/fork.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d2d69d3..373411e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -949,12 +949,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, unsigned long stack_size, int __user *child_tidptr, struct pid *pid, + pid_t *target_pids, int trace) { int retval; struct task_struct *p; int cgroup_callbacks_done = 0; - pid_t *target_pids = NULL; if ((clone_flags (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1327,7 +1327,7 @@ struct task_struct * __cpuinit fork_idle(int cpu) struct pt_regs regs; task = copy_process(CLONE_VM, 0, idle_regs(regs), 0, NULL, - init_struct_pid, 0); + init_struct_pid, NULL, 0); if (!IS_ERR(task)) init_idle(task, cpu); @@ -1350,6 +1350,7 @@ long do_fork(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; + pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1390,7 +1391,7 @@ long do_fork(unsigned long clone_flags, trace = tracehook_prepare_clone(clone_flags); p = copy_process(clone_flags, stack_start, regs, stack_size, -child_tidptr, NULL, trace); +child_tidptr, NULL, target_pids, trace); /* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 4/7] [PATCH] Add target_pids parameter to alloc_pid()
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:42 -0700 Subject: [PATCH 4/7] [PATCH] Add target_pids parameter to alloc_pid() This parameter is currently NULL, but will be used in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- include/linux/pid.h |2 +- kernel/fork.c |3 ++- kernel/pid.c|9 +++-- 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index 49f1c2f..914185d 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr); extern struct pid *find_ge_pid(int nr, struct pid_namespace *); int next_pidmap(struct pid_namespace *pid_ns, int last); -extern struct pid *alloc_pid(struct pid_namespace *ns); +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids); extern void free_pid(struct pid *pid); /* diff --git a/kernel/fork.c b/kernel/fork.c index f8411a8..d2d69d3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -954,6 +954,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, int retval; struct task_struct *p; int cgroup_callbacks_done = 0; + pid_t *target_pids = NULL; if ((clone_flags (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1119,7 +1120,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != init_struct_pid) { - pid = alloc_pid(p-nsproxy-pid_ns); + pid = alloc_pid(p-nsproxy-pid_ns, target_pids); if (IS_ERR(pid)) { retval = PTR_ERR(pid); goto bad_fork_cleanup_io; diff --git a/kernel/pid.c b/kernel/pid.c index 93406c6..4b2373a 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -279,13 +279,14 @@ void free_pid(struct pid *pid) call_rcu(pid-rcu, delayed_put_pid); } -struct pid *alloc_pid(struct pid_namespace *ns) +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids) { struct pid *pid; enum pid_type type; int i, nr; struct pid_namespace *tmp; struct upid *upid; + int tpid; pid = kmem_cache_alloc(ns-pid_cachep, GFP_KERNEL); if (!pid) @@ -293,7 +294,11 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns-level; i = 0; i--) { - nr = alloc_pidmap(tmp, 0); + tpid = 0; + if (target_pids) + tpid = target_pids[i]; + + nr = alloc_pidmap(tmp, tpid); if (nr 0) goto out_free; -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 7/7] [PATCH] Define clone_with_pids syscall
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:45 -0700 Subject: [PATCH 7/7] [PATCH] Define clone_with_pids syscall clone_with_pids() is same as clone(), except that it takes a 'target_pid_set' paramter which lets caller choose a specific pid number for the child process in each of the child process's pid namespace. This system call would be needed to implement Checkpoint/Restart (i.e after a checkpoint, restart a process with its original pids). Call clone_with_pids as follows: pid_t pids[] = { 0, 77, 99 }; struct target_pid_set pid_set; pid_set.num_pids = sizeof(pids) / sizeof(int); pid_set.target_pids = pids; syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, pid_set); If a target-pid is 0, the kernel continues to assign a pid for the process in that namespace. In the above example, pids[0] is 0, meaning the kernel will assign next available pid to the process in init_pid_ns. But kernel will assign pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either 77 or 99 are taken, the system call fails with -EBUSY. If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces, the system call fails with -EINVAL. Its mostly an exploratory patch seeking feedback on the interface. NOTE: Compared to clone(), clone_with_pids() needs to pass in two more pieces of information: - number of pids in the set - user buffer containing the list of pids. But since clone() already takes 5 parameters, use a 'struct target_pid_set'. TODO: - Gently tested. - May need additional sanity checks in check_target_pids() - Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in the namespace is either 1 or 0). Changelog[v1]: - Fixed some compile errors (had fixed these errors earlier in my git tree but had not refreshed patches before emailing them) Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/x86/include/asm/syscalls.h|1 + arch/x86/include/asm/unistd_32.h |1 + arch/x86/kernel/entry_32.S |1 + arch/x86/kernel/process_32.c | 94 arch/x86/kernel/syscall_table_32.S |1 + include/linux/types.h |5 ++ 6 files changed, 103 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 7043408..1fdc149 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -31,6 +31,7 @@ asmlinkage int sys_get_thread_area(struct user_desc __user *); /* kernel/process_32.c */ int sys_fork(struct pt_regs *); int sys_clone(struct pt_regs *); +int sys_clone_with_pids(struct pt_regs *); int sys_vfork(struct pt_regs *); int sys_execve(struct pt_regs *); diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6e72d74..90f906f 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -340,6 +340,7 @@ #define __NR_inotify_init1 332 #define __NR_preadv333 #define __NR_pwritev 334 +#define __NR_clone_with_pids 335 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index c929add..ee92b0d 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -707,6 +707,7 @@ ptregs_##name: \ PTREGSCALL(iopl) PTREGSCALL(fork) PTREGSCALL(clone) +PTREGSCALL(clone_with_pids) PTREGSCALL(vfork) PTREGSCALL(execve) PTREGSCALL(sigaltstack) diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 76f8f84..65b27a8 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -445,6 +445,100 @@ int sys_clone(struct pt_regs *regs) return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr); } +static int check_target_pids(unsigned long clone_flags, + struct target_pid_set *pid_setp) +{ + /* +* CLONE_NEWPID implies pid == 1 +* +* TODO: Maybe this should be more fine-grained (i.e would we want +* to have a container-init have a specific pid in ancestor +* namespaces ?) +*/ + if (clone_flags CLONE_NEWPID) + return -EINVAL; + + /* number of pids must match current nesting level of pid ns */ + if (pid_setp-num_pids task_pid(current)-level + 1) + return -EINVAL; + + /* TODO: More sanity checks ? */ + + return 0; +} + +static pid_t *copy_target_pids(unsigned long clone_flags, void __user *upid_setp) +{ + int rc; + int size; + pid_t __user *utarget_pids; + pid_t *target_pids; + struct target_pid_set pid_set; + + if (copy_from_user(pid_set, upid_setp, sizeof(pid_set))) + return ERR_PTR(-EFAULT); + + size = pid_set.num_pids *
[Devel] [PATCH 6/7] [PATCH] Define do_fork_with_pids()
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:44 -0700 Subject: [PATCH 6/7] [PATCH] Define do_fork_with_pids() do_fork_with_pids() is same as do_fork(), except that it takes an additional, target_pids, parameter. This parameter, currently unused, specifies the target_pids of the process in each of its pid namespaces. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- include/linux/sched.h |1 + kernel/fork.c | 17 ++--- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index b4c38bc..2173df1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1995,6 +1995,7 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, pid_t *target_pids); struct task_struct *fork_idle(int); extern void set_task_comm(struct task_struct *tsk, char *from); diff --git a/kernel/fork.c b/kernel/fork.c index 373411e..912d008 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1340,17 +1340,17 @@ struct task_struct * __cpuinit fork_idle(int cpu) * It copies the process, and if successful kick-starts * it and waits for it to finish using the VM if required. */ -long do_fork(unsigned long clone_flags, +long do_fork_with_pids(unsigned long clone_flags, unsigned long stack_start, struct pt_regs *regs, unsigned long stack_size, int __user *parent_tidptr, - int __user *child_tidptr) + int __user *child_tidptr, + pid_t *target_pids) { struct task_struct *p; int trace = 0; long nr; - pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1448,6 +1448,17 @@ long do_fork(unsigned long clone_flags, return nr; } +long do_fork(unsigned long clone_flags, + unsigned long stack_start, + struct pt_regs *regs, + unsigned long stack_size, + int __user *parent_tidptr, + int __user *child_tidptr) +{ + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size, + parent_tidptr, child_tidptr, NULL); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/8] a start to credentials c/r
Serge E. Hallyn wrote: Quoting Casey Schaufler (ca...@schaufler-ca.com): Serge E. Hallyn wrote: Following is the next version of the credentials c/r patchset, on top of the c/r patchset at git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git It implements checkpoint and restart of user, user namespaces, groups, supplementary groups, and struct cred. There is a question as to what to do about LSM data at restart. Right now I'm ignoring it, which means that prepare_creds() should ensure that the restart tasks get the context of the task calling sys_restart(). I suspect the right thing to do is to add two new LSM hooks, one which checks current's authorization to restart from the checkpoint file, How would that work? Based on information in the file? You have to assume that some number of checkpoint files have been hand written by Elbonian ne'er do wells. Not based on information in the file, but based on the credentials of the task which created the file, and whether an unprivileged task could have hand-edited the file before feeding it to sys_restart(). So some example decisions in terms of selinux contexts might be, 1. a task of user_u may restart a file of type user_u if the checkpointed context is user_u 2. a task of user_u may NOT restart a file of type user_u if the checkpointed context is root_u 3. a task of root_u may restart a file of type root_u if the checkpointed context is user_u Uh, so yes, bsaed on info in the file as well :) Except of course the LSM would just be fed the checkpointed context and the checkpoint file context (and can deduce current's context). And SELinux can do whatever calculations it likes based on the three contexts and the loaded policy. Are you at all concerned about the possibility that the policy may have changed? I can envision scenarios in which it would be impossible for a process to gain a particular context under current policy, but that a checkpointed process may have stored away. and one which determines the task-cred-security filed based upon any of: 1. current_security() of the task calling sys_restart() 2. the task-cred-security checkpointed in the ckpt file 3. the -security of the checkpoint file For Smack the correct behavior would be: 1. for sys_restart() callers without CAP_MAC_ADMIN 2. for sys_restart() callers with CAP_MAC_ADMIN 3. never That makes sense, and is basically analagous (if I'm thinking right) to how I'm doing capabilities. So the first (authorization hook) for smack would just always return TRUE? I suggest that it needs to check for a valid Smack label. Even though they're just text strings they do have limitations, including size ( 0 24) and character set. A call to smk_import() is the right way to do it, as it also makes sure the label is in the internal list. If smk_import() returns NULL something's amiss. I can hook that up right now... I bet you could do it even with the call to smk_import. (smiley here) sys_restart() callers running with CAP_MAC_ADMIN would have to be very very careful about the files they restart. But that's nothing new in the MAC world. Yup. Mind you eventually I expect a setup where some privileged program is asked (by privileged or unprivilegd tasks) to create a checkpoint and ask the TPM to sign it. No unprivileged program can sign an image directly, so then a restart of a task with privilege can be restricted to anything with a valid signature. In that case, it may be safe to have the checkpointed task's credentials completely restored, including LSM labels. All of the current LSMs share the property that the access control rules (SELinux policy, Smack access rules, TOMOYO policy) may change between the time of checkpoint and the time of restart. If I had a silver bullet answer to the concerns that raises I'd pass it along, but as I don't I'll stick to the answer I have for Smack (The rules of the moment are those that matter, and the architecture of Smack supports that) and leave the other LSMs to their own devices. But that's a ways off. It does look like a bit of work. Thank you. thanks, -serge -- To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 18/38] C/R: core stuff
On Fri, May 22, 2009 at 08:55:12AM +0400, Alexey Dobriyan wrote: +static int task_struct_restorer(void *_tsk_ctx) +{ + struct task_struct_restore_context *tsk_ctx = _tsk_ctx; + struct kstate_image_task_struct *i = tsk_ctx-i; + struct kstate_context *ctx = tsk_ctx-ctx; + /* In the name of symmetry. */ + struct task_struct *tsk = current, *real_parent; + int rv; + + pr_debug(%s: ENTER tsk %p/%s\n, __func__, tsk, tsk-comm); + + write_lock_irq(tasklist_lock); + real_parent = ctx-init_tsk-nsproxy-pid_ns-child_reaper; + tsk-real_parent = tsk-parent = real_parent; + list_move_tail(tsk-sibling, tsk-real_parent-sibling); ^^^ + write_unlock_irq(tasklist_lock); Eek, what a stupid bug here commit 2c4b5f5d606a1892b702d95a0e4d29f207685381 Author: Alexey Dobriyan adobri...@gmail.com Date: Wed May 27 20:21:59 2009 +0400 C/R: fix stupid bug in reparenting Child process should be added to -children list of course Signed-off-by: Alexey Dobriyan adobri...@gmail.com diff --git a/kernel/kstate/kstate-task.c b/kernel/kstate/kstate-task.c index 9ed5a19..6df7d25 100644 --- a/kernel/kstate/kstate-task.c +++ b/kernel/kstate/kstate-task.c @@ -490,7 +490,7 @@ static int task_struct_restorer(void *_tsk_ctx) real_parent = tmp-o_obj; } tsk-real_parent = tsk-parent = real_parent; - list_move_tail(tsk-sibling, tsk-real_parent-sibling); + list_move_tail(tsk-sibling, tsk-real_parent-children); write_unlock_irq(tasklist_lock); rv = restore_mm(ctx, i-ref_mm); ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/1] cr: safely restore a task's securebits
As I started doing c/r of LSM credentials, I realized I wasn't handling securebit yet with the current patchet. This is on top of the set I sent out yesterday, and if I'm sendng out another full patchset then I'll integrate this back into the previous patches. Signed-off-by: Serge E. Hallyn se...@us.ibm.com --- include/linux/capability.h |2 + include/linux/checkpoint_hdr.h |1 + kernel/cred.c |4 +++ security/commoncap.c | 52 +++- 4 files changed, 42 insertions(+), 17 deletions(-) diff --git a/include/linux/capability.h b/include/linux/capability.h index 572b5a0..b3853ca 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -540,6 +540,8 @@ extern void checkpoint_save_cap(__u64 *dest, kernel_cap_t src); struct cred; extern int checkpoint_restore_cap(__u64 e, __u64 i, __u64 p, __u64 x, struct cred *cred); +extern void checkpoint_save_securebits(unsigned *, unsigned); +extern int checkpoint_restore_securebits(unsigned, struct cred *); /** * has_capability - Determine if a task has a superior capability available diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 0bad447..7f65964 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -192,6 +192,7 @@ struct ckpt_hdr_cred { __u32 version; /* especially since capability sets might grow */ __u32 uid, suid, euid, fsuid; __u32 gid, sgid, egid, fsgid; + __u32 securebits; __u64 cap_i, cap_p, cap_e; __u64 cap_x; /* bounding set ('X') */ __s32 user_ref; diff --git a/kernel/cred.c b/kernel/cred.c index c05192e..fe2941d 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -739,6 +739,7 @@ int checkpoint_write_cred(struct ckpt_ctx *ctx, const struct cred *cred) checkpoint_save_cap(h-cap_p, cred-cap_permitted); checkpoint_save_cap(h-cap_e, cred-cap_effective); checkpoint_save_cap(h-cap_x, cred-cap_bset); + checkpoint_save_securebits(h-securebits, cred-securebits); h-user_ref = user_ref; h-groupinfo_ref = groupinfo_ref; @@ -811,6 +812,9 @@ struct cred *restore_read_cred(struct ckpt_ctx *ctx) cred); if (ret) goto err_putcred; + ret = checkpoint_restore_securebits(h-securebits, cred); + if (ret) + goto err_putcred; ckpt_hdr_put(ctx, h); return cred; diff --git a/security/commoncap.c b/security/commoncap.c index beac025..31ecd3d 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -804,6 +804,29 @@ int cap_task_setnice (struct task_struct *p, int nice) } #endif +int cap_set_securebits(struct cred *new, unsigned securebits) +{ + if new-securebits SECURE_ALL_LOCKS) 1) + (new-securebits ^ securebits)) /*[1]*/ + || ((new-securebits SECURE_ALL_LOCKS ~securebits)) /*[2]*/ + || (securebits ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS)) /*[3]*/ + || (cap_capable(current, current_cred(), CAP_SETPCAP, + SECURITY_CAP_AUDIT) != 0) /*[4]*/ + /* +* [1] no changing of bits that are locked +* [2] no unlocking of locks +* [3] no setting of unsupported bits +* [4] doing anything requires privilege (go read about +* the sendmail capabilities bug) +*/ + ) + /* cannot change a locked bit */ + return -EPERM; + new-securebits = securebits; + return 0; +} + + /** * cap_task_prctl - Implement process control functions for this security module * @option: The process control function requested @@ -861,24 +884,9 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, * capability-based-privilege environment. */ case PR_SET_SECUREBITS: - error = -EPERM; - if new-securebits SECURE_ALL_LOCKS) 1) - (new-securebits ^ arg2))/*[1]*/ - || ((new-securebits SECURE_ALL_LOCKS ~arg2)) /*[2]*/ - || (arg2 ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS)) /*[3]*/ - || (cap_capable(current, current_cred(), CAP_SETPCAP, - SECURITY_CAP_AUDIT) != 0) /*[4]*/ - /* -* [1] no changing of bits that are locked -* [2] no unlocking of locks -* [3] no setting of unsupported bits -* [4] doing anything requires privilege (go read about -* the sendmail capabilities bug) -*/ - ) - /* cannot change a locked
[Devel] Re: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page
On Wed, 2009-05-27 at 08:42 -0700, Sukadev Bhattiprolu wrote: From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:39 -0700 Subject: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- kernel/pid.c | 43 --- 1 files changed, 28 insertions(+), 15 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index b2e5f78..c0aaebe 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -122,6 +122,31 @@ static void free_pidmap(struct upid *upid) atomic_inc(map-nr_free); } +static int alloc_pidmap_page(struct pidmap *map) +{ + void *page; + + if (likely(map-page)) + return 0; + + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + + /* + * Free the page if someone raced with us installing it: + */ + spin_lock_irq(pidmap_lock); + if (map-page) + kfree(page); + else + map-page = page; + spin_unlock_irq(pidmap_lock); + + if (unlikely(!map-page)) + return -1; + -ENOMEM, please Otherwise looks fine. Please at least add some minimal patch description about what you're doing and why, though. -- Dave ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/7] [PATCH] Have alloc_pidmap() return actual error code
On Wed, 2009-05-27 at 08:42 -0700, Sukadev Bhattiprolu wrote: From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:40 -0700 Subject: [PATCH 2/7] [PATCH] Have alloc_pidmap() return actual error code alloc_pidmap() can fail either because all pid numbers are in use or we can't allocate memory. With support for setting a specific pid number, alloc_pidmap() would also fail if either the given pid number is invalid or in use. Rather than have caller assume -ENOMEM, have alloc_pidmap() return the actual error. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- kernel/fork.c |5 +++-- kernel/pid.c |9 ++--- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index b9e2edd..f8411a8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1119,10 +1119,11 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != init_struct_pid) { - retval = -ENOMEM; pid = alloc_pid(p-nsproxy-pid_ns); - if (!pid) + if (IS_ERR(pid)) { + retval = PTR_ERR(pid); goto bad_fork_cleanup_io; + } if (clone_flags CLONE_NEWPID) { retval = pid_ns_prepare_proc(p-nsproxy-pid_ns); diff --git a/kernel/pid.c b/kernel/pid.c index c0aaebe..fd72ad9 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -151,6 +151,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) { int i, offset, max_scan, pid, last = pid_ns-last_pid; struct pidmap *map; + int rc = -EAGAIN; pid = last + 1; if (pid = pid_max) @@ -159,8 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) map = pid_ns-pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i = max_scan; ++i) { - if (alloc_pidmap_page(map)) + if (alloc_pidmap_page(map)) { + rc = -ENOMEM; break; + } OK, pet peeve time: rc = alloc_pidmap_page(map); if (rc) break; It saves the bracket and saves a line of assignment, *and* it clarifies program flow. -- Dave ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 02/43] c/r: make file_pos_read/write() public
These two are used in the next patch when calling vfs_read/write() Signed-off-by: Oren Laadan or...@cs.columbia.edu --- fs/read_write.c| 10 -- include/linux/fs.h | 10 ++ 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 9d1e76b..ed63ea3 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ EXPORT_SYMBOL(vfs_write); -static inline loff_t file_pos_read(struct file *file) -{ - return file-f_pos; -} - -static inline void file_pos_write(struct file *file, loff_t pos) -{ - file-f_pos = pos; -} - SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { struct file *file; diff --git a/include/linux/fs.h b/include/linux/fs.h index 3b534e5..9c4348a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1546,6 +1546,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector, struct iovec *fast_pointer, struct iovec **ret_pointer); +static inline loff_t file_pos_read(struct file *file) +{ + return file-f_pos; +} + +static inline void file_pos_write(struct file *file, loff_t pos) +{ + file-f_pos = pos; +} + extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t vfs_readv(struct file *, const struct iovec __user *, -- 1.6.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 00/43] Kernel based checkpoint/restart
Application checkpoint/restart (c/r) is the ability to save the state of a running application so that it can later resume its execution from the time at which it was checkpointed, on the same or a different machine. Here is another round of the c/r patchset. The patches are reordered to reduce size and for easier review, and the code is more stable. See the changelog below for details. Hey, it even includes renaming of functions and files ... Most importantly, it's a working proof-of-concept and has been tested with v2.6.30-rc7. And while not everything is supported, it provides a glimpse at _how_ things are done. For more information, check out Documentation/checkpoint/*.txt Q: How useful is this code as it stands in real-world usage? A: Right now, the application can be single- or multi-processes. Supports open files - regular files and directories on ext[234], pipes, and /dev/{null,zero,random,urandom}. All sort of shared memory work. sysv IPC also works (except for semaphore undo). The restart does not yet preserve the original pid(s), but patches are already circulating. Definitely already suitable for many types of batch jobs. (Note: it is assumed that the fs view is available at restart). Q: What can it checkpoint and rsetart ? A: A (single threaded) process can checkpoint itself, aka self checkpoint, if it calls the new system calls. Otherise, for an external checkpoint, the caller must first freeze the target process(es). One can either checkpoint an entire container (and we make best effort to ensure that the result is self-contained), or merely a subtree of a process hierarchy. Q: What about namespaces ? A: Currrently, UTS and IPC namespaces are restored. They demonstrate how namespaces are handled. More to come. Q: What additional work needs to be done to it? A: Fill in the gory details following the examples so far. Short term plan is: restore pids, complete work on threads, zombies, signals, and more files types. Q: How can I try it ? A: This one can actually be used for simple batch jobs (pipes, too), a whole container or just a subtree of tasks. Try it: create the freezer cgroup: $ mount -t cgroup -ofreezer freezer /freezer $ mkdir /freezer/0 run the test, freeze it: $ test/multitask [1] 2754 $ for i in `pidof multitask`; do echo $i /freezer/0/tasks; done $ echo FROZEN /freezer/0/freezer.state checkpoint: $ ./ckpt 2754 ckpt.out restart: $ ./mktree ckpt.out voila :) To do all this, you'll need: The git tree tracking v14, branch 'ckpt-v14' (and past versions): git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git Restarting multiple processes requires 'mktree' userspace tool with the matching branch (v14): git://git.ncl.cs.columbia.edu/pub/git/user-cr.git Oren. Changelog: [2009-May-27] v16 - Privilege checks for IPC checkpoint - Fix error string generation during checkpoint - Use kzalloc for header allocation - Restart blocks are arch-independent - Redo pipe c/r using splice - Fixes to s390 arch - Remove powerpc arch (temporary) - EXplicitly restore -nsproxy - All objects in image are precedeed by 'struct ckpt_hdr' - Fix leaks detection (and leaks) - Reorder of patchset - Misc bugs and compilation fixes [2009-Apr-12] v15 - Minor fixes [2009-Apr-28] v14 - Tested against kernel v2.6.30-rc3 on x86_32. - Refactor files chekpoint to use f_ops (file operations) - Refactor mm/vma to use vma_ops - Explicitly handle VDSO vma (and require compat mode) - Added code to c/r restat-blocks (restart timeout related syscalls) - Added code to c/r namespaces: uts, ipc (with Dan Smith) - Added code to c/r sysvipc (shm, msg, sem) - Support for VM_CLONE shared memory - Added resource leak detection for whole-container checkpoint - Added sysctl gauge to allow unprivileged restart/checkpoint - Improve and simplify the code and logic of shared objects - Rework image format: shared objects appear prior to their use - Merge checkpoint and restart functionality into same files - Massive renaming of functions: prefix ckpt_ for generics, checkpoint_ for checkpoint, and restore_ for restart. - Report checkpoint errors as a valid (string record) in the output - Merged PPC architecture (by Nathan Lunch), - Requires updates to userspace tools too. - Misc nits and bug fixes [2009-Mar-31] v14-rc2 - Change along Dave's suggestion to use f_ops-checkpoint() for files - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT - Merge support for PPC arch (Nathan Lynch) - Misc cleanups and fixes in response to comments [2009-Mar-20] v14-rc1: - The 'h.parent' field of 'struct cr_hdr' isn't used - discard - Check whether calls to cr_hbuf_get() succeed or fail. - Fixed of pipe c/r code - Prevent deadlock by refusing c/r when a pipe inode == ctx-file inode - Refuse non-self
[Devel] Re: [PATCH] io-controller: Add io group reference handling for request
On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote: Hi Andrea and Vivek, Ryo Tsuruta r...@valinux.co.jp wrote: Hi Andrea and Vivek, From: Andrea Righi righi.and...@gmail.com Subject: Re: [PATCH] io-controller: Add io group reference handling for request Date: Mon, 18 May 2009 16:39:23 +0200 On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote: On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote: On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote: On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote: On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote: Vivek Goyal wrote: ... } @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru /* * Find the io group bio belongs to. * If create is set, io group is created if it is not already present. + * If curr is set, io group is information is searched for current + * task and not with the help of bio. + * + * FIXME: Can we assume that if bio is NULL then lookup group for current + * task and not create extra function parameter ? * - * Note: There is a narrow window of race where a group is being freed - * by cgroup deletion path and some rq has slipped through in this group. - * Fix it. */ -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio, - int create) +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio, + int create, int curr) Hi Vivek, IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL, get iog from bio, otherwise get it from current task. Consider also that get_cgroup_from_bio() is much more slow than task_cgroup() and need to lock/unlock_page_cgroup() in get_blkio_cgroup_id(), while task_cgroup() is rcu protected. True. BTW another optimization could be to use the blkio-cgroup functionality only for dirty pages and cut out some blkio_set_owner(). For all the other cases IO always occurs in the same context of the current task, and you can use task_cgroup(). Yes, may be in some cases we can avoid setting page owner. I will get to it once I have got functionality going well. In the mean time if you have a patch for it, it will be great. However, this is true only for page cache pages, for IO generated by anonymous pages (swap) you still need the page tracking functionality both for reads and writes. Right now I am assuming that all the sync IO will belong to task submitting the bio hence use task_cgroup() for that. Only for async IO, I am trying to use page tracking functionality to determine the owner. Look at elv_bio_sync(bio). You seem to be saying that there are cases where even for sync IO, we can't use submitting task's context and need to rely on page tracking functionlity? I think that there are some kernel threads (e.g., dm-crypt, LVM and md devices) which actually submit IOs instead of tasks which originate the IOs. When IOs are submitted from such kernel threads, we can't use submitting task's context to determine to which cgroup the IO belongs. In case of getting page (read) from swap, will it not happen in the context of process who will take a page fault and initiate the swap read? No, for example in read_swap_cache_async(): @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, */ __set_page_locked(new_page); SetPageSwapBacked(new_page); + blkio_cgroup_set_owner(new_page, current-mm); err = add_to_swap_cache(new_page, entry, gfp_mask GFP_KERNEL); if (likely(!err)) { /* This is a read, but the current task is not always the owner of this swap cache page, because it's a readahead operation. But will this readahead be not initiated in the context of the task taking the page fault? handle_pte_fault() do_swap_page() swapin_readahead() read_swap_cache_async() If yes, then swap reads issued will still be in the context of process and we should be fine? Right. I was trying to say that the current task may swap-in also
[Devel] [RFC v16][PATCH 35/43] c/r (ipc): export interface from ipc/shm.c to delete ipc shm
Export shmctl_down() which will be used in the next patch during restart to delete an ipc shm (the shm is mapped already, so it won't be lost). Signed-off-by: Oren Laadan or...@cs.columbia.edu --- include/linux/shm.h |4 ipc/shm.c |4 ++-- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/include/linux/shm.h b/include/linux/shm.h index eca6235..ec36e99 100644 --- a/include/linux/shm.h +++ b/include/linux/shm.h @@ -118,6 +118,10 @@ static inline int is_file_shm_hugepages(struct file *file) } #endif +struct ipc_namespace; +extern int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd, + struct shmid_ds __user *buf, int version); + #endif /* __KERNEL__ */ #endif /* _LINUX_SHM_H_ */ diff --git a/ipc/shm.c b/ipc/shm.c index 7dd5f0c..8aba22f 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -598,8 +598,8 @@ static void shm_get_stat(struct ipc_namespace *ns, unsigned long *rss, * to be held in write mode. * NOTE: no locks must be held, the rw_mutex is taken inside this function. */ -static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd, - struct shmid_ds __user *buf, int version) +int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd, + struct shmid_ds __user *buf, int version) { struct kern_ipc_perm *ipcp; struct shmid64_ds shmid64; -- 1.6.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 33/43] c/r (ipc): helpers to save and restore kern_ipc_perm structures
Add the helpers to save and restore the contents of 'struct kern_ipc_perm'. Add header structures for ipc state. Put place-holders to save and restore ipc state. TODO: This patch does _not_ address the issues of users/groups and the related security issues. For now, it saves the old user/group of ipc objects, but does not restore them during restart. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- include/linux/checkpoint.h |7 +++- include/linux/checkpoint_hdr.h | 29 ++ ipc/Makefile |1 + ipc/checkpoint.c | 81 ipc/util.h |8 5 files changed, 125 insertions(+), 1 deletions(-) diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 5a42399..9a7517f 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -10,6 +10,10 @@ * distribution for more details. */ +#include linux/sched.h +#include linux/nsproxy.h +#include linux/ipc_namespace.h + #include linux/checkpoint_types.h #include linux/checkpoint_hdr.h #include asm/checkpoint_hdr.h @@ -157,8 +161,9 @@ extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode); #define CKPT_DFILE 0x10/* files and filesystem */ #define CKPT_DMEM 0x20/* memory state */ #define CKPT_DPAGE 0x40/* memory pages */ +#define CKPT_DIPC 0x80/* sysvipc */ -#define CKPT_DDEFAULT 0x37/* default debug level */ +#define CKPT_DDEFAULT 0xb7/* default debug level */ #ifndef CKPT_DFLAG #define CKPT_DFLAG 0x0 /* nothing */ diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 44a48dc..05769f4 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -70,6 +70,11 @@ enum { CKPT_HDR_PGARR, CKPT_HDR_MM_CONTEXT, + CKPT_HDR_IPC = 501, + CKPT_HDR_IPC_SHM, + CKPT_HDR_IPC_MSG, + CKPT_HDR_IPC_SEM, + CKPT_HDR_TAIL = 9001, CKPT_HDR_ERROR = , @@ -299,4 +304,28 @@ struct ckpt_hdr_pgarr { } __attribute__((aligned(8))); +/* ipc commons */ +struct ckpt_hdr_ipc_perms { + __s32 id; + __u32 key; + __u32 uid; + __u32 gid; + __u32 cuid; + __u32 cgid; + __u32 mode; + __u32 _padding; + __u64 seq; +} __attribute__((aligned(8))); + + +#define CKPT_TST_OVERFLOW_16(a, b) \ + ((sizeof(a) sizeof(b)) ((a) SHORT_MAX)) + +#define CKPT_TST_OVERFLOW_32(a, b) \ + ((sizeof(a) sizeof(b)) ((a) INT_MAX)) + +#define CKPT_TST_OVERFLOW_64(a, b) \ + ((sizeof(a) sizeof(b)) ((a) LONG_MAX)) + + #endif /* _CHECKPOINT_CKPT_HDR_H_ */ diff --git a/ipc/Makefile b/ipc/Makefile index 4e1955e..aa6c8dd 100644 --- a/ipc/Makefile +++ b/ipc/Makefile @@ -9,4 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y) obj-$(CONFIG_IPC_NS) += namespace.o obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o +obj-$(CONFIG_CHECKPOINT) += checkpoint.o diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c new file mode 100644 index 000..b7b48b0 --- /dev/null +++ b/ipc/checkpoint.c @@ -0,0 +1,81 @@ +/* + * Checkpoint logic and helpers + * + * Copyright (C) 2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DIPC + +#include linux/ipc.h +#include linux/msg.h +#include linux/sched.h +#include linux/ipc_namespace.h +#include linux/checkpoint.h +#include linux/checkpoint_hdr.h + +#include util.h + +int checkpoint_ipcns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns) +{ + return 0; +} + +int restore_ipcns(struct ckpt_ctx *ctx) +{ + return 0; +} + +int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h, + struct kern_ipc_perm *perm) +{ + if (ipcperms(perm, S_IROTH)) + return -EACCES; + + h-id = perm-id; + h-key = perm-key; + h-uid = perm-uid; + h-gid = perm-gid; + h-cuid = perm-cuid; + h-cgid = perm-cgid; + h-mode = perm-mode S_IRWXUGO; + h-seq = perm-seq; + + return 0; +} + +int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h, + struct kern_ipc_perm *perm) +{ + if (h-id 0) + return -EINVAL; + if (CKPT_TST_OVERFLOW_16(h-uid, perm-uid) || + CKPT_TST_OVERFLOW_16(h-gid, perm-gid) || + CKPT_TST_OVERFLOW_16(h-cuid, perm-cuid) || + CKPT_TST_OVERFLOW_16(h-cgid, perm-cgid) || + CKPT_TST_OVERFLOW_16(h-mode, perm-mode)) + return -EINVAL; + if (h-seq = USHORT_MAX) + return -EINVAL; + if (h-mode ~S_IRWXUGO) +
[Devel] [RFC v16][PATCH 37/43] c/r (ipc): make 'struct msg_msgseg' visible in ipc/util.h
Move the definition of 'struct msg_msgseg' and constants DATALEN_* to ipc/util.h, where they are visible to ipc/ckpt_msg.c Signed-off-by: Oren Laadan or...@cs.columbia.edu --- ipc/msg.c |3 +-- ipc/msgutil.c |8 ipc/util.h| 10 ++ 3 files changed, 11 insertions(+), 10 deletions(-) diff --git a/ipc/msg.c b/ipc/msg.c index 1db7c45..1d5d087 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -72,7 +72,6 @@ struct msg_sender { #define msg_unlock(msq)ipc_unlock((msq)-q_perm) -static void freeque(struct ipc_namespace *, struct kern_ipc_perm *); static int newque(struct ipc_namespace *, struct ipc_params *, int); #ifdef CONFIG_PROC_FS static int sysvipc_msg_proc_show(struct seq_file *s, void *it); @@ -278,7 +277,7 @@ static void expunge_all(struct msg_queue *msq, int res) * msg_ids.rw_mutex (writer) and the spinlock for this message queue are held * before freeque() is called. msg_ids.rw_mutex remains locked on exit. */ -static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) +void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) { struct list_head *tmp; struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm); diff --git a/ipc/msgutil.c b/ipc/msgutil.c index f095ee2..e119243 100644 --- a/ipc/msgutil.c +++ b/ipc/msgutil.c @@ -36,14 +36,6 @@ struct ipc_namespace init_ipc_ns = { atomic_t nr_ipc_ns = ATOMIC_INIT(1); -struct msg_msgseg { - struct msg_msgseg* next; - /* the next part of the message follows immediately */ -}; - -#define DATALEN_MSG(PAGE_SIZE-sizeof(struct msg_msg)) -#define DATALEN_SEG(PAGE_SIZE-sizeof(struct msg_msgseg)) - struct msg_msg *load_msg(const void __user *src, int len) { struct msg_msg *msg; diff --git a/ipc/util.h b/ipc/util.h index 5a6373f..db067b0 100644 --- a/ipc/util.h +++ b/ipc/util.h @@ -140,6 +140,14 @@ extern void free_msg(struct msg_msg *msg); extern struct msg_msg *load_msg(const void __user *src, int len); extern int store_msg(void __user *dest, struct msg_msg *msg, int len); +struct msg_msgseg { + struct msg_msgseg *next; + /* the next part of the message follows immediately */ +}; + +#define DATALEN_MSG(PAGE_SIZE-sizeof(struct msg_msg)) +#define DATALEN_SEG(PAGE_SIZE-sizeof(struct msg_msgseg)) + extern void recompute_msgmni(struct ipc_namespace *); static inline int ipc_buildid(int id, int seq) @@ -175,6 +183,8 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids, /* for checkpoint/restart */ extern int do_shmget(key_t key, size_t size, int shmflg, int req_id); +extern int do_msgget(key_t key, int msgflg, int req_id); +extern void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp); extern void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp); -- 1.6.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 32/43] c/r (ipc): allow allocation of a desired ipc identifier
During restart, we need to allocate ipc objects that with the same identifiers as recorded during checkpoint. Modify the allocation code allow an in-kernel caller to request a specific ipc identifier. The system call interface remains unchanged. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- ipc/msg.c | 17 - ipc/sem.c | 17 - ipc/shm.c | 19 +-- ipc/util.c | 42 +- ipc/util.h | 12 +--- 5 files changed, 75 insertions(+), 32 deletions(-) diff --git a/ipc/msg.c b/ipc/msg.c index 2ceab7f..1db7c45 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -73,7 +73,7 @@ struct msg_sender { #define msg_unlock(msq)ipc_unlock((msq)-q_perm) static void freeque(struct ipc_namespace *, struct kern_ipc_perm *); -static int newque(struct ipc_namespace *, struct ipc_params *); +static int newque(struct ipc_namespace *, struct ipc_params *, int); #ifdef CONFIG_PROC_FS static int sysvipc_msg_proc_show(struct seq_file *s, void *it); #endif @@ -174,10 +174,12 @@ static inline void msg_rmid(struct ipc_namespace *ns, struct msg_queue *s) * newque - Create a new msg queue * @ns: namespace * @params: ptr to the structure that contains the key and msgflg + * @req_id: request desired id if available (-1 if don't care) * * Called with msg_ids.rw_mutex held (writer) */ -static int newque(struct ipc_namespace *ns, struct ipc_params *params) +static int +newque(struct ipc_namespace *ns, struct ipc_params *params, int req_id) { struct msg_queue *msq; int id, retval; @@ -201,7 +203,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params) /* * ipc_addid() locks msq */ - id = ipc_addid(msg_ids(ns), msq-q_perm, ns-msg_ctlmni); + id = ipc_addid(msg_ids(ns), msq-q_perm, ns-msg_ctlmni, req_id); if (id 0) { security_msg_queue_free(msq); ipc_rcu_putref(msq); @@ -309,7 +311,7 @@ static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg) return security_msg_queue_associate(msq, msgflg); } -SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg) +int do_msgget(key_t key, int msgflg, int req_id) { struct ipc_namespace *ns; struct ipc_ops msg_ops; @@ -324,7 +326,12 @@ SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg) msg_params.key = key; msg_params.flg = msgflg; - return ipcget(ns, msg_ids(ns), msg_ops, msg_params); + return ipcget(ns, msg_ids(ns), msg_ops, msg_params, req_id); +} + +SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg) +{ + return do_msgget(key, msgflg, -1); } static inline unsigned long diff --git a/ipc/sem.c b/ipc/sem.c index 16a2189..207dbbb 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -92,7 +92,7 @@ #define sem_unlock(sma)ipc_unlock((sma)-sem_perm) #define sem_checkid(sma, semid)ipc_checkid(sma-sem_perm, semid) -static int newary(struct ipc_namespace *, struct ipc_params *); +static int newary(struct ipc_namespace *, struct ipc_params *, int); static void freeary(struct ipc_namespace *, struct kern_ipc_perm *); #ifdef CONFIG_PROC_FS static int sysvipc_sem_proc_show(struct seq_file *s, void *it); @@ -227,11 +227,13 @@ static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s) * newary - Create a new semaphore set * @ns: namespace * @params: ptr to the structure that contains key, semflg and nsems + * @req_id: request desired id if available (-1 if don't care) * * Called with sem_ids.rw_mutex held (as a writer) */ -static int newary(struct ipc_namespace *ns, struct ipc_params *params) +static int +newary(struct ipc_namespace *ns, struct ipc_params *params, int req_id) { int id; int retval; @@ -263,7 +265,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params) return retval; } - id = ipc_addid(sem_ids(ns), sma-sem_perm, ns-sc_semmni); + id = ipc_addid(sem_ids(ns), sma-sem_perm, ns-sc_semmni, req_id); if (id 0) { security_sem_free(sma); ipc_rcu_putref(sma); @@ -308,7 +310,7 @@ static inline int sem_more_checks(struct kern_ipc_perm *ipcp, return 0; } -SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg) +int do_semget(key_t key, int nsems, int semflg, int req_id) { struct ipc_namespace *ns; struct ipc_ops sem_ops; @@ -327,7 +329,12 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg) sem_params.flg = semflg; sem_params.u.nsems = nsems; - return ipcget(ns, sem_ids(ns), sem_ops, sem_params); + return ipcget(ns, sem_ids(ns), sem_ops, sem_params, req_id); +} + +SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg) +{ + return do_semget(key, nsems, semflg, -1); } /* diff --git a/ipc/shm.c b/ipc/shm.c index faa46da..7dd5f0c 100644 --- a/ipc/shm.c +++
[Devel] [RFC v16][PATCH 12/43] c/r: add generic '-checkpoint()' f_op to simple devices
* /dev/null * /dev/zero * /dev/random * /dev/urandom Signed-off-by: Oren Laadan or...@cs.columbia.edu --- drivers/char/mem.c|2 ++ drivers/char/random.c |2 ++ 2 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/char/mem.c b/drivers/char/mem.c index 8f05c38..bfde41f 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -797,6 +797,7 @@ static const struct file_operations null_fops = { .read = read_null, .write = write_null, .splice_write = splice_write_null, + .checkpoint = generic_file_checkpoint, }; #ifdef CONFIG_DEVPORT @@ -813,6 +814,7 @@ static const struct file_operations zero_fops = { .read = read_zero, .write = write_zero, .mmap = mmap_zero, + .checkpoint = generic_file_checkpoint, }; /* diff --git a/drivers/char/random.c b/drivers/char/random.c index 8c74448..211ca70 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1164,6 +1164,7 @@ const struct file_operations random_fops = { .poll = random_poll, .unlocked_ioctl = random_ioctl, .fasync = random_fasync, + .checkpoint = generic_file_checkpoint, }; const struct file_operations urandom_fops = { @@ -1171,6 +1172,7 @@ const struct file_operations urandom_fops = { .write = random_write, .unlocked_ioctl = random_ioctl, .fasync = random_fasync, + .checkpoint = generic_file_checkpoint, }; /*** -- 1.6.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 41/43] c/r: (s390): expose a constant for the number of words (CRs)
We need to use this value in the checkpoint/restart code and would like to have a constant instead of a magic '3'. Changelog: Mar 30: . Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch) Mar 03: . Picked up additional use of magic '3' in ptrace.h Signed-off-by: Dan Smith da...@us.ibm.com --- arch/s390/Kconfig |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 2eca5fe..bf62cad 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL config GENERIC_CLOCKEVENTS def_bool y +config CHECKPOINT_SUPPORT + bool + default y if 64BIT + config GENERIC_BUG bool depends on BUG -- 1.6.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 27/43] c/r: support for open pipes
A pipe is a double-headed inode with a buffer attached to it. We checkpoint the pipe buffer only once, as soon as we hit one side of the pipe, regardless whether it is read- or write- end. To checkpoint a file descriptor that refers to a pipe (either end), we first lookup the inode in the hash table: If not found, it is the first encounter of this pipe. Besides the file descriptor, we also (a) save the pipe data, and (b) register the pipe inode in the hash. If found, it is the second encounter of this pipe, namely, as we hit the other end of the same pipe. In both cases we write the pipe-objref of the inode. To restore, create a new pipe and thus have two file pointers (read- and write- ends). We only use one of them, depending on which side was checkpointed first. We register the file pointer of the other end in the hash table, with the pipe_objref given for this pipe from the checkpoint, to be used later when the other arrives. At this point we also restore the contents of the pipe buffers. To save the pipe buffer, given a source pipe, use do_tee() to clone its contents into a temporary 'struct pipe_inode_info', and then use do_splice_from() to transfer it directly to the checkpoint image file. To restore the pipe buffer, with a fresh newly allocated target pipe, use do_splice_to() to splice the data directly between the checkpoint image file and the pipe. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/files.c |7 ++ fs/pipe.c | 173 fs/splice.c| 10 +- include/linux/checkpoint_hdr.h | 12 +++ include/linux/pipe_fs_i.h |6 ++ include/linux/splice.h |9 ++ 6 files changed, 212 insertions(+), 5 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index d7583d3..b264e40 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -17,6 +17,7 @@ #include linux/file.h #include linux/fdtable.h #include linux/fsnotify.h +#include linux/pipe_fs_i.h #include linux/syscalls.h #include linux/checkpoint.h #include linux/checkpoint_hdr.h @@ -433,6 +434,12 @@ static struct restore_file_ops restore_file_ops[] = { .file_type = CKPT_FILE_GENERIC, .restore = generic_file_restore, }, + /* pipes */ + { + .file_name = PIPE, + .file_type = CKPT_FILE_PIPE, + .restore = pipe_file_restore, + }, }; static struct file *do_restore_file(struct ckpt_ctx *ctx) diff --git a/fs/pipe.c b/fs/pipe.c index 13414ec..d0aba56 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -13,6 +13,7 @@ #include linux/fs.h #include linux/mount.h #include linux/pipe_fs_i.h +#include linux/splice.h #include linux/uio.h #include linux/highmem.h #include linux/pagemap.h @@ -22,6 +23,9 @@ #include asm/uaccess.h #include asm/ioctls.h +#include linux/checkpoint.h +#include linux/checkpoint_hdr.h + /* * We use a start+len construction, which provides full use of the * allocated memory. @@ -795,6 +799,172 @@ pipe_rdwr_open(struct inode *inode, struct file *filp) return 0; } +#ifdef CONFIG_CHECKPOINT +static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode) +{ + struct ckpt_hdr_file_pipe_state *h; + struct pipe_inode_info *pipe; + int len, ret = -ENOMEM; + + pipe = alloc_pipe_info(NULL); + if (!pipe) + return ret; + + pipe-readers = 1; /* bluff link_pipe() below */ + len = link_pipe(inode-i_pipe, pipe, INT_MAX, SPLICE_F_NONBLOCK); + if (len == -EAGAIN) + len = 0; + if (len 0) { + ret = len; + goto out; + } + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_PIPE); + if (!h) + goto out; + h-pipe_len = len; + ret = ckpt_write_obj(ctx, h-h); + ckpt_hdr_put(ctx, h); + if (ret 0) + goto out; + + ret = do_splice_from(pipe, ctx-file, ctx-file-f_pos, len, 0); + if (ret 0) + goto out; + if (ret != len) + ret = -EPIPE; /* can occur due to an error in target file */ + out: + __free_pipe_info(pipe); + return ret; +} + +static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file) +{ + struct ckpt_hdr_file_pipe *h; + struct inode *inode = file-f_dentry-d_inode; + int objref, first, ret; + + objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, first); + if (objref 0) + return objref; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE); + if (!h) + return -ENOMEM; + + h-common.f_type = CKPT_FILE_PIPE; + h-pipe_objref = objref; + + ret = checkpoint_file_common(ctx, file, h-common); + if (ret 0) + goto out; + ret = ckpt_write_obj(ctx, h-common.h); + if (ret 0) + goto out; + + if
[Devel] [RFC v16][PATCH 23/43] c/r: restart multiple processes
Restarting of multiple processes expects all restarting tasks to call sys_restart(). Once inside the system call, each task will restart itself at the same order that they were saved. The internals of the syscall will take care of in-kernel synchronization bewteen tasks. This patch does _not_ create the task tree in the kernel. Instead it assumes that all tasks are created in some way and then invoke the restart syscall. You can use the userspace mktree.c program to do that. The init task (*) has a special role: it allocates the restart context (ctx), and coordinates the operation. In particular, it first waits until all participating tasks enter the kernel, and provides them the common restart context. Once everyone in ready, it begins to restart itself. In contrast, the other tasks enter the kernel, locate the init task (*) and grab its restart context, and then wait for their turn to restore. When a task (init or not) completes its restart, it hands the control over to the next in line, by waking that task. An array of pids (the one saved during the checkpoint) is used to synchronize the operation. The first task in the array is the init task (*). The restart context (ctx) maintain a current position in the array, which indicates which task is currently active. Once the currently active task completes its own restart, it increments that position and wakes up the next task. Restart assumes that userspace provides meaningful data, otherwise it's garbage-in-garbage-out. In this case, the syscall may block indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or otherwise kill the stray restarting tasks. In terms of security, restart runs as the user the invokes it, so it will not allow a user to do more than is otherwise permitted by the usual system semantics and policy. Currently we ignore threads and zombies, as well as session ids. Add support for multiple processes (*) For containers, restart should be called inside a fresh container by the init task of that container. However, it is also possible to restart applications not necessarily inside a container, and without restoring the original pids of the processes (that is, provided that the application can tolerate such behavior). This is useful to allow multi-process restart of tasks not isolated inside a container, and also for debugging. Changelog[v14]: - Revert change to pr_debug(), back to ckpt_debug() - Discard field 'h.parent' - Check whether calls to ckpt_hbuf_get() fail Changelog[v13]: - Clear root_task-checkpoint_ctx regardless of error condition - Remove unused argument 'ctx' from do_restore_task() prototype - Remove unused member 'pids_err' from 'struct ckpt_ctx' Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/restart.c | 242 -- checkpoint/sys.c | 27 - include/linux/checkpoint.h |3 + include/linux/checkpoint_types.h | 17 +++- include/linux/sched.h|4 + 5 files changed, 277 insertions(+), 16 deletions(-) diff --git a/checkpoint/restart.c b/checkpoint/restart.c index 8b8229e..5e68835 100644 --- a/checkpoint/restart.c +++ b/checkpoint/restart.c @@ -13,6 +13,7 @@ #include linux/version.h #include linux/sched.h +#include linux/wait.h #include linux/file.h #include linux/magic.h #include linux/utsname.h @@ -353,12 +354,233 @@ static int restore_read_tail(struct ckpt_ctx *ctx) return ret; } +/* restore_read_tree - read the tasks tree into the checkpoint context */ +static int restore_read_tree(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_tree *h; + int size, ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TREE); + if (IS_ERR(h)) + return PTR_ERR(h); + + ret = -EINVAL; + if (h-nr_tasks 0) + goto out; + + ctx-nr_pids = h-nr_tasks; + size = sizeof(*ctx-pids_arr) * ctx-nr_pids; + if (size 0) /* overflow ? */ + goto out; + + ctx-pids_arr = kmalloc(size, GFP_KERNEL); + if (!ctx-pids_arr) { + ret = -ENOMEM; + goto out; + } + ret = _ckpt_read_buffer(ctx, ctx-pids_arr, size); + out: + ckpt_hdr_put(ctx, h); + return ret; +} + +static inline pid_t active_pid(struct ckpt_ctx *ctx) +{ + return ctx-pids_arr[ctx-active_pid].vpid; +} + +static int restore_wait_task(struct ckpt_ctx *ctx) +{ + pid_t pid = task_pid_vnr(current); + + ckpt_debug(pid %d waiting\n, pid); + return wait_event_interruptible(ctx-waitq, active_pid(ctx) == pid); +} + +static int restore_next_task(struct ckpt_ctx *ctx) +{ + struct task_struct *task; + + ctx-active_pid++; + + ckpt_debug(active_pid %d of %d\n, ctx-active_pid, ctx-nr_pids); + if (ctx-active_pid == ctx-nr_pids) { + complete(ctx-complete); +
[Devel] [RFC v16][PATCH 25/43] tee: don't return 0 when another task drains/fills a pipe
This patch is a modified version of Max Kellerman patch that fixes a race in do_tee() (see http://patchwork/kernel/org/patch/21040). It differs in that it rafactors link_pipe() so that the following patch (that adds support for splice() between pipes, also based on a patch by Max Kellerman), can better share code. Below is Max's original description: -- Cite from the tee() manual page: A return value of 0 means that there was no data to transfer, and it would not make sense to block, because there are no writers connected to the write end of the pipe There is however a race condition in the tee() implementation, which violates this definition: - do_tee() ensures that ipipe is readable and opipe is writable by calling link_ipipe_prep() and link_opipe_prep() - these two functions unlock the pipe after they have waited - during this unlocked phase, there is a short window where other tasks may drain the input pipe or fill the output pipe - do_tee() now calls link_pipe(), which re-locks both pipes - link_pipe() sees that it is unable to read (i = ipipe-nrbufs || opipe-nrbufs = PIPE_BUFFERS) and breaks from the loop - link_pipe() returns 0 Although there may be writers connected to the input pipe, tee() now returns 0, and the caller (spuriously) assumes this is the end of the stream. This patch wraps the link_[io]pipe_prep() invocation in a loop within link_pipe(), and loops until the result is reliable. -- Signed-off-by: Oren Laadan or...@cs.columbia.edu --- fs/splice.c | 80 +-- 1 files changed, 61 insertions(+), 19 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 666953d..92dd63c 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1586,6 +1586,59 @@ static int link_opipe_prep(struct pipe_inode_info *pipe, unsigned int flags) return ret; } +/** + * link_pipe_prep - make sure there's readable data and writable room + * @ipipe: the input pipe + * @opipe: the output pipe + * @flags: splice modifier flags + * + * Wrap the link_[io]pipe_prep() invocation in a loop until the result + * is reliable. + * + * Expects pipes to be unlocked, and on success returns them locked. + */ +static int link_pipe_prep(struct pipe_inode_info *ipipe, + struct pipe_inode_info *opipe, + unsigned int flags) +{ + int ret; + + while (1) { + /* wait for ipipe to become ready to read */ + ret = link_ipipe_prep(ipipe, flags); + if (ret) + return ret; + + /* wait for opipe to become ready to write */ + ret = link_opipe_prep(opipe, flags); + if (ret) + return ret; + + /* +* Potential ABBA deadlock, work around it by ordering +* lock grabbing by inode address. Otherwise two +* different processes could deadlock (one doing tee +* from A - B, the other from B - A). +*/ + pipe_double_lock(ipipe, opipe); + + /* see if the tee() is still possible */ + if ((ipipe-nrbufs 0 || ipipe-writers == 0) + opipe-nrbufs PIPE_BUFFERS) + /* yes, it is - keep the locks and end this + loop */ + break; + + /* no - someone has drained ipipe or has filled opipe + between link_[io]pipe_pre()'s lock and our lock. + Drop both locks and wait again. */ + pipe_unlock(ipipe); + pipe_unlock(opipe); + } + + return 0; +} + /* * Link contents of ipipe to opipe. */ @@ -1594,14 +1647,13 @@ static int link_pipe(struct pipe_inode_info *ipipe, size_t len, unsigned int flags) { struct pipe_buffer *ibuf, *obuf; - int ret = 0, i = 0, nbuf; + int ret, i = 0, nbuf; - /* -* Potential ABBA deadlock, work around it by ordering lock -* grabbing by pipe info address. Otherwise two different processes -* could deadlock (one doing tee from A - B, the other from B - A). -*/ - pipe_double_lock(ipipe, opipe); + ret = link_pipe_prep(ipipe, opipe, flags); + if (ret 0) + return ret; + + /* pipes are now locked */ do { if (!opipe-readers) { @@ -1685,18 +1737,8 @@ static long do_tee(struct file *in, struct file *out, size_t len, * Duplicate the contents of ipipe to opipe without actually * copying the data. */ - if (ipipe opipe ipipe != opipe) { - /* -* Keep going, unless we encounter an error. The ipipe/opipe -* ordering doesn't really matter. -*/ - ret = link_ipipe_prep(ipipe, flags); - if (!ret) { - ret =
[Devel] Re: [PATCH 3/7] [PATCH] Add target_pid parameter to alloc_pidmap()
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com): From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:41 -0700 Subject: [PATCH 3/7] [PATCH] Add target_pid parameter to alloc_pidmap() Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com How about #define TARGET_PID_UNSPECIFIED 0 or something to pass to alloc_pidmap() from alloc_pid()? Up to you More importantly: --- kernel/pid.c | 28 ++-- 1 files changed, 26 insertions(+), 2 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index fd72ad9..93406c6 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -147,12 +147,36 @@ static int alloc_pidmap_page(struct pidmap *map) return 0; } -static int alloc_pidmap(struct pid_namespace *pid_ns) +static int set_pidmap(struct pid_namespace *pid_ns, int pid) +{ + int offset; + struct pidmap *map; + + if (pid = pid_max) + return -EINVAL; what about pid 0? + + offset = pid BITS_PER_PAGE_MASK; + map = pid_ns-pidmap[pid/BITS_PER_PAGE]; + + if (alloc_pidmap_page(map)) + return -ENOMEM; + + if (test_and_set_bit(offset, map-page)) + return -EBUSY; + + atomic_dec(map-nr_free); + return pid; +} + +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid) { int i, offset, max_scan, pid, last = pid_ns-last_pid; struct pidmap *map; int rc = -EAGAIN; + if (target_pid) + return set_pidmap(pid_ns, target_pid); + pid = last + 1; if (pid = pid_max) pid = RESERVED_PIDS; @@ -269,7 +293,7 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns-level; i = 0; i--) { - nr = alloc_pidmap(tmp); + nr = alloc_pidmap(tmp, 0); if (nr 0) goto out_free; -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com): From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:39 -0700 Subject: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Acked-by: Serge Hallyn se...@us.ibm.com --- kernel/pid.c | 43 --- 1 files changed, 28 insertions(+), 15 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index b2e5f78..c0aaebe 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -122,6 +122,31 @@ static void free_pidmap(struct upid *upid) atomic_inc(map-nr_free); } +static int alloc_pidmap_page(struct pidmap *map) +{ + void *page; + + if (likely(map-page)) + return 0; + + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + + /* + * Free the page if someone raced with us installing it: + */ + spin_lock_irq(pidmap_lock); + if (map-page) + kfree(page); + else + map-page = page; + spin_unlock_irq(pidmap_lock); + + if (unlikely(!map-page)) + return -1; + + return 0; +} + static int alloc_pidmap(struct pid_namespace *pid_ns) { int i, offset, max_scan, pid, last = pid_ns-last_pid; @@ -134,21 +159,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) map = pid_ns-pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i = max_scan; ++i) { - if (unlikely(!map-page)) { - void *page = kzalloc(PAGE_SIZE, GFP_KERNEL); - /* - * Free the page if someone raced with us - * installing it: - */ - spin_lock_irq(pidmap_lock); - if (map-page) - kfree(page); - else - map-page = page; - spin_unlock_irq(pidmap_lock); - if (unlikely(!map-page)) - break; - } + if (alloc_pidmap_page(map)) + break; + if (likely(atomic_read(map-nr_free))) { do { if (!test_and_set_bit(offset, map-page)) { -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process()
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com): From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:43 -0700 Subject: [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process() The new parameter will be used in a follow-on patch when clone_with_pids() is implemented. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Acked-by: Serge Hallyn se...@us.ibm.com --- kernel/fork.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d2d69d3..373411e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -949,12 +949,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, unsigned long stack_size, int __user *child_tidptr, struct pid *pid, + pid_t *target_pids, int trace) { int retval; struct task_struct *p; int cgroup_callbacks_done = 0; - pid_t *target_pids = NULL; if ((clone_flags (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1327,7 +1327,7 @@ struct task_struct * __cpuinit fork_idle(int cpu) struct pt_regs regs; task = copy_process(CLONE_VM, 0, idle_regs(regs), 0, NULL, - init_struct_pid, 0); + init_struct_pid, NULL, 0); if (!IS_ERR(task)) init_idle(task, cpu); @@ -1350,6 +1350,7 @@ long do_fork(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; + pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1390,7 +1391,7 @@ long do_fork(unsigned long clone_flags, trace = tracehook_prepare_clone(clone_flags); p = copy_process(clone_flags, stack_start, regs, stack_size, - child_tidptr, NULL, trace); + child_tidptr, NULL, target_pids, trace); /* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/7] [PATCH] Add target_pids parameter to alloc_pid()
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com): From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:42 -0700 Subject: [PATCH 4/7] [PATCH] Add target_pids parameter to alloc_pid() This parameter is currently NULL, but will be used in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Acked-by: Serge Hallyn se...@us.ibm.com --- include/linux/pid.h |2 +- kernel/fork.c |3 ++- kernel/pid.c|9 +++-- 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index 49f1c2f..914185d 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr); extern struct pid *find_ge_pid(int nr, struct pid_namespace *); int next_pidmap(struct pid_namespace *pid_ns, int last); -extern struct pid *alloc_pid(struct pid_namespace *ns); +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids); extern void free_pid(struct pid *pid); /* diff --git a/kernel/fork.c b/kernel/fork.c index f8411a8..d2d69d3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -954,6 +954,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, int retval; struct task_struct *p; int cgroup_callbacks_done = 0; + pid_t *target_pids = NULL; if ((clone_flags (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1119,7 +1120,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != init_struct_pid) { - pid = alloc_pid(p-nsproxy-pid_ns); + pid = alloc_pid(p-nsproxy-pid_ns, target_pids); if (IS_ERR(pid)) { retval = PTR_ERR(pid); goto bad_fork_cleanup_io; diff --git a/kernel/pid.c b/kernel/pid.c index 93406c6..4b2373a 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -279,13 +279,14 @@ void free_pid(struct pid *pid) call_rcu(pid-rcu, delayed_put_pid); } -struct pid *alloc_pid(struct pid_namespace *ns) +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids) { struct pid *pid; enum pid_type type; int i, nr; struct pid_namespace *tmp; struct upid *upid; + int tpid; pid = kmem_cache_alloc(ns-pid_cachep, GFP_KERNEL); if (!pid) @@ -293,7 +294,11 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns-level; i = 0; i--) { - nr = alloc_pidmap(tmp, 0); + tpid = 0; + if (target_pids) + tpid = target_pids[i]; + + nr = alloc_pidmap(tmp, tpid); if (nr 0) goto out_free; -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 6/7] [PATCH] Define do_fork_with_pids()
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com): From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:44 -0700 Subject: [PATCH 6/7] [PATCH] Define do_fork_with_pids() do_fork_with_pids() is same as do_fork(), except that it takes an additional, target_pids, parameter. This parameter, currently unused, specifies the target_pids of the process in each of its pid namespaces. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Acked-by: Serge Hallyn se...@us.ibm.com --- include/linux/sched.h |1 + kernel/fork.c | 17 ++--- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index b4c38bc..2173df1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1995,6 +1995,7 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, pid_t *target_pids); struct task_struct *fork_idle(int); extern void set_task_comm(struct task_struct *tsk, char *from); diff --git a/kernel/fork.c b/kernel/fork.c index 373411e..912d008 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1340,17 +1340,17 @@ struct task_struct * __cpuinit fork_idle(int cpu) * It copies the process, and if successful kick-starts * it and waits for it to finish using the VM if required. */ -long do_fork(unsigned long clone_flags, +long do_fork_with_pids(unsigned long clone_flags, unsigned long stack_start, struct pt_regs *regs, unsigned long stack_size, int __user *parent_tidptr, - int __user *child_tidptr) + int __user *child_tidptr, + pid_t *target_pids) { struct task_struct *p; int trace = 0; long nr; - pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1448,6 +1448,17 @@ long do_fork(unsigned long clone_flags, return nr; } +long do_fork(unsigned long clone_flags, + unsigned long stack_start, + struct pt_regs *regs, + unsigned long stack_size, + int __user *parent_tidptr, + int __user *child_tidptr) +{ + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size, + parent_tidptr, child_tidptr, NULL); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 39/43] c/r (ipc): export interface from ipc/sem.c to cleanup ipc sem
Export freeary() which will be used in the next patch during restart to cleanup an ipc sem. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- ipc/sem.c |3 +-- ipc/util.h |1 + 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/ipc/sem.c b/ipc/sem.c index 207dbbb..c60076e 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -93,7 +93,6 @@ #define sem_checkid(sma, semid)ipc_checkid(sma-sem_perm, semid) static int newary(struct ipc_namespace *, struct ipc_params *, int); -static void freeary(struct ipc_namespace *, struct kern_ipc_perm *); #ifdef CONFIG_PROC_FS static int sysvipc_sem_proc_show(struct seq_file *s, void *it); #endif @@ -521,7 +520,7 @@ static void free_un(struct rcu_head *head) * as a writer and the spinlock for this semaphore set hold. sem_ids.rw_mutex * remains locked on exit. */ -static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) +void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) { struct sem_undo *un, *tu; struct sem_queue *q, *tq; diff --git a/ipc/util.h b/ipc/util.h index 2a05fb3..347ffb2 100644 --- a/ipc/util.h +++ b/ipc/util.h @@ -185,6 +185,7 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids, extern int do_shmget(key_t key, size_t size, int shmflg, int req_id); extern int do_msgget(key_t key, int msgflg, int req_id); extern void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp); +extern void freeary(struct ipc_namespace *, struct kern_ipc_perm *); extern void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp); -- 1.6.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 16/43] c/r: export shmem_getpage() to support shared memory
Export functionality to retrieve specific pages from shared memory given an inode in shmem-fs; this will be used in the next two patches to provide support for c/r of shared memory. mm/shmem.c: - shmem_getpage() and 'enum sgp_type' moved to linux/mm.h Signed-off-by: Oren Laadan or...@cs.columbia.edu --- include/linux/mm.h | 11 +++ mm/shmem.c | 15 ++- 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ae70b50..53e916a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -330,6 +330,17 @@ void put_pages_list(struct list_head *pages); void split_page(struct page *page, unsigned int order); +/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */ +enum sgp_type { + SGP_READ, /* don't exceed i_size, don't allocate page */ + SGP_CACHE, /* don't exceed i_size, may allocate page */ + SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */ + SGP_WRITE, /* may exceed i_size, may allocate page */ +}; + +extern int shmem_getpage(struct inode *inode, unsigned long idx, +struct page **pagep, enum sgp_type sgp, int *type); + /* * Compound pages have a destructor function. Provide a * prototype for that function and accessor functions. diff --git a/mm/shmem.c b/mm/shmem.c index b25f95c..f260336 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -99,14 +99,6 @@ static struct vfsmount *shm_mnt; /* Pretend that each entry is of this size in directory's i_size */ #define BOGO_DIRENT_SIZE 20 -/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */ -enum sgp_type { - SGP_READ, /* don't exceed i_size, don't allocate page */ - SGP_CACHE, /* don't exceed i_size, may allocate page */ - SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */ - SGP_WRITE, /* may exceed i_size, may allocate page */ -}; - #ifdef CONFIG_TMPFS static unsigned long shmem_default_max_blocks(void) { @@ -119,9 +111,6 @@ static unsigned long shmem_default_max_inodes(void) } #endif -static int shmem_getpage(struct inode *inode, unsigned long idx, -struct page **pagep, enum sgp_type sgp, int *type); - static inline struct page *shmem_dir_alloc(gfp_t gfp_mask) { /* @@ -1202,8 +1191,8 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) * vm. If we swap it in we mark it dirty since we also free the swap * entry since a page cannot live in both the swap and page cache */ -static int shmem_getpage(struct inode *inode, unsigned long idx, - struct page **pagep, enum sgp_type sgp, int *type) +int shmem_getpage(struct inode *inode, unsigned long idx, + struct page **pagep, enum sgp_type sgp, int *type) { struct address_space *mapping = inode-i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); -- 1.6.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 08/43] c/r: introduce '-checkpoint()' method in 'struct file_operations'
While we assume all normal files and directories can be checkpointed, there are, as usual in the VFS, specialized places that will always need an ability to override these defaults. Although we could do this completely in the checkpoint code, that would bitrot quickly. This adds a new 'file_operations' function for checkpointing a file. It is assumed that there should be a dirt-simple way to make something (un)checkpointable that fits in with current code. As you can see in the ext[234] patches down the road, all that we have to do to make something simple be supported is add a single generic f_op entry. Also introduce vfs_fcntl() so that it can be called from restart (see patch adding restart of files). Signed-off-by: Oren Laadan or...@cs.columbia.edu --- fs/fcntl.c | 21 + include/linux/checkpoint_types.h |2 ++ include/linux/fs.h |6 ++ 3 files changed, 21 insertions(+), 8 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index 1ad7031..17020a9 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -337,6 +337,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, return err; } +int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) +{ + int err; + + err = security_file_fcntl(filp, cmd, arg); + if (err) + goto out; + err = do_fcntl(fd, cmd, arg, filp); + out: + return err; +} + SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) { struct file *filp; @@ -346,14 +358,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) if (!filp) goto out; - err = security_file_fcntl(filp, cmd, arg); - if (err) { - fput(filp); - return err; - } - - err = do_fcntl(fd, cmd, arg, filp); - + err = vfs_fcntl(fd, cmd, arg, filp); fput(filp); out: return err; diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h index c1032fa..9c14034 100644 --- a/include/linux/checkpoint_types.h +++ b/include/linux/checkpoint_types.h @@ -15,6 +15,8 @@ #ifdef __KERNEL__ +#include linux/sched.h + struct ckpt_ctx { int crid; /* unique checkpoint id */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 9c4348a..60d9229 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -8,6 +8,7 @@ #include linux/limits.h #include linux/ioctl.h +#include linux/checkpoint_types.h /* * It's silly to have NR_OPEN bigger than NR_FILE, but you can change @@ -1082,6 +1083,8 @@ struct file_lock { #include linux/fcntl.h +extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp); + extern void send_sigio(struct fown_struct *fown, int fd, int band); /* fs/sync.c */ @@ -1508,6 +1511,7 @@ struct file_operations { ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **); + int (*checkpoint)(struct ckpt_ctx *, struct file *); }; struct inode_operations { @@ -2306,6 +2310,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes); loff_t inode_get_bytes(struct inode *inode); void inode_set_bytes(struct inode *inode, loff_t bytes); +#define generic_file_checkpoint NULL + extern int vfs_readdir(struct file *, filldir_t, void *); extern int vfs_stat(char __user *, struct kstat *); -- 1.6.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 7/7] [PATCH] Define clone_with_pids syscall
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com): From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:45 -0700 Subject: [PATCH 7/7] [PATCH] Define clone_with_pids syscall clone_with_pids() is same as clone(), except that it takes a 'target_pid_set' paramter which lets caller choose a specific pid number for the child process in each of the child process's pid namespace. This system call would be needed to implement Checkpoint/Restart (i.e after a checkpoint, restart a process with its original pids). I think you should point out here that CAP_SYS_ADMIN is needed to use the syscall, so unprivileged tasks can't use this to try to play games with /var/run/*.pid. ... Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Acked-by: Serge Hallyn se...@us.ibm.com -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 03/43] c/r: create syscalls: sys_checkpoint, sys_restart
Create trivial sys_checkpoint and sys_restore system calls. They will enable to checkpoint and restart an entire container, to and from a checkpoint image file descriptor. The syscalls take a pid, a file descriptor (for the image file) and flags as arguments. The pid identifies the top-most (root) task in the process tree, e.g. the container init: for sys_checkpoint the first argument identifies the pid of the target container/subtree; for sys_restart it will identify the pid of restarting root task. A checkpoint, much like a process coredump, dumps the state of multiple processes at once, including the state of the container. The checkpoint image is written to (and read from) the file descriptor directly from the kernel. This way the data is generated and then pushed out naturally as resources and tasks are scanned to save their state. This is the approach taken by, e.g., Zap and OpenVZ. By using a return value and not a file descriptor, we can distinguish between a return from checkpoint, a return from restart (in case of a checkpoint that includes self, i.e. a task checkpointing its own container, or itself), and an error condition, in a manner analogous to a fork() call. We don't use copy_from_user()/copy_to_user() because it requires holding the entire image in user space, and does not make sense for restart. Also, we don't use a pipe, pseudo-fs file and the like, because they work by generating data on demand as the user pulls it (unless the entire image is buffered in the kernel) and would require more complex logic. They also would significantly complicate checkpoint that includes self. Changelog[v16]: - Change sys_restart() first argument to be 'pid_t pid' Changelog[v14]: - Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo) - Remove line 'def_bool n' (default is already 'n') - Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch) Changelog[v5]: - Config is 'def_bool n' by default Signed-off-by: Oren Laadan or...@cs.columbia.edu Acked-by: Serge Hallyn se...@us.ibm.com Signed-off-by: Dave Hansen d...@linux.vnet.ibm.com --- arch/x86/Kconfig |4 +++ arch/x86/include/asm/unistd_32.h |2 + arch/x86/kernel/syscall_table_32.S |2 + checkpoint/Kconfig | 14 checkpoint/Makefile|5 checkpoint/sys.c | 41 include/linux/syscalls.h |2 + init/Kconfig |2 + kernel/sys_ni.c|4 +++ 9 files changed, 76 insertions(+), 0 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index a6efe0a..2891a26 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -77,6 +77,10 @@ config STACKTRACE_SUPPORT config HAVE_LATENCYTOP_SUPPORT def_bool y +config CHECKPOINT_SUPPORT + bool + default y if X86_32 + config FAST_CMPXCHG_LOCAL bool default y diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6e72d74..48557e1 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -340,6 +340,8 @@ #define __NR_inotify_init1 332 #define __NR_preadv333 #define __NR_pwritev 334 +#define __NR_checkpoint335 +#define __NR_restart 336 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index ff5c873..e70b7ee 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -334,3 +334,5 @@ ENTRY(sys_call_table) .long sys_inotify_init1 .long sys_preadv .long sys_pwritev + .long sys_checkpoint + .long sys_restart diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig new file mode 100644 index 000..1761b0a --- /dev/null +++ b/checkpoint/Kconfig @@ -0,0 +1,14 @@ +# Architectures should define CHECKPOINT_SUPPORT when they have +# implemented the hooks for processor state etc. needed by the +# core checkpoint/restart code. + +config CHECKPOINT + bool Enable checkpoint/restart (EXPERIMENTAL) + depends on CHECKPOINT_SUPPORT EXPERIMENTAL + help + Application checkpoint/restart is the ability to save the + state of a running application so that it can later resume + its execution from the time at which it was checkpointed. + + Turning this option on will enable checkpoint and restart + functionality in the kernel. diff --git a/checkpoint/Makefile b/checkpoint/Makefile new file mode 100644 index 000..8a32c6f --- /dev/null +++ b/checkpoint/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for linux checkpoint/restart. +# + +obj-$(CONFIG_CHECKPOINT) += sys.o diff --git a/checkpoint/sys.c b/checkpoint/sys.c new file mode 100644 index 000..9d4caff --- /dev/null +++ b/checkpoint/sys.c @@ -0,0 +1,41 @@ +/* + * Generic container checkpoint-restart + * + * Copyright (C) 2008 Oren Laadan + * +
[Devel] [RFC v16][PATCH 24/43] c/r: detect resource leaks for whole-container checkpoint
Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE checkpoint, return an error code if the actual objects' counts are higher, indicating leaks (references to the objects from a task not being checkpointed). Of course, by this time most of the checkpoint image has been written out to disk, so this is purely advisory. But then, it's probably naive to argue that anything more than an advisory 'this went wrong' error code is useful. The comparison of the objhash user counts to object refcounts as a basis for checking for leaks comes from Alexey's OpenVZ-based c/r patchset. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/checkpoint.c|8 checkpoint/objhash.c | 82 ++-- include/linux/checkpoint.h |1 + 3 files changed, 88 insertions(+), 3 deletions(-) diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index 92f219e..b70adf4 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -578,6 +578,14 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid) if (ret 0) goto out; + if (!(ctx-uflags CHECKPOINT_SUBTREE)) { + /* verify that all objects are contained (no leaks) */ + if (!ckpt_obj_contained(ctx)) { + ret = -EBUSY; + goto out; + } + } + /* on success, return (unique) checkpoint identifier */ ctx-crid = atomic_inc_return(ctx_count); ret = ctx-crid; diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index ff9388d..e481911 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -27,19 +27,23 @@ struct ckpt_obj_ops { enum obj_type obj_type; void (*ref_drop)(void *ptr); int (*ref_grab)(void *ptr); + int (*ref_users)(void *ptr); int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr); void *(*restore)(struct ckpt_ctx *ctx); }; struct ckpt_obj { + int users; int objref; void *ptr; struct ckpt_obj_ops *ops; struct hlist_node hash; + struct hlist_node next; }; struct ckpt_obj_hash { struct hlist_head *head; + struct hlist_head list; int next_free_objref; }; @@ -53,7 +57,7 @@ void *restore_bad(struct ckpt_ctx *ctx) return ERR_PTR(-EINVAL); } -/* helper grab/drop functions: */ +/* helper grab/drop/users functions */ static void obj_no_drop(void *ptr) { @@ -86,6 +90,11 @@ static void obj_file_table_drop(void *ptr) put_files_struct((struct files_struct *) ptr); } +static int obj_file_table_users(void *ptr) +{ + return atomic_read(((struct files_struct *) ptr)-count); +} + static int obj_file_grab(void *ptr) { get_file((struct file *) ptr); @@ -97,6 +106,11 @@ static void obj_file_drop(void *ptr) fput((struct file *) ptr); } +static int obj_file_users(void *ptr) +{ + return atomic_long_read(((struct file *) ptr)-f_count); +} + static int obj_mm_grab(void *ptr) { atomic_inc(((struct mm_struct *) ptr)-mm_users); @@ -108,6 +122,11 @@ static void obj_mm_drop(void *ptr) mmput((struct mm_struct *) ptr); } +static int obj_mm_users(void *ptr) +{ + return atomic_read(((struct mm_struct *) ptr)-mm_users); +} + static struct ckpt_obj_ops ckpt_obj_ops[] = { /* ignored object */ { @@ -131,6 +150,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .obj_type = CKPT_OBJ_FILE_TABLE, .ref_drop = obj_file_table_drop, .ref_grab = obj_file_table_grab, + .ref_users = obj_file_table_users, .checkpoint = checkpoint_file_table, .restore = restore_file_table, }, @@ -140,6 +160,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .obj_type = CKPT_OBJ_FILE, .ref_drop = obj_file_drop, .ref_grab = obj_file_grab, + .ref_users = obj_file_users, .checkpoint = checkpoint_file, .restore = restore_file, }, @@ -149,6 +170,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .obj_type = CKPT_OBJ_MM, .ref_drop = obj_mm_drop, .ref_grab = obj_mm_grab, + .ref_users = obj_mm_users, .checkpoint = checkpoint_mm, .restore = restore_mm, }, @@ -201,6 +223,7 @@ int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx) obj_hash-head = head; obj_hash-next_free_objref = 1; + INIT_HLIST_HEAD(obj_hash-list); ctx-obj_hash = obj_hash; return 0; @@ -259,6 +282,7 @@ static int obj_new(struct ckpt_ctx *ctx, void *ptr, int objref, obj-ptr = ptr; obj-ops = ops; + obj-users = 2; /* extra reference that objhash itself takes */ if (objref) { /* use @obj-objref to index (restart) */ @@ -271,10 +295,12 @@
[Devel] Re: [PATCH 6/8] cr: checkpoint and restore task credentials
On Tue, May 26, 2009 at 12:33:54PM -0500, Serge E. Hallyn wrote: +struct ckpt_hdr_cred { + struct ckpt_hdr h; + __u32 version; /* especially since capability sets might grow */ Oh, no. Image version should be incremented. + __u32 uid, suid, euid, fsuid; + __u32 gid, sgid, egid, fsgid; + __u64 cap_i, cap_p, cap_e; + __u64 cap_x; /* bounding set ('X') */ + __s32 user_ref; + __s32 groupinfo_ref; + __u32 padding; +} __attribute__((aligned(8))); + +struct ckpt_hdr_groupinfo { + struct ckpt_hdr h; + __u32 ngroups; + /* + * This is followed by ngroups __u32s + */ + __u32 groups[0]; +} __attribute__((aligned(8))); --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1871,6 +1871,12 @@ static inline struct user_struct *get_uid(struct user_struct *u) extern void free_uid(struct user_struct *); extern void release_uids(struct user_namespace *ns); +#ifdef CONFIG_CHECKPOINT +struct ckpt_ctx; +int checkpoint_write_user(struct ckpt_ctx *, struct user_struct *); +struct user_struct *restore_read_user(struct ckpt_ctx *); +#endif I'll rip credential stuff from sched.h, better not add more. --- a/kernel/groups.c +++ b/kernel/groups.c @@ -287,3 +288,58 @@ int in_egroup_p(gid_t grp) } EXPORT_SYMBOL(in_egroup_p); + +#ifdef CONFIG_CHECKPOINT +int checkpoint_write_groupinfo(struct ckpt_ctx *ctx, struct group_info *g) +{ + int ret, i, size; + struct ckpt_hdr_groupinfo *h; + + size = sizeof(*h) + g-ngroups * sizeof(__u32); + h = ckpt_hdr_get_type(ctx, size, CKPT_HDR_GROUPINFO); + if (!h) + return -ENOMEM; + + h-ngroups = g-ngroups; + for (i = 0; i g-ngroups; i++) + h-groups[i] = GROUP_AT(g, i); + + ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h); + ckpt_hdr_put(ctx, h); + + return ret; +} + +/* + * TODO - switch to reading in blocks, and only return an + * error for truly obscene # groups (like 1) + */ +#define CKPT_MAXGROUPS 100 +#define MAX_GROUPINFO_SIZE (sizeof(*h)+CKPT_MAXGROUPS*sizeof(gid_t)) +struct group_info *restore_read_groupinfo(struct ckpt_ctx *ctx) +{ + struct group_info *g; + struct ckpt_hdr_groupinfo *h; + int i; + + h = ckpt_read_buf_type(ctx, MAX_GROUPINFO_SIZE, CKPT_HDR_GROUPINFO); + if (IS_ERR(h)) + return ERR_PTR(PTR_ERR(h)); + if (h-ngroups CKPT_MAXGROUPS) { + g = ERR_PTR(-EINVAL); + goto out; + } + g = groups_alloc(h-ngroups); + if (!g) { + g = ERR_PTR(-ENOMEM); + goto out; + } + for (i = 0; i h-ngroups; i++) + GROUP_AT(g, i) = h-groups[i]; + +out: + ckpt_hdr_put(ctx, h); + return g; +} No checks, that groups in image are a) sorted, b) -ngroups is compatible with object image. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/8] a start to credentials c/r
Quoting Casey Schaufler (ca...@schaufler-ca.com): Serge E. Hallyn wrote: Quoting Casey Schaufler (ca...@schaufler-ca.com): ... Uh, so yes, bsaed on info in the file as well :) Except of course the LSM would just be fed the checkpointed context and the checkpoint file context (and can deduce current's context). And SELinux can do whatever calculations it likes based on the three contexts and the loaded policy. Are you at all concerned about the possibility that the policy may have changed? I can envision scenarios in which it would be impossible for a process to gain a particular context under current policy, but that a checkpointed process may have stored away. Good point. But on the other hand, if the program were running the whole time, instead of being checkpointed and restarted, then the running program wouldn't be relabeled when the policy changed, right? Now if the domain becomes invalid, then presumably the restart would fail. But if the (source_domain,entry_type)-new_domain set changes from (root_t,x_entry_t)-x_t to (root_t,x_entry_t)-y_t, a task running as x_t won't be relabeled to y_t. So I don't thnk restarting a task which is checkpointed as x_t, under the x_t domain, is wrong. and one which determines the task-cred-security filed based upon any of: 1. current_security() of the task calling sys_restart() 2. the task-cred-security checkpointed in the ckpt file 3. the -security of the checkpoint file For Smack the correct behavior would be: 1. for sys_restart() callers without CAP_MAC_ADMIN 2. for sys_restart() callers with CAP_MAC_ADMIN 3. never That makes sense, and is basically analagous (if I'm thinking right) to how I'm doing capabilities. So the first (authorization hook) for smack would just always return TRUE? I suggest that it needs to check for a valid Smack label. Even though they're just text strings they do have limitations, including size ( 0 24) and character set. A call to smk_import() is the right way to do it, as it also makes sure the label is in the internal list. If smk_import() returns NULL something's amiss. Ok, thanks. -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC v16][PATCH 41/43] c/r: (s390): expose a constant for the number of words (CRs)
On Wed, May 27, 2009 at 01:33:07PM -0400, Oren Laadan wrote: We need to use this value in the checkpoint/restart code and would like to have a constant instead of a magic '3'. Changelog: Mar 30: . Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch) Mar 03: . Picked up additional use of magic '3' in ptrace.h Signed-off-by: Dan Smith da...@us.ibm.com --- arch/s390/Kconfig |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 2eca5fe..bf62cad 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL config GENERIC_CLOCKEVENTS def_bool y +config CHECKPOINT_SUPPORT + bool + default y if 64BIT + config GENERIC_BUG bool depends on BUG Changelog and content aren't compatible. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 18/38] C/R: core stuff
On Tue, 2009-05-26 at 23:35 +0400, Alexey Dobriyan wrote: The other part, is that I looked at Oren patchset, found quite a lot of suspicious, broken and unclean places and decided that it'd be faster to start from scratch because sending patches will overhaul like 85% of the code. I know the feeling. I get sudden urges to rewrite the whole patch set, but I'm working on getting past them too. :) As long as we have two patch sets, *nobody* is going to get their patches in, that's virtually guaranteed. Just look at the poor I/O controller. The OpenVZ users are almost certainly the most important container and c/r users out there today. Meeting their needs with whatever we come up with should be a top priority and I know I'm counting on you to help us do that. But, I'm having a really hard time culling the OpenVZ user needs from your patch set. If we really need to rewrite 85% of Oren's stuff to meet the OpenVZ needs, then by all means let's do it. I'm even willing to help you. But, I honestly don't know what you need. Can we talk about specifics? -- Dave ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 26/43] splice: added support for pipe-to-pipe splice()
This patch is a modified version of Max Kellerman patch that allows splice() between pipes (see http://patchwork/kernel/org/patch/21042). By refactoring link_pipe(), do_tee() and do_splice_pipes() shrink considerably. Below is Max's original description: -- This patch enables the splice() system call to copy buffers from one pipe to another. This obvious and trivial use case for splice() was not supported until now. It reuses the functions link_ipipe_prep() and link_opipe_prep() from the tee() system call implementation. -- Signed-off-by: Oren Laadan or...@cs.columbia.edu --- fs/splice.c | 203 --- 1 files changed, 166 insertions(+), 37 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 92dd63c..96e0d58 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -903,13 +903,95 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out, EXPORT_SYMBOL(generic_splice_sendpage); /* + * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same + * location, so checking -i_pipe is not enough to verify that this is a + * pipe. + */ +static inline struct pipe_inode_info *pipe_info(struct inode *inode) +{ + if (S_ISFIFO(inode-i_mode)) + return inode-i_pipe; + + return NULL; +} + +static int link_pipe_prep(struct pipe_inode_info *ipipe, + struct pipe_inode_info *opipe, + unsigned int flags); +static long do_link_pipe(struct pipe_inode_info *ipipe, +struct pipe_inode_info *opipe, +size_t len, unsigned int flags, int move); + +/** +* Splice pages from one pipe to another. +* +* @ipipe the input pipe +* @opipe the output pipe +* @len the maximum number of bytes to move +* @flags splice modifier flags +*/ +static long do_splice_pipes(struct pipe_inode_info *ipipe, + struct pipe_inode_info *opipe, + size_t len, unsigned int flags) +{ + int do_wakeup; + long ret; + + if (ipipe == opipe) + return -EINVAL; + + ret = link_pipe_prep(ipipe, opipe, flags); + if (ret 0) + return ret; + + /* both pipes are now locked */ + + do_wakeup = ipipe-nrbufs; + ret = do_link_pipe(ipipe, opipe, len, flags, 1); + do_wakeup -= ipipe-nrbufs; + + pipe_unlock(ipipe); + pipe_unlock(opipe); + + if (do_wakeup) { + /* at least one buffer was removed from the + input pipe: wake up potential writers */ + smp_mb(); + if (waitqueue_active(ipipe-wait)) + wake_up_interruptible(ipipe-wait); + kill_fasync(ipipe-fasync_writers, SIGIO, POLL_OUT); + } + + /* +* If we put data in the output pipe, wakeup any potential +* readers. +*/ + if (ret 0) { + smp_mb(); + if (waitqueue_active(opipe-wait)) + wake_up_interruptible(opipe-wait); + kill_fasync(opipe-fasync_readers, SIGIO, POLL_IN); + } + + return ret; +} + +/* * Attempt to initiate a splice from pipe to file. */ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out, loff_t *ppos, size_t len, unsigned int flags) { + struct pipe_inode_info *opipe; int ret; + opipe = pipe_info(out-f_dentry-d_inode); + if (opipe) { + if (unlikely(!(out-f_mode FMODE_WRITE))) + return -EBADF; + return do_splice_pipes(pipe, opipe, len, flags); + } + if (unlikely(!out-f_op || !out-f_op-splice_write)) return -EINVAL; @@ -933,8 +1015,16 @@ static long do_splice_to(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) { + struct pipe_inode_info *ipipe; int ret; + ipipe = pipe_info(in-f_dentry-d_inode); + if (ipipe) { + if (unlikely(!(in-f_mode FMODE_READ))) + return -EBADF; + return do_splice_pipes(ipipe, pipe, len, flags); + } + if (unlikely(!in-f_op || !in-f_op-splice_read)) return -EINVAL; @@ -1113,19 +1203,6 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, } /* - * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same - * location, so checking -i_pipe is not enough to verify that this is a - * pipe. - */ -static inline struct pipe_inode_info *pipe_info(struct inode *inode) -{ - if (S_ISFIFO(inode-i_mode)) - return inode-i_pipe; - - return NULL; -} - -/* * Determine where to splice to/from. */ static long do_splice(struct file *in, loff_t __user *off_in, @@ -1140,7 +1217,10 @@ static long do_splice(struct file *in, loff_t __user
[Devel] [RFC v16][PATCH 01/43] c/r: extend arch_setup_additional_pages()
From: Alexey Dobriyan adobri...@gmail.com Add start argument, to request to map vDSO to a specific place, and fail the operation if not. This is useful for restart(2) to ensure that memory layout is restore exactly as needed. Signed-off-by: Alexey Dobriyan adobri...@gmail.com Signed-off-by: Oren Laadan or...@cs.columbia.edu --- arch/powerpc/include/asm/elf.h |1 + arch/powerpc/kernel/vdso.c | 11 ++- arch/s390/include/asm/elf.h|2 +- arch/s390/kernel/vdso.c| 13 - arch/sh/include/asm/elf.h |1 + arch/sh/kernel/vsyscall/vsyscall.c |2 +- arch/x86/include/asm/elf.h |3 ++- arch/x86/vdso/vdso32-setup.c |9 +++-- arch/x86/vdso/vma.c|9 +++-- fs/binfmt_elf.c|2 +- 10 files changed, 43 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h index d6b4a12..3946e01 100644 --- a/arch/powerpc/include/asm/elf.h +++ b/arch/powerpc/include/asm/elf.h @@ -271,6 +271,7 @@ extern int ucache_bsize; #define ARCH_HAS_SETUP_ADDITIONAL_PAGES struct linux_binprm; extern int arch_setup_additional_pages(struct linux_binprm *bprm, + unsigned long start, int uses_interp); #define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b); diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index ad06d5c..48beff6 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -184,7 +184,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma) * This is called from binfmt_elf, we create the special vma for the * vDSO and insert it into the mm struct tree */ -int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) +int arch_setup_additional_pages(struct linux_binprm *bprm, + unsigned long start, int uses_interp) { struct mm_struct *mm = current-mm; struct page **vdso_pagelist; @@ -211,6 +212,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) vdso_base = VDSO32_MBASE; #endif + /* in case restart(2) mandates a specific location */ + if (start) + vdso_base = start; + current-mm-context.vdso_base = 0; /* vDSO has a problem and was disabled, just don't enable it for the @@ -234,6 +239,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) goto fail_mmapsem; } + /* for restart(2), double check that we got we asked for */ + if (start vdso_base != start) + goto fail_mmapsem; + /* * our vma flags don't have VM_WRITE so by default, the process isn't * allowed to write those pages. diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h index 74d0bbb..54235bc 100644 --- a/arch/s390/include/asm/elf.h +++ b/arch/s390/include/asm/elf.h @@ -205,6 +205,6 @@ do { \ struct linux_binprm; #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1 -int arch_setup_additional_pages(struct linux_binprm *, int); +int arch_setup_additional_pages(struct linux_binprm *, unsigned long, int); #endif diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c index 89b2e7f..bab43b3 100644 --- a/arch/s390/kernel/vdso.c +++ b/arch/s390/kernel/vdso.c @@ -182,7 +182,8 @@ static void vdso_init_cr5(void) * This is called from binfmt_elf, we create the special vma for the * vDSO and insert it into the mm struct tree */ -int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) +int arch_setup_additional_pages(struct linux_binprm *bprm, + unsigned long start, int uses_interp) { struct mm_struct *mm = current-mm; struct page **vdso_pagelist; @@ -213,6 +214,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) vdso_pages = vdso32_pages; #endif + /* in case restart(2) mandates a specific location */ + if (start) + vdso_base = start; + /* * vDSO has a problem and was disabled, just don't enable it for * the process @@ -235,6 +240,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) goto out_up; } + /* for restart(2), double check that we got we asked for */ + if (start vdso_base != start) { + rc = -EINVAL; + goto out_up; + } + /* * our vma flags don't have VM_WRITE so by default, the process * isn't allowed to write those pages. diff --git a/arch/sh/include/asm/elf.h b/arch/sh/include/asm/elf.h index ccb1d93..6c27b1f 100644 --- a/arch/sh/include/asm/elf.h +++ b/arch/sh/include/asm/elf.h @@ -202,6 +202,7 @@ do {
[Devel] [RFC v16][PATCH 34/43] c/r: save and restore ipc namespace basics
Save and restores the common state (parameters) of ipc namespace. Also add logic to iterate through the objects of sysvipc shared memory, message queues and semaphores. The logic to save and restore the state of these objects will be added in the next few patches. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/process.c |4 - include/linux/checkpoint.h |5 +- include/linux/checkpoint_hdr.h | 22 + ipc/checkpoint.c | 203 ++-- 4 files changed, 220 insertions(+), 14 deletions(-) diff --git a/checkpoint/process.c b/checkpoint/process.c index eff3d76..b604a85 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -121,10 +121,8 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy) if (ns_flags CLONE_NEWUTS) ret = checkpoint_uts_ns(ctx, nsproxy-uts_ns); -#if 0 if (!ret (ns_flags CLONE_NEWIPC)) ret = checkpoint_ipc_ns(ctx, nsproxy-ipc_ns); -#endif /* FIX: Write other namespaces here */ return ret; @@ -472,10 +470,8 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx) ckpt_debug(uts ns: %d\n, ret); if (ret 0) goto out; -#if 0 ret = restore_ipc_ns(ctx, h-ipc_objref, h-flags); ckpt_debug(ipc ns: %d\n, ret); -#endif /* FIX: add more namespaces here */ out: diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 9a7517f..d5498bc 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -85,7 +85,6 @@ extern int restore_restart_block(struct ckpt_ctx *ctx); extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr); extern void *restore_ns(struct ckpt_ctx *ctx); -#if 0 /* ipc-ns */ #ifdef CONFIG_SYSVIPC extern int checkpoint_ipc_ns(struct ckpt_ctx *ctx, @@ -98,7 +97,9 @@ static inline int checkpoint_ipc_ns(struct ckpt_ctx *ctx, static inline int restore_ipc_ns(struct ckpt_ctx *ctx) { return 0; } #endif /* CONFIG_SYSVIPC */ -#endif + +extern int checkpoint_ipcns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns); +extern int restore_ipcns(struct ckpt_ctx *ctx); /* file table */ extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx, diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 05769f4..406b5d6 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -305,6 +305,28 @@ struct ckpt_hdr_pgarr { /* ipc commons */ +struct ckpt_hdr_ipcns { + struct ckpt_hdr h; + __u64 shm_ctlmax; + __u64 shm_ctlall; + __s32 shm_ctlmni; + + __s32 msg_ctlmax; + __s32 msg_ctlmnb; + __s32 msg_ctlmni; + + __s32 sem_ctl_msl; + __s32 sem_ctl_mns; + __s32 sem_ctl_opm; + __s32 sem_ctl_mni; +} __attribute__((aligned(8))); + +struct ckpt_hdr_ipc { + struct ckpt_hdr h; + __u32 ipc_type; + __u32 ipc_count; +} __attribute__((aligned(8))); + struct ckpt_hdr_ipc_perms { __s32 id; __u32 key; diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c index b7b48b0..436be5e 100644 --- a/ipc/checkpoint.c +++ b/ipc/checkpoint.c @@ -20,15 +20,12 @@ #include util.h -int checkpoint_ipcns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns) -{ - return 0; -} +/* for ckpt_debug */ +static char *ipc_ind_to_str[] = { sem, msg, shm }; -int restore_ipcns(struct ckpt_ctx *ctx) -{ - return 0; -} +/** + * Checkpoint + */ int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h, struct kern_ipc_perm *perm) @@ -48,6 +45,82 @@ int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h, return 0; } +static int checkpoint_ipc_any(struct ckpt_ctx *ctx, + struct ipc_namespace *ipc_ns, + int ipc_ind, int ipc_type, + int (*func)(int id, void *p, void *data)) +{ + struct ckpt_hdr_ipc *h; + struct ipc_ids *ipc_ids = ipc_ns-ids[ipc_ind]; + int ret = -ENOMEM; + + down_read(ipc_ids-rw_mutex); + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC); + if (!h) + goto out; + + h-ipc_type = ipc_type; + h-ipc_count = ipc_ids-in_use; + ckpt_debug(ipc-%s count %d\n, ipc_ind_to_str[ipc_ind], h-ipc_count); + + ret = ckpt_write_obj(ctx, h-h); + ckpt_hdr_put(ctx, h); + if (ret 0) + goto out; + + ret = idr_for_each(ipc_ids-ipcs_idr, func, ctx); + ckpt_debug(ipc-%s ret %d\n, ipc_ind_to_str[ipc_ind], ret); + out: + up_read(ipc_ids-rw_mutex); + return ret; +} + +int checkpoint_ipc_ns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns) +{ + struct ckpt_hdr_ipcns *h; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS); + if (!h) +
[Devel] [RFC v16][PATCH 31/43] deferqueue: generic queue to defer work
Add a interface to postpone an action until the end of the entire checkpoint or restart operation. This is useful when during the scan of tasks an operation cannot be performed in place, to avoid the need for a second scan. One use case is when restoring an ipc shared memory region that has been deleted (but is still attached), during restart it needs to be create, attached and then deleted. However, creation and attachment are performed in distinct locations, so deletion can not be performed on the spot. Instead, this work (delete) is deferred until later. (This example is in one of the following patches). This interface allows chronic procrastination in the kernel: deferqueue_create(void): Allocates and returns a new deferqueue. deferqueue_run(deferqueue): Executes all the pending works in the queue. Returns the number of works executed, or an error upon the first error reported by a deferred work. deferqueue_add(deferqueue, data, size, func, dtor): Enqueue a deferred work. @function is the callback function to do the work, which will be called with @data as an argument. @size tells the size of data. @dtor is a destructor callback that is invoked for deferred works remaining in the queue when the queue is destroyed. NOTE: for a given deferred work, @dtor is _not_ called if @func was already called (regardless of the return value of the latter). deferqueue_destroy(deferqueue): Free the deferqueue and any queued items while invoking the @dtor callback for each queued item. Why aren't we using the existing kernel workqueue mechanism? We need to defer to work until the end of the operation: not earlier, since we need other things to be in place; not later, to not block waiting for it. However, the workqueue schedules the work for 'some time later'. Also, the kernel workqueue may run in any task context, but we require many times that an operation be run in the context of some specific restarting task (e.g., restoring IPC state of a certain ipc_ns). Instead, this mechanism is a simple way for the c/r operation as a whole, and later a task in particular, to defer some action until later (but not arbitrarily later) _in the restore_ operation. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/Kconfig |5 ++ include/linux/deferqueue.h | 58 +++ kernel/Makefile|1 + kernel/deferqueue.c| 109 4 files changed, 173 insertions(+), 0 deletions(-) diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig index 1761b0a..53ed6fa 100644 --- a/checkpoint/Kconfig +++ b/checkpoint/Kconfig @@ -2,9 +2,14 @@ # implemented the hooks for processor state etc. needed by the # core checkpoint/restart code. +config DEFERQUEUE + bool + default n + config CHECKPOINT bool Enable checkpoint/restart (EXPERIMENTAL) depends on CHECKPOINT_SUPPORT EXPERIMENTAL + select DEFERQUEUE help Application checkpoint/restart is the ability to save the state of a running application so that it can later resume diff --git a/include/linux/deferqueue.h b/include/linux/deferqueue.h new file mode 100644 index 000..2eb58cf --- /dev/null +++ b/include/linux/deferqueue.h @@ -0,0 +1,58 @@ +/* + * deferqueue.h --- deferred work queue handling for Linux. + */ + +#ifndef _LINUX_DEFERQUEUE_H +#define _LINUX_DEFERQUEUE_H + +#include linux/list.h +#include linux/slab.h +#include linux/spinlock.h + +/* + * This interface allows chronic procrastination in the kernel: + * + * deferqueue_create(void): + * Allocates and returns a new deferqueue. + * + * deferqueue_run(deferqueue): + * Executes all the pending works in the queue. Returns the number + * of works executed, or an error upon the first error reported by + * a deferred work. + * + * deferqueue_add(deferqueue, data, size, func, dtor): + * Enqueue a deferred work. @function is the callback function to + * do the work, which will be called with @data as an argument. + * @size tells the size of data. @dtor is a destructor callback + * that is invoked for deferred works remaining in the queue when + * the queue is destroyed. NOTE: for a given deferred work, @dtor + * is _not_ called if @func was already called (regardless of the + * return value of the latter). + * + * deferqueue_destroy(deferqueue): + * Free the deferqueue and any queued items while invoking the + * @dtor callback for each queued item. + */ + + +typedef int (*deferqueue_func_t)(void *); + +struct deferqueue_entry { + deferqueue_func_t function; + deferqueue_func_t destructor; + struct list_head list; + char data[0]; +}; + +struct deferqueue_head { + spinlock_t lock; + struct list_head list; +}; + +struct deferqueue_head *deferqueue_create(void); +void deferqueue_destroy(struct deferqueue_head *head); +int
[Devel] Re: [PATCH 1/1] cr: nsproxy: fix refcounting
thanks, applied. Serge E. Hallyn wrote: [This is the fix for the bug I was trying to nail down earlier today] If more than one restarted task are to share a checkpointed nsproxy, then we must inc the count on the nsproxy for each new task, as switch_task_namespaces() does not do that for us. Signed-off-by: Serge E. Hallyn se...@us.ibm.com --- checkpoint/process.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/checkpoint/process.c b/checkpoint/process.c index fa166cd..52d2a9c 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -603,8 +603,10 @@ static int restore_ns_obj(struct ckpt_ctx *ctx, int ns_objref) if (IS_ERR(nsproxy)) return PTR_ERR(nsproxy); - if (nsproxy != task_nsproxy(current)) + if (nsproxy != task_nsproxy(current)) { + get_nsproxy(nsproxy); switch_task_namespaces(current, nsproxy); + } return 0; } ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 40/43] c/r: support semaphore sysv-ipc
Checkpoint of sysvipc semaphores is performed by iterating through all sem objects and dumping the contents of each one. The semaphore array of each sem is dumped with that object. The semaphore array (sem-sem_base) holds an array of 'struct sem', which is a {int, int}. Because this translates into the same format on 32- and 64-bit architectures, the checkpoint format is simply the dump of this array as is. TODO: this patch does not handle semaphore-undo -- this data should be saved per-task while iterating through the tasks. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- include/linux/checkpoint_hdr.h |8 ++ ipc/Makefile |3 +- ipc/checkpoint.c |4 - ipc/checkpoint_sem.c | 220 ipc/util.h |5 + 5 files changed, 235 insertions(+), 5 deletions(-) diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index b05f39c..cd427d8 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -376,6 +376,14 @@ struct ckpt_hdr_ipc_msg_msg { __u32 m_ts; } __attribute__((aligned(8))); +struct ckpt_hdr_ipc_sem { + struct ckpt_hdr h; + struct ckpt_hdr_ipc_perms perms; + __u64 sem_otime; + __u64 sem_ctime; + __u32 sem_nsems; +} __attribute__((aligned(8))); + #define CKPT_TST_OVERFLOW_16(a, b) \ ((sizeof(a) sizeof(b)) ((a) SHORT_MAX)) diff --git a/ipc/Makefile b/ipc/Makefile index ca408ff..81af168 100644 --- a/ipc/Makefile +++ b/ipc/Makefile @@ -9,5 +9,6 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y) obj-$(CONFIG_IPC_NS) += namespace.o obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o -obj-$(CONFIG_CHECKPOINT) += checkpoint.o checkpoint_shm.o checkpoint_msg.o +obj-$(CONFIG_CHECKPOINT) += checkpoint.o \ + checkpoint_shm.o checkpoint_msg.o checkpoint_sem.o diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c index 7eece96..f621226 100644 --- a/ipc/checkpoint.c +++ b/ipc/checkpoint.c @@ -108,12 +108,10 @@ int checkpoint_ipc_ns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns) return ret; ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS, CKPT_HDR_IPC_MSG, checkpoint_ipc_msg); -#if 0 /* NEXT FEW PATCHES */ if (ret 0) return ret; ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS, CKPT_HDR_IPC_SEM, checkpoint_ipc_sem); -#endif return ret; } @@ -220,12 +218,10 @@ static int do_restore_ipc_ns(struct ckpt_ctx *ctx) goto out; ret = restore_ipc_any(ctx, IPC_MSG_IDS, CKPT_HDR_IPC_MSG, restore_ipc_msg); -#if 0 /* NEXT FEW PATCHES */ if (ret 0) goto out; ret = restore_ipc_any(ctx, IPC_SEM_IDS, CKPT_HDR_IPC_SEM, restore_ipc_sem); -#endif out: ckpt_hdr_put(ctx, h); return ret; diff --git a/ipc/checkpoint_sem.c b/ipc/checkpoint_sem.c new file mode 100644 index 000..34dea40 --- /dev/null +++ b/ipc/checkpoint_sem.c @@ -0,0 +1,220 @@ +/* + * Checkpoint/restart - dump state of sysvipc sem + * + * Copyright (C) 2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DIPC + +#include linux/mm.h +#include linux/sem.h +#include linux/rwsem.h +#include linux/sched.h +#include linux/syscalls.h +#include linux/nsproxy.h +#include linux/ipc_namespace.h + +#include linux/msg.h /* needed for util.h that uses 'struct msg_msg' */ +#include util.h + +#include linux/checkpoint.h +#include linux/checkpoint_hdr.h + +/ + * ipc checkpoint + */ + +static int fill_ipc_sem_hdr(struct ckpt_ctx *ctx, + struct ckpt_hdr_ipc_sem *h, + struct sem_array *sem) +{ + int ret = 0; + + ipc_lock_by_ptr(sem-sem_perm); + + ret = checkpoint_fill_ipc_perms(h-perms, sem-sem_perm); + if (ret 0) + goto unlock; + + h-sem_otime = sem-sem_otime; + h-sem_ctime = sem-sem_ctime; + h-sem_nsems = sem-sem_nsems; + + unlock: + ipc_unlock(sem-sem_perm); + ckpt_debug(sem: nsems %u\n, h-sem_nsems); + + return ret; +} + +/** + * ckpt_write_sem_array - dump the state of a semaphore array + * @ctx: checkpoint context + * @sem: semphore array + * + * The state of a sempahore is an array of 'struct sem'. This structure + * is {int, int}, which translates to the same format {32 bits, 32 bits} + * on both 32- and 64-bit architectures. So we simply dump the array. + * + * The sem-undo information is not saved
[Devel] [RFC v16][PATCH 18/43] c/r: restore anonymous- and file-mapped- shared memory
The bulk of the work is in ckpt_read_vma(), which has been refactored: the part that create the suitable 'struct file *' for the mapping is now larger and moved to a separate function. What's left is to read the VMA description, get the file pointer, create the mapping, and proceed to read the contents in. Both anonymous shared VMAs that have been read earlier (as indicated by a look up to objhash) and file-mapped shared VMAs are skipped. Anonymous shared VMAs seen for the first time have their contents read in directly to the backing inode, as indexed by the page numbers (as opposed to virtual addresses). Changelog[v14]: - Introduce patch Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/memory.c| 66 --- include/linux/checkpoint.h |6 include/linux/mm.h |2 + mm/filemap.c | 13 - mm/shmem.c | 49 5 files changed, 118 insertions(+), 18 deletions(-) diff --git a/checkpoint/memory.c b/checkpoint/memory.c index 2b73abc..c163b76 100644 --- a/checkpoint/memory.c +++ b/checkpoint/memory.c @@ -785,13 +785,36 @@ static int restore_read_page(struct ckpt_ctx *ctx, struct page *page, void *p) return 0; } +static struct page *bring_private_page(unsigned long addr) +{ + struct page *page; + int ret; + + ret = get_user_pages(current, current-mm, addr, 1, 1, 1, page, NULL); + if (ret 0) + page = ERR_PTR(ret); + return page; +} + +static struct page *bring_shared_page(unsigned long idx, struct inode *ino) +{ + struct page *page = NULL; + int ret; + + ret = shmem_getpage(ino, idx, page, SGP_WRITE, NULL); + if (ret 0) + return ERR_PTR(ret); + if (page) + unlock_page(page); + return page; +} + /** * read_pages_contents - read in data of pages in page-array chain * @ctx - restart context */ -static int read_pages_contents(struct ckpt_ctx *ctx) +static int read_pages_contents(struct ckpt_ctx *ctx, struct inode *inode) { - struct mm_struct *mm = current-mm; struct ckpt_pgarr *pgarr; unsigned long *vaddrs; char *buf; @@ -801,17 +824,22 @@ static int read_pages_contents(struct ckpt_ctx *ctx) if (!buf) return -ENOMEM; - down_read(mm-mmap_sem); + down_read(current-mm-mmap_sem); list_for_each_entry_reverse(pgarr, ctx-pgarr_list, list) { vaddrs = pgarr-vaddrs; for (i = 0; i pgarr-nr_used; i++) { struct page *page; _ckpt_debug(CKPT_DPAGE, got page %#lx\n, vaddrs[i]); - ret = get_user_pages(current, mm, vaddrs[i], -1, 1, 1, page, NULL); - if (ret 0) + if (inode) + page = bring_shared_page(vaddrs[i], inode); + else + page = bring_private_page(vaddrs[i]); + + if (IS_ERR(page)) { + ret = PTR_ERR(page); goto out; + } ret = restore_read_page(ctx, page, buf); page_cache_release(page); @@ -822,14 +850,15 @@ static int read_pages_contents(struct ckpt_ctx *ctx) } out: - up_read(mm-mmap_sem); + up_read(current-mm-mmap_sem); kfree(buf); return 0; } /** - * restore_memory_contents - restore contents of a VMA with private memory + * restore_memory_contents - restore contents of a memory region * @ctx - restart context + * @inode - backing inode * * Reads a header that specifies how many pages will follow, then reads * a list of virtual addresses into ctx-pgarr_list page-array chain, @@ -837,7 +866,7 @@ static int read_pages_contents(struct ckpt_ctx *ctx) * these steps until reaching a header specifying 0 pages, which marks * the end of the contents. */ -static int restore_memory_contents(struct ckpt_ctx *ctx) +int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode) { struct ckpt_hdr_pgarr *h; unsigned long nr_pages; @@ -864,7 +893,7 @@ static int restore_memory_contents(struct ckpt_ctx *ctx) ret = read_pages_vaddrs(ctx, nr_pages); if (ret 0) break; - ret = read_pages_contents(ctx); + ret = read_pages_contents(ctx, inode); if (ret 0) break; pgarr_reset_all(ctx); @@ -922,9 +951,9 @@ static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags) * @file - file to map (NULL for anonymous) * @h - vma header data */ -static unsigned long generic_vma_restore(struct mm_struct *mm, -
[Devel] [RFC v16][PATCH 10/43] c/r: restore open file descriptors
For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the hash table; If not found in the hash table, (first occurence), read in 'struct ckpt_hdr_file', create a new file and register in the hash. Otherwise attach the file pointer from the hash as an FD. Changelog[v16]: - Reorder patch (move earlier in series) - Handle shared files_struct objects Changelog[v14]: - Introduce a per file-type restore() callback - Revert change to pr_debug(), back to ckpt_debug() - Rename: restore_files() = restore_fd_table() - Rename: ckpt_read_fd_data() = restore_file() - Check whether calls to ckpt_hbuf_get() fail - Discard field 'hh-parent' Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (even though it's not really needed) Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/files.c | 285 checkpoint/objhash.c |2 + checkpoint/process.c | 20 +++ checkpoint/restart.c |9 ++ include/linux/checkpoint.h |5 + 5 files changed, 321 insertions(+), 0 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index d10dfb6..d7583d3 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -16,6 +16,8 @@ #include linux/sched.h #include linux/file.h #include linux/fdtable.h +#include linux/fsnotify.h +#include linux/syscalls.h #include linux/checkpoint.h #include linux/checkpoint_hdr.h @@ -309,3 +311,286 @@ int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t) return objref; } + +/** + * Restart + */ + +/** + * read_open_fname - read a file name and open a file + * @ctx: checkpoint context + * @flags: file flags + * @mode: file mode + */ +static struct file *read_open_fname(struct ckpt_ctx *ctx, int flags, int mode) +{ + struct ckpt_hdr *h; + struct file *file; + char *fname; + + h = ckpt_read_buf_type(ctx, PATH_MAX, CKPT_HDR_FILE_NAME); + if (IS_ERR(h)) + return (struct file *) h; + fname = (char *) (h + 1); + ckpt_debug(fname '%s' flags %#x mode %#x\n, fname, flags, mode); + + file = filp_open(fname, flags, mode); + ckpt_hdr_put(ctx, h); + return file; +} + +static int close_all_fds(struct files_struct *files) +{ + int *fdtable; + int nfds; + + nfds = scan_fds(files, fdtable); + if (nfds 0) + return nfds; + while (nfds--) + sys_close(fdtable[nfds]); + kfree(fdtable); + return 0; +} + +/** + * attach_file - attach a lonely file ptr to a file descriptor + * @file: lonely file pointer + */ +static int attach_file(struct file *file) +{ + int fd = get_unused_fd_flags(0); + + if (fd = 0) { + get_file(file); + fsnotify_open(file-f_path.dentry); + fd_install(fd, file); + } + return fd; +} + +#define CKPT_SETFL_MASK \ + (O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME) + +int restore_file_common(struct ckpt_ctx *ctx, struct file *file, + struct ckpt_hdr_file *h) +{ + int ret; + + /* FIX: need to restore uid, gid, owner etc */ + + /* safe to set 1st arg (fd) to 0, as command is F_SETFL */ + ret = vfs_fcntl(0, F_SETFL, h-f_flags CKPT_SETFL_MASK, file); + if (ret 0) + goto out; + + ret = vfs_llseek(file, h-f_pos, SEEK_SET); + if (ret == -ESPIPE) /* ignore error on non-seekable files */ + ret = 0; + out: + return ret; +} + +static struct file *generic_file_restore(struct ckpt_ctx *ctx, +struct ckpt_hdr_file *ptr) +{ + struct file *file; + int ret; + + if (ptr-h.type != CKPT_HDR_FILE || + ptr-h.len != sizeof(*ptr) || ptr-f_type != CKPT_FILE_GENERIC) + return ERR_PTR(-EINVAL); + + file = read_open_fname(ctx, ptr-f_flags, ptr-f_mode); + if (IS_ERR(file)) + return file; + + ret = restore_file_common(ctx, file, ptr); + if (ret 0) { + fput(file); + file = ERR_PTR(ret); + } + return file; +} + +struct restore_file_ops { + char *file_name; + enum file_type file_type; + struct file * (*restore) (struct ckpt_ctx *ctx, + struct ckpt_hdr_file *ptr); +}; + +static struct restore_file_ops restore_file_ops[] = { + /* ignored file */ + { + .file_name = IGNORE, + .file_type = CKPT_FILE_IGNORE, + .restore = NULL, + }, + /* regular file/directory */ + { + .file_name = GENERIC, + .file_type = CKPT_FILE_GENERIC, + .restore = generic_file_restore, +
[Devel] [RFC v16][PATCH 38/43] c/r: support message-queues sysv-ipc
Checkpoint of sysvipc message-queues is performed by iterating through all 'msq' objects and dumping the contents of each one. The message queued on each 'msq' are dumped with that object. Message of a specific queue get written one by one. The queue lock cannot be held while dumping them, but the loop must be protected from someone (who ?) writing or reading. To do that we grab the lock, then hijack the entire chain of messages from the queue, drop the lock, and then safely dump them in a loop. Finally, with the lock held, we re-attach the chain while verifying that there isn't other (new) data on that queue. Writing the message contents themselves is straight forward. The code is similar to that in ipc/msgutil.c, the main difference being that we deal with kernel memory and not user memory. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- include/linux/checkpoint_hdr.h | 21 +++- ipc/Makefile |2 +- ipc/checkpoint.c |6 +- ipc/checkpoint_msg.c | 362 ipc/util.h |3 + 5 files changed, 389 insertions(+), 5 deletions(-) diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index f7e331d..b05f39c 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -73,6 +73,7 @@ enum { CKPT_HDR_IPC = 501, CKPT_HDR_IPC_SHM, CKPT_HDR_IPC_MSG, + CKPT_HDR_IPC_MSG_MSG, CKPT_HDR_IPC_SEM, CKPT_HDR_TAIL = 9001, @@ -356,6 +357,25 @@ struct ckpt_hdr_ipc_shm { __u32 objref; } __attribute__((aligned(8))); +struct ckpt_hdr_ipc_msg { + struct ckpt_hdr h; + struct ckpt_hdr_ipc_perms perms; + __u64 q_stime; + __u64 q_rtime; + __u64 q_ctime; + __u64 q_cbytes; + __u64 q_qnum; + __u64 q_qbytes; + __s32 q_lspid; + __s32 q_lrpid; +} __attribute__((aligned(8))); + +struct ckpt_hdr_ipc_msg_msg { + struct ckpt_hdr h; + __s32 m_type; + __u32 m_ts; +} __attribute__((aligned(8))); + #define CKPT_TST_OVERFLOW_16(a, b) \ ((sizeof(a) sizeof(b)) ((a) SHORT_MAX)) @@ -366,5 +386,4 @@ struct ckpt_hdr_ipc_shm { #define CKPT_TST_OVERFLOW_64(a, b) \ ((sizeof(a) sizeof(b)) ((a) LONG_MAX)) - #endif /* _CHECKPOINT_CKPT_HDR_H_ */ diff --git a/ipc/Makefile b/ipc/Makefile index 7e23683..ca408ff 100644 --- a/ipc/Makefile +++ b/ipc/Makefile @@ -9,5 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y) obj-$(CONFIG_IPC_NS) += namespace.o obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o -obj-$(CONFIG_CHECKPOINT) += checkpoint.o checkpoint_shm.o +obj-$(CONFIG_CHECKPOINT) += checkpoint.o checkpoint_shm.o checkpoint_msg.o diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c index 25d2277..7eece96 100644 --- a/ipc/checkpoint.c +++ b/ipc/checkpoint.c @@ -104,11 +104,11 @@ int checkpoint_ipc_ns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns) ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS, CKPT_HDR_IPC_SHM, checkpoint_ipc_shm); -#if 0 /* NEXT FEW PATCHES */ if (ret 0) return ret; ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS, CKPT_HDR_IPC_MSG, checkpoint_ipc_msg); +#if 0 /* NEXT FEW PATCHES */ if (ret 0) return ret; ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS, @@ -216,11 +216,11 @@ static int do_restore_ipc_ns(struct ckpt_ctx *ctx) ret = restore_ipc_any(ctx, IPC_SHM_IDS, CKPT_HDR_IPC_SHM, restore_ipc_shm); -#if 0 /* NEXT FEW PATCHES */ if (ret 0) goto out; - ret = ckpt_read_ipc_any(ctx, IPC_MSG_IDS, + ret = restore_ipc_any(ctx, IPC_MSG_IDS, CKPT_HDR_IPC_MSG, restore_ipc_msg); +#if 0 /* NEXT FEW PATCHES */ if (ret 0) goto out; ret = restore_ipc_any(ctx, IPC_SEM_IDS, diff --git a/ipc/checkpoint_msg.c b/ipc/checkpoint_msg.c new file mode 100644 index 000..a988a9e --- /dev/null +++ b/ipc/checkpoint_msg.c @@ -0,0 +1,362 @@ +/* + * Checkpoint/restart - dump state of sysvipc msg + * + * Copyright (C) 2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DIPC + +#include linux/mm.h +#include linux/msg.h +#include linux/rwsem.h +#include linux/sched.h +#include linux/syscalls.h +#include linux/nsproxy.h +#include linux/ipc_namespace.h + +#include util.h + +#include linux/checkpoint.h +#include linux/checkpoint_hdr.h + +/ + * ipc checkpoint + */ + +static int fill_ipc_msg_hdr(struct ckpt_ctx *ctx, +
[Devel] [RFC v16][PATCH 22/43] c/r: checkpoint multiple processes
Checkpointing of multiple processes works by recording the tasks tree structure below a given root task. The root task is expected to be a container init, and then an entire container is checkpointed. However, passing CHECKPOINT_SUBTREE to checkpoint(2) relaxes this requirement and allows to checkpoint a subtree of processes from the root task. For a given root task, do a DFS scan of the tasks tree and collect them into an array (keeping a reference to each task). Using DFS simplifies the recreation of tasks either in user space or kernel space. For each task collected, test if it can be checkpointed, and save its pid, tgid, and ppid. The actual work is divided into two passes: a first scan counts the tasks, then memory is allocated and a second scan fills the array. Whether checkpoints and restarts require CAP_SYS_ADMIN is determined by sysctl 'ckpt_unpriv_allowed': if 1, then regular permission checks are intended to prevent privilege escalation, however if 0 it prevents unprivileged users from exploiting any privilege escalation bugs. The logic is suitable for creation of processes during restart either in userspace or by the kernel. Currently we ignore threads and zombies. Changelog[v16]: - CHECKPOINT_SUBTREE flags allows subtree (not whole container) - sysctl variable 'ckpt_unpriv_allowed' controls needed privileges Changelog[v14]: - Refuse non-self checkpoint if target task isn't frozen - Refuse checkpoint (for now) if task is ptraced - Revert change to pr_debug(), back to ckpt_debug() - Use only unsigned fields in checkpoint headers - Check retval of ckpt_tree_count_tasks() in ckpt_build_tree() - Discard 'h.parent' field - Check whether calls to ckpt_hbuf_get() fail - Disallow threads or siblings to container init Changelog[v13]: - Release tasklist_lock in error path in ckpt_tree_count_tasks() - Use separate index for 'tasks_arr' and 'hh' in ckpt_write_pids() Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/checkpoint.c | 237 -- checkpoint/restart.c |2 +- checkpoint/sys.c | 33 +- include/linux/checkpoint_hdr.h | 16 +++- include/linux/checkpoint_types.h | 16 ++- kernel/sysctl.c | 17 +++ 6 files changed, 305 insertions(+), 16 deletions(-) diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index 3999d80..92f219e 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -249,8 +249,27 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx) return ret; } +/* dump all tasks in ctx-tasks_arr[] */ +static int checkpoint_all_tasks(struct ckpt_ctx *ctx) +{ + int n, ret = 0; + + for (n = 0; n ctx-nr_tasks; n++) { + ckpt_debug(dumping task #%d\n, n); + ret = checkpoint_task(ctx, ctx-tasks_arr[n]); + if (ret 0) + break; + } + + return ret; +} + static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) { + struct task_struct *root = ctx-root_task; + + ckpt_debug(check %d\n, task_pid_nr_ns(t, ctx-root_nsproxy-pid_ns)); + if (t-state == TASK_DEAD) { pr_warning(c/r: task %d is TASK_DEAD\n, task_pid_vnr(t)); return -EAGAIN; @@ -276,14 +295,211 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) return -EBUSY; } + /* +* FIX: for now, disallow siblings of container init created +* via CLONE_PARENT (unclear if they will remain possible) +*/ + if (ctx-root_init t != root t-tgid != root-tgid + t-real_parent == root-real_parent) { + __ckpt_write_err(ctx, task %d (%s) is sibling of root, +task_pid_vnr(t), t-comm); + return -EINVAL; + } + + /* FIX: change this when namespaces are added */ + if (task_nsproxy(t) != ctx-root_nsproxy) + return -EPERM; + + return 0; +} + +#define CKPT_HDR_PIDS_CHUNK256 + +static int checkpoint_pids(struct ckpt_ctx *ctx) +{ + struct ckpt_hdr_pids *h; + struct pid_namespace *ns; + struct task_struct *task; + struct task_struct **tasks_arr; + int nr_tasks, n, pos = 0, ret = 0; + + ns = ctx-root_nsproxy-pid_ns; + tasks_arr = ctx-tasks_arr; + nr_tasks = ctx-nr_tasks; + BUG_ON(nr_tasks = 0); + + ret = ckpt_write_obj_type(ctx, NULL, + sizeof(*h) * nr_tasks, + CKPT_HDR_BUFFER); + if (ret 0) + return ret; + + h = ckpt_hdr_get(ctx, sizeof(*h) * CKPT_HDR_PIDS_CHUNK); + if (!h) + return -ENOMEM; + + do { + rcu_read_lock(); + for (n = 0; n min(nr_tasks, CKPT_HDR_PIDS_CHUNK); n++) {
[Devel] [RFC v16][PATCH 17/43] c/r: dump anonymous- and file-mapped- shared memory
We now handle anonymous and file-mapped shared memory. Support for IPC shared memory requires support for IPC first. We extend ckpt_write_vma() to detect shared memory VMAs and handle it separately than private memory. There is not much to do for file-mapped shared memory, except to force msync() on the region to ensure that the file system is consistent with the checkpoint image. Use our internal type CKPT_VMA_SHM_FILE. Anonymous shared memory is always backed by inode in shmem filesystem. We use that inode to look up the VMA in the objhash and register it if not found (on first encounter). In this case, the type of the VMA is CKPT_VMA_SHM_ANON, and we dump the contents. On the other hand, if it is found there, we must have already saved it before, so we change the type to CKPT_VMA_SHM_ANON_SKIP and skip it. To dump the contents of a shmem VMA, we loop through the pages of the inode in the shmem filesystem, and dump the contents of each dirty (allocated) page - unallocated pages must be clean. Note that we save the original size of a shmem VMA because it may have been re-mapped partially. The format itself remains like with private VMAs, except that instead of addresses we record _indices_ (page nr) into the backing inode. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/memory.c| 143 +++ checkpoint/objhash.c | 29 include/linux/checkpoint.h | 15 +++-- include/linux/checkpoint_hdr.h |8 ++ mm/filemap.c | 39 +++- mm/mmap.c |2 +- mm/shmem.c | 33 + 7 files changed, 246 insertions(+), 23 deletions(-) diff --git a/checkpoint/memory.c b/checkpoint/memory.c index 99bafaa..2b73abc 100644 --- a/checkpoint/memory.c +++ b/checkpoint/memory.c @@ -21,6 +21,7 @@ #include linux/pagemap.h #include linux/mm_types.h #include linux/proc_fs.h +#include linux/swap.h #include linux/checkpoint.h #include linux/checkpoint_hdr.h @@ -281,6 +282,54 @@ static struct page *consider_private_page(struct vm_area_struct *vma, } /** + * consider_shared_page - return page pointer for dirty pages + * @ino - inode of shmem object + * @idx - page index in shmem object + * + * Looks up the page that corresponds to the index in the shmem object, + * and returns the page if it was modified (and grabs a reference to it), + * or otherwise returns NULL (or error). + */ +static struct page *consider_shared_page(struct inode *ino, unsigned long idx) +{ + struct page *page = NULL; + int ret; + + /* +* Inspired by do_shmem_file_read(): very simplified version. +* +* FIXME: consolidate with do_shmem_file_read() +*/ + + ret = shmem_getpage(ino, idx, page, SGP_READ, NULL); + if (ret 0) + return ERR_PTR(ret); + + /* +* Only care about dirty pages; shmem_getpage() only returns +* pages that have been allocated, so they must be dirty. The +* pages returned are locked and referenced. +*/ + + if (page) { + unlock_page(page); + /* +* If users can be writing to this page using arbitrary +* virtual addresses, take care about potential aliasing +* before reading the page on the kernel side. +*/ + if (mapping_writably_mapped(ino-i_mapping)) + flush_dcache_page(page); + /* +* Mark the page accessed if we read the beginning. +*/ + mark_page_accessed(page); + } + + return page; +} + +/** * vma_fill_pgarr - fill a page-array with addr/page tuples * @ctx - checkpoint context * @vma - vma to scan @@ -289,17 +338,16 @@ static struct page *consider_private_page(struct vm_area_struct *vma, * Returns the number of pages collected */ static int vma_fill_pgarr(struct ckpt_ctx *ctx, - struct vm_area_struct *vma, - unsigned long *start) + struct vm_area_struct *vma, struct inode *inode, + unsigned long *start, unsigned long end) { - unsigned long end = vma-vm_end; unsigned long addr = *start; struct ckpt_pgarr *pgarr; int nr_used; int cnt = 0; /* this function is only for private memory (anon or file-mapped) */ - BUG_ON(vma-vm_flags (VM_SHARED | VM_MAYSHARE)); + BUG_ON(inode vma); do { pgarr = pgarr_current(ctx); @@ -311,7 +359,11 @@ static int vma_fill_pgarr(struct ckpt_ctx *ctx, while (addr end) { struct page *page; - page = consider_private_page(vma, addr); + if (vma) + page = consider_private_page(vma, addr); + else +
[Devel] [RFC v16][PATCH 21/43] c/r: restart-blocks
(Paraphrasing what's said this message: http://lists.openwall.net/linux-kernel/2007/12/05/64) Restart blocks are callbacks used cause a system call to be restarted with the arguments specified in the system call restart block. It is useful for system call that are not idempotent, i.e. the argument(s) might be a relative timeout, where some adjustments are required when restarting the system call. It relies on the system call itself to set up its restart point and the argument save area. They are rare: an actual signal would turn that it an EINTR. The only case that should ever trigger this is some kernel action that interrupts the system call, but does not actually result in any user-visible state changes - like freeze and thaw. So restart blocks are about time remaining for the system call to sleep/wait. Generally in c/r, there are two possible time models that we can follow: absolute, relative. Here, I chose to save the relative timeout, measured from the beginning of the checkpoint. The time when the checkpoint (and restart) begin is also saved. This information is sufficient to restart in either model (absolute or negative). Which model to use should eventually be a per application choice (and possible configurable via cradvise() or some sort). For now, we adopt the relative model, namely, at restart the timeout is set relative to the beginning of the restart. To checkpoint, we check if a task has a valid restart block, and if so we save the *remaining* time that is has to wait/sleep, and the type of the restart block. To restart, we fill in the data required at the proper place in the thread information. If the system call return an error (which is possibly an -ERESTARTSYS eg), we not only use that error as our own return value, but also arrange for the task to execute the signal handler (by faking a signal). The handler, in turn, already has the code to handle these restart request gracefully. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- arch/x86/include/asm/checkpoint_hdr.h |1 - arch/x86/mm/checkpoint.c | 10 +- checkpoint/checkpoint.c |1 + checkpoint/process.c | 226 + checkpoint/restart.c | 35 +- checkpoint/sys.c |1 + include/linux/checkpoint.h|4 + include/linux/checkpoint_hdr.h| 22 +++ include/linux/checkpoint_types.h |3 + 9 files changed, 293 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h index cf90170..ee23df9 100644 --- a/arch/x86/include/asm/checkpoint_hdr.h +++ b/arch/x86/include/asm/checkpoint_hdr.h @@ -57,7 +57,6 @@ struct ckpt_hdr_header_arch { struct ckpt_hdr_thread { struct ckpt_hdr h; - /* FIXME: restart blocks */ __u16 gdt_entry_tls_entries; __u16 sizeof_tls_array; __u16 ntls; /* number of TLS entries to follow */ diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c index c781416..7cd7494 100644 --- a/arch/x86/mm/checkpoint.c +++ b/arch/x86/mm/checkpoint.c @@ -63,13 +63,9 @@ int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t) * FIXME: the TLS descriptors in the GDT should be called out and * not tied to the in-kernel representation. */ - ret = ckpt_write_obj_type(ctx, thread-tls_array, - sizeof(thread-tls_array), - CKPT_HDR_THREAD_TLS); - - /* IGNORE RESTART BLOCKS FOR NOW ... */ - - return ret; + return ckpt_write_obj_type(ctx, thread-tls_array, + sizeof(thread-tls_array), + CKPT_HDR_THREAD_TLS); } #ifndef CONFIG_X86_64 diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index 086f2d9..3999d80 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -23,6 +23,7 @@ #include linux/mount.h #include linux/utsname.h #include linux/magic.h +#include linux/hrtimer.h #include linux/checkpoint.h #include linux/checkpoint_hdr.h diff --git a/checkpoint/process.c b/checkpoint/process.c index 3ce82cb..876be3e 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -12,6 +12,9 @@ #define CKPT_DFLAG CKPT_DSYS #include linux/sched.h +#include linux/posix-timers.h +#include linux/futex.h +#include linux/poll.h #include linux/checkpoint.h #include linux/checkpoint_hdr.h @@ -80,6 +83,116 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) return ret; } +/* dump the task_struct of a given task */ +int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr_restart_block *h; + struct restart_block *restart_block; + long (*fn)(struct restart_block *); + s64 base, expire = 0; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h),
[Devel] [RFC v16][PATCH 36/43] c/r: support share-memory sysv-ipc
Checkpoint of sysvipc shared memory is performed in two steps: first, the entire ipc namespace is dumped as a whole by iterating through all shm objects and dumping the contents of each one. The shmem inode is registered in the objhash. Second, for each vma that refers to ipc shared memory we find the inode in the objhash, and save the objref. (If we find a new inode, that indicates that the ipc namespace is not entirely frozen and someone must have manipulated it since step 1). Handling of shm objects that have been deleted (via IPC_RMID) is left to a later patch in this series. Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/memory.c | 28 - checkpoint/sys.c | 10 ++ include/linux/checkpoint.h |3 + include/linux/checkpoint_hdr.h | 19 +++- include/linux/checkpoint_types.h |1 + include/linux/shm.h |9 ++ ipc/Makefile |2 +- ipc/checkpoint.c |4 +- ipc/checkpoint_shm.c | 261 ++ ipc/shm.c| 73 +++ ipc/util.h |4 +- 11 files changed, 406 insertions(+), 8 deletions(-) diff --git a/checkpoint/memory.c b/checkpoint/memory.c index c163b76..997359f 100644 --- a/checkpoint/memory.c +++ b/checkpoint/memory.c @@ -20,6 +20,7 @@ #include linux/mman.h #include linux/pagemap.h #include linux/mm_types.h +#include linux/shm.h #include linux/proc_fs.h #include linux/swap.h #include linux/checkpoint.h @@ -459,9 +460,9 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total) * virtual addresses into ctx-pgarr_list page-array chain. Then dump * the addresses, followed by the page contents. */ -static int checkpoint_memory_contents(struct ckpt_ctx *ctx, - struct vm_area_struct *vma, - struct inode *inode) +int checkpoint_memory_contents(struct ckpt_ctx *ctx, + struct vm_area_struct *vma, + struct inode *inode) { struct ckpt_hdr_pgarr *h; unsigned long addr, end; @@ -1022,6 +1023,13 @@ static int anon_private_restore(struct ckpt_ctx *ctx, return private_vma_restore(ctx, mm, NULL, h); } +static int bad_vma_restore(struct ckpt_ctx *ctx, + struct mm_struct *mm, + struct ckpt_hdr_vma *h) +{ + return -EINVAL; +} + /* callbacks to restore vma per its type: */ struct restore_vma_ops { char *vma_name; @@ -1074,6 +1082,20 @@ static struct restore_vma_ops restore_vma_ops[] = { .vma_type = CKPT_VMA_SHM_FILE, .restore = filemap_restore, }, + /* sysvipc shared */ + { + .vma_name = IPC SHARED, + .vma_type = CKPT_VMA_SHM_IPC, + /* ipc inode itself is restore by restore_ipc_ns()... */ + .restore = bad_vma_restore, + + }, + /* sysvipc shared (skip) */ + { + .vma_name = IPC SHARED (skip), + .vma_type = CKPT_VMA_SHM_IPC_SKIP, + .restore = ipcshm_restore, + }, }; /** diff --git a/checkpoint/sys.c b/checkpoint/sys.c index f6cf0ac..ac3bf7c 100644 --- a/checkpoint/sys.c +++ b/checkpoint/sys.c @@ -20,6 +20,7 @@ #include linux/uaccess.h #include linux/capability.h #include linux/checkpoint.h +#include linux/deferqueue.h /* * ckpt_unpriv_allowed - sysctl controlled, do not allow checkpoints or @@ -188,8 +189,17 @@ static void task_arr_free(struct ckpt_ctx *ctx) static void ckpt_ctx_free(struct ckpt_ctx *ctx) { + int ret; + BUG_ON(atomic_read(ctx-refcount)); + if (ctx-deferqueue) { + ret = deferqueue_run(ctx-deferqueue); + if (ret != 0) + pr_warning(c/r: deferqueue had %d entries\n, ret); + deferqueue_destroy(ctx-deferqueue); + } + if (ctx-file) fput(ctx-file); diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index d5498bc..064dd25 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -145,6 +145,9 @@ extern unsigned long generic_vma_restore(struct mm_struct *mm, extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm, struct file *file, struct ckpt_hdr_vma *h); +extern int checkpoint_memory_contents(struct ckpt_ctx *ctx, + struct vm_area_struct *vma, + struct inode *inode); extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode); diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 406b5d6..f7e331d 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -278,7 +278,9 @@ enum vma_type { CKPT_VMA_SHM_ANON, /*
[Devel] [RFC v16][PATCH 07/43] c/r: infrastructure for shared objects
The state of shared objects is saved once. On the first encounter, the state is dumped and the object is assigned a unique identifier (objref) and also stored in a hash table (indexed by its physical kernel address). From then on the object will be found in the hash and only its identifier is saved. On restart the identifier is looked up in the hash table; if not found then the state is read, the object is created, and added to the hash table (this time indexed by its identifier). Otherwise, the object in the hash table is used. The hash is one-way: objects added to it are never deleted until the hash it discarded. The hash is discarded at the end of checkpoint or restart, whether successful or not. The hash keeps a reference to every object that is added to it, matching the object's type, and maintains this reference during its lifetime. Therefore, it is always safe to use an object that is stored in the hash. Changelog[v16]: - Introduce ckpt_obj_lookup() to find an object by its ptr Changelog[v14]: - Introduce 'struct ckpt_obj_ops' to better modularize shared objs. - Replace long 'switch' statements with table lookups and callbacks. - Introduce checkpoint_obj() and restart_obj() helpers - Shared objects now dumped/saved right before they are referenced - Cleanup interface of shared objects Changelog[v13]: - Use hash_long() with 'unsigned long' cast to support 64bit archs (Nathan Lynch n...@pobox.com) Changelog[v11]: - Doc: be explicit about grabbing a reference and object lifetime Changelog[v4]: - Fix calculation of hash table size Changelog[v3]: - Use standard hlist_... for hash table Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/Makefile |1 + checkpoint/objhash.c | 397 ++ checkpoint/restart.c | 46 + checkpoint/sys.c |7 + include/linux/checkpoint.h | 15 ++ include/linux/checkpoint_hdr.h | 14 ++ include/linux/checkpoint_types.h |2 + 7 files changed, 482 insertions(+), 0 deletions(-) diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 99364cc..5aa6a75 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -4,6 +4,7 @@ obj-$(CONFIG_CHECKPOINT) += \ sys.o \ + objhash.o \ checkpoint.o \ restart.o \ process.o diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c new file mode 100644 index 000..82b4618 --- /dev/null +++ b/checkpoint/objhash.c @@ -0,0 +1,397 @@ +/* + * Checkpoint-restart - object hash infrastructure to manage shared objects + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DOBJ + +#include linux/kernel.h +#include linux/hash.h +#include linux/checkpoint.h +#include linux/checkpoint_hdr.h + +struct ckpt_obj; +struct ckpt_obj_ops; + +/* object operations */ +struct ckpt_obj_ops { + char *obj_name; + enum obj_type obj_type; + void (*ref_drop)(void *ptr); + int (*ref_grab)(void *ptr); + int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr); + void *(*restore)(struct ckpt_ctx *ctx); +}; + +struct ckpt_obj { + int objref; + void *ptr; + struct ckpt_obj_ops *ops; + struct hlist_node hash; +}; + +struct ckpt_obj_hash { + struct hlist_head *head; + int next_free_objref; +}; + +/* helper grab/drop functions: */ + +static void obj_no_drop(void *ptr) +{ + return; +} + +static int obj_no_grab(void *ptr) +{ + return 0; +} + +static struct ckpt_obj_ops ckpt_obj_ops[] = { + /* ignored object */ + { + .obj_name = IGNORED, + .obj_type = CKPT_OBJ_IGNORE, + .ref_drop = obj_no_drop, + .ref_grab = obj_no_grab, + }, +}; + + +#define CKPT_OBJ_HASH_NBITS 10 +#define CKPT_OBJ_HASH_TOTAL (1UL CKPT_OBJ_HASH_NBITS) + +static void obj_hash_clear(struct ckpt_obj_hash *obj_hash) +{ + struct hlist_head *h = obj_hash-head; + struct hlist_node *n, *t; + struct ckpt_obj *obj; + int i; + + for (i = 0; i CKPT_OBJ_HASH_TOTAL; i++) { + hlist_for_each_entry_safe(obj, n, t, h[i], hash) { + obj-ops-ref_drop(obj-ptr); + kfree(obj); + } + } +} + +void ckpt_obj_hash_free(struct ckpt_ctx *ctx) +{ + struct ckpt_obj_hash *obj_hash = ctx-obj_hash; + + if (obj_hash) { + obj_hash_clear(obj_hash); + kfree(obj_hash-head); + kfree(ctx-obj_hash); + ctx-obj_hash = NULL; + } +} + +int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx) +{ + struct ckpt_obj_hash *obj_hash; + struct hlist_head *head; + + obj_hash =
[Devel] [RFC v16][PATCH 09/43] c/r: dump open file descriptors
Dump the file table with 'struct ckpt_hdr_file_table, followed by all open file descriptors. Because the 'struct file' corresponding to an fd can be shared, they are assigned an objref and registered in the object hash. A reference to the 'file *' is kept for as long as it lives in the hash (the hash is only cleaned up at the end of the checkpoint). Also provide generic_checkpoint_file() and generic_restore_file() which is good for normal files and directories. It does not support yet unlinked files or directories. Changelog[v16]: - Reorder patch (move earlier in series) - Handle shared files_struct objects Changelog[v14]: - File objects are dumped/restored prior to the first reference - Introduce a per file-type restore() callback - Use struct file_operations-checkpoint() - Put code for generic file descriptors in a separate function - Use one CKPT_FILE_GENERIC for both regular files and dirs - Revert change to pr_debug(), back to ckpt_debug() - Use only unsigned fields in checkpoint headers - Rename: ckpt_write_files() = checkpoint_fd_table() - Rename: ckpt_write_fd_data() = checkpoint_file() - Discard field 'h-parent' Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v11]: - Discard handling of opened symlinks (there is no such thing) - ckpt_scan_fds() retries from scratch if hits size limits Changelog[v9]: - Fix a couple of leaks in ckpt_write_files() - Drop useless kfree from ckpt_scan_fds() Changelog[v8]: - initialize 'coe' to workaround gcc false warning Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (even though it's not really needed) Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/Makefile |3 +- checkpoint/checkpoint.c | 25 +++ checkpoint/files.c | 311 ++ checkpoint/objhash.c | 40 + checkpoint/process.c | 28 checkpoint/sys.c |1 + include/linux/checkpoint.h | 14 ++- include/linux/checkpoint_hdr.h | 49 ++ include/linux/checkpoint_types.h |8 + include/linux/fs.h |4 + 10 files changed, 481 insertions(+), 2 deletions(-) diff --git a/checkpoint/Makefile b/checkpoint/Makefile index 5aa6a75..1d0c058 100644 --- a/checkpoint/Makefile +++ b/checkpoint/Makefile @@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \ objhash.o \ checkpoint.o \ restart.o \ - process.o + process.o \ + files.o diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index 409c78b..a346b7e 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -15,6 +15,7 @@ #include linux/time.h #include linux/fs.h #include linux/file.h +#include linux/fs_struct.h #include linux/dcache.h #include linux/mount.h #include linux/utsname.h @@ -244,10 +245,34 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx) return ret; } +/* setup checkpoint-specific parts of ctx */ +static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid) +{ + struct fs_struct *fs; + + ctx-root_pid = pid; + + /* +* assume checkpointer is in container's root vfs +* FIXME: this works for now, but will change with real containers +*/ + + fs = current-fs; + read_lock(fs-lock); + ctx-fs_mnt = fs-root; + path_get(ctx-fs_mnt); + read_unlock(fs-lock); + + return 0; +} + int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid) { int ret; + ret = init_checkpoint_ctx(ctx, pid); + if (ret 0) + goto out; ret = checkpoint_write_header(ctx); if (ret 0) goto out; diff --git a/checkpoint/files.c b/checkpoint/files.c new file mode 100644 index 000..d10dfb6 --- /dev/null +++ b/checkpoint/files.c @@ -0,0 +1,311 @@ +/* + * Checkpoint file descriptors + * + * Copyright (C) 2008-2009 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* default debug level for output */ +#define CKPT_DFLAG CKPT_DFILE + +#include linux/kernel.h +#include linux/module.h +#include linux/sched.h +#include linux/file.h +#include linux/fdtable.h +#include linux/checkpoint.h +#include linux/checkpoint_hdr.h + + +/** + * Checkpoint + */ + +/** + * fill_fname - return pathname of a given file + * @path: path name + * @root: relative root + * @buf: buffer for pathname + * @len: buffer length (in) and pathname length (out) + */ +static char *fill_fname(struct path *path, struct path *root, + char *buf, int *len) +{ + struct path tmp = *root; + char *fname; + + BUG_ON(!buf); + spin_lock(dcache_lock); + fname =
[Devel] [RFC v16][PATCH 15/43] c/r: restore memory address space (private memory)
Restoring the memory address space begins with nuking the existing one of the current process, and then reading the vma state and contents. Call do_mmap_pgoffset() for each vma and then read in the data. Changelog[v16]: - Restore mm-exe_file Changelog[v14]: - Introduce per vma-type restore() function - Merge restart code into same file as checkpoint (memory.c) - Compare saved 'vdso' field of mm_context with current value - Check whether calls to ckpt_hbuf_get() fail - Discard field 'h-parent' - Revert change to pr_debug(), back to ckpt_debug() Changelog[v13]: - Avoid access to hh-vma_type after the header is freed - Test for no vma's in exit_mmap() before calling unmap_vma() (or it may crash if restart fails after having removed all vma's) Changelog[v12]: - Replace obsolete ckpt_debug() with pr_debug() Changelog[v9]: - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup Changelog[v7]: - Fix argument given to kunmap_atomic() in memory dump/restore Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (even though it's not really needed) Changelog[v5]: - Improve memory restore code (following Dave Hansen's comments) - Change dump format (and code) to allow chunks of vaddrs, pages instead of one long list of each - Memory restore now maps user pages explicitly to copy data into them, instead of reading directly to user space; got rid of mprotect_fixup() Changelog[v4]: - Use standard list_... for ckpt_pgarr Signed-off-by: Oren Laadan or...@cs.columbia.edu --- arch/x86/include/asm/ldt.h |7 + arch/x86/mm/checkpoint.c | 64 ++ checkpoint/memory.c | 463 ++ checkpoint/objhash.c |1 + checkpoint/process.c |3 + fs/exec.c|2 +- include/linux/checkpoint.h |7 + include/linux/checkpoint_hdr.h |2 +- include/linux/checkpoint_types.h |1 + include/linux/mm.h | 12 + mm/filemap.c | 19 ++ mm/mmap.c| 23 ++- 12 files changed, 601 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/ldt.h b/arch/x86/include/asm/ldt.h index 46727eb..f2845f9 100644 --- a/arch/x86/include/asm/ldt.h +++ b/arch/x86/include/asm/ldt.h @@ -37,4 +37,11 @@ struct user_desc { #define MODIFY_LDT_CONTENTS_CODE 2 #endif /* !__ASSEMBLY__ */ + +#ifdef __KERNEL__ +#include linux/linkage.h +asmlinkage int sys_modify_ldt(int func, void __user *ptr, + unsigned long bytecount); +#endif + #endif /* _ASM_X86_LDT_H */ diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c index dc4fbb4..c781416 100644 --- a/arch/x86/mm/checkpoint.c +++ b/arch/x86/mm/checkpoint.c @@ -13,6 +13,7 @@ #include asm/desc.h #include asm/i387.h +#include asm/elf.h #include linux/checkpoint_types.h #include asm/checkpoint_hdr.h @@ -461,3 +462,66 @@ int restore_read_header_arch(struct ckpt_ctx *ctx) ckpt_hdr_put(ctx, h); return ret; } + +int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm) +{ + struct ckpt_hdr_mm_context *h; + unsigned int n; + int ret; + + h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT); + if (IS_ERR(h)) + return PTR_ERR(h); + + ckpt_debug(nldt %d vdso %#lx (%p)\n, +h-nldt, (unsigned long) h-vdso, mm-context.vdso); + + ret = -EINVAL; + if (h-vdso != (unsigned long) mm-context.vdso) + goto out; + if (h-ldt_entry_size != LDT_ENTRY_SIZE) + goto out; + + ret = _ckpt_read_obj_type(ctx, NULL, + h-nldt * LDT_ENTRY_SIZE, + CKPT_HDR_MM_CONTEXT_LDT); + if (ret 0) + goto out; + + /* +* to utilize the syscall modify_ldt() we first convert the data +* in the checkpoint image from 'struct desc_struct' to 'struct +* user_desc' with reverse logic of include/asm/desc.h:fill_ldt() +*/ + for (n = 0; n h-nldt; n++) { + struct user_desc info; + struct desc_struct desc; + mm_segment_t old_fs; + + ret = ckpt_kread(ctx, desc, LDT_ENTRY_SIZE); + if (ret 0) + break; + + info.entry_number = n; + info.base_addr = desc.base0 | (desc.base1 16); + info.limit = desc.limit0; + info.seg_32bit = desc.d; + info.contents = desc.type 2; + info.read_exec_only = (desc.type 1) ^ 1; + info.limit_in_pages = desc.g; + info.seg_not_present = desc.p ^ 1; + info.useable = desc.avl; + + old_fs = get_fs(); + set_fs(get_ds()); + ret = sys_modify_ldt(1, (struct user_desc __user *) info, +
[Devel] [RFC v16][PATCH 14/43] c/r: dump memory address space (private memory)
For each vma, there is a 'struct ckpt_vma'; Then comes the actual contents, in one or more chunk: each chunk begins with a header that specifies how many pages it holds, then the virtual addresses of all the dumped pages in that chunk, followed by the actual contents of all dumped pages. A header with zero number of pages marks the end of the contents. Then comes the next vma and so on. To checkpoint a vma, call the ops-checkpoint() method of that vma. Normally the per-vma function will invoke generic_vma_checkpoint() which first writes the vma description, followed by the specific logic to dump the contents of the pages. Currently for private mapped memory we save the pathname of the file that is mapped (restart will use it to re-open it and then map it). Later we change that to reference a file object. Changelog[v16]: - Precede vaddrs/pages with a buffer header - Checkpoint mm-exe_file - Handle shared task-mm Changelog[v14]: - Modify the ops-checkpoint method to be much more powerful - Improve support for VDSO (with special_mapping checkpoint callback) - Save new field 'vdso' in mm_context - Revert change to pr_debug(), back to ckpt_debug() - Check whether calls to ckpt_hbuf_get() fail - Discard field 'h-parent' Changelog[v13]: - pgprot_t is an abstract type; use the proper accessor (fix for 64-bit powerpc (Nathan Lynch n...@pobox.com) Changelog[v12]: - Hide pgarr management inside ckpt_private_vma_fill_pgarr() - Fix management of pgarr chain reset and alloc/expand: keep empty pgarr in a pool chain - Replace obsolete ckpt_debug() with pr_debug() Changelog[v11]: - Copy contents of 'init-fs-root' instead of pointing to them. - Add missing test for VM_MAYSHARE when dumping memory Changelog[v10]: - Acquire dcache_lock around call to __d_path() in ckpt_fill_name() Changelog[v9]: - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup - Test if __d_path() changes mnt/dentry (when crossing filesystem namespace boundary). for now ckpt_fill_fname() fails the checkpoint. Changelog[v7]: - Fix argument given to kunmap_atomic() in memory dump/restore Changelog[v6]: - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put() (even though it's not really needed) Changelog[v5]: - Improve memory dump code (following Dave Hansen's comments) - Change dump format (and code) to allow chunks of vaddrs, pages instead of one long list of each - Fix use of follow_page() to avoid faulting in non-present pages Changelog[v4]: - Use standard list_... for ckpt_pgarr Signed-off-by: Oren Laadan or...@cs.columbia.edu --- arch/x86/include/asm/checkpoint_hdr.h |8 + arch/x86/mm/checkpoint.c | 32 ++ checkpoint/Makefile |3 +- checkpoint/memory.c | 633 + checkpoint/objhash.c | 19 + checkpoint/process.c | 10 + checkpoint/sys.c |4 + include/linux/checkpoint.h| 26 ++- include/linux/checkpoint_hdr.h| 47 +++ include/linux/checkpoint_types.h |3 + mm/filemap.c | 28 ++ mm/mmap.c | 31 ++ 12 files changed, 842 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h index 362b499..cf90170 100644 --- a/arch/x86/include/asm/checkpoint_hdr.h +++ b/arch/x86/include/asm/checkpoint_hdr.h @@ -43,6 +43,7 @@ enum { CKPT_HDR_THREAD_TLS = 201, CKPT_HDR_CPU_FPU, + CKPT_HDR_MM_CONTEXT_LDT, }; struct ckpt_hdr_header_arch { @@ -107,4 +108,11 @@ struct ckpt_hdr_cpu { /* thread_xstate contents follow (if used_math) */ } __attribute__((aligned(8))); +struct ckpt_hdr_mm_context { + struct ckpt_hdr h; + __u64 vdso; + __u32 ldt_entry_size; + __u32 nldt; +} __attribute__((aligned(8))); + #endif /* __ASM_X86_CKPT_HDR__H */ diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c index f54fe80..dc4fbb4 100644 --- a/arch/x86/mm/checkpoint.c +++ b/arch/x86/mm/checkpoint.c @@ -14,6 +14,7 @@ #include asm/desc.h #include asm/i387.h +#include linux/checkpoint_types.h #include asm/checkpoint_hdr.h #include linux/checkpoint.h @@ -239,6 +240,37 @@ int checkpoint_write_header_arch(struct ckpt_ctx *ctx) return ret; } +/* dump the mm-context state */ +int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm) +{ + struct ckpt_hdr_mm_context *h; + int ret; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT); + if (!h) + return -ENOMEM; + + mutex_lock(mm-context.lock); + + h-vdso = (unsigned long) mm-context.vdso; + h-ldt_entry_size = LDT_ENTRY_SIZE; + h-nldt = mm-context.size; + + ckpt_debug(nldt %d vdso %#llx\n, h-nldt, h-vdso); + + ret = ckpt_write_obj(ctx, h-h); +
[Devel] [RFC v16][PATCH 42/43] c/r: add CKPT_COPY() macro
From: Dan Smith da...@us.ibm.com As suggested by Dave[1], this provides us a way to make the copy-in and copy-out processes symmetric. CKPT_COPY_ARRAY() provides us a way to do the same thing but for arrays. It's not critical, but it helps us unify the checkpoint and restart paths for some things. Changelog: Mar 04: . Removed semicolons . Added build-time check for __must_be_array in CKPT_COPY_ARRAY Feb 27: . Changed CKPT_COPY() to use assignment, eliminating the need for the CKPT_COPY_BIT() macro . Add CKPT_COPY_ARRAY() macro to help copying register arrays, etc . Move the macro definitions inside the CR #ifdef Feb 25: . Changed WARN_ON() to BUILD_BUG_ON() Signed-off-by: Dan Smith da...@us.ibm.com Signed-off-by: Oren Laadan or...@cs.columbia.edu 1: https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html (all the way at the bottom) --- include/linux/checkpoint.h | 29 + 1 files changed, 29 insertions(+), 0 deletions(-) diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index 064dd25..669e90c 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -157,6 +157,34 @@ extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode); VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO) +/* useful macros to copy fields and buffers to/from ckpt_hdr_xxx structures */ +#define CKPT_CPT 1 +#define CKPT_RST 2 + +#define CKPT_COPY(op, SAVE, LIVE) \ + do {\ + if (op == CKPT_CPT) \ + SAVE = LIVE;\ + else\ + LIVE = SAVE;\ + } while (0) + +/* + * Copy @count items from @LIVE to @SAVE if op is CKPT_CPT (otherwise, + * copy in the reverse direction) + */ +#define CKPT_COPY_ARRAY(op, SAVE, LIVE, count) \ + do {\ + (void)__must_be_array(SAVE);\ + (void)__must_be_array(LIVE);\ + BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE)); \ + if (op == CKPT_CPT) \ + memcpy(SAVE, LIVE, count * sizeof(*SAVE)); \ + else\ + memcpy(LIVE, SAVE, count * sizeof(*SAVE)); \ + } while (0) + + /* debugging flags */ #define CKPT_DBASE 0x1 /* anything */ #define CKPT_DSYS 0x2 /* generic (system) */ @@ -189,6 +217,7 @@ extern unsigned long ckpt_debug_level; * CKPT_DBASE is the base flags, doesn't change * CKPT_DFLAG is to be redfined in each source file */ + #define ckpt_debug(fmt, args...) \ _ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args) -- 1.6.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC v16][PATCH 28/43] c/r: make ckpt_may_checkpoint_task() check each namespace individually
From: Dan Smith da...@us.ibm.com Signed-off-by: Dan Smith da...@us.ibm.com Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/checkpoint.c| 20 ++-- checkpoint/objhash.c | 28 +++ checkpoint/process.c | 101 include/linux/checkpoint.h |4 ++ include/linux/checkpoint_hdr.h |8 +++ 5 files changed, 157 insertions(+), 4 deletions(-) diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index b70adf4..e66f82b 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -267,6 +267,8 @@ static int checkpoint_all_tasks(struct ckpt_ctx *ctx) static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) { struct task_struct *root = ctx-root_task; + struct nsproxy *nsproxy; + int ret = 0; ckpt_debug(check %d\n, task_pid_nr_ns(t, ctx-root_nsproxy-pid_ns)); @@ -306,11 +308,21 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) return -EINVAL; } - /* FIX: change this when namespaces are added */ - if (task_nsproxy(t) != ctx-root_nsproxy) - return -EPERM; + rcu_read_lock(); + nsproxy = task_nsproxy(t); + if (nsproxy-uts_ns != ctx-root_nsproxy-uts_ns) + ret = -EPERM; + if (nsproxy-ipc_ns != ctx-root_nsproxy-ipc_ns) + ret = -EPERM; + if (nsproxy-mnt_ns != ctx-root_nsproxy-mnt_ns) + ret = -EPERM; + if (nsproxy-pid_ns != ctx-root_nsproxy-pid_ns) + ret = -EPERM; + if (nsproxy-net_ns != ctx-root_nsproxy-net_ns) + ret = -EPERM; + rcu_read_unlock(); - return 0; + return ret; } #define CKPT_HDR_PIDS_CHUNK256 diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index e481911..56553ae 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -127,6 +127,22 @@ static int obj_mm_users(void *ptr) return atomic_read(((struct mm_struct *) ptr)-mm_users); } +static int obj_ns_grab(void *ptr) +{ + get_nsproxy((struct nsproxy *) ptr); + return 0; +} + +static void obj_ns_drop(void *ptr) +{ + put_nsproxy((struct nsproxy *) ptr); +} + +static int obj_ns_users(void *ptr) +{ + return atomic_read(((struct nsproxy *) ptr)-count); +} + static struct ckpt_obj_ops ckpt_obj_ops[] = { /* ignored object */ { @@ -174,6 +190,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .checkpoint = checkpoint_mm, .restore = restore_mm, }, + /* ns object */ + { + .obj_name = NSPROXY, + .obj_type = CKPT_OBJ_NS, + .ref_drop = obj_ns_drop, + .ref_grab = obj_ns_grab, + .ref_users = obj_ns_users, + .checkpoint = checkpoint_ns, + .restore = restore_ns, + }, }; @@ -396,6 +422,8 @@ int ckpt_obj_contained(struct ckpt_ctx *ctx) /* account for ctx-file reference (if in the table already) */ ckpt_obj_users_inc(ctx, ctx-file, 1); + /* account for ctx-root_nsproxy reference (if in the table already) */ + ckpt_obj_users_inc(ctx, ctx-root_nsproxy, 1); hlist_for_each_entry(obj, node, ctx-obj_hash-list, next) { if (!obj-ops-ref_users) diff --git a/checkpoint/process.c b/checkpoint/process.c index 876be3e..fbe0d16 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -12,6 +12,7 @@ #define CKPT_DFLAG CKPT_DSYS #include linux/sched.h +#include linux/nsproxy.h #include linux/posix-timers.h #include linux/futex.h #include linux/poll.h @@ -49,6 +50,45 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t) return ckpt_write_string(ctx, t-comm, TASK_COMM_LEN); } + +static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy) +{ + return 0; +} + +int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr) +{ + return do_checkpoint_ns(ctx, (struct nsproxy *) ptr); +} + +static int checkpoint_task_ns(struct ckpt_ctx *ctx, struct task_struct *t) +{ + struct ckpt_hdr_task_ns *h; + struct nsproxy *nsproxy; + int ns_objref; + int ret; + + rcu_read_lock(); + nsproxy = task_nsproxy(t); + get_nsproxy(nsproxy); + rcu_read_unlock(); + + ns_objref = checkpoint_obj(ctx, nsproxy, CKPT_OBJ_NS); + put_nsproxy(nsproxy); + + ckpt_debug(nsproxy: objref %d\n, ns_objref); + if (ns_objref 0) + return ns_objref; + + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_NS); + if (!h) + return -ENOMEM; + h-ns_objref = ns_objref; + ret = ckpt_write_obj(ctx, h-h); + ckpt_hdr_put(ctx, h); + return ret; +} + static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t) { struct ckpt_hdr_task_objs *h; @@ -56,6 +96,18
[Devel] [RFC v16][PATCH 30/43] c/r: stub implementation for IPC namespace
From: Dan Smith da...@us.ibm.com Changes: - Update to match UTS changes Signed-off-by: Dan Smith da...@us.ibm.com Signed-off-by: Oren Laadan or...@cs.columbia.edu --- checkpoint/checkpoint.c|2 -- checkpoint/objhash.c | 28 checkpoint/process.c | 24 ++-- include/linux/checkpoint.h | 15 +++ include/linux/checkpoint_hdr.h |3 +++ 5 files changed, 68 insertions(+), 4 deletions(-) diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index 904f19b..afc7300 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -310,8 +310,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) rcu_read_lock(); nsproxy = task_nsproxy(t); - if (nsproxy-ipc_ns != ctx-root_nsproxy-ipc_ns) - ret = -EPERM; if (nsproxy-mnt_ns != ctx-root_nsproxy-mnt_ns) ret = -EPERM; if (nsproxy-pid_ns != ctx-root_nsproxy-pid_ns) diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index 8b7adc6..045a920 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -15,6 +15,8 @@ #include linux/hash.h #include linux/file.h #include linux/fdtable.h +#include linux/sched.h +#include linux/ipc_namespace.h #include linux/checkpoint.h #include linux/checkpoint_hdr.h @@ -159,6 +161,22 @@ static int obj_uts_ns_users(void *ptr) return atomic_read(((struct uts_namespace *) ptr)-kref.refcount); } +static int obj_ipc_ns_grab(void *ptr) +{ + get_ipc_ns((struct ipc_namespace *) ptr); + return 0; +} + +static void obj_ipc_ns_drop(void *ptr) +{ + put_ipc_ns((struct ipc_namespace *) ptr); +} + +static int obj_ipc_ns_users(void *ptr) +{ + return atomic_read(((struct ipc_namespace *) ptr)-count); +} + static struct ckpt_obj_ops ckpt_obj_ops[] = { /* ignored object */ { @@ -226,6 +244,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .checkpoint = checkpoint_bad, .restore = restore_bad, }, + /* ipc_ns object */ + { + .obj_name = IPC_NS, + .obj_type = CKPT_OBJ_IPC_NS, + .ref_drop = obj_ipc_ns_drop, + .ref_grab = obj_ipc_ns_grab, + .ref_users = obj_ipc_ns_users, + .checkpoint = checkpoint_bad, + .restore = restore_bad, + }, }; diff --git a/checkpoint/process.c b/checkpoint/process.c index a827987..eff3d76 100644 --- a/checkpoint/process.c +++ b/checkpoint/process.c @@ -89,6 +89,7 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy) struct ckpt_hdr_ns *h; int ns_flags = 0; int uts_objref; + int ipc_objref; int first, ret; uts_objref = ckpt_obj_lookup_add(ctx, nsproxy-uts_ns, @@ -98,12 +99,20 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy) if (first) ns_flags |= CLONE_NEWUTS; + ipc_objref = ckpt_obj_lookup_add(ctx, nsproxy-ipc_ns, +CKPT_OBJ_IPC_NS, first); + if (ipc_objref 0) + return ipc_objref; + if (first) + ns_flags |= CLONE_NEWIPC; + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NS); if (!h) return -ENOMEM; h-flags = ns_flags; h-uts_objref = uts_objref; + h-ipc_objref = ipc_objref; ret = ckpt_write_obj(ctx, h-h); ckpt_hdr_put(ctx, h); @@ -112,6 +121,10 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy) if (ns_flags CLONE_NEWUTS) ret = checkpoint_uts_ns(ctx, nsproxy-uts_ns); +#if 0 + if (!ret (ns_flags CLONE_NEWIPC)) + ret = checkpoint_ipc_ns(ctx, nsproxy-ipc_ns); +#endif /* FIX: Write other namespaces here */ return ret; @@ -438,9 +451,10 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx) return (struct nsproxy *) h; ret = -EINVAL; - if (h-uts_objref = 0) + if (h-uts_objref = 0 || + h-ipc_objref = 0) goto out; - if (h-flags ~CLONE_NEWUTS) + if (h-flags ~(CLONE_NEWUTS | CLONE_NEWIPC)) goto out; /* each unseen-before namespace will be un-shared now */ @@ -456,6 +470,12 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx) */ ret = restore_uts_ns(ctx, h-uts_objref, h-flags); ckpt_debug(uts ns: %d\n, ret); + if (ret 0) + goto out; +#if 0 + ret = restore_ipc_ns(ctx, h-ipc_objref, h-flags); + ckpt_debug(ipc ns: %d\n, ret); +#endif /* FIX: add more namespaces here */ out: diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h index a7125fc..5a42399 100644 --- a/include/linux/checkpoint.h +++ b/include/linux/checkpoint.h @@ -81,6 +81,21
[Devel] [RFC v16][PATCH 43/43] c/r: define s390-specific checkpoint-restart code
From: Dan Smith da...@us.ibm.com Implement the s390 arch-specific checkpoint/restart helpers. This is on top of Oren Laadan's c/r code. With these, I am able to checkpoint and restart simple programs as per Oren's patch intro. While on x86 I never had to freeze a single task to checkpoint it, on s390 I do need to. That is a prereq for consistent snapshots (esp with multiple processes) anyway so I don't see that as a problem. Changelog: Apr 11: . Introduce ckpt_arch_vdso() Feb 27: . Add checkpoint_s390.h . Fixed up save and restore of PSW, with the non-address bits properly masked out Feb 25: . Make checkpoint_hdr.h safe for inclusion in userspace . Replace comment about vsdo code . Add comment about restoring access registers . Write and read an empty ckpt_hdr_head_arch record to appease code (mktree) that expects it to be there . Utilize NUM_CKPT_WORDS in checkpoint_hdr.h Feb 24: . Use CKPT_COPY() to unify the un/loading of cpu and mm state . Fix fprs definition in ckpt_hdr_cpu . Remove debug WARN_ON() from checkpoint.c Feb 23: . Macro-ize the un/packing of trace flags . Fix the crash when externally-linked . Break out the restart functions into restart.c . Remove unneeded s390_enable_sie() call Jan 30: . Switched types in ckpt_hdr_cpu to __u64 etc. (Per Oren suggestion) . Replaced direct inclusion of structs in ckpt_hdr_cpu with the struct members. (Per Oren suggestion) . Also ended up adding a bunch of new things into restart (mm_segment, ksp, etc) in vain attempt to get code using fpu to not segfault after restart. Signed-off-by: Serge E. Hallyn se...@us.ibm.com Signed-off-by: Dan Smith da...@us.ibm.com --- arch/s390/include/asm/checkpoint_hdr.h | 81 ++ arch/s390/include/asm/unistd.h |4 +- arch/s390/kernel/compat_wrapper.S | 12 ++ arch/s390/kernel/syscalls.S|2 + arch/s390/mm/Makefile |1 + arch/s390/mm/checkpoint.c | 183 arch/s390/mm/checkpoint_s390.h | 23 7 files changed, 305 insertions(+), 1 deletions(-) diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h new file mode 100644 index 000..185194b --- /dev/null +++ b/arch/s390/include/asm/checkpoint_hdr.h @@ -0,0 +1,81 @@ +#ifndef __ASM_S390_CKPT_HDR_H +#define __ASM_S390_CKPT_HDR_H +/* + * Checkpoint/restart - architecture specific headers s/390 + * + * Copyright IBM Corp. 2009 + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include linux/types.h +#include linux/checkpoint_hdr.h +#include asm/ptrace.h + +#ifdef __KERNEL__ +#include asm/processor.h +#else +#include sys/user.h +#endif + +/* + * Notes + * NUM_GPRS defined in asm/ptrace.h to be 16 + * NUM_FPRS defined in asm/ptrace.h to be 16 + * NUM_APRS defined in asm/ptrace.h to be 16 + * NUM_CR_WORDS defined in asm/ptrace.h to be 3 + */ +struct ckpt_hdr_cpu { + struct ckpt_hdr h; + __u64 args[1]; + __u64 gprs[NUM_GPRS]; + __u64 orig_gpr2; + __u16 svcnr; + __u16 ilc; + __u32 acrs[NUM_ACRS]; + __u64 ieee_instruction_pointer; + + /* psw_t */ + __u64 psw_t_mask; + __u64 psw_t_addr; + + /* s390_fp_regs_t */ + __u32 fpc; + union { + float f; + double d; + __u64 ui; + struct { + __u32 fp_hi; + __u32 fp_lo; + } fp; + } fprs[NUM_FPRS]; + + /* per_struct */ + __u64 per_control_regs[NUM_CR_WORDS]; + __u64 starting_addr; + __u64 ending_addr; + __u64 address; + __u16 perc_atmid; + __u8 access_id; + __u8 single_step; + __u8 instruction_fetch; +}; + +struct ckpt_hdr_mm_context { + struct ckpt_hdr h; + unsigned long vdso_base; + int noexec; + int has_pgste; + int alloc_pgste; + unsigned long asce_bits; + unsigned long asce_limit; +}; + +struct ckpt_hdr_header_arch { + struct ckpt_hdr h; +}; + +#endif /* __ASM_S390_CKPT_HDR__H */ diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h index f0f19e6..3d22f17 100644 --- a/arch/s390/include/asm/unistd.h +++ b/arch/s390/include/asm/unistd.h @@ -267,7 +267,9 @@ #define __NR_epoll_create1 327 #define__NR_preadv 328 #define__NR_pwritev329 -#define NR_syscalls 330 +#define __NR_checkpoint330 +#define
[Devel] [RFC v16][PATCH 11/43] c/r: add generic '-checkpoint' f_op to ext fses
From: Dave Hansen d...@linux.vnet.ibm.com This marks ext[234] as being checkpointable. There will be many more to do this to, but this is a start. Signed-off-by: Dave Hansen d...@linux.vnet.ibm.com --- fs/ext2/dir.c |1 + fs/ext2/file.c |2 ++ fs/ext3/dir.c |1 + fs/ext3/file.c |1 + fs/ext4/dir.c |1 + fs/ext4/file.c |1 + 6 files changed, 7 insertions(+), 0 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 2999d72..4f1dd79 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -721,4 +721,5 @@ const struct file_operations ext2_dir_operations = { .compat_ioctl = ext2_compat_ioctl, #endif .fsync = ext2_sync_file, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 45ed071..e1731c5 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -58,6 +58,7 @@ const struct file_operations ext2_file_operations = { .fsync = ext2_sync_file, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .checkpoint = generic_file_checkpoint, }; #ifdef CONFIG_EXT2_FS_XIP @@ -73,6 +74,7 @@ const struct file_operations ext2_xip_file_operations = { .open = generic_file_open, .release= ext2_release_file, .fsync = ext2_sync_file, + .checkpoint = generic_file_checkpoint, }; #endif diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c index 3d724a9..54b05d2 100644 --- a/fs/ext3/dir.c +++ b/fs/ext3/dir.c @@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = { #endif .fsync = ext3_sync_file, /* BKL held */ .release= ext3_release_dir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ext3/file.c b/fs/ext3/file.c index 5b49704..a421e07 100644 --- a/fs/ext3/file.c +++ b/fs/ext3/file.c @@ -126,6 +126,7 @@ const struct file_operations ext3_file_operations = { .fsync = ext3_sync_file, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ext3_file_inode_operations = { diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index b647899..2787fdb 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = { #endif .fsync = ext4_sync_file, .release= ext4_release_dir, + .checkpoint = generic_file_checkpoint, }; diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 588af8c..c2dab33 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -161,6 +161,7 @@ const struct file_operations ext4_file_operations = { .fsync = ext4_sync_file, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .checkpoint = generic_file_checkpoint, }; const struct inode_operations ext4_file_inode_operations = { -- 1.6.0.4 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page
Sukadev Bhattiprolu wrote: From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:39 -0700 Subject: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- Reviewed-by: Oren Laadan or...@cs.columbia.edu kernel/pid.c | 43 --- 1 files changed, 28 insertions(+), 15 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index b2e5f78..c0aaebe 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -122,6 +122,31 @@ static void free_pidmap(struct upid *upid) atomic_inc(map-nr_free); } +static int alloc_pidmap_page(struct pidmap *map) +{ + void *page; + + if (likely(map-page)) + return 0; + + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + + /* + * Free the page if someone raced with us installing it: + */ + spin_lock_irq(pidmap_lock); + if (map-page) + kfree(page); + else + map-page = page; + spin_unlock_irq(pidmap_lock); + + if (unlikely(!map-page)) + return -1; + + return 0; +} + static int alloc_pidmap(struct pid_namespace *pid_ns) { int i, offset, max_scan, pid, last = pid_ns-last_pid; @@ -134,21 +159,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) map = pid_ns-pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i = max_scan; ++i) { - if (unlikely(!map-page)) { - void *page = kzalloc(PAGE_SIZE, GFP_KERNEL); - /* - * Free the page if someone raced with us - * installing it: - */ - spin_lock_irq(pidmap_lock); - if (map-page) - kfree(page); - else - map-page = page; - spin_unlock_irq(pidmap_lock); - if (unlikely(!map-page)) - break; - } + if (alloc_pidmap_page(map)) + break; + if (likely(atomic_read(map-nr_free))) { do { if (!test_and_set_bit(offset, map-page)) { ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC v16][PATCH 23/43] c/r: restart multiple processes
On Wed, May 27, 2009 at 01:32:49PM -0400, Oren Laadan wrote: Restarting of multiple processes expects all restarting tasks to call sys_restart(). Once inside the system call, each task will restart itself at the same order that they were saved. The internals of the syscall will take care of in-kernel synchronization bewteen tasks. This patch does _not_ create the task tree in the kernel. Instead it assumes that all tasks are created in some way and then invoke the restart syscall. You can use the userspace mktree.c program to do that. The init task (*) has a special role: it allocates the restart context (ctx), and coordinates the operation. In particular, it first waits until all participating tasks enter the kernel, and provides them the common restart context. Once everyone in ready, it begins to restart itself. In contrast, the other tasks enter the kernel, locate the init task (*) and grab its restart context, and then wait for their turn to restore. When a task (init or not) completes its restart, it hands the control over to the next in line, by waking that task. An array of pids (the one saved during the checkpoint) is used to synchronize the operation. The first task in the array is the init task (*). The restart context (ctx) maintain a current position in the array, which indicates which task is currently active. Once the currently active task completes its own restart, it increments that position and wakes up the next task. Restart assumes that userspace provides meaningful data, otherwise it's garbage-in-garbage-out. In this case, the syscall may block indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or otherwise kill the stray restarting tasks. In terms of security, restart runs as the user the invokes it, so it will not allow a user to do more than is otherwise permitted by the usual system semantics and policy. Currently we ignore threads and zombies Let's discuss threads and zombies. 1. Will zombie end up in a image? 2. If yes, how it will be restored. Will it be forked, call restart(2) and then somehow zombified inside kernel? 3. How thread group will be restored, will every thread be CLONE_THREAD'ed? What to do with exited thread group leaders, will they be forked, then CLONE_THREAD thread group? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/7] [PATCH] Have alloc_pidmap() return actual error code
Sukadev Bhattiprolu wrote: From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:40 -0700 Subject: [PATCH 2/7] [PATCH] Have alloc_pidmap() return actual error code alloc_pidmap() can fail either because all pid numbers are in use or we can't allocate memory. With support for setting a specific pid number, alloc_pidmap() would also fail if either the given pid number is invalid or in use. Rather than have caller assume -ENOMEM, have alloc_pidmap() return the actual error. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- Reviewed-by: Oren Laadan or...@cs.columbia.edu kernel/fork.c |5 +++-- kernel/pid.c |9 ++--- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index b9e2edd..f8411a8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1119,10 +1119,11 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != init_struct_pid) { - retval = -ENOMEM; pid = alloc_pid(p-nsproxy-pid_ns); - if (!pid) + if (IS_ERR(pid)) { + retval = PTR_ERR(pid); goto bad_fork_cleanup_io; + } if (clone_flags CLONE_NEWPID) { retval = pid_ns_prepare_proc(p-nsproxy-pid_ns); diff --git a/kernel/pid.c b/kernel/pid.c index c0aaebe..fd72ad9 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -151,6 +151,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) { int i, offset, max_scan, pid, last = pid_ns-last_pid; struct pidmap *map; + int rc = -EAGAIN; pid = last + 1; if (pid = pid_max) @@ -159,8 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) map = pid_ns-pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i = max_scan; ++i) { - if (alloc_pidmap_page(map)) + if (alloc_pidmap_page(map)) { + rc = -ENOMEM; break; + } if (likely(atomic_read(map-nr_free))) { do { @@ -192,7 +195,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) } pid = mk_pid(pid_ns, map, offset); } - return -1; + return rc; } int next_pidmap(struct pid_namespace *pid_ns, int last) @@ -297,7 +300,7 @@ out_free: free_pidmap(pid-numbers + i); kmem_cache_free(ns-pid_cachep, pid); - pid = NULL; + pid = ERR_PTR(nr); goto out; } ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 6/7] [PATCH] Define do_fork_with_pids()
Sukadev Bhattiprolu wrote: From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:44 -0700 Subject: [PATCH 6/7] [PATCH] Define do_fork_with_pids() do_fork_with_pids() is same as do_fork(), except that it takes an additional, target_pids, parameter. This parameter, currently unused, specifies the target_pids of the process in each of its pid namespaces. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- Reviewed-by: Oren Laadan or...@cs.columbia.edu include/linux/sched.h |1 + kernel/fork.c | 17 ++--- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index b4c38bc..2173df1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1995,6 +1995,7 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, pid_t *target_pids); struct task_struct *fork_idle(int); extern void set_task_comm(struct task_struct *tsk, char *from); diff --git a/kernel/fork.c b/kernel/fork.c index 373411e..912d008 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1340,17 +1340,17 @@ struct task_struct * __cpuinit fork_idle(int cpu) * It copies the process, and if successful kick-starts * it and waits for it to finish using the VM if required. */ -long do_fork(unsigned long clone_flags, +long do_fork_with_pids(unsigned long clone_flags, unsigned long stack_start, struct pt_regs *regs, unsigned long stack_size, int __user *parent_tidptr, - int __user *child_tidptr) + int __user *child_tidptr, + pid_t *target_pids) { struct task_struct *p; int trace = 0; long nr; - pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1448,6 +1448,17 @@ long do_fork(unsigned long clone_flags, return nr; } +long do_fork(unsigned long clone_flags, + unsigned long stack_start, + struct pt_regs *regs, + unsigned long stack_size, + int __user *parent_tidptr, + int __user *child_tidptr) +{ + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size, + parent_tidptr, child_tidptr, NULL); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process()
Sukadev Bhattiprolu wrote: From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:43 -0700 Subject: [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process() The new parameter will be used in a follow-on patch when clone_with_pids() is implemented. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- Reviewed-by: Oren Laadan or...@cs.columbia.edu kernel/fork.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d2d69d3..373411e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -949,12 +949,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, unsigned long stack_size, int __user *child_tidptr, struct pid *pid, + pid_t *target_pids, int trace) { int retval; struct task_struct *p; int cgroup_callbacks_done = 0; - pid_t *target_pids = NULL; if ((clone_flags (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1327,7 +1327,7 @@ struct task_struct * __cpuinit fork_idle(int cpu) struct pt_regs regs; task = copy_process(CLONE_VM, 0, idle_regs(regs), 0, NULL, - init_struct_pid, 0); + init_struct_pid, NULL, 0); if (!IS_ERR(task)) init_idle(task, cpu); @@ -1350,6 +1350,7 @@ long do_fork(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; + pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1390,7 +1391,7 @@ long do_fork(unsigned long clone_flags, trace = tracehook_prepare_clone(clone_flags); p = copy_process(clone_flags, stack_start, regs, stack_size, - child_tidptr, NULL, trace); + child_tidptr, NULL, target_pids, trace); /* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 7/7] [PATCH] Define clone_with_pids syscall
Sukadev Bhattiprolu wrote: From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:45 -0700 Subject: [PATCH 7/7] [PATCH] Define clone_with_pids syscall clone_with_pids() is same as clone(), except that it takes a 'target_pid_set' paramter which lets caller choose a specific pid number for the child process in each of the child process's pid namespace. This system call would be needed to implement Checkpoint/Restart (i.e after a checkpoint, restart a process with its original pids). Call clone_with_pids as follows: pid_t pids[] = { 0, 77, 99 }; struct target_pid_set pid_set; pid_set.num_pids = sizeof(pids) / sizeof(int); pid_set.target_pids = pids; syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, pid_set); If a target-pid is 0, the kernel continues to assign a pid for the process in that namespace. In the above example, pids[0] is 0, meaning the kernel will assign next available pid to the process in init_pid_ns. But kernel will assign pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either 77 or 99 are taken, the system call fails with -EBUSY. If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces, the system call fails with -EINVAL. Its mostly an exploratory patch seeking feedback on the interface. NOTE: Compared to clone(), clone_with_pids() needs to pass in two more pieces of information: - number of pids in the set - user buffer containing the list of pids. But since clone() already takes 5 parameters, use a 'struct target_pid_set'. TODO: - Gently tested. - May need additional sanity checks in check_target_pids() - Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in the namespace is either 1 or 0). Changelog[v1]: - Fixed some compile errors (had fixed these errors earlier in my git tree but had not refreshed patches before emailing them) Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- Reviewed-by: Oren Laadan or...@cs.columbia.edu but ... [...] +static pid_t *copy_target_pids(unsigned long clone_flags, void __user *upid_setp) +{ + int rc; + int size; + pid_t __user *utarget_pids; + pid_t *target_pids; + struct target_pid_set pid_set; + + if (copy_from_user(pid_set, upid_setp, sizeof(pid_set))) + return ERR_PTR(-EFAULT); + + size = pid_set.num_pids * sizeof(pid_t); ...either test pid_set.num_pids 0 (and give -EINVAL), or... [...] +struct target_pid_set { + int num_pids; ... make this 'size_t' ? + pid_t *target_pids; +}; + #endif /* __KERNEL__ */ #endif /* __ASSEMBLY__ */ #endif /* _LINUX_TYPES_H */ ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 18/38] C/R: core stuff
Alexey Dobriyan wrote: On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote: Quoting Alexey Dobriyan (adobri...@gmail.com): Introduction Checkpoint/restart (C/R from now) allows to dump group of processes to disk for various reasons like saving process state in case of box failure or restoration of group of processes on another or same machine later. Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel and dump more or less raw pages, proposed C/R doesn't require hypervisor. For that C/R code needs to know about all little and big intimate kernel details. The good thing is that not all details needs to be serialized and saved like, say, readahead state. The bad things is still quite a few things need to be. Hi Alexey, the last time you posted this, I went through and tried to discern the meaningful differences between yours and Oren's patchsets. Then I sent some patches to Oren to make his set configurable to act more like yours. And Oren took them! But now you resend this patchset with no real changelog, no acknowledgment that Oren's set even exists Is this a requirement? Everybody following topic already knows about Oren's patchset. Some people do ack other people's work. See for example patches #1 and #24 in my recent post. You're welcome. - or is much farther along and pretty widely reviewed and tested (which is only because he started earlier and, when we asked for your counterpatches at an earlier stage, you would never reply) - or, most importantly, what it is that you think your patchset does that his does not and cannot. There are differences. And they're not small like you're trying to describe but pretty big compared the scale of the problem. I've asked before, and I repeat now: can you enumerate these big scary differences that make it such a big problem ? So far, we identified two main design issues - 1) Whether or not allow c/r of sub-container (partial hierarchy) 2) Creation of restarting process hierarchy in kernel or in userspace As for #1, you are the _only_ one who advocates restricting c/r to a full container only. I guess you have your reasons, but I'm unsure what they may be. On the other hand, there has been a handful of use-cases and opinions in favor of allowing both capabilities to co-exist. Not the mention that nearly no additional code is necessary, on the contrary. As for #2, you didn't even bother to reply to the discussion that I had started about it. This decision is important to allow future flexibility of the mechanism, and to address the needs of several potential users, as seen in that discussion and others. Here, too, you are the _only_ one that advocates that direction. And the funniest thing -- *both* decisions can be *easily* overturned in my patchset. In fact, regarding #2 - either way can be easily done in it. So I wonder, what are the big issues that bother you so much ? if there is a will, there is a way. *Why* are you spending your time on this instead of helping with Oren's set? Because we disagree with some core directions Oren chose. ANK literally said: I don't know how to dump live netns. Eh... and you have it all sorted out ? (yeah, I do, but not in this patchset). So, partly patchset was created so that absolutely nobody will tell us to shut up and show the code. Oh well ... the code meaning your code I suppose. The other part, is that I looked at Oren patchset, found quite a lot of suspicious, broken and unclean places and decided that it'd be faster to start from scratch because sending patches will overhaul like 85% of the code. So you actually took the time to read and review. And then you spent even more time in calculating this number ! Feedback appreciated. If you looked closely you would have seen that we do address your concerns over time. One example, is why CKPT_HDR_CPU and CKPT_RESTART_BLOCK exist at all? Should objects in image be only what sharable objects are in kernel (expect VMAs, pages and possibly file descriptors)? pt_regs don't exist by themselves after all. A good reason to break it into small pieces is for ease of maintenance and debugging, as well as in the future easier transition between incompatible kernel versions. I think it's better than a few-pages-long single struct. And it encourages more naming of things. But ... I'm confused ... is this your big concern ? Oh well, if that's what stands in your way, we could even rework that (~1.3% of the code ? I reckon...). And since you guys showed that just idea of in-kernel checkpointing is not rejected outright, it doesn't mean that you can drag every single idea too. Because history shows, that once something (especially user-visible, like restart syscall semantics) is in kernel it's nearly impossible to cut it out, so it's very-very important to get it right from the very beginning. Yes. Let's indeed talk about how to get
[Devel] Re: [RFC v16][PATCH 19/43] c/r: external checkpoint of a task other than ourself
On Wed, May 27, 2009 at 01:32:45PM -0400, Oren Laadan wrote: Now we can do external checkpoint, i.e. act on another task. +static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) +{ + if (t-state == TASK_DEAD) { + pr_warning(c/r: task %d is TASK_DEAD\n, task_pid_vnr(t)); + return -EAGAIN; + } + + if (!ptrace_may_access(t, PTRACE_MODE_READ)) { + __ckpt_write_err(ctx, access to task %d (%s) denied, + task_pid_vnr(t), t-comm); + return -EPERM; + } + + /* verify that the task is frozen (unless self) */ + if (t != current !frozen(t)) { + __ckpt_write_err(ctx, task %d (%s) is not frozen, + task_pid_vnr(t), t-comm); + return -EBUSY; + } + + /* FIX: add support for ptraced tasks */ + if (task_ptrace(t)) { + __ckpt_write_err(ctx, task %d (%s) is ptraced, + task_pid_vnr(t), t-comm); + return -EBUSY; + } + + return 0; +} + +static int get_container(struct ckpt_ctx *ctx, pid_t pid) +{ + struct task_struct *task = NULL; + struct nsproxy *nsproxy = NULL; + int ret; + + ctx-root_pid = pid; + + read_lock(tasklist_lock); + task = find_task_by_vpid(pid); + if (task) + get_task_struct(task); + read_unlock(tasklist_lock); + + if (!task) + return -ESRCH; + + ret = may_checkpoint_task(ctx, task); + if (ret) { + ckpt_write_err(ctx, NULL); + put_task_struct(task); + return ret; + } + + rcu_read_lock(); + nsproxy = task_nsproxy(task); + get_nsproxy(nsproxy); Will oops if init is multi-threaded and thread group leader exited (nsproxy = NULL). I need to think what to do, too. + rcu_read_unlock(); + + ctx-root_task = task; + ctx-root_nsproxy = nsproxy; + + return 0; +} ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC v16][PATCH 23/43] c/r: restart multiple processes
Alexey Dobriyan wrote: On Wed, May 27, 2009 at 01:32:49PM -0400, Oren Laadan wrote: Restarting of multiple processes expects all restarting tasks to call sys_restart(). Once inside the system call, each task will restart itself at the same order that they were saved. The internals of the syscall will take care of in-kernel synchronization bewteen tasks. This patch does _not_ create the task tree in the kernel. Instead it assumes that all tasks are created in some way and then invoke the restart syscall. You can use the userspace mktree.c program to do that. The init task (*) has a special role: it allocates the restart context (ctx), and coordinates the operation. In particular, it first waits until all participating tasks enter the kernel, and provides them the common restart context. Once everyone in ready, it begins to restart itself. In contrast, the other tasks enter the kernel, locate the init task (*) and grab its restart context, and then wait for their turn to restore. When a task (init or not) completes its restart, it hands the control over to the next in line, by waking that task. An array of pids (the one saved during the checkpoint) is used to synchronize the operation. The first task in the array is the init task (*). The restart context (ctx) maintain a current position in the array, which indicates which task is currently active. Once the currently active task completes its own restart, it increments that position and wakes up the next task. Restart assumes that userspace provides meaningful data, otherwise it's garbage-in-garbage-out. In this case, the syscall may block indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or otherwise kill the stray restarting tasks. In terms of security, restart runs as the user the invokes it, so it will not allow a user to do more than is otherwise permitted by the usual system semantics and policy. Currently we ignore threads and zombies Let's discuss threads and zombies. 1. Will zombie end up in a image? Zombies will be mentioned in the hierarchy description, and will have very little state saved (e.g. exit status, parent). 2. If yes, how it will be restored. Will it be forked, call restart(2) and then somehow zombified inside kernel? (not part of this patchset, but soon will be added to ckpt-v16-dev) Zombie will be restarted as a normal process, will restore bare minimum needed, and will call do_exit(). It will have to ensure that there are no side effects on (=signals to) parent/children. 3. How thread group will be restored, will every thread be CLONE_THREAD'ed? What to do with exited thread group leaders, will they be forked, then CLONE_THREAD thread group? First, user space creates the entire tree hierarchy, including zombies. Then each task calls sys_restart(). Inside, they are coordinated to restore their state one after the other. So that eventually, the to-be-zombies, be it a thread-group-leader or not, will call do_exit() and zombify themselves. Take a look at mktree.c (part of the user tools). It's already done there using CLONE_THREAD. The reason I wrote that it isn't supported well is because I think that in full-container mode the link count won't work correctly. Other than that, threads should work as long as you don't play with partial sharing (e.g. only CLONE_FS). Oren. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 18/38] C/R: core stuff
On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote: Alexey Dobriyan wrote: On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote: Quoting Alexey Dobriyan (adobri...@gmail.com): Introduction Checkpoint/restart (C/R from now) allows to dump group of processes to disk for various reasons like saving process state in case of box failure or restoration of group of processes on another or same machine later. Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel and dump more or less raw pages, proposed C/R doesn't require hypervisor. For that C/R code needs to know about all little and big intimate kernel details. The good thing is that not all details needs to be serialized and saved like, say, readahead state. The bad things is still quite a few things need to be. Hi Alexey, the last time you posted this, I went through and tried to discern the meaningful differences between yours and Oren's patchsets. Then I sent some patches to Oren to make his set configurable to act more like yours. And Oren took them! But now you resend this patchset with no real changelog, no acknowledgment that Oren's set even exists Is this a requirement? Everybody following topic already knows about Oren's patchset. Some people do ack other people's work. See for example patches #1 and #24 in my recent post. You're welcome. - or is much farther along and pretty widely reviewed and tested (which is only because he started earlier and, when we asked for your counterpatches at an earlier stage, you would never reply) - or, most importantly, what it is that you think your patchset does that his does not and cannot. There are differences. And they're not small like you're trying to describe but pretty big compared the scale of the problem. I've asked before, and I repeat now: can you enumerate these big scary differences that make it such a big problem ? So far, we identified two main design issues - Why in ? Yes, they are high-level design issues. 1) Whether or not allow c/r of sub-container (partial hierarchy) 2) Creation of restarting process hierarchy in kernel or in userspace As for #1, you are the _only_ one who advocates restricting c/r to a full container only. I guess you have your reasons, but I'm unsure what they may be. The reason is that checkpointing half-frozen, half-live container is essentially equivalent to live container which adds much complexity to code fundamentally preventing kernel from taking coherent snapshot. In such situations kernel will do its job badly. Manpage will be filled with strings like if $FOO is shared then $BAR is not guaranteed. What to do if user simply doesn't know if container is bounded? Checkpoint and to hell with consequences? If two tasks share mm_struct you can't even detect that pages you dump aren't filled with garbage meanwhile from second task. If two tasks share mm_struct, other task can issue AIO indefinitely preventing from taking even coherent filesystem snapshot. That's why I raise this issue again to hear from people what they think and these people shouldn't be containers and C/R people, because the latter already made up their minds. This is super-important issue to get right from the beginning. On the other hand, there has been a handful of use-cases and opinions in favor of allowing both capabilities to co-exist. Not the mention that nearly no additional code is necessary, on the contrary. As for #2, you didn't even bother to reply to the discussion that I had started about it. This decision is important to allow future flexibility of the mechanism, and to address the needs of several potential users, as seen in that discussion and others. Here, too, you are the _only_ one that advocates that direction. Are you going to fork to-become-zombies, make them call restart(2) and zombify? And the funniest thing -- *both* decisions can be *easily* overturned in my patchset. In fact, regarding #2 - either way can be easily done in it. So I wonder, what are the big issues that bother you so much ? if there is a will, there is a way. Oren, don't you really understand? Users want millions of things, but every thing has price. Some think hardlinking of directories should be implemented. You can ask VFS guys how hard would it be and how hard would it be to do reliably without races/deadlocks et al. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 18/38] C/R: core stuff
On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote: Now here goes second version, with prefixes fixed (kstate_) like Ingo suggested and so Linus could look at the code and with C/R code moved close to usual code and with more checks added (which you should have already!) to not restore null selector in %cs for example. It is far from perfect. In fact, it's even clearly commented as such, and exactly there. It would have been helpful if you pointed that out in a review, or even - god forbid - sent a patch to improve it. This is ridiculous. First, you declare that restart(2) should be allowed for anyone(!). and then send patchset for inclusion in -mm which doesn't even check if selectors are right! But it works, and it lets people play with a more-than-a-toy implementation and provide us with important feedback. Oh, and by the way, it doesn't require that people use containers to try it out. Setting up container for playing is not hard: CLONE_NEWUTS=y CLONE_NEWIPC=y CLONE_NEWPID=y CLONE_NEWUSER=y CLONE_NEWNET=y #define _GNU_SOURCE #include stdio.h #include stdlib.h #include unistd.h #include sched.h #include sys/mount.h #include sys/types.h #include sys/stat.h #include sys/wait.h #include fcntl.h #define CLONE_NEWNS 0x0002 #define CLONE_NEWUTS0x0400 #define CLONE_NEWIPC0x0800 #define CLONE_NEWUSER 0x1000 #define CLONE_NEWPID0x2000 #define CLONE_NEWNET0x4000 static int fn(void *_argv) { char **argv = (char **)_argv; setsid(); setpgid(getpid(), getpid()); execve(argv[0], argv, __environ); return 1; } int main(int argc, char *argv[]) { unsigned long flags = 0; int status; pid_t pid; void *p; flags |= CLONE_NEWNS; flags |= CLONE_NEWUTS; flags |= CLONE_NEWIPC; flags |= CLONE_NEWUSER; flags |= CLONE_NEWPID; flags |= CLONE_NEWNET; p = malloc(4 * 4096); if (!p) return 1; argv++; pid = clone(fn, p + 4 * 4096, flags, (void *)argv); fprintf(stderr, pid = %d\n, pid); if (pid == -1) return 1; waitpid(pid, status, __WALL); return 0; } ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC v16][PATCH 19/43] c/r: external checkpoint of a task other than ourself
On Thu, 28 May 2009, Alexey Dobriyan wrote: On Wed, May 27, 2009 at 01:32:45PM -0400, Oren Laadan wrote: Now we can do external checkpoint, i.e. act on another task. +static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t) +{ + if (t-state == TASK_DEAD) { + pr_warning(c/r: task %d is TASK_DEAD\n, task_pid_vnr(t)); + return -EAGAIN; + } + + if (!ptrace_may_access(t, PTRACE_MODE_READ)) { + __ckpt_write_err(ctx, access to task %d (%s) denied, +task_pid_vnr(t), t-comm); + return -EPERM; + } + + /* verify that the task is frozen (unless self) */ + if (t != current !frozen(t)) { + __ckpt_write_err(ctx, task %d (%s) is not frozen, +task_pid_vnr(t), t-comm); + return -EBUSY; + } + + /* FIX: add support for ptraced tasks */ + if (task_ptrace(t)) { + __ckpt_write_err(ctx, task %d (%s) is ptraced, +task_pid_vnr(t), t-comm); + return -EBUSY; + } + + return 0; +} + +static int get_container(struct ckpt_ctx *ctx, pid_t pid) +{ + struct task_struct *task = NULL; + struct nsproxy *nsproxy = NULL; + int ret; + + ctx-root_pid = pid; + + read_lock(tasklist_lock); + task = find_task_by_vpid(pid); + if (task) + get_task_struct(task); + read_unlock(tasklist_lock); + + if (!task) + return -ESRCH; + + ret = may_checkpoint_task(ctx, task); + if (ret) { + ckpt_write_err(ctx, NULL); + put_task_struct(task); + return ret; + } + + rcu_read_lock(); + nsproxy = task_nsproxy(task); + get_nsproxy(nsproxy); Will oops if init is multi-threaded and thread group leader exited (nsproxy = NULL). I need to think what to do, too. ood catch. Since all threads share same nsproxy (except those who exits.. duh) we can test for this case, and get the nsproxy from any of the other threads, something like this (untested): diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c index afc7300..b303876 100644 --- a/checkpoint/checkpoint.c +++ b/checkpoint/checkpoint.c @@ -522,9 +522,33 @@ static int get_container(struct ckpt_ctx *ctx, pid_t pid) rcu_read_lock(); nsproxy = task_nsproxy(task); - get_nsproxy(nsproxy); + if (nsproxy) + get_nsproxy(nsproxy); rcu_read_unlock(); + /* +* If we hit a zombie thread-group-leader, nsproxy will be NULL, +* and we instead grab it from one of the other threads. +*/ + if (!nsproxy) { + struct task_struct *p = next_thread(task); + + BUG_ON(task-state != TASK_DEAD); + read_lock(tasklist_lock); + while (p != task !task_nsproxy(p)) + p = next_thread(p); + nsproxy = get_nsproxy(p); + if (nsproxy) + get_nsproxy(nsproxy); + read_unlock(tasklist_lock); + } + + /* still not ... too bad ... */ + if (!nsproxy) { + put_task_struct(task); + return -ESRCH; + } + ctx-root_task = task; ctx-root_nsproxy = nsproxy; ctx-root_init = is_container_init(task); ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 18/38] C/R: core stuff
On Thu, 28 May 2009 02:17:53 +0400 Alexey Dobriyan adobri...@gmail.com wrote: 1) Whether or not allow c/r of sub-container (partial hierarchy) 2) Creation of restarting process hierarchy in kernel or in userspace As for #1, you are the _only_ one who advocates restricting c/r to a full container only. I guess you have your reasons, but I'm unsure what they may be. The reason is that checkpointing half-frozen, half-live container is essentially equivalent to live container which adds much complexity to code fundamentally preventing kernel from taking coherent snapshot. In such situations kernel will do its job badly. Manpage will be filled with strings like if $FOO is shared then $BAR is not guaranteed. What to do if user simply doesn't know if container is bounded? Checkpoint and to hell with consequences? If two tasks share mm_struct you can't even detect that pages you dump aren't filled with garbage meanwhile from second task. If two tasks share mm_struct, other task can issue AIO indefinitely preventing from taking even coherent filesystem snapshot. That's why I raise this issue again to hear from people what they think and these people shouldn't be containers and C/R people, because the latter already made up their minds. This is super-important issue to get right from the beginning. pipes up yeah, checkpointing a partial hierarchy at this stage sounds like overreach. Get full-container working usably first, think about sub-containers in version 2. pipes down again ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 18/38] C/R: core stuff
Alexey Dobriyan wrote: On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote: Alexey Dobriyan wrote: On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote: Quoting Alexey Dobriyan (adobri...@gmail.com): Introduction Checkpoint/restart (C/R from now) allows to dump group of processes to disk for various reasons like saving process state in case of box failure or restoration of group of processes on another or same machine later. Unlike, let's say, hypervisor C/R style which only needs to freeze guest kernel and dump more or less raw pages, proposed C/R doesn't require hypervisor. For that C/R code needs to know about all little and big intimate kernel details. The good thing is that not all details needs to be serialized and saved like, say, readahead state. The bad things is still quite a few things need to be. Hi Alexey, the last time you posted this, I went through and tried to discern the meaningful differences between yours and Oren's patchsets. Then I sent some patches to Oren to make his set configurable to act more like yours. And Oren took them! But now you resend this patchset with no real changelog, no acknowledgment that Oren's set even exists Is this a requirement? Everybody following topic already knows about Oren's patchset. Some people do ack other people's work. See for example patches #1 and #24 in my recent post. You're welcome. - or is much farther along and pretty widely reviewed and tested (which is only because he started earlier and, when we asked for your counterpatches at an earlier stage, you would never reply) - or, most importantly, what it is that you think your patchset does that his does not and cannot. There are differences. And they're not small like you're trying to describe but pretty big compared the scale of the problem. I've asked before, and I repeat now: can you enumerate these big scary differences that make it such a big problem ? So far, we identified two main design issues - Why in ? Yes, they are high-level design issues. In quotes, because I argued further on that, although my patchset takes a stand on both issues, it can be easily reverted _within_ that patchset. Moreover, I argue that they can co-exist. 1) Whether or not allow c/r of sub-container (partial hierarchy) 2) Creation of restarting process hierarchy in kernel or in userspace As for #1, you are the _only_ one who advocates restricting c/r to a full container only. I guess you have your reasons, but I'm unsure what they may be. The reason is that checkpointing half-frozen, half-live container is essentially equivalent to live container which adds much complexity to code fundamentally preventing kernel from taking coherent snapshot. In such situations kernel will do its job badly. In such situation the kernel will do a bad job if the user is asking for a bad job. Just like checkpointing without snapshotting the file system and expecting it to always work. But if the user is a bit more careful (and even then, not that much), she can enjoy the wonderful benefits of c/r without the wonderful benefits of containers. If useful, it's easy to pass a flag to checkpoint() that will ask to enforce, say, shared memory leaks but not nsproxy or file leaks. In fact, even shared memory leaks may be useful for some users (e.g. what the guys from kerlabs pointed out). Manpage will be filled with strings like if $FOO is shared then $BAR is not guaranteed. What to do if user simply doesn't know if container is bounded? Checkpoint and to hell with consequences? If two tasks share mm_struct you can't even detect that pages you dump aren't filled with garbage meanwhile from second task. If two tasks share mm_struct, other task can issue AIO indefinitely preventing from taking even coherent filesystem snapshot. That's why I raise this issue again to hear from people what they think and these people shouldn't be containers and C/R people, because the latter already made up their minds. Lol .. and disagreement persists among us :) And indeed, I have heard and seen already a few opinions in favor of permitting non-container checkpoint. From potential users (not c/r people). This is super-important issue to get right from the beginning. On the other hand, there has been a handful of use-cases and opinions in favor of allowing both capabilities to co-exist. Not the mention that nearly no additional code is necessary, on the contrary. As for #2, you didn't even bother to reply to the discussion that I had started about it. This decision is important to allow future flexibility of the mechanism, and to address the needs of several potential users, as seen in that discussion and others. Here, too, you are the _only_ one that advocates that direction. Are you going to fork to-become-zombies, make them call restart(2) and zombify? Yes. And the funniest thing -- *both* decisions
[Devel] Re: [PATCH 7/7] [PATCH] Define clone_with_pids syscall
| + if (copy_from_user(pid_set, upid_setp, sizeof(pid_set))) | + return ERR_PTR(-EFAULT); | + | + size = pid_set.num_pids * sizeof(pid_t); | | ...either test pid_set.num_pids 0 (and give -EINVAL), | or... Good point. I now check for num_pids 0 and treat num_pids == 0 as normal clone(). While addressing this I realized I had a lot of arch-independent code in arch/x86/kernel/process_32.c. I have now moved this common code to kernel/fork.c. Its non-trivial code move, so need new review/acks from you and Serge for at least patches 6 and 7. Sukadev ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 1/7][v2] Factor out code to allocate pidmap page
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:39 -0700 Subject: [RFC][PATCH 1/7][v2] Factor out code to allocate pidmap page To implement support for clone_with_pids() system call we would need to allocate pidmap page in more than one place. Move this code to a new function alloc_pidmap_page(). Changelog[v2]: - (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return -ENOMEM on error instead of -1. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Acked-by: Serge Hallyn se...@us.ibm.com Reviewed-by: Oren Laadan or...@cs.columbia.edu --- kernel/pid.c | 46 ++ 1 files changed, 30 insertions(+), 16 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index b2e5f78..9ff33cc 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -122,9 +122,34 @@ static void free_pidmap(struct upid *upid) atomic_inc(map-nr_free); } +static int alloc_pidmap_page(struct pidmap *map) +{ + void *page; + + if (likely(map-page)) + return 0; + + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + + /* +* Free the page if someone raced with us installing it: +*/ + spin_lock_irq(pidmap_lock); + if (map-page) + kfree(page); + else + map-page = page; + spin_unlock_irq(pidmap_lock); + + if (unlikely(!map-page)) + return -ENOMEM; + + return 0; +} + static int alloc_pidmap(struct pid_namespace *pid_ns) { - int i, offset, max_scan, pid, last = pid_ns-last_pid; + int i, rc, offset, max_scan, pid, last = pid_ns-last_pid; struct pidmap *map; pid = last + 1; @@ -134,21 +159,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) map = pid_ns-pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i = max_scan; ++i) { - if (unlikely(!map-page)) { - void *page = kzalloc(PAGE_SIZE, GFP_KERNEL); - /* -* Free the page if someone raced with us -* installing it: -*/ - spin_lock_irq(pidmap_lock); - if (map-page) - kfree(page); - else - map-page = page; - spin_unlock_irq(pidmap_lock); - if (unlikely(!map-page)) - break; - } + rc = alloc_pidmap_page(map); + if (rc) + break; + if (likely(atomic_read(map-nr_free))) { do { if (!test_and_set_bit(offset, map-page)) { -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 2/7][v2] Have alloc_pidmap() return actual error code
From 991fb474b055d36c4516cf7f79a247b7d79819ae Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:40 -0700 Subject: [RFC][PATCH 2/7][v2] Have alloc_pidmap() return actual error code alloc_pidmap() can fail either because all pid numbers are in use or because memory allocation failed. With support for setting a specific pid number, alloc_pidmap() would also fail if either the given pid number is invalid or in use. Rather than have callers assume -ENOMEM, have alloc_pidmap() return the actual error. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Acked-by: Serge Hallyn se...@us.ibm.com Reviewed-by: Oren Laadan or...@cs.columbia.edu --- kernel/fork.c |5 +++-- kernel/pid.c |9 ++--- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index b9e2edd..f8411a8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1119,10 +1119,11 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != init_struct_pid) { - retval = -ENOMEM; pid = alloc_pid(p-nsproxy-pid_ns); - if (!pid) + if (IS_ERR(pid)) { + retval = PTR_ERR(pid); goto bad_fork_cleanup_io; + } if (clone_flags CLONE_NEWPID) { retval = pid_ns_prepare_proc(p-nsproxy-pid_ns); diff --git a/kernel/pid.c b/kernel/pid.c index 9ff33cc..b2d6a19 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -158,6 +158,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) offset = pid BITS_PER_PAGE_MASK; map = pid_ns-pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; + rc = -EAGAIN; for (i = 0; i = max_scan; ++i) { rc = alloc_pidmap_page(map); if (rc) @@ -188,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) } else { map = pid_ns-pidmap[0]; offset = RESERVED_PIDS; - if (unlikely(last == offset)) + if (unlikely(last == offset)) { + rc = -EAGAIN; break; + } } pid = mk_pid(pid_ns, map, offset); } - return -1; + return rc; } int next_pidmap(struct pid_namespace *pid_ns, int last) @@ -298,7 +301,7 @@ out_free: free_pidmap(pid-numbers + i); kmem_cache_free(ns-pid_cachep, pid); - pid = NULL; + pid = ERR_PTR(nr); goto out; } -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap()
From a1fdec1036a952359d02a7c667d126bd2fff6804 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:41 -0700 Subject: [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap() With support for setting a specific pid number for a process, alloc_pidmap() will need a paramter a 'target_pid' parameter. Changelog[v2]: - (Serge Hallyn) Check for 'pid 0' in set_pidmap().(Code actually checks for 'pid = 0' for completeness). Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- kernel/pid.c | 28 ++-- 1 files changed, 26 insertions(+), 2 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index b2d6a19..b44dd21 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -147,11 +147,35 @@ static int alloc_pidmap_page(struct pidmap *map) return 0; } -static int alloc_pidmap(struct pid_namespace *pid_ns) +static int set_pidmap(struct pid_namespace *pid_ns, int pid) +{ + int offset; + struct pidmap *map; + + if (pid = 0 || pid = pid_max) + return -EINVAL; + + offset = pid BITS_PER_PAGE_MASK; + map = pid_ns-pidmap[pid/BITS_PER_PAGE]; + + if (alloc_pidmap_page(map)) + return -ENOMEM; + + if (test_and_set_bit(offset, map-page)) + return -EBUSY; + + atomic_dec(map-nr_free); + return pid; +} + +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid) { int i, rc, offset, max_scan, pid, last = pid_ns-last_pid; struct pidmap *map; + if (target_pid) + return set_pidmap(pid_ns, target_pid); + pid = last + 1; if (pid = pid_max) pid = RESERVED_PIDS; @@ -270,7 +294,7 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns-level; i = 0; i--) { - nr = alloc_pidmap(tmp); + nr = alloc_pidmap(tmp, 0); if (nr 0) goto out_free; -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 4/7][v2] Add target_pids parameter to alloc_pid()
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:42 -0700 Subject: [RFC][PATCH 4/7][v2] Add target_pids parameter to alloc_pid() With support for setting a specific pid numbers, alloc_pid() would need to take a set of 'target-pids' which gives the user-specified pids. Add this parameter to alloc_pid(), but leave it set to NULL for now. The parameter will be used in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Acked-by: Serge Hallyn se...@us.ibm.com --- include/linux/pid.h |2 +- kernel/fork.c |3 ++- kernel/pid.c|9 +++-- 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index 49f1c2f..914185d 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr); extern struct pid *find_ge_pid(int nr, struct pid_namespace *); int next_pidmap(struct pid_namespace *pid_ns, int last); -extern struct pid *alloc_pid(struct pid_namespace *ns); +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids); extern void free_pid(struct pid *pid); /* diff --git a/kernel/fork.c b/kernel/fork.c index f8411a8..d2d69d3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -954,6 +954,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, int retval; struct task_struct *p; int cgroup_callbacks_done = 0; + pid_t *target_pids = NULL; if ((clone_flags (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1119,7 +1120,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != init_struct_pid) { - pid = alloc_pid(p-nsproxy-pid_ns); + pid = alloc_pid(p-nsproxy-pid_ns, target_pids); if (IS_ERR(pid)) { retval = PTR_ERR(pid); goto bad_fork_cleanup_io; diff --git a/kernel/pid.c b/kernel/pid.c index b44dd21..090b221 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -280,13 +280,14 @@ void free_pid(struct pid *pid) call_rcu(pid-rcu, delayed_put_pid); } -struct pid *alloc_pid(struct pid_namespace *ns) +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids) { struct pid *pid; enum pid_type type; int i, nr; struct pid_namespace *tmp; struct upid *upid; + int tpid; pid = kmem_cache_alloc(ns-pid_cachep, GFP_KERNEL); if (!pid) @@ -294,7 +295,11 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns-level; i = 0; i--) { - nr = alloc_pidmap(tmp, 0); + tpid = 0; + if (target_pids) + tpid = target_pids[i]; + + nr = alloc_pidmap(tmp, tpid); if (nr 0) goto out_free; -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 6/7][v2] Define do_fork_with_pids()
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:44 -0700 Subject: [RFC][PATCH 6/7][v2] Define do_fork_with_pids() do_fork_with_pids() is same as do_fork(), except that it takes an additional, 'pid_set', parameter. This parameter, currently unused, specifies the set of target pids of the process in each of its pid namespaces. Changelog[v2]: - [v1] of this patch had some architecture-indpendent code in arch/x86/kernel/process_32.c. To facilitate moving this code to kernel/fork.c, in the next patch, [v2] of the patch passes 'struct target_pid_set __user *' to do_fork_with_pids() instead of 'pid_t *'. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- include/linux/sched.h |1 + include/linux/types.h |5 + kernel/fork.c | 16 ++-- 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index b4c38bc..8468e54 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1995,6 +1995,7 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, struct target_pid_set __user *pid_set); struct task_struct *fork_idle(int); extern void set_task_comm(struct task_struct *tsk, char *from); diff --git a/include/linux/types.h b/include/linux/types.h index 5abe354..17ec186 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -204,6 +204,11 @@ struct ustat { charf_fpack[6]; }; +struct target_pid_set { + int num_pids; + pid_t *target_pids; +}; + #endif /* __KERNEL__ */ #endif /* __ASSEMBLY__ */ #endif /* _LINUX_TYPES_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 373411e..a16ef7b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1340,12 +1340,13 @@ struct task_struct * __cpuinit fork_idle(int cpu) * It copies the process, and if successful kick-starts * it and waits for it to finish using the VM if required. */ -long do_fork(unsigned long clone_flags, +long do_fork_with_pids(unsigned long clone_flags, unsigned long stack_start, struct pt_regs *regs, unsigned long stack_size, int __user *parent_tidptr, - int __user *child_tidptr) + int __user *child_tidptr, + struct target_pid_set __user *pid_setp) { struct task_struct *p; int trace = 0; @@ -1448,6 +1449,17 @@ long do_fork(unsigned long clone_flags, return nr; } +long do_fork(unsigned long clone_flags, + unsigned long stack_start, + struct pt_regs *regs, + unsigned long stack_size, + int __user *parent_tidptr, + int __user *child_tidptr) +{ + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size, + parent_tidptr, child_tidptr, NULL); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 7/7][v2] Define clone_with_pids syscall
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:45 -0700 Subject: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall clone_with_pids() is same as clone(), except that it takes a 'target_pid_set' paramter which lets caller choose a specific pid number for the child process in each of the child process's pid namespace. This system call would be needed to implement Checkpoint/Restart (i.e after a checkpoint, restart a process with its original pids). Call clone_with_pids as follows: pid_t pids[] = { 0, 77, 99 }; struct target_pid_set pid_set; pid_set.num_pids = sizeof(pids) / sizeof(int); pid_set.target_pids = pids; syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, pid_set); If a target-pid is 0, the kernel continues to assign a pid for the process in that namespace. In the above example, pids[0] is 0, meaning the kernel will assign next available pid to the process in init_pid_ns. But kernel will assign pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either 77 or 99 are taken, the system call fails with -EBUSY. If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces, the system call fails with -EINVAL. Its mostly an exploratory patch seeking feedback on the interface. NOTE: 1. clone_with_pids(), at least for now, needs CAP_SYS_ADMIN to prevent misuse of the interface. 2. Compared to clone(), clone_with_pids() needs to pass in two more pieces of information: - number of pids in the set - user buffer containing the list of pids. But since clone() already takes 5 parameters, use a 'struct target_pid_set'. TODO: - Gently tested. - May need additional sanity checks in do_fork_with_pids(). - Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in the namespace is either 1 or 0). Changelog[v2]: - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description. - (Oren Laadan) Add checks for 'num_pids 0' (return -EINVAL) and 'num_pids == 0' (fall back to normal clone()). - Move arch-independent code (sanity checks and copy-in of target-pids) into kernel/fork.c and simplify sys_clone_with_pids() Changelog[v1]: - Fixed some compile errors (had fixed these errors earlier in my git tree but had not refreshed patches before emailing them) Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com --- arch/x86/include/asm/syscalls.h|1 + arch/x86/include/asm/unistd_32.h |1 + arch/x86/kernel/entry_32.S |1 + arch/x86/kernel/process_32.c | 21 + arch/x86/kernel/syscall_table_32.S |1 + kernel/fork.c | 81 +++- 6 files changed, 105 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 7043408..1fdc149 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -31,6 +31,7 @@ asmlinkage int sys_get_thread_area(struct user_desc __user *); /* kernel/process_32.c */ int sys_fork(struct pt_regs *); int sys_clone(struct pt_regs *); +int sys_clone_with_pids(struct pt_regs *); int sys_vfork(struct pt_regs *); int sys_execve(struct pt_regs *); diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6e72d74..90f906f 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -340,6 +340,7 @@ #define __NR_inotify_init1 332 #define __NR_preadv333 #define __NR_pwritev 334 +#define __NR_clone_with_pids 335 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index c929add..ee92b0d 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -707,6 +707,7 @@ ptregs_##name: \ PTREGSCALL(iopl) PTREGSCALL(fork) PTREGSCALL(clone) +PTREGSCALL(clone_with_pids) PTREGSCALL(vfork) PTREGSCALL(execve) PTREGSCALL(sigaltstack) diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 76f8f84..1efc3de 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -445,6 +445,27 @@ int sys_clone(struct pt_regs *regs) return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr); } +int sys_clone_with_pids(struct pt_regs *regs) +{ + unsigned long clone_flags; + unsigned long newsp; + int __user *parent_tidptr; + int __user *child_tidptr; + void __user *upid_setp; + + clone_flags = regs-bx; + newsp = regs-cx; + parent_tidptr = (int __user *)regs-dx; + child_tidptr = (int __user *)regs-di; + upid_setp = (void __user *)regs-bp; + + if (!newsp) + newsp = regs-sp; + + return do_fork_with_pids(clone_flags, newsp, regs,
[Devel] [RFC][PATCH 5/7][v2] Add target_pids parameter to copy_process()
From 432bc68b622661cd4a28379e98e3ecc8f44d915d Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Date: Mon, 4 May 2009 01:17:43 -0700 Subject: [RFC][PATCH 5/7][v2] Add target_pids parameter to copy_process() Add a 'target_pids' parameter to copy_process(). The new parameter will be used in a follow-on patch when clone_with_pids() is implemented. Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com Acked-by: Serge Hallyn se...@us.ibm.com Reviewed-by: Oren Laadan or...@cs.columbia.edu --- kernel/fork.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d2d69d3..373411e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -949,12 +949,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, unsigned long stack_size, int __user *child_tidptr, struct pid *pid, + pid_t *target_pids, int trace) { int retval; struct task_struct *p; int cgroup_callbacks_done = 0; - pid_t *target_pids = NULL; if ((clone_flags (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1327,7 +1327,7 @@ struct task_struct * __cpuinit fork_idle(int cpu) struct pt_regs regs; task = copy_process(CLONE_VM, 0, idle_regs(regs), 0, NULL, - init_struct_pid, 0); + init_struct_pid, NULL, 0); if (!IS_ERR(task)) init_idle(task, cpu); @@ -1350,6 +1350,7 @@ long do_fork(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; + pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1390,7 +1391,7 @@ long do_fork(unsigned long clone_flags, trace = tracehook_prepare_clone(clone_flags); p = copy_process(clone_flags, stack_start, regs, stack_size, -child_tidptr, NULL, trace); +child_tidptr, NULL, target_pids, trace); /* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel