[Devel] Re: [PATCH] cgroups: handle failure of cgroup_populate_dir() at mount/remount

2009-05-27 Thread KAMEZAWA Hiroyuki
On Wed, 27 May 2009 11:24:22 +0800
Li Zefan l...@cn.fujitsu.com wrote:

 KAMEZAWA Hiroyuki wrote:
  On Wed, 27 May 2009 09:07:31 +0800
  Li Zefan l...@cn.fujitsu.com wrote:
  
  Paul Menage wrote:
  On Fri, May 22, 2009 at 1:25 AM, KAMEZAWA Hiroyuki
  kamezawa.hir...@jp.fujitsu.com wrote:
  Hm, shouldn't we allow noprefix to be effective only agaisnt cpuset ?
  I think it's just for backward-compatibility of cpuset.
  (I don't like the option at all.)
  Yes, exposing the noprefix option externally was one of the mistakes
  I made when developing cgroups.
 
  It seems to me really unlikely that anyone is using noprefix for
  And noprefix is not documented in cgroups.txt, so I guess not
  many people know this option. Even libcgroup doesn't handle it.
 
  anything other than implicitly when mounting the cpuset filesystem.
  So I'd be inclined to just forbid it if we're mounting more than just
  the cpuset subsystem. A bit of a nasty abstraction violation, but it
  makes more sense overall. The only problem is that someone *might* be
  using it - do we have any way to determine how, and how big do they
  have to be before we care?
 
  I think we can never know..
  
  How about this method ?
  
   - add noprefix to to-be-removed list.
   - add WARNING: noprefix option will be removed in 2.6.32 (or 2.6.31) now
   - remove noprefix in 2.6.31-rc or later
  
 
 I don't see how we can remove noprefix while reserve the compatibility of
 old cpuset..
 
 As Paul Menage said, we can allow noprefix to be used only if we mount just
 cpuset subsystem:
 
I have no objection.

-Kame


 (pseudo code)
 
 diff --git a/kernel/cgroup.c b/kernel/cgroup.c
 --- a/kernel/cgroup.c
 +++ b/kernel/cgroup.c
 @@ -886,6 +886,11 @@ static int parse_cgroupfs_options(char *data,
 }
 }
 
 +
 +   if (test_bit(ROOT_NOPREFIX, opts-flags) 
 +   (opts-subsys_bits  ~cpuset_subsys_id) != 0)
 +   return -EINVAL;
 +
 /* We can't have an empty hierarchy */
 if (!opts-subsys_bits)
 return -EINVAL;
 
 

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] io-controller: Add io group reference handling for request

2009-05-27 Thread Ryo Tsuruta
Hi Andrea and Vivek,

Ryo Tsuruta r...@valinux.co.jp wrote:
 Hi Andrea and Vivek,
 
 From: Andrea Righi righi.and...@gmail.com
 Subject: Re: [PATCH] io-controller: Add io group reference handling for 
 request
 Date: Mon, 18 May 2009 16:39:23 +0200
 
  On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
   On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
 On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
  On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
   Vivek Goyal wrote:
   ...
 }
@@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
 /*
  * Find the io group bio belongs to.
  * If create is set, io group is created if it is not 
already present.
+ * If curr is set, io group is information is searched for 
current
+ * task and not with the help of bio.
+ *
+ * FIXME: Can we assume that if bio is NULL then lookup group 
for current
+ * task and not create extra function parameter ?
  *
- * Note: There is a narrow window of race where a group is 
being freed
- * by cgroup deletion path and some rq has slipped through in 
this group.
- * Fix it.
  */
-struct io_group *io_get_io_group_bio(struct request_queue *q, 
struct bio *bio,
-   int create)
+struct io_group *io_get_io_group(struct request_queue *q, 
struct bio *bio,
+   int create, int curr)
   
 Hi Vivek,
   
 IIUC we can get rid of curr, and just determine iog from bio. 
   If bio is not NULL,
 get iog from bio, otherwise get it from current task.
  
  Consider also that get_cgroup_from_bio() is much more slow than
  task_cgroup() and need to lock/unlock_page_cgroup() in
  get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
  
 
 True.
 
  BTW another optimization could be to use the blkio-cgroup 
  functionality
  only for dirty pages and cut out some blkio_set_owner(). For all the
  other cases IO always occurs in the same context of the current 
  task,
  and you can use task_cgroup().
  
 
 Yes, may be in some cases we can avoid setting page owner. I will get
 to it once I have got functionality going well. In the mean time if
 you have a patch for it, it will be great.
 
  However, this is true only for page cache pages, for IO generated by
  anonymous pages (swap) you still need the page tracking 
  functionality
  both for reads and writes.
  
 
 Right now I am assuming that all the sync IO will belong to task
 submitting the bio hence use task_cgroup() for that. Only for async
 IO, I am trying to use page tracking functionality to determine the 
 owner.
 Look at elv_bio_sync(bio).
 
 You seem to be saying that there are cases where even for sync IO, we
 can't use submitting task's context and need to rely on page tracking
 functionlity? 
 
 I think that there are some kernel threads (e.g., dm-crypt, LVM and md
 devices) which actually submit IOs instead of tasks which originate the
 IOs. When IOs are submitted from such kernel threads, we can't use
 submitting task's context to determine to which cgroup the IO belongs.
 
 In case of getting page (read) from swap, will it not happen
 in the context of process who will take a page fault and initiate the
 swap read?

No, for example in read_swap_cache_async():

@@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t 
entry, gfp_t gfp_mask,
 */
__set_page_locked(new_page);
SetPageSwapBacked(new_page);
+   blkio_cgroup_set_owner(new_page, current-mm);
err = add_to_swap_cache(new_page, entry, gfp_mask  
GFP_KERNEL);
if (likely(!err)) {
/*

This is a read, but the current task is not always the owner of this
swap cache page, because it's a readahead operation.

   
   But will this readahead be not initiated in the context of the task taking
   the page fault?
   
   handle_pte_fault()
 do_swap_page()
 swapin_readahead()
 read_swap_cache_async()
   
   If yes, then swap reads issued will still be in the context of process and
   we should be fine?
  
  Right. I was trying to say that the current task may swap-in also pages
  belonging to a different task, so from a certain point of view it's not
  so fair to charge the current task for the whole activity. But ok, I
  think it's a minor issue.
  
   
Anyway, this is a minor corner case I think. And probably it 

[Devel] Re: [PATCH] io-controller: Add io group reference handling for request

2009-05-27 Thread Andrea Righi
On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote:
  I think that only putting the hook in try_to_unmap() doesn't work
  correctly, because IOs will be charged to reclaiming processes or
  kswapd. These IOs should be charged to processes which cause memory
  pressure.
 
 Consider the following case:
 
   (1) There are two processes Proc-A and Proc-B.
   (2) Proc-A maps a large file into many pages by mmap() and writes
   many data to the file.
   (3) After (2), Proc-B try to get a page, but there are no available
   pages because Proc-A has used them.
   (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
   a page which is owned by Proc-A, then blkio_cgroup_set_owner()
   sets Proc-B's ID on the page because the task's context is Proc-B.
   (5) After (4), kernel writes the page out to a disk. This IO is
   charged to Proc-B.
 
 In the above case, I think that the IO should be charged to a Proc-A,
 because the IO is caused by Proc-A's memory pressure. 
 I think we should consider in the case without memory and swap
 isolation.

mmmh.. even if they're strictly related I think we're mixing two
different problems in this way: memory pressure control and IO control.

It seems you're proposing something like the badness() for OOM
conditions to charge swap IO depending on how bad is a cgroup in terms
of memory consumption. I don't think this is the right way to proceed,
also because we already have the memory and swap control.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/8] a start to credentials c/r

2009-05-27 Thread Serge E. Hallyn
Quoting Casey Schaufler (ca...@schaufler-ca.com):
 Serge E. Hallyn wrote:
  Following is the next version of the credentials c/r patchset,
  on top of the c/r patchset at
  git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
 
  It implements checkpoint and restart of user, user namespaces,
  groups, supplementary groups, and struct cred.
 
  There is a question as to what to do about LSM data at
  restart.  Right now I'm ignoring it, which means that
  prepare_creds() should ensure that the restart tasks get
  the context of the task calling sys_restart().  I
  suspect the right thing to do is to add two new LSM
  hooks, one which checks current's authorization to
  restart from the checkpoint file,
 
 How would that work? Based on information in the file?
 You have to assume that some number of checkpoint files
 have been hand written by Elbonian ne'er do wells.

Not based on information in the file, but based on the
credentials of the task which created the file, and
whether an unprivileged task could have hand-edited the
file before feeding it to sys_restart().

So some example decisions in terms of selinux contexts might be,
1. a task of user_u may restart a file of type user_u
if the checkpointed context is user_u
2. a task of user_u may NOT restart a file of type user_u
if the checkpointed context is root_u
3. a task of root_u may restart a file of type root_u
if the checkpointed context is user_u

Uh, so yes, bsaed on info in the file as well  :)  Except
of course the LSM would just be fed the checkpointed context
and the checkpoint file context (and can deduce current's context).

   and one which determines
  the task-cred-security filed based upon any of:
  1. current_security() of the task calling sys_restart()
  2. the task-cred-security checkpointed in the ckpt file
  3. the -security of the checkpoint file

 
 For Smack the correct behavior would be:
 
 1. for sys_restart() callers without CAP_MAC_ADMIN
 2. for sys_restart() callers with CAP_MAC_ADMIN
 3. never

That makes sense, and is basically analagous (if I'm thinking
right) to how I'm doing capabilities.

So the first (authorization hook) for smack would just always
return TRUE?

I can hook that up right now...

 sys_restart() callers running with CAP_MAC_ADMIN would have to be
 very very careful about the files they restart. But that's nothing
 new in the MAC world.

Yup.

Mind you eventually I expect a setup where some privileged program
is asked (by privileged or unprivilegd tasks) to create a checkpoint
and ask the TPM to sign it.  No unprivileged program can sign an
image directly, so then a restart of a task with privilege can be
restricted to anything with a valid signature.  In that case, it
may be safe to have the checkpointed task's credentials completely
restored, including LSM labels.

But that's a ways off.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page

2009-05-27 Thread Sukadev Bhattiprolu

From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:39 -0700
Subject: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 kernel/pid.c |   43 ---
 1 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index b2e5f78..c0aaebe 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,31 @@ static void free_pidmap(struct upid *upid)
atomic_inc(map-nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+   void *page;
+
+   if (likely(map-page))
+   return 0;
+
+   page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+   /*
+* Free the page if someone raced with us installing it:
+*/
+   spin_lock_irq(pidmap_lock);
+   if (map-page)
+   kfree(page);
+   else
+   map-page = page;
+   spin_unlock_irq(pidmap_lock);
+
+   if (unlikely(!map-page))
+   return -1;
+
+   return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
int i, offset, max_scan, pid, last = pid_ns-last_pid;
@@ -134,21 +159,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
map = pid_ns-pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i = max_scan; ++i) {
-   if (unlikely(!map-page)) {
-   void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-   /*
-* Free the page if someone raced with us
-* installing it:
-*/
-   spin_lock_irq(pidmap_lock);
-   if (map-page)
-   kfree(page);
-   else
-   map-page = page;
-   spin_unlock_irq(pidmap_lock);
-   if (unlikely(!map-page))
-   break;
-   }
+   if (alloc_pidmap_page(map))
+   break;
+
if (likely(atomic_read(map-nr_free))) {
do {
if (!test_and_set_bit(offset, map-page)) {
-- 
1.5.2.5

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/7] [PATCH] Add target_pid parameter to alloc_pidmap()

2009-05-27 Thread Sukadev Bhattiprolu

From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:41 -0700
Subject: [PATCH 3/7] [PATCH] Add target_pid parameter to alloc_pidmap()

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 kernel/pid.c |   28 ++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index fd72ad9..93406c6 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,12 +147,36 @@ static int alloc_pidmap_page(struct pidmap *map)
return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+   int offset;
+   struct pidmap *map;
+
+   if (pid = pid_max)
+   return -EINVAL;
+
+   offset = pid  BITS_PER_PAGE_MASK;
+   map = pid_ns-pidmap[pid/BITS_PER_PAGE];
+
+   if (alloc_pidmap_page(map))
+   return -ENOMEM;
+
+   if (test_and_set_bit(offset, map-page))
+   return -EBUSY;
+
+   atomic_dec(map-nr_free);
+   return pid;
+}
+
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
 {
int i, offset, max_scan, pid, last = pid_ns-last_pid;
struct pidmap *map;
int rc = -EAGAIN;
 
+   if (target_pid)
+   return set_pidmap(pid_ns, target_pid);
+
pid = last + 1;
if (pid = pid_max)
pid = RESERVED_PIDS;
@@ -269,7 +293,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
tmp = ns;
for (i = ns-level; i = 0; i--) {
-   nr = alloc_pidmap(tmp);
+   nr = alloc_pidmap(tmp, 0);
if (nr  0)
goto out_free;
 
-- 
1.5.2.5

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process()

2009-05-27 Thread Sukadev Bhattiprolu

From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:43 -0700
Subject: [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process()

The new parameter will be used in a follow-on patch when clone_with_pids()
is implemented.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 kernel/fork.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d2d69d3..373411e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -949,12 +949,12 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
+   pid_t *target_pids,
int trace)
 {
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
-   pid_t *target_pids = NULL;
 
if ((clone_flags  (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1327,7 +1327,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
struct pt_regs regs;
 
task = copy_process(CLONE_VM, 0, idle_regs(regs), 0, NULL,
-   init_struct_pid, 0);
+   init_struct_pid, NULL, 0);
if (!IS_ERR(task))
init_idle(task, cpu);
 
@@ -1350,6 +1350,7 @@ long do_fork(unsigned long clone_flags,
struct task_struct *p;
int trace = 0;
long nr;
+   pid_t *target_pids = NULL;
 
/*
 * Do some preliminary argument and permissions checking before we
@@ -1390,7 +1391,7 @@ long do_fork(unsigned long clone_flags,
trace = tracehook_prepare_clone(clone_flags);
 
p = copy_process(clone_flags, stack_start, regs, stack_size,
-child_tidptr, NULL, trace);
+child_tidptr, NULL, target_pids, trace);
/*
 * Do this prior waking up the new thread - the thread pointer
 * might get invalid after that point, if the thread exits quickly.
-- 
1.5.2.5

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 4/7] [PATCH] Add target_pids parameter to alloc_pid()

2009-05-27 Thread Sukadev Bhattiprolu

From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:42 -0700
Subject: [PATCH 4/7] [PATCH] Add target_pids parameter to alloc_pid()

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 include/linux/pid.h |2 +-
 kernel/fork.c   |3 ++-
 kernel/pid.c|9 +++--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index f8411a8..d2d69d3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -954,6 +954,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
+   pid_t *target_pids = NULL;
 
if ((clone_flags  (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1119,7 +1120,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
goto bad_fork_cleanup_io;
 
if (pid != init_struct_pid) {
-   pid = alloc_pid(p-nsproxy-pid_ns);
+   pid = alloc_pid(p-nsproxy-pid_ns, target_pids);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 93406c6..4b2373a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -279,13 +279,14 @@ void free_pid(struct pid *pid)
call_rcu(pid-rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
+   int tpid;
 
pid = kmem_cache_alloc(ns-pid_cachep, GFP_KERNEL);
if (!pid)
@@ -293,7 +294,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
tmp = ns;
for (i = ns-level; i = 0; i--) {
-   nr = alloc_pidmap(tmp, 0);
+   tpid = 0;
+   if (target_pids)
+   tpid = target_pids[i];
+
+   nr = alloc_pidmap(tmp, tpid);
if (nr  0)
goto out_free;
 
-- 
1.5.2.5

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 7/7] [PATCH] Define clone_with_pids syscall

2009-05-27 Thread Sukadev Bhattiprolu

From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:45 -0700
Subject: [PATCH 7/7] [PATCH] Define clone_with_pids syscall

clone_with_pids() is same as clone(), except that it takes a 'target_pid_set'
paramter which lets caller choose a specific pid number for the child process
in each of the child process's pid namespace. This system call would be needed
to implement Checkpoint/Restart (i.e after a checkpoint, restart a process with
its original pids).

Call clone_with_pids as follows:

pid_t pids[] = { 0, 77, 99 };
struct target_pid_set pid_set;

pid_set.num_pids = sizeof(pids) / sizeof(int);
pid_set.target_pids = pids;

syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, pid_set);

If a target-pid is 0, the kernel continues to assign a pid for the process in
that namespace. In the above example, pids[0] is 0, meaning the kernel will
assign next available pid to the process in init_pid_ns. But kernel will assign
pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
77 or 99 are taken, the system call fails with -EBUSY.

If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
the system call fails with -EINVAL.

Its mostly an exploratory patch seeking feedback on the interface.

NOTE:
Compared to clone(), clone_with_pids() needs to pass in two more
pieces of information:

- number of pids in the set
- user buffer containing the list of pids.

But since clone() already takes 5 parameters, use a 'struct
target_pid_set'.

TODO:
- Gently tested.
- May need additional sanity checks in check_target_pids()
- Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in
  the namespace is either 1 or 0).

Changelog[v1]:
- Fixed some compile errors (had fixed these errors earlier in my
  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/x86/include/asm/syscalls.h|1 +
 arch/x86/include/asm/unistd_32.h   |1 +
 arch/x86/kernel/entry_32.S |1 +
 arch/x86/kernel/process_32.c   |   94 
 arch/x86/kernel/syscall_table_32.S |1 +
 include/linux/types.h  |5 ++
 6 files changed, 103 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 7043408..1fdc149 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -31,6 +31,7 @@ asmlinkage int sys_get_thread_area(struct user_desc __user *);
 /* kernel/process_32.c */
 int sys_fork(struct pt_regs *);
 int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
 int sys_vfork(struct pt_regs *);
 int sys_execve(struct pt_regs *);
 
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..90f906f 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
 #define __NR_inotify_init1 332
 #define __NR_preadv333
 #define __NR_pwritev   334
+#define __NR_clone_with_pids   335
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c929add..ee92b0d 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -707,6 +707,7 @@ ptregs_##name: \
 PTREGSCALL(iopl)
 PTREGSCALL(fork)
 PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
 PTREGSCALL(vfork)
 PTREGSCALL(execve)
 PTREGSCALL(sigaltstack)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 76f8f84..65b27a8 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -445,6 +445,100 @@ int sys_clone(struct pt_regs *regs)
return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, 
child_tidptr);
 }
 
+static int check_target_pids(unsigned long clone_flags,
+   struct target_pid_set *pid_setp)
+{
+   /*
+* CLONE_NEWPID implies pid == 1
+*
+* TODO: Maybe this should be more fine-grained (i.e would we want
+*   to have a container-init have a specific pid in ancestor
+*   namespaces ?)
+*/
+   if (clone_flags  CLONE_NEWPID)
+   return -EINVAL;
+
+   /* number of pids must match current nesting level of pid ns */
+   if (pid_setp-num_pids  task_pid(current)-level + 1)
+   return -EINVAL;
+
+   /* TODO: More sanity checks ?  */
+
+   return 0;
+}
+
+static pid_t *copy_target_pids(unsigned long clone_flags, void __user 
*upid_setp)
+{
+   int rc;
+   int size;
+   pid_t __user *utarget_pids;
+   pid_t *target_pids;
+   struct target_pid_set pid_set;
+
+   if (copy_from_user(pid_set, upid_setp, sizeof(pid_set)))
+   return ERR_PTR(-EFAULT);
+
+   size = pid_set.num_pids * 

[Devel] [PATCH 6/7] [PATCH] Define do_fork_with_pids()

2009-05-27 Thread Sukadev Bhattiprolu

From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:44 -0700
Subject: [PATCH 6/7] [PATCH] Define do_fork_with_pids()

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, target_pids, parameter. This parameter, currently unused,
specifies the target_pids of the process in each of its pid namespaces.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 include/linux/sched.h |1 +
 kernel/fork.c |   17 ++---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..2173df1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1995,6 +1995,7 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, 
struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned 
long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, 
unsigned long, int __user *, int __user *, pid_t *target_pids);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/kernel/fork.c b/kernel/fork.c
index 373411e..912d008 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1340,17 +1340,17 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
  unsigned long stack_start,
  struct pt_regs *regs,
  unsigned long stack_size,
  int __user *parent_tidptr,
- int __user *child_tidptr)
+ int __user *child_tidptr,
+ pid_t *target_pids)
 {
struct task_struct *p;
int trace = 0;
long nr;
-   pid_t *target_pids = NULL;
 
/*
 * Do some preliminary argument and permissions checking before we
@@ -1448,6 +1448,17 @@ long do_fork(unsigned long clone_flags,
return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+ unsigned long stack_start,
+ struct pt_regs *regs,
+ unsigned long stack_size,
+ int __user *parent_tidptr,
+ int __user *child_tidptr)
+{
+   return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+   parent_tidptr, child_tidptr, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
1.5.2.5

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/8] a start to credentials c/r

2009-05-27 Thread Casey Schaufler
Serge E. Hallyn wrote:
 Quoting Casey Schaufler (ca...@schaufler-ca.com):
   
 Serge E. Hallyn wrote:
 
 Following is the next version of the credentials c/r patchset,
 on top of the c/r patchset at
 git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

 It implements checkpoint and restart of user, user namespaces,
 groups, supplementary groups, and struct cred.

 There is a question as to what to do about LSM data at
 restart.  Right now I'm ignoring it, which means that
 prepare_creds() should ensure that the restart tasks get
 the context of the task calling sys_restart().  I
 suspect the right thing to do is to add two new LSM
 hooks, one which checks current's authorization to
 restart from the checkpoint file,
   
 How would that work? Based on information in the file?
 You have to assume that some number of checkpoint files
 have been hand written by Elbonian ne'er do wells.
 

 Not based on information in the file, but based on the
 credentials of the task which created the file, and
 whether an unprivileged task could have hand-edited the
 file before feeding it to sys_restart().

 So some example decisions in terms of selinux contexts might be,
   1. a task of user_u may restart a file of type user_u
   if the checkpointed context is user_u
   2. a task of user_u may NOT restart a file of type user_u
   if the checkpointed context is root_u
   3. a task of root_u may restart a file of type root_u
   if the checkpointed context is user_u

 Uh, so yes, bsaed on info in the file as well  :)  Except
 of course the LSM would just be fed the checkpointed context
 and the checkpoint file context (and can deduce current's context).
   

And SELinux can do whatever calculations it likes based on the
three contexts and the loaded policy.  Are you at all concerned
about the possibility that the policy may have changed? I can
envision scenarios in which it would be impossible for a process
to gain a particular context under current policy, but that a
checkpointed process may have stored away.

   
  and one which determines
 the task-cred-security filed based upon any of:
 1. current_security() of the task calling sys_restart()
 2. the task-cred-security checkpointed in the ckpt file
 3. the -security of the checkpoint file
   
   
 For Smack the correct behavior would be:

 1. for sys_restart() callers without CAP_MAC_ADMIN
 2. for sys_restart() callers with CAP_MAC_ADMIN
 3. never
 

 That makes sense, and is basically analagous (if I'm thinking
 right) to how I'm doing capabilities.

 So the first (authorization hook) for smack would just always
 return TRUE?
   

I suggest that it needs to check for a valid Smack label. Even though
they're just text strings they do have limitations, including size
( 0  24) and character set. A call to smk_import() is the right
way to do it, as it also makes sure the label is in the internal list.
If smk_import() returns NULL something's amiss.


 I can hook that up right now...
   

I bet you could do it even with the call to smk_import. (smiley here)

   
 sys_restart() callers running with CAP_MAC_ADMIN would have to be
 very very careful about the files they restart. But that's nothing
 new in the MAC world.
 

 Yup.

 Mind you eventually I expect a setup where some privileged program
 is asked (by privileged or unprivilegd tasks) to create a checkpoint
 and ask the TPM to sign it.  No unprivileged program can sign an
 image directly, so then a restart of a task with privilege can be
 restricted to anything with a valid signature.  In that case, it
 may be safe to have the checkpointed task's credentials completely
 restored, including LSM labels.
   

All of the current LSMs share the property that the access control
rules (SELinux policy, Smack access rules, TOMOYO policy) may change
between the time of checkpoint and the time of restart. If I had a
silver bullet answer to the concerns that raises I'd pass it along,
but as I don't I'll stick to the answer I have for Smack (The rules
of the moment are those that matter, and the architecture of Smack
supports that) and leave the other LSMs to their own devices.


 But that's a ways off.
   

It does look like a bit of work.

Thank you.


 thanks,
 -serge
 --
 To unsubscribe from this list: send the line unsubscribe 
 linux-security-module in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


   
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 18/38] C/R: core stuff

2009-05-27 Thread Alexey Dobriyan
On Fri, May 22, 2009 at 08:55:12AM +0400, Alexey Dobriyan wrote:
 +static int task_struct_restorer(void *_tsk_ctx)
 +{
 + struct task_struct_restore_context *tsk_ctx = _tsk_ctx;
 + struct kstate_image_task_struct *i = tsk_ctx-i;
 + struct kstate_context *ctx = tsk_ctx-ctx;
 + /* In the name of symmetry. */
 + struct task_struct *tsk = current, *real_parent;
 + int rv;
 +
 + pr_debug(%s: ENTER tsk %p/%s\n, __func__, tsk, tsk-comm);
 +
 + write_lock_irq(tasklist_lock);
 + real_parent = ctx-init_tsk-nsproxy-pid_ns-child_reaper;
 + tsk-real_parent = tsk-parent = real_parent;
 + list_move_tail(tsk-sibling, tsk-real_parent-sibling);
 ^^^
 + write_unlock_irq(tasklist_lock);

Eek, what a stupid bug here


commit 2c4b5f5d606a1892b702d95a0e4d29f207685381
Author: Alexey Dobriyan adobri...@gmail.com
Date:   Wed May 27 20:21:59 2009 +0400

C/R: fix stupid bug in reparenting

Child process should be added to -children list of course

Signed-off-by: Alexey Dobriyan adobri...@gmail.com

diff --git a/kernel/kstate/kstate-task.c b/kernel/kstate/kstate-task.c
index 9ed5a19..6df7d25 100644
--- a/kernel/kstate/kstate-task.c
+++ b/kernel/kstate/kstate-task.c
@@ -490,7 +490,7 @@ static int task_struct_restorer(void *_tsk_ctx)
real_parent = tmp-o_obj;
}
tsk-real_parent = tsk-parent = real_parent;
-   list_move_tail(tsk-sibling, tsk-real_parent-sibling);
+   list_move_tail(tsk-sibling, tsk-real_parent-children);
write_unlock_irq(tasklist_lock);
 
rv = restore_mm(ctx, i-ref_mm);
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/1] cr: safely restore a task's securebits

2009-05-27 Thread Serge E. Hallyn
As I started doing c/r of LSM credentials, I realized I wasn't
handling securebit yet with the current patchet.  This is on
top of the set I sent out yesterday, and if I'm sendng out another
full patchset then I'll integrate this back into the previous
patches.

Signed-off-by: Serge E. Hallyn se...@us.ibm.com
---
 include/linux/capability.h |2 +
 include/linux/checkpoint_hdr.h |1 +
 kernel/cred.c  |4 +++
 security/commoncap.c   |   52 +++-
 4 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 572b5a0..b3853ca 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -540,6 +540,8 @@ extern void checkpoint_save_cap(__u64 *dest, kernel_cap_t 
src);
 struct cred;
 extern int checkpoint_restore_cap(__u64 e, __u64 i, __u64 p, __u64 x,
struct cred *cred);
+extern void checkpoint_save_securebits(unsigned *, unsigned);
+extern int checkpoint_restore_securebits(unsigned, struct cred *);
 
 /**
  * has_capability - Determine if a task has a superior capability available
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0bad447..7f65964 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -192,6 +192,7 @@ struct ckpt_hdr_cred {
__u32 version; /* especially since capability sets might grow */
__u32 uid, suid, euid, fsuid;
__u32 gid, sgid, egid, fsgid;
+   __u32 securebits;
__u64 cap_i, cap_p, cap_e;
__u64 cap_x;  /* bounding set ('X') */
__s32 user_ref;
diff --git a/kernel/cred.c b/kernel/cred.c
index c05192e..fe2941d 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -739,6 +739,7 @@ int checkpoint_write_cred(struct ckpt_ctx *ctx, const 
struct cred *cred)
checkpoint_save_cap(h-cap_p, cred-cap_permitted);
checkpoint_save_cap(h-cap_e, cred-cap_effective);
checkpoint_save_cap(h-cap_x, cred-cap_bset);
+   checkpoint_save_securebits(h-securebits, cred-securebits);
 
h-user_ref = user_ref;
h-groupinfo_ref = groupinfo_ref;
@@ -811,6 +812,9 @@ struct cred *restore_read_cred(struct ckpt_ctx *ctx)
cred);
if (ret)
goto err_putcred;
+   ret = checkpoint_restore_securebits(h-securebits, cred);
+   if (ret)
+   goto err_putcred;
 
ckpt_hdr_put(ctx, h);
return cred;
diff --git a/security/commoncap.c b/security/commoncap.c
index beac025..31ecd3d 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -804,6 +804,29 @@ int cap_task_setnice (struct task_struct *p, int nice)
 }
 #endif
 
+int cap_set_securebits(struct cred *new, unsigned securebits)
+{
+   if new-securebits  SECURE_ALL_LOCKS)  1)
+ (new-securebits ^ securebits))  /*[1]*/
+   || ((new-securebits  SECURE_ALL_LOCKS  ~securebits)) /*[2]*/
+   || (securebits  ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS)) /*[3]*/
+   || (cap_capable(current, current_cred(), CAP_SETPCAP,
+   SECURITY_CAP_AUDIT) != 0)   /*[4]*/
+   /*
+* [1] no changing of bits that are locked
+* [2] no unlocking of locks
+* [3] no setting of unsupported bits
+* [4] doing anything requires privilege (go read about
+* the sendmail capabilities bug)
+*/
+   )
+   /* cannot change a locked bit */
+   return -EPERM;
+   new-securebits = securebits;
+   return 0;
+}
+
+
 /**
  * cap_task_prctl - Implement process control functions for this security 
module
  * @option: The process control function requested
@@ -861,24 +884,9 @@ int cap_task_prctl(int option, unsigned long arg2, 
unsigned long arg3,
 * capability-based-privilege environment.
 */
case PR_SET_SECUREBITS:
-   error = -EPERM;
-   if new-securebits  SECURE_ALL_LOCKS)  1)
- (new-securebits ^ arg2))/*[1]*/
-   || ((new-securebits  SECURE_ALL_LOCKS  ~arg2))   /*[2]*/
-   || (arg2  ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS))   /*[3]*/
-   || (cap_capable(current, current_cred(), CAP_SETPCAP,
-   SECURITY_CAP_AUDIT) != 0)   /*[4]*/
-   /*
-* [1] no changing of bits that are locked
-* [2] no unlocking of locks
-* [3] no setting of unsupported bits
-* [4] doing anything requires privilege (go read about
-* the sendmail capabilities bug)
-*/
-   )
-   /* cannot change a locked 

[Devel] Re: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page

2009-05-27 Thread Dave Hansen
On Wed, 2009-05-27 at 08:42 -0700, Sukadev Bhattiprolu wrote:
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:39 -0700
 Subject: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 ---
  kernel/pid.c |   43 ---
  1 files changed, 28 insertions(+), 15 deletions(-)
 
 diff --git a/kernel/pid.c b/kernel/pid.c
 index b2e5f78..c0aaebe 100644
 --- a/kernel/pid.c
 +++ b/kernel/pid.c
 @@ -122,6 +122,31 @@ static void free_pidmap(struct upid *upid)
   atomic_inc(map-nr_free);
  }
 
 +static int alloc_pidmap_page(struct pidmap *map)
 +{
 + void *page;
 +
 + if (likely(map-page))
 + return 0;
 +
 + page = kzalloc(PAGE_SIZE, GFP_KERNEL);
 +
 + /*
 +  * Free the page if someone raced with us installing it:
 +  */
 + spin_lock_irq(pidmap_lock);
 + if (map-page)
 + kfree(page);
 + else
 + map-page = page;
 + spin_unlock_irq(pidmap_lock);
 +
 + if (unlikely(!map-page))
 + return -1;
 +

-ENOMEM, please

Otherwise looks fine.  Please at least add some minimal patch
description about what you're doing and why, though.

-- Dave

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/7] [PATCH] Have alloc_pidmap() return actual error code

2009-05-27 Thread Dave Hansen
On Wed, 2009-05-27 at 08:42 -0700, Sukadev Bhattiprolu wrote:
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:40 -0700
 Subject: [PATCH 2/7] [PATCH] Have alloc_pidmap() return actual error code
 
 alloc_pidmap() can fail either because all pid numbers are in use or
 we can't allocate memory. With support for setting a specific pid
 number, alloc_pidmap() would also fail if either the given pid
 number is invalid or in use.
 
 Rather than have caller assume -ENOMEM, have alloc_pidmap() return
 the actual error.
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 ---
  kernel/fork.c |5 +++--
  kernel/pid.c  |9 ++---
  2 files changed, 9 insertions(+), 5 deletions(-)
 
 diff --git a/kernel/fork.c b/kernel/fork.c
 index b9e2edd..f8411a8 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -1119,10 +1119,11 @@ static struct task_struct *copy_process(unsigned long 
 clone_flags,
   goto bad_fork_cleanup_io;
 
   if (pid != init_struct_pid) {
 - retval = -ENOMEM;
   pid = alloc_pid(p-nsproxy-pid_ns);
 - if (!pid)
 + if (IS_ERR(pid)) {
 + retval = PTR_ERR(pid);
   goto bad_fork_cleanup_io;
 + }
 
   if (clone_flags  CLONE_NEWPID) {
   retval = pid_ns_prepare_proc(p-nsproxy-pid_ns);
 diff --git a/kernel/pid.c b/kernel/pid.c
 index c0aaebe..fd72ad9 100644
 --- a/kernel/pid.c
 +++ b/kernel/pid.c
 @@ -151,6 +151,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
  {
   int i, offset, max_scan, pid, last = pid_ns-last_pid;
   struct pidmap *map;
 + int rc = -EAGAIN;
 
   pid = last + 1;
   if (pid = pid_max)
 @@ -159,8 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
   map = pid_ns-pidmap[pid/BITS_PER_PAGE];
   max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
   for (i = 0; i = max_scan; ++i) {
 - if (alloc_pidmap_page(map))
 + if (alloc_pidmap_page(map)) {
 + rc = -ENOMEM;
   break;
 + }

OK, pet peeve time:

rc = alloc_pidmap_page(map);
if (rc)
break;

It saves the bracket and saves a line of assignment, *and* it clarifies
program flow.

-- Dave

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 02/43] c/r: make file_pos_read/write() public

2009-05-27 Thread Oren Laadan
These two are used in the next patch when calling vfs_read/write()

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 fs/read_write.c|   10 --
 include/linux/fs.h |   10 ++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 9d1e76b..ed63ea3 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user 
*buf, size_t count, loff_
 
 EXPORT_SYMBOL(vfs_write);
 
-static inline loff_t file_pos_read(struct file *file)
-{
-   return file-f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
-   file-f_pos = pos;
-}
-
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
struct file *file;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3b534e5..9c4348a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1546,6 +1546,16 @@ ssize_t rw_copy_check_uvector(int type, const struct 
iovec __user * uvector,
struct iovec *fast_pointer,
struct iovec **ret_pointer);
 
+static inline loff_t file_pos_read(struct file *file)
+{
+   return file-f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+   file-f_pos = pos;
+}
+
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-- 
1.6.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 00/43] Kernel based checkpoint/restart

2009-05-27 Thread Oren Laadan
Application checkpoint/restart (c/r) is the ability to save the state
of a running application so that it can later resume its execution
from the time at which it was checkpointed, on the same or a different
machine.

Here is another round of the c/r patchset. The patches are reordered
to reduce size and for easier review, and the code is more stable.
See the changelog below for details. Hey, it even includes renaming
of functions and files ...

Most importantly, it's a working proof-of-concept and has been tested
with v2.6.30-rc7. And while not everything is supported, it provides
a glimpse at _how_ things are done.

For more information, check out Documentation/checkpoint/*.txt

Q: How useful is this code as it stands in real-world usage?
A: Right now, the application can be single- or multi-processes.
   Supports open files - regular files and directories on ext[234],
   pipes, and /dev/{null,zero,random,urandom}. All sort of shared
   memory work. sysv IPC also works (except for semaphore undo).
   The restart does not yet preserve the original pid(s), but 
   patches are already circulating. Definitely already suitable
   for many types of batch jobs. (Note: it is assumed that the fs
   view is available at restart).

Q: What can it checkpoint and rsetart ?
A: A (single threaded) process can checkpoint itself, aka self
   checkpoint, if it calls the new system calls. Otherise, for an
   external checkpoint, the caller must first freeze the target
   process(es). One can either checkpoint an entire container (and
   we make best effort to ensure that the result is self-contained),
   or merely a subtree of a process hierarchy.

Q: What about namespaces ?
A: Currrently, UTS and IPC namespaces are restored. They demonstrate
   how namespaces are handled. More to come.

Q: What additional work needs to be done to it?
A: Fill in the gory details following the examples so far. Short
   term plan is: restore pids, complete work on threads, zombies,
   signals, and more files types.
   
Q: How can I try it ?
A: This one can actually be used for simple batch jobs (pipes, too),
   a whole container or just a subtree of tasks. Try it:

   create the freezer cgroup:
 $ mount -t cgroup -ofreezer freezer /freezer
 $ mkdir /freezer/0
   
   run the test, freeze it:  
 $ test/multitask 
 [1] 2754
 $ for i in `pidof multitask`; do echo $i  /freezer/0/tasks; done
 $ echo FROZEN  /freezer/0/freezer.state
   
   checkpoint:
 $ ./ckpt 2754  ckpt.out
   
   restart:
 $ ./mktree  ckpt.out
   
   voila :)
   
To do all this, you'll need:

The git tree tracking v14, branch 'ckpt-v14' (and past versions):
git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

Restarting multiple processes requires 'mktree' userspace tool with
the matching branch (v14):
git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Oren.


Changelog:

[2009-May-27] v16
  - Privilege checks for IPC checkpoint
  - Fix error string generation during checkpoint
  - Use kzalloc for header allocation
  - Restart blocks are arch-independent
  - Redo pipe c/r using splice
  - Fixes to s390 arch
  - Remove powerpc arch (temporary)
  - EXplicitly restore -nsproxy
  - All objects in image are precedeed by 'struct ckpt_hdr'
  - Fix leaks detection (and leaks)
  - Reorder of patchset
  - Misc bugs and compilation fixes

[2009-Apr-12] v15
  - Minor fixes

[2009-Apr-28] v14
  - Tested against kernel v2.6.30-rc3 on x86_32.
  - Refactor files chekpoint to use f_ops (file operations)
  - Refactor mm/vma to use vma_ops
  - Explicitly handle VDSO vma (and require compat mode)
  - Added code to c/r restat-blocks (restart timeout related syscalls)
  - Added code to c/r namespaces: uts, ipc (with Dan Smith)
  - Added code to c/r sysvipc (shm, msg, sem)
  - Support for VM_CLONE shared memory
  - Added resource leak detection for whole-container checkpoint
  - Added sysctl gauge to allow unprivileged restart/checkpoint
  - Improve and simplify the code and logic of shared objects
  - Rework image format: shared objects appear prior to their use
  - Merge checkpoint and restart functionality into same files
  - Massive renaming of functions: prefix ckpt_ for generics,
checkpoint_ for checkpoint, and restore_ for restart.
  - Report checkpoint errors as a valid (string record) in the output
  - Merged PPC architecture (by Nathan Lunch),
  - Requires updates to userspace tools too.
  - Misc nits and bug fixes

[2009-Mar-31] v14-rc2
  - Change along Dave's suggestion to use f_ops-checkpoint() for files
  - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
  - Merge support for PPC arch (Nathan Lynch)
  - Misc cleanups and fixes in response to comments

[2009-Mar-20] v14-rc1:
  - The 'h.parent' field of 'struct cr_hdr' isn't used - discard
  - Check whether calls to cr_hbuf_get() succeed or fail.
  - Fixed of pipe c/r code
  - Prevent deadlock by refusing c/r when a pipe inode == ctx-file inode
  - Refuse non-self 

[Devel] Re: [PATCH] io-controller: Add io group reference handling for request

2009-05-27 Thread Vivek Goyal
On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote:
 Hi Andrea and Vivek,
 
 Ryo Tsuruta r...@valinux.co.jp wrote:
  Hi Andrea and Vivek,
  
  From: Andrea Righi righi.and...@gmail.com
  Subject: Re: [PATCH] io-controller: Add io group reference handling for 
  request
  Date: Mon, 18 May 2009 16:39:23 +0200
  
   On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
 On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
  On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
   On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
Vivek Goyal wrote:
...
  }
 @@ -1462,20 +1462,27 @@ struct io_cgroup 
 *get_iocg_from_bio(stru
  /*
   * Find the io group bio belongs to.
   * If create is set, io group is created if it is not 
 already present.
 + * If curr is set, io group is information is searched for 
 current
 + * task and not with the help of bio.
 + *
 + * FIXME: Can we assume that if bio is NULL then lookup 
 group for current
 + * task and not create extra function parameter ?
   *
 - * Note: There is a narrow window of race where a group is 
 being freed
 - * by cgroup deletion path and some rq has slipped through 
 in this group.
 - * Fix it.
   */
 -struct io_group *io_get_io_group_bio(struct request_queue 
 *q, struct bio *bio,
 - int create)
 +struct io_group *io_get_io_group(struct request_queue *q, 
 struct bio *bio,
 + int create, int curr)

  Hi Vivek,

  IIUC we can get rid of curr, and just determine iog from bio. 
If bio is not NULL,
  get iog from bio, otherwise get it from current task.
   
   Consider also that get_cgroup_from_bio() is much more slow than
   task_cgroup() and need to lock/unlock_page_cgroup() in
   get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
   
  
  True.
  
   BTW another optimization could be to use the blkio-cgroup 
   functionality
   only for dirty pages and cut out some blkio_set_owner(). For all 
   the
   other cases IO always occurs in the same context of the current 
   task,
   and you can use task_cgroup().
   
  
  Yes, may be in some cases we can avoid setting page owner. I will 
  get
  to it once I have got functionality going well. In the mean time if
  you have a patch for it, it will be great.
  
   However, this is true only for page cache pages, for IO generated 
   by
   anonymous pages (swap) you still need the page tracking 
   functionality
   both for reads and writes.
   
  
  Right now I am assuming that all the sync IO will belong to task
  submitting the bio hence use task_cgroup() for that. Only for async
  IO, I am trying to use page tracking functionality to determine the 
  owner.
  Look at elv_bio_sync(bio).
  
  You seem to be saying that there are cases where even for sync IO, 
  we
  can't use submitting task's context and need to rely on page 
  tracking
  functionlity? 
  
  I think that there are some kernel threads (e.g., dm-crypt, LVM and md
  devices) which actually submit IOs instead of tasks which originate the
  IOs. When IOs are submitted from such kernel threads, we can't use
  submitting task's context to determine to which cgroup the IO belongs.
  
  In case of getting page (read) from swap, will it not happen
  in the context of process who will take a page fault and initiate 
  the
  swap read?
 
 No, for example in read_swap_cache_async():
 
 @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t 
 entry, gfp_t gfp_mask,
*/
   __set_page_locked(new_page);
   SetPageSwapBacked(new_page);
 + blkio_cgroup_set_owner(new_page, current-mm);
   err = add_to_swap_cache(new_page, entry, gfp_mask  
 GFP_KERNEL);
   if (likely(!err)) {
   /*
 
 This is a read, but the current task is not always the owner of this
 swap cache page, because it's a readahead operation.
 

But will this readahead be not initiated in the context of the task 
taking
the page fault?

handle_pte_fault()
do_swap_page()
swapin_readahead()
read_swap_cache_async()

If yes, then swap reads issued will still be in the context of process 
and
we should be fine?
   
   Right. I was trying to say that the current task may swap-in also 

[Devel] [RFC v16][PATCH 35/43] c/r (ipc): export interface from ipc/shm.c to delete ipc shm

2009-05-27 Thread Oren Laadan
Export shmctl_down() which will be used in the next patch during
restart to delete an ipc shm (the shm is mapped already, so it
won't be lost).

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 include/linux/shm.h |4 
 ipc/shm.c   |4 ++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/shm.h b/include/linux/shm.h
index eca6235..ec36e99 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -118,6 +118,10 @@ static inline int is_file_shm_hugepages(struct file *file)
 }
 #endif
 
+struct ipc_namespace;
+extern int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+  struct shmid_ds __user *buf, int version);
+
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_SHM_H_ */
diff --git a/ipc/shm.c b/ipc/shm.c
index 7dd5f0c..8aba22f 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -598,8 +598,8 @@ static void shm_get_stat(struct ipc_namespace *ns, unsigned 
long *rss,
  * to be held in write mode.
  * NOTE: no locks must be held, the rw_mutex is taken inside this function.
  */
-static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
-  struct shmid_ds __user *buf, int version)
+int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+   struct shmid_ds __user *buf, int version)
 {
struct kern_ipc_perm *ipcp;
struct shmid64_ds shmid64;
-- 
1.6.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 33/43] c/r (ipc): helpers to save and restore kern_ipc_perm structures

2009-05-27 Thread Oren Laadan
Add the helpers to save and restore the contents of 'struct
kern_ipc_perm'. Add header structures for ipc state. Put
place-holders to save and restore ipc state.

TODO:
This patch does _not_ address the issues of users/groups and the
related security issues. For now, it saves the old user/group of
ipc objects, but does not restore them during restart.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 include/linux/checkpoint.h |7 +++-
 include/linux/checkpoint_hdr.h |   29 ++
 ipc/Makefile   |1 +
 ipc/checkpoint.c   |   81 
 ipc/util.h |8 
 5 files changed, 125 insertions(+), 1 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 5a42399..9a7517f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,6 +10,10 @@
  *  distribution for more details.
  */
 
+#include linux/sched.h
+#include linux/nsproxy.h
+#include linux/ipc_namespace.h
+
 #include linux/checkpoint_types.h
 #include linux/checkpoint_hdr.h
 #include asm/checkpoint_hdr.h
@@ -157,8 +161,9 @@ extern int restore_memory_contents(struct ckpt_ctx *ctx, 
struct inode *inode);
 #define CKPT_DFILE 0x10/* files and filesystem */
 #define CKPT_DMEM  0x20/* memory state */
 #define CKPT_DPAGE 0x40/* memory pages */
+#define CKPT_DIPC  0x80/* sysvipc */
 
-#define CKPT_DDEFAULT  0x37/* default debug level */
+#define CKPT_DDEFAULT  0xb7/* default debug level */
 
 #ifndef CKPT_DFLAG
 #define CKPT_DFLAG 0x0 /* nothing */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 44a48dc..05769f4 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -70,6 +70,11 @@ enum {
CKPT_HDR_PGARR,
CKPT_HDR_MM_CONTEXT,
 
+   CKPT_HDR_IPC = 501,
+   CKPT_HDR_IPC_SHM,
+   CKPT_HDR_IPC_MSG,
+   CKPT_HDR_IPC_SEM,
+
CKPT_HDR_TAIL = 9001,
 
CKPT_HDR_ERROR = ,
@@ -299,4 +304,28 @@ struct ckpt_hdr_pgarr {
 } __attribute__((aligned(8)));
 
 
+/* ipc commons */
+struct ckpt_hdr_ipc_perms {
+   __s32 id;
+   __u32 key;
+   __u32 uid;
+   __u32 gid;
+   __u32 cuid;
+   __u32 cgid;
+   __u32 mode;
+   __u32 _padding;
+   __u64 seq;
+} __attribute__((aligned(8)));
+
+
+#define CKPT_TST_OVERFLOW_16(a, b) \
+   ((sizeof(a)  sizeof(b))  ((a)  SHORT_MAX))
+
+#define CKPT_TST_OVERFLOW_32(a, b) \
+   ((sizeof(a)  sizeof(b))  ((a)  INT_MAX))
+
+#define CKPT_TST_OVERFLOW_64(a, b) \
+   ((sizeof(a)  sizeof(b))  ((a)  LONG_MAX))
+
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/ipc/Makefile b/ipc/Makefile
index 4e1955e..aa6c8dd 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o
 
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
new file mode 100644
index 000..b7b48b0
--- /dev/null
+++ b/ipc/checkpoint.c
@@ -0,0 +1,81 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include linux/ipc.h
+#include linux/msg.h
+#include linux/sched.h
+#include linux/ipc_namespace.h
+#include linux/checkpoint.h
+#include linux/checkpoint_hdr.h
+
+#include util.h
+
+int checkpoint_ipcns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
+{
+   return 0;
+}
+
+int restore_ipcns(struct ckpt_ctx *ctx)
+{
+   return 0;
+}
+
+int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+ struct kern_ipc_perm *perm)
+{
+   if (ipcperms(perm, S_IROTH))
+   return -EACCES;
+
+   h-id = perm-id;
+   h-key = perm-key;
+   h-uid = perm-uid;
+   h-gid = perm-gid;
+   h-cuid = perm-cuid;
+   h-cgid = perm-cgid;
+   h-mode = perm-mode  S_IRWXUGO;
+   h-seq = perm-seq;
+
+   return 0;
+}
+
+int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+  struct kern_ipc_perm *perm)
+{
+   if (h-id  0)
+   return -EINVAL;
+   if (CKPT_TST_OVERFLOW_16(h-uid, perm-uid) ||
+   CKPT_TST_OVERFLOW_16(h-gid, perm-gid) ||
+   CKPT_TST_OVERFLOW_16(h-cuid, perm-cuid) ||
+   CKPT_TST_OVERFLOW_16(h-cgid, perm-cgid) ||
+   CKPT_TST_OVERFLOW_16(h-mode, perm-mode))
+   return -EINVAL;
+   if (h-seq = USHORT_MAX)
+   return -EINVAL;
+   if (h-mode  ~S_IRWXUGO)
+   

[Devel] [RFC v16][PATCH 37/43] c/r (ipc): make 'struct msg_msgseg' visible in ipc/util.h

2009-05-27 Thread Oren Laadan
Move the definition of 'struct msg_msgseg' and constants DATALEN_*
to ipc/util.h, where they are visible to ipc/ckpt_msg.c

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 ipc/msg.c |3 +--
 ipc/msgutil.c |8 
 ipc/util.h|   10 ++
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 1db7c45..1d5d087 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -72,7 +72,6 @@ struct msg_sender {
 
 #define msg_unlock(msq)ipc_unlock((msq)-q_perm)
 
-static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
 static int newque(struct ipc_namespace *, struct ipc_params *, int);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
@@ -278,7 +277,7 @@ static void expunge_all(struct msg_queue *msq, int res)
  * msg_ids.rw_mutex (writer) and the spinlock for this message queue are held
  * before freeque() is called. msg_ids.rw_mutex remains locked on exit.
  */
-static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
struct list_head *tmp;
struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index f095ee2..e119243 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -36,14 +36,6 @@ struct ipc_namespace init_ipc_ns = {
 
 atomic_t nr_ipc_ns = ATOMIC_INIT(1);
 
-struct msg_msgseg {
-   struct msg_msgseg* next;
-   /* the next part of the message follows immediately */
-};
-
-#define DATALEN_MSG(PAGE_SIZE-sizeof(struct msg_msg))
-#define DATALEN_SEG(PAGE_SIZE-sizeof(struct msg_msgseg))
-
 struct msg_msg *load_msg(const void __user *src, int len)
 {
struct msg_msg *msg;
diff --git a/ipc/util.h b/ipc/util.h
index 5a6373f..db067b0 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -140,6 +140,14 @@ extern void free_msg(struct msg_msg *msg);
 extern struct msg_msg *load_msg(const void __user *src, int len);
 extern int store_msg(void __user *dest, struct msg_msg *msg, int len);
 
+struct msg_msgseg {
+   struct msg_msgseg *next;
+   /* the next part of the message follows immediately */
+};
+
+#define DATALEN_MSG(PAGE_SIZE-sizeof(struct msg_msg))
+#define DATALEN_SEG(PAGE_SIZE-sizeof(struct msg_msgseg))
+
 extern void recompute_msgmni(struct ipc_namespace *);
 
 static inline int ipc_buildid(int id, int seq)
@@ -175,6 +183,8 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 
 /* for checkpoint/restart */
 extern int do_shmget(key_t key, size_t size, int shmflg, int req_id);
+extern int do_msgget(key_t key, int msgflg, int req_id);
+extern void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
 extern void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
-- 
1.6.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 32/43] c/r (ipc): allow allocation of a desired ipc identifier

2009-05-27 Thread Oren Laadan
During restart, we need to allocate ipc objects that with the same
identifiers as recorded during checkpoint. Modify the allocation
code allow an in-kernel caller to request a specific ipc identifier.
The system call interface remains unchanged.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 ipc/msg.c  |   17 -
 ipc/sem.c  |   17 -
 ipc/shm.c  |   19 +--
 ipc/util.c |   42 +-
 ipc/util.h |   12 +---
 5 files changed, 75 insertions(+), 32 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 2ceab7f..1db7c45 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -73,7 +73,7 @@ struct msg_sender {
 #define msg_unlock(msq)ipc_unlock((msq)-q_perm)
 
 static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
-static int newque(struct ipc_namespace *, struct ipc_params *);
+static int newque(struct ipc_namespace *, struct ipc_params *, int);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
 #endif
@@ -174,10 +174,12 @@ static inline void msg_rmid(struct ipc_namespace *ns, 
struct msg_queue *s)
  * newque - Create a new msg queue
  * @ns: namespace
  * @params: ptr to the structure that contains the key and msgflg
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with msg_ids.rw_mutex held (writer)
  */
-static int newque(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newque(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
struct msg_queue *msq;
int id, retval;
@@ -201,7 +203,7 @@ static int newque(struct ipc_namespace *ns, struct 
ipc_params *params)
/*
 * ipc_addid() locks msq
 */
-   id = ipc_addid(msg_ids(ns), msq-q_perm, ns-msg_ctlmni);
+   id = ipc_addid(msg_ids(ns), msq-q_perm, ns-msg_ctlmni, req_id);
if (id  0) {
security_msg_queue_free(msq);
ipc_rcu_putref(msq);
@@ -309,7 +311,7 @@ static inline int msg_security(struct kern_ipc_perm *ipcp, 
int msgflg)
return security_msg_queue_associate(msq, msgflg);
 }
 
-SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+int do_msgget(key_t key, int msgflg, int req_id)
 {
struct ipc_namespace *ns;
struct ipc_ops msg_ops;
@@ -324,7 +326,12 @@ SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
msg_params.key = key;
msg_params.flg = msgflg;
 
-   return ipcget(ns, msg_ids(ns), msg_ops, msg_params);
+   return ipcget(ns, msg_ids(ns), msg_ops, msg_params, req_id);
+}
+
+SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+{
+   return do_msgget(key, msgflg, -1);
 }
 
 static inline unsigned long
diff --git a/ipc/sem.c b/ipc/sem.c
index 16a2189..207dbbb 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -92,7 +92,7 @@
 #define sem_unlock(sma)ipc_unlock((sma)-sem_perm)
 #define sem_checkid(sma, semid)ipc_checkid(sma-sem_perm, semid)
 
-static int newary(struct ipc_namespace *, struct ipc_params *);
+static int newary(struct ipc_namespace *, struct ipc_params *, int);
 static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
@@ -227,11 +227,13 @@ static inline void sem_rmid(struct ipc_namespace *ns, 
struct sem_array *s)
  * newary - Create a new semaphore set
  * @ns: namespace
  * @params: ptr to the structure that contains key, semflg and nsems
+ * @req_id: request desired id if available (-1 if don't care)
  *
  * Called with sem_ids.rw_mutex held (as a writer)
  */
 
-static int newary(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newary(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
 {
int id;
int retval;
@@ -263,7 +265,7 @@ static int newary(struct ipc_namespace *ns, struct 
ipc_params *params)
return retval;
}
 
-   id = ipc_addid(sem_ids(ns), sma-sem_perm, ns-sc_semmni);
+   id = ipc_addid(sem_ids(ns), sma-sem_perm, ns-sc_semmni, req_id);
if (id  0) {
security_sem_free(sma);
ipc_rcu_putref(sma);
@@ -308,7 +310,7 @@ static inline int sem_more_checks(struct kern_ipc_perm 
*ipcp,
return 0;
 }
 
-SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+int do_semget(key_t key, int nsems, int semflg, int req_id)
 {
struct ipc_namespace *ns;
struct ipc_ops sem_ops;
@@ -327,7 +329,12 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, 
semflg)
sem_params.flg = semflg;
sem_params.u.nsems = nsems;
 
-   return ipcget(ns, sem_ids(ns), sem_ops, sem_params);
+   return ipcget(ns, sem_ids(ns), sem_ops, sem_params, req_id);
+}
+
+SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+{
+   return do_semget(key, nsems, semflg, -1);
 }
 
 /*
diff --git a/ipc/shm.c b/ipc/shm.c
index faa46da..7dd5f0c 100644
--- a/ipc/shm.c
+++ 

[Devel] [RFC v16][PATCH 12/43] c/r: add generic '-checkpoint()' f_op to simple devices

2009-05-27 Thread Oren Laadan
* /dev/null
* /dev/zero
* /dev/random
* /dev/urandom

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 drivers/char/mem.c|2 ++
 drivers/char/random.c |2 ++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 8f05c38..bfde41f 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -797,6 +797,7 @@ static const struct file_operations null_fops = {
.read   = read_null,
.write  = write_null,
.splice_write   = splice_write_null,
+   .checkpoint = generic_file_checkpoint,
 };
 
 #ifdef CONFIG_DEVPORT
@@ -813,6 +814,7 @@ static const struct file_operations zero_fops = {
.read   = read_zero,
.write  = write_zero,
.mmap   = mmap_zero,
+   .checkpoint = generic_file_checkpoint,
 };
 
 /*
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 8c74448..211ca70 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1164,6 +1164,7 @@ const struct file_operations random_fops = {
.poll  = random_poll,
.unlocked_ioctl = random_ioctl,
.fasync = random_fasync,
+   .checkpoint = generic_file_checkpoint,
 };
 
 const struct file_operations urandom_fops = {
@@ -1171,6 +1172,7 @@ const struct file_operations urandom_fops = {
.write = random_write,
.unlocked_ioctl = random_ioctl,
.fasync = random_fasync,
+   .checkpoint = generic_file_checkpoint,
 };
 
 /***
-- 
1.6.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 41/43] c/r: (s390): expose a constant for the number of words (CRs)

2009-05-27 Thread Oren Laadan
We need to use this value in the checkpoint/restart code and would like to
have a constant instead of a magic '3'.

Changelog:
Mar 30:
. Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
Mar 03:
. Picked up additional use of magic '3' in ptrace.h

Signed-off-by: Dan Smith da...@us.ibm.com
---
 arch/s390/Kconfig |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2eca5fe..bf62cad 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL
 config GENERIC_CLOCKEVENTS
def_bool y
 
+config CHECKPOINT_SUPPORT
+   bool
+   default y if 64BIT
+
 config GENERIC_BUG
bool
depends on BUG
-- 
1.6.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 27/43] c/r: support for open pipes

2009-05-27 Thread Oren Laadan
A pipe is a double-headed inode with a buffer attached to it. We
checkpoint the pipe buffer only once, as soon as we hit one side of
the pipe, regardless whether it is read- or write- end.

To checkpoint a file descriptor that refers to a pipe (either end), we
first lookup the inode in the hash table: If not found, it is the
first encounter of this pipe. Besides the file descriptor, we also (a)
save the pipe data, and (b) register the pipe inode in the hash. If
found, it is the second encounter of this pipe, namely, as we hit the
other end of the same pipe. In both cases we write the pipe-objref of
the inode.

To restore, create a new pipe and thus have two file pointers (read-
and write- ends). We only use one of them, depending on which side was
checkpointed first. We register the file pointer of the other end in
the hash table, with the pipe_objref given for this pipe from the
checkpoint, to be used later when the other arrives. At this point we
also restore the contents of the pipe buffers.

To save the pipe buffer, given a source pipe, use do_tee() to clone
its contents into a temporary 'struct pipe_inode_info', and then use
do_splice_from() to transfer it directly to the checkpoint image file.

To restore the pipe buffer, with a fresh newly allocated target pipe,
use do_splice_to() to splice the data directly between the checkpoint
image file and the pipe.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/files.c |7 ++
 fs/pipe.c  |  173 
 fs/splice.c|   10 +-
 include/linux/checkpoint_hdr.h |   12 +++
 include/linux/pipe_fs_i.h  |6 ++
 include/linux/splice.h |9 ++
 6 files changed, 212 insertions(+), 5 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index d7583d3..b264e40 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -17,6 +17,7 @@
 #include linux/file.h
 #include linux/fdtable.h
 #include linux/fsnotify.h
+#include linux/pipe_fs_i.h
 #include linux/syscalls.h
 #include linux/checkpoint.h
 #include linux/checkpoint_hdr.h
@@ -433,6 +434,12 @@ static struct restore_file_ops restore_file_ops[] = {
.file_type = CKPT_FILE_GENERIC,
.restore = generic_file_restore,
},
+   /* pipes */
+   {
+   .file_name = PIPE,
+   .file_type = CKPT_FILE_PIPE,
+   .restore = pipe_file_restore,
+   },
 };
 
 static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index 13414ec..d0aba56 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -13,6 +13,7 @@
 #include linux/fs.h
 #include linux/mount.h
 #include linux/pipe_fs_i.h
+#include linux/splice.h
 #include linux/uio.h
 #include linux/highmem.h
 #include linux/pagemap.h
@@ -22,6 +23,9 @@
 #include asm/uaccess.h
 #include asm/ioctls.h
 
+#include linux/checkpoint.h
+#include linux/checkpoint_hdr.h
+
 /*
  * We use a start+len construction, which provides full use of the 
  * allocated memory.
@@ -795,6 +799,172 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
return 0;
 }
 
+#ifdef CONFIG_CHECKPOINT
+static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
+{
+   struct ckpt_hdr_file_pipe_state *h;
+   struct pipe_inode_info *pipe;
+   int len, ret = -ENOMEM;
+
+   pipe = alloc_pipe_info(NULL);
+   if (!pipe)
+   return ret;
+
+   pipe-readers = 1;  /* bluff link_pipe() below */
+   len = link_pipe(inode-i_pipe, pipe, INT_MAX, SPLICE_F_NONBLOCK);
+   if (len == -EAGAIN)
+   len = 0;
+   if (len  0) {
+   ret = len;
+   goto out;
+   }
+
+   h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_PIPE);
+   if (!h)
+   goto out;
+   h-pipe_len = len;
+   ret = ckpt_write_obj(ctx, h-h);
+   ckpt_hdr_put(ctx, h);
+   if (ret  0)
+   goto out;
+
+   ret = do_splice_from(pipe, ctx-file, ctx-file-f_pos, len, 0);
+   if (ret  0)
+   goto out;
+   if (ret != len)
+   ret = -EPIPE;  /* can occur due to an error in target file */
+ out:
+   __free_pipe_info(pipe);
+   return ret;
+}
+
+static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+   struct ckpt_hdr_file_pipe *h;
+   struct inode *inode = file-f_dentry-d_inode;
+   int objref, first, ret;
+
+   objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, first);
+   if (objref  0)
+   return objref;
+
+   h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+   if (!h)
+   return -ENOMEM;
+
+   h-common.f_type = CKPT_FILE_PIPE;
+   h-pipe_objref = objref;
+
+   ret = checkpoint_file_common(ctx, file, h-common);
+   if (ret  0)
+   goto out;
+   ret = ckpt_write_obj(ctx, h-common.h);
+   if (ret  0)
+   goto out;
+
+   if 

[Devel] [RFC v16][PATCH 23/43] c/r: restart multiple processes

2009-05-27 Thread Oren Laadan
Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

The init task (*) has a special role: it allocates the restart context
(ctx), and coordinates the operation. In particular, it first waits
until all participating tasks enter the kernel, and provides them the
common restart context. Once everyone in ready, it begins to restart
itself.

In contrast, the other tasks enter the kernel, locate the init task (*)
and grab its restart context, and then wait for their turn to restore.

When a task (init or not) completes its restart, it hands the control
over to the next in line, by waking that task.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (ctx) maintain a current position in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Changelog[v14]:
  - Revert change to pr_debug(), back to ckpt_debug()
  - Discard field 'h.parent'
  - Check whether calls to ckpt_hbuf_get() fail

Changelog[v13]:
  - Clear root_task-checkpoint_ctx regardless of error condition
  - Remove unused argument 'ctx' from do_restore_task() prototype
  - Remove unused member 'pids_err' from 'struct ckpt_ctx'

Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/restart.c |  242 --
 checkpoint/sys.c |   27 -
 include/linux/checkpoint.h   |3 +
 include/linux/checkpoint_types.h |   17 +++-
 include/linux/sched.h|4 +
 5 files changed, 277 insertions(+), 16 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 8b8229e..5e68835 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -13,6 +13,7 @@
 
 #include linux/version.h
 #include linux/sched.h
+#include linux/wait.h
 #include linux/file.h
 #include linux/magic.h
 #include linux/utsname.h
@@ -353,12 +354,233 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
return ret;
 }
 
+/* restore_read_tree - read the tasks tree into the checkpoint context */
+static int restore_read_tree(struct ckpt_ctx *ctx)
+{
+   struct ckpt_hdr_tree *h;
+   int size, ret;
+
+   h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TREE);
+   if (IS_ERR(h))
+   return PTR_ERR(h);
+
+   ret = -EINVAL;
+   if (h-nr_tasks  0)
+   goto out;
+
+   ctx-nr_pids = h-nr_tasks;
+   size = sizeof(*ctx-pids_arr) * ctx-nr_pids;
+   if (size  0)   /* overflow ? */
+   goto out;
+
+   ctx-pids_arr = kmalloc(size, GFP_KERNEL);
+   if (!ctx-pids_arr) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   ret = _ckpt_read_buffer(ctx, ctx-pids_arr, size);
+ out:
+   ckpt_hdr_put(ctx, h);
+   return ret;
+}
+
+static inline pid_t active_pid(struct ckpt_ctx *ctx)
+{
+   return ctx-pids_arr[ctx-active_pid].vpid;
+}
+
+static int restore_wait_task(struct ckpt_ctx *ctx)
+{
+   pid_t pid = task_pid_vnr(current);
+
+   ckpt_debug(pid %d waiting\n, pid);
+   return wait_event_interruptible(ctx-waitq, active_pid(ctx) == pid);
+}
+
+static int restore_next_task(struct ckpt_ctx *ctx)
+{
+   struct task_struct *task;
+
+   ctx-active_pid++;
+
+   ckpt_debug(active_pid %d of %d\n, ctx-active_pid, ctx-nr_pids);
+   if (ctx-active_pid == ctx-nr_pids) {
+   complete(ctx-complete);
+   

[Devel] [RFC v16][PATCH 25/43] tee: don't return 0 when another task drains/fills a pipe

2009-05-27 Thread Oren Laadan
This patch is a modified version of Max Kellerman patch that fixes
a race in do_tee() (see http://patchwork/kernel/org/patch/21040).

It differs in that it rafactors link_pipe() so that the following
patch (that adds support for splice() between pipes, also based on
a patch by Max Kellerman), can better share code.

Below is Max's original description:
--
Cite from the tee() manual page:

 A return value of 0 means that there was no data to transfer, and it
 would not make sense to block, because there are no writers connected
 to the write end of the pipe

There is however a race condition in the tee() implementation, which
violates this definition:

- do_tee() ensures that ipipe is readable and opipe is writable by
  calling link_ipipe_prep() and link_opipe_prep()
- these two functions unlock the pipe after they have waited
- during this unlocked phase, there is a short window where other
  tasks may drain the input pipe or fill the output pipe
- do_tee() now calls link_pipe(), which re-locks both pipes
- link_pipe() sees that it is unable to read (i = ipipe-nrbufs ||
  opipe-nrbufs = PIPE_BUFFERS) and breaks from the loop
- link_pipe() returns 0

Although there may be writers connected to the input pipe, tee() now
returns 0, and the caller (spuriously) assumes this is the end of the
stream.

This patch wraps the link_[io]pipe_prep() invocation in a loop within
link_pipe(), and loops until the result is reliable.
--

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 fs/splice.c |   80 +--
 1 files changed, 61 insertions(+), 19 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 666953d..92dd63c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1586,6 +1586,59 @@ static int link_opipe_prep(struct pipe_inode_info *pipe, 
unsigned int flags)
return ret;
 }
 
+/**
+ * link_pipe_prep - make sure there's readable data and writable room
+ * @ipipe: the input pipe
+ * @opipe: the output pipe
+ * @flags: splice modifier flags
+ *
+ * Wrap the link_[io]pipe_prep() invocation in a loop until the result
+ * is reliable.
+ *
+ * Expects pipes to be unlocked, and on success returns them locked.
+ */
+static int link_pipe_prep(struct pipe_inode_info *ipipe,
+ struct pipe_inode_info *opipe,
+ unsigned int flags)
+{
+   int ret;
+
+   while (1) {
+   /* wait for ipipe to become ready to read */
+   ret = link_ipipe_prep(ipipe, flags);
+   if (ret)
+   return ret;
+
+   /* wait for opipe to become ready to write */
+   ret = link_opipe_prep(opipe, flags);
+   if (ret)
+   return ret;
+
+   /*
+* Potential ABBA deadlock, work around it by ordering
+* lock grabbing by inode address. Otherwise two
+* different processes could deadlock (one doing tee
+* from A - B, the other from B - A).
+*/
+   pipe_double_lock(ipipe, opipe);
+
+   /* see if the tee() is still possible */
+   if ((ipipe-nrbufs  0 || ipipe-writers == 0) 
+   opipe-nrbufs  PIPE_BUFFERS)
+   /* yes, it is - keep the locks and end this
+  loop */
+   break;
+
+   /* no - someone has drained ipipe or has filled opipe
+  between link_[io]pipe_pre()'s lock and our lock.
+  Drop both locks and wait again. */
+   pipe_unlock(ipipe);
+   pipe_unlock(opipe);
+   }
+
+   return 0;
+}
+
 /*
  * Link contents of ipipe to opipe.
  */
@@ -1594,14 +1647,13 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 size_t len, unsigned int flags)
 {
struct pipe_buffer *ibuf, *obuf;
-   int ret = 0, i = 0, nbuf;
+   int ret, i = 0, nbuf;
 
-   /*
-* Potential ABBA deadlock, work around it by ordering lock
-* grabbing by pipe info address. Otherwise two different processes
-* could deadlock (one doing tee from A - B, the other from B - A).
-*/
-   pipe_double_lock(ipipe, opipe);
+   ret = link_pipe_prep(ipipe, opipe, flags);
+   if (ret  0)
+   return ret;
+
+   /* pipes are now locked */
 
do {
if (!opipe-readers) {
@@ -1685,18 +1737,8 @@ static long do_tee(struct file *in, struct file *out, 
size_t len,
 * Duplicate the contents of ipipe to opipe without actually
 * copying the data.
 */
-   if (ipipe  opipe  ipipe != opipe) {
-   /*
-* Keep going, unless we encounter an error. The ipipe/opipe
-* ordering doesn't really matter.
-*/
-   ret = link_ipipe_prep(ipipe, flags);
-   if (!ret) {
-   ret = 

[Devel] Re: [PATCH 3/7] [PATCH] Add target_pid parameter to alloc_pidmap()

2009-05-27 Thread Serge E. Hallyn
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com):
 
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:41 -0700
 Subject: [PATCH 3/7] [PATCH] Add target_pid parameter to alloc_pidmap()
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

How about

#define TARGET_PID_UNSPECIFIED 0

or something to pass to alloc_pidmap() from alloc_pid()?
Up to you  More importantly:

 ---
  kernel/pid.c |   28 ++--
  1 files changed, 26 insertions(+), 2 deletions(-)
 
 diff --git a/kernel/pid.c b/kernel/pid.c
 index fd72ad9..93406c6 100644
 --- a/kernel/pid.c
 +++ b/kernel/pid.c
 @@ -147,12 +147,36 @@ static int alloc_pidmap_page(struct pidmap *map)
   return 0;
  }
 
 -static int alloc_pidmap(struct pid_namespace *pid_ns)
 +static int set_pidmap(struct pid_namespace *pid_ns, int pid)
 +{
 + int offset;
 + struct pidmap *map;
 +
 + if (pid = pid_max)
 + return -EINVAL;

what about pid  0?

 +
 + offset = pid  BITS_PER_PAGE_MASK;
 + map = pid_ns-pidmap[pid/BITS_PER_PAGE];
 +
 + if (alloc_pidmap_page(map))
 + return -ENOMEM;
 +
 + if (test_and_set_bit(offset, map-page))
 + return -EBUSY;
 +
 + atomic_dec(map-nr_free);
 + return pid;
 +}
 +
 +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
  {
   int i, offset, max_scan, pid, last = pid_ns-last_pid;
   struct pidmap *map;
   int rc = -EAGAIN;
 
 + if (target_pid)
 + return set_pidmap(pid_ns, target_pid);
 +
   pid = last + 1;
   if (pid = pid_max)
   pid = RESERVED_PIDS;
 @@ -269,7 +293,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
   tmp = ns;
   for (i = ns-level; i = 0; i--) {
 - nr = alloc_pidmap(tmp);
 + nr = alloc_pidmap(tmp, 0);
   if (nr  0)
   goto out_free;
 
 -- 
 1.5.2.5
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page

2009-05-27 Thread Serge E. Hallyn
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com):
 
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:39 -0700
 Subject: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

Acked-by: Serge Hallyn se...@us.ibm.com

 ---
  kernel/pid.c |   43 ---
  1 files changed, 28 insertions(+), 15 deletions(-)
 
 diff --git a/kernel/pid.c b/kernel/pid.c
 index b2e5f78..c0aaebe 100644
 --- a/kernel/pid.c
 +++ b/kernel/pid.c
 @@ -122,6 +122,31 @@ static void free_pidmap(struct upid *upid)
   atomic_inc(map-nr_free);
  }
 
 +static int alloc_pidmap_page(struct pidmap *map)
 +{
 + void *page;
 +
 + if (likely(map-page))
 + return 0;
 +
 + page = kzalloc(PAGE_SIZE, GFP_KERNEL);
 +
 + /*
 +  * Free the page if someone raced with us installing it:
 +  */
 + spin_lock_irq(pidmap_lock);
 + if (map-page)
 + kfree(page);
 + else
 + map-page = page;
 + spin_unlock_irq(pidmap_lock);
 +
 + if (unlikely(!map-page))
 + return -1;
 +
 + return 0;
 +}
 +
  static int alloc_pidmap(struct pid_namespace *pid_ns)
  {
   int i, offset, max_scan, pid, last = pid_ns-last_pid;
 @@ -134,21 +159,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
   map = pid_ns-pidmap[pid/BITS_PER_PAGE];
   max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
   for (i = 0; i = max_scan; ++i) {
 - if (unlikely(!map-page)) {
 - void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
 - /*
 -  * Free the page if someone raced with us
 -  * installing it:
 -  */
 - spin_lock_irq(pidmap_lock);
 - if (map-page)
 - kfree(page);
 - else
 - map-page = page;
 - spin_unlock_irq(pidmap_lock);
 - if (unlikely(!map-page))
 - break;
 - }
 + if (alloc_pidmap_page(map))
 + break;
 +
   if (likely(atomic_read(map-nr_free))) {
   do {
   if (!test_and_set_bit(offset, map-page)) {
 -- 
 1.5.2.5
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process()

2009-05-27 Thread Serge E. Hallyn
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com):
 
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:43 -0700
 Subject: [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process()
 
 The new parameter will be used in a follow-on patch when clone_with_pids()
 is implemented.
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

Acked-by: Serge Hallyn se...@us.ibm.com

 ---
  kernel/fork.c |7 ---
  1 files changed, 4 insertions(+), 3 deletions(-)
 
 diff --git a/kernel/fork.c b/kernel/fork.c
 index d2d69d3..373411e 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -949,12 +949,12 @@ static struct task_struct *copy_process(unsigned long 
 clone_flags,
   unsigned long stack_size,
   int __user *child_tidptr,
   struct pid *pid,
 + pid_t *target_pids,
   int trace)
  {
   int retval;
   struct task_struct *p;
   int cgroup_callbacks_done = 0;
 - pid_t *target_pids = NULL;
 
   if ((clone_flags  (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
   return ERR_PTR(-EINVAL);
 @@ -1327,7 +1327,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
   struct pt_regs regs;
 
   task = copy_process(CLONE_VM, 0, idle_regs(regs), 0, NULL,
 - init_struct_pid, 0);
 + init_struct_pid, NULL, 0);
   if (!IS_ERR(task))
   init_idle(task, cpu);
 
 @@ -1350,6 +1350,7 @@ long do_fork(unsigned long clone_flags,
   struct task_struct *p;
   int trace = 0;
   long nr;
 + pid_t *target_pids = NULL;
 
   /*
* Do some preliminary argument and permissions checking before we
 @@ -1390,7 +1391,7 @@ long do_fork(unsigned long clone_flags,
   trace = tracehook_prepare_clone(clone_flags);
 
   p = copy_process(clone_flags, stack_start, regs, stack_size,
 -  child_tidptr, NULL, trace);
 +  child_tidptr, NULL, target_pids, trace);
   /*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
 -- 
 1.5.2.5
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/7] [PATCH] Add target_pids parameter to alloc_pid()

2009-05-27 Thread Serge E. Hallyn
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com):
 
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:42 -0700
 Subject: [PATCH 4/7] [PATCH] Add target_pids parameter to alloc_pid()
 
 This parameter is currently NULL, but will be used in a follow-on patch.
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

Acked-by: Serge Hallyn se...@us.ibm.com

 ---
  include/linux/pid.h |2 +-
  kernel/fork.c   |3 ++-
  kernel/pid.c|9 +++--
  3 files changed, 10 insertions(+), 4 deletions(-)
 
 diff --git a/include/linux/pid.h b/include/linux/pid.h
 index 49f1c2f..914185d 100644
 --- a/include/linux/pid.h
 +++ b/include/linux/pid.h
 @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
  int next_pidmap(struct pid_namespace *pid_ns, int last);
 
 -extern struct pid *alloc_pid(struct pid_namespace *ns);
 +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
  extern void free_pid(struct pid *pid);
 
  /*
 diff --git a/kernel/fork.c b/kernel/fork.c
 index f8411a8..d2d69d3 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -954,6 +954,7 @@ static struct task_struct *copy_process(unsigned long 
 clone_flags,
   int retval;
   struct task_struct *p;
   int cgroup_callbacks_done = 0;
 + pid_t *target_pids = NULL;
 
   if ((clone_flags  (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
   return ERR_PTR(-EINVAL);
 @@ -1119,7 +1120,7 @@ static struct task_struct *copy_process(unsigned long 
 clone_flags,
   goto bad_fork_cleanup_io;
 
   if (pid != init_struct_pid) {
 - pid = alloc_pid(p-nsproxy-pid_ns);
 + pid = alloc_pid(p-nsproxy-pid_ns, target_pids);
   if (IS_ERR(pid)) {
   retval = PTR_ERR(pid);
   goto bad_fork_cleanup_io;
 diff --git a/kernel/pid.c b/kernel/pid.c
 index 93406c6..4b2373a 100644
 --- a/kernel/pid.c
 +++ b/kernel/pid.c
 @@ -279,13 +279,14 @@ void free_pid(struct pid *pid)
   call_rcu(pid-rcu, delayed_put_pid);
  }
 
 -struct pid *alloc_pid(struct pid_namespace *ns)
 +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
  {
   struct pid *pid;
   enum pid_type type;
   int i, nr;
   struct pid_namespace *tmp;
   struct upid *upid;
 + int tpid;
 
   pid = kmem_cache_alloc(ns-pid_cachep, GFP_KERNEL);
   if (!pid)
 @@ -293,7 +294,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
   tmp = ns;
   for (i = ns-level; i = 0; i--) {
 - nr = alloc_pidmap(tmp, 0);
 + tpid = 0;
 + if (target_pids)
 + tpid = target_pids[i];
 +
 + nr = alloc_pidmap(tmp, tpid);
   if (nr  0)
   goto out_free;
 
 -- 
 1.5.2.5
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 6/7] [PATCH] Define do_fork_with_pids()

2009-05-27 Thread Serge E. Hallyn
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com):
 
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:44 -0700
 Subject: [PATCH 6/7] [PATCH] Define do_fork_with_pids()
 
 do_fork_with_pids() is same as do_fork(), except that it takes an
 additional, target_pids, parameter. This parameter, currently unused,
 specifies the target_pids of the process in each of its pid namespaces.
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

Acked-by: Serge Hallyn se...@us.ibm.com

 ---
  include/linux/sched.h |1 +
  kernel/fork.c |   17 ++---
  2 files changed, 15 insertions(+), 3 deletions(-)
 
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index b4c38bc..2173df1 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1995,6 +1995,7 @@ extern int disallow_signal(int);
 
  extern int do_execve(char *, char __user * __user *, char __user * __user *, 
 struct pt_regs *);
  extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned 
 long, int __user *, int __user *);
 +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs 
 *, unsigned long, int __user *, int __user *, pid_t *target_pids);
  struct task_struct *fork_idle(int);
 
  extern void set_task_comm(struct task_struct *tsk, char *from);
 diff --git a/kernel/fork.c b/kernel/fork.c
 index 373411e..912d008 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -1340,17 +1340,17 @@ struct task_struct * __cpuinit fork_idle(int cpu)
   * It copies the process, and if successful kick-starts
   * it and waits for it to finish using the VM if required.
   */
 -long do_fork(unsigned long clone_flags,
 +long do_fork_with_pids(unsigned long clone_flags,
 unsigned long stack_start,
 struct pt_regs *regs,
 unsigned long stack_size,
 int __user *parent_tidptr,
 -   int __user *child_tidptr)
 +   int __user *child_tidptr,
 +   pid_t *target_pids)
  {
   struct task_struct *p;
   int trace = 0;
   long nr;
 - pid_t *target_pids = NULL;
 
   /*
* Do some preliminary argument and permissions checking before we
 @@ -1448,6 +1448,17 @@ long do_fork(unsigned long clone_flags,
   return nr;
  }
 
 +long do_fork(unsigned long clone_flags,
 +   unsigned long stack_start,
 +   struct pt_regs *regs,
 +   unsigned long stack_size,
 +   int __user *parent_tidptr,
 +   int __user *child_tidptr)
 +{
 + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
 + parent_tidptr, child_tidptr, NULL);
 +}
 +
  #ifndef ARCH_MIN_MMSTRUCT_ALIGN
  #define ARCH_MIN_MMSTRUCT_ALIGN 0
  #endif
 -- 
 1.5.2.5
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 39/43] c/r (ipc): export interface from ipc/sem.c to cleanup ipc sem

2009-05-27 Thread Oren Laadan
Export freeary() which will be used in the next patch during restart
to cleanup an ipc sem.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 ipc/sem.c  |3 +--
 ipc/util.h |1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 207dbbb..c60076e 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -93,7 +93,6 @@
 #define sem_checkid(sma, semid)ipc_checkid(sma-sem_perm, semid)
 
 static int newary(struct ipc_namespace *, struct ipc_params *, int);
-static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 #endif
@@ -521,7 +520,7 @@ static void free_un(struct rcu_head *head)
  * as a writer and the spinlock for this semaphore set hold. sem_ids.rw_mutex
  * remains locked on exit.
  */
-static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
 {
struct sem_undo *un, *tu;
struct sem_queue *q, *tq;
diff --git a/ipc/util.h b/ipc/util.h
index 2a05fb3..347ffb2 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -185,6 +185,7 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 extern int do_shmget(key_t key, size_t size, int shmflg, int req_id);
 extern int do_msgget(key_t key, int msgflg, int req_id);
 extern void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
+extern void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
 
 extern void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
 
-- 
1.6.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 16/43] c/r: export shmem_getpage() to support shared memory

2009-05-27 Thread Oren Laadan
Export functionality to retrieve specific pages from shared memory
given an inode in shmem-fs; this will be used in the next two patches
to provide support for c/r of shared memory.

mm/shmem.c:
- shmem_getpage() and 'enum sgp_type' moved to linux/mm.h

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 include/linux/mm.h |   11 +++
 mm/shmem.c |   15 ++-
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ae70b50..53e916a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -330,6 +330,17 @@ void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
 
+/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+enum sgp_type {
+   SGP_READ,   /* don't exceed i_size, don't allocate page */
+   SGP_CACHE,  /* don't exceed i_size, may allocate page */
+   SGP_DIRTY,  /* like SGP_CACHE, but set new page dirty */
+   SGP_WRITE,  /* may exceed i_size, may allocate page */
+};
+
+extern int shmem_getpage(struct inode *inode, unsigned long idx,
+struct page **pagep, enum sgp_type sgp, int *type);
+
 /*
  * Compound pages have a destructor function.  Provide a
  * prototype for that function and accessor functions.
diff --git a/mm/shmem.c b/mm/shmem.c
index b25f95c..f260336 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -99,14 +99,6 @@ static struct vfsmount *shm_mnt;
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20
 
-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
-enum sgp_type {
-   SGP_READ,   /* don't exceed i_size, don't allocate page */
-   SGP_CACHE,  /* don't exceed i_size, may allocate page */
-   SGP_DIRTY,  /* like SGP_CACHE, but set new page dirty */
-   SGP_WRITE,  /* may exceed i_size, may allocate page */
-};
-
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
@@ -119,9 +111,6 @@ static unsigned long shmem_default_max_inodes(void)
 }
 #endif
 
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-struct page **pagep, enum sgp_type sgp, int *type);
-
 static inline struct page *shmem_dir_alloc(gfp_t gfp_mask)
 {
/*
@@ -1202,8 +1191,8 @@ static inline struct mempolicy *shmem_get_sbmpol(struct 
shmem_sb_info *sbinfo)
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache
  */
-static int shmem_getpage(struct inode *inode, unsigned long idx,
-   struct page **pagep, enum sgp_type sgp, int *type)
+int shmem_getpage(struct inode *inode, unsigned long idx,
+ struct page **pagep, enum sgp_type sgp, int *type)
 {
struct address_space *mapping = inode-i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
-- 
1.6.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 08/43] c/r: introduce '-checkpoint()' method in 'struct file_operations'

2009-05-27 Thread Oren Laadan
While we assume all normal files and directories can be checkpointed,
there are, as usual in the VFS, specialized places that will always
need an ability to override these defaults. Although we could do this
completely in the checkpoint code, that would bitrot quickly.

This adds a new 'file_operations' function for checkpointing a file.
It is assumed that there should be a dirt-simple way to make something
(un)checkpointable that fits in with current code.

As you can see in the ext[234] patches down the road, all that we have
to do to make something simple be supported is add a single generic
f_op entry.

Also introduce vfs_fcntl() so that it can be called from restart (see
patch adding restart of files).

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 fs/fcntl.c   |   21 +
 include/linux/checkpoint_types.h |2 ++
 include/linux/fs.h   |6 ++
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 1ad7031..17020a9 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -337,6 +337,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned 
long arg,
return err;
 }
 
+int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
+{
+   int err;
+
+   err = security_file_fcntl(filp, cmd, arg);
+   if (err)
+   goto out;
+   err = do_fcntl(fd, cmd, arg, filp);
+ out:
+   return err;
+}
+
 SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 {  
struct file *filp;
@@ -346,14 +358,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, 
cmd, unsigned long, arg)
if (!filp)
goto out;
 
-   err = security_file_fcntl(filp, cmd, arg);
-   if (err) {
-   fput(filp);
-   return err;
-   }
-
-   err = do_fcntl(fd, cmd, arg, filp);
-
+   err = vfs_fcntl(fd, cmd, arg, filp);
fput(filp);
 out:
return err;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index c1032fa..9c14034 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -15,6 +15,8 @@
 
 #ifdef __KERNEL__
 
+#include linux/sched.h
+
 struct ckpt_ctx {
int crid;   /* unique checkpoint id */
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9c4348a..60d9229 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -8,6 +8,7 @@
 
 #include linux/limits.h
 #include linux/ioctl.h
+#include linux/checkpoint_types.h
 
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -1082,6 +1083,8 @@ struct file_lock {
 
 #include linux/fcntl.h
 
+extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
+
 extern void send_sigio(struct fown_struct *fown, int fd, int band);
 
 /* fs/sync.c */
@@ -1508,6 +1511,7 @@ struct file_operations {
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t 
*, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info 
*, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
+   int (*checkpoint)(struct ckpt_ctx *, struct file *);
 };
 
 struct inode_operations {
@@ -2306,6 +2310,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
 loff_t inode_get_bytes(struct inode *inode);
 void inode_set_bytes(struct inode *inode, loff_t bytes);
 
+#define generic_file_checkpoint NULL
+
 extern int vfs_readdir(struct file *, filldir_t, void *);
 
 extern int vfs_stat(char __user *, struct kstat *);
-- 
1.6.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 7/7] [PATCH] Define clone_with_pids syscall

2009-05-27 Thread Serge E. Hallyn
Quoting Sukadev Bhattiprolu (suka...@linux.vnet.ibm.com):
 
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:45 -0700
 Subject: [PATCH 7/7] [PATCH] Define clone_with_pids syscall
 
 clone_with_pids() is same as clone(), except that it takes a 'target_pid_set'
 paramter which lets caller choose a specific pid number for the child process
 in each of the child process's pid namespace. This system call would be needed
 to implement Checkpoint/Restart (i.e after a checkpoint, restart a process 
 with
 its original pids).

I think you should point out here that CAP_SYS_ADMIN is needed
to use the syscall, so unprivileged tasks can't use this to try to
play games with /var/run/*.pid.

...

 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

Acked-by: Serge Hallyn se...@us.ibm.com

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 03/43] c/r: create syscalls: sys_checkpoint, sys_restart

2009-05-27 Thread Oren Laadan
Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a pid, a file descriptor (for the image file) and
flags as arguments. The pid identifies the top-most (root) task in the
process tree, e.g. the container init: for sys_checkpoint the first
argument identifies the pid of the target container/subtree; for
sys_restart it will identify the pid of restarting root task.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart.  Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic.  They also would significantly complicate
checkpoint that includes self.

Changelog[v16]:
  - Change sys_restart() first argument to be 'pid_t pid'

Changelog[v14]:
  - Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo)
  - Remove line 'def_bool n' (default is already 'n')
  - Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)

Changelog[v5]:
  - Config is 'def_bool n' by default

Signed-off-by: Oren Laadan or...@cs.columbia.edu
Acked-by: Serge Hallyn se...@us.ibm.com
Signed-off-by: Dave Hansen d...@linux.vnet.ibm.com
---
 arch/x86/Kconfig   |4 +++
 arch/x86/include/asm/unistd_32.h   |2 +
 arch/x86/kernel/syscall_table_32.S |2 +
 checkpoint/Kconfig |   14 
 checkpoint/Makefile|5 
 checkpoint/sys.c   |   41 
 include/linux/syscalls.h   |2 +
 init/Kconfig   |2 +
 kernel/sys_ni.c|4 +++
 9 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a6efe0a..2891a26 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -77,6 +77,10 @@ config STACKTRACE_SUPPORT
 config HAVE_LATENCYTOP_SUPPORT
def_bool y
 
+config CHECKPOINT_SUPPORT
+   bool
+   default y if X86_32
+
 config FAST_CMPXCHG_LOCAL
bool
default y
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..48557e1 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,8 @@
 #define __NR_inotify_init1 332
 #define __NR_preadv333
 #define __NR_pwritev   334
+#define __NR_checkpoint335
+#define __NR_restart   336
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S 
b/arch/x86/kernel/syscall_table_32.S
index ff5c873..e70b7ee 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,5 @@ ENTRY(sys_call_table)
.long sys_inotify_init1
.long sys_preadv
.long sys_pwritev
+   .long sys_checkpoint
+   .long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 000..1761b0a
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,14 @@
+# Architectures should define CHECKPOINT_SUPPORT when they have
+# implemented the hooks for processor state etc. needed by the
+# core checkpoint/restart code.
+
+config CHECKPOINT
+   bool Enable checkpoint/restart (EXPERIMENTAL)
+   depends on CHECKPOINT_SUPPORT  EXPERIMENTAL
+   help
+ Application checkpoint/restart is the ability to save the
+ state of a running application so that it can later resume
+ its execution from the time at which it was checkpointed.
+
+ Turning this option on will enable checkpoint and restart
+ functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 000..8a32c6f
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 000..9d4caff
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ 

[Devel] [RFC v16][PATCH 24/43] c/r: detect resource leaks for whole-container checkpoint

2009-05-27 Thread Oren Laadan
Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
checkpoint, return an error code if the actual objects' counts are
higher, indicating leaks (references to the objects from a task not
being checkpointed).  Of course, by this time most of the checkpoint
image has been written out to disk, so this is purely advisory.  But
then, it's probably naive to argue that anything more than an advisory
'this went wrong' error code is useful.

The comparison of the objhash user counts to object refcounts as a
basis for checking for leaks comes from Alexey's OpenVZ-based c/r
patchset.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/checkpoint.c|8 
 checkpoint/objhash.c   |   82 ++--
 include/linux/checkpoint.h |1 +
 3 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 92f219e..b70adf4 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -578,6 +578,14 @@ int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
if (ret  0)
goto out;
 
+   if (!(ctx-uflags  CHECKPOINT_SUBTREE)) {
+   /* verify that all objects are contained (no leaks) */
+   if (!ckpt_obj_contained(ctx)) {
+   ret = -EBUSY;
+   goto out;
+   }
+   }
+
/* on success, return (unique) checkpoint identifier */
ctx-crid = atomic_inc_return(ctx_count);
ret = ctx-crid;
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index ff9388d..e481911 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -27,19 +27,23 @@ struct ckpt_obj_ops {
enum obj_type obj_type;
void (*ref_drop)(void *ptr);
int (*ref_grab)(void *ptr);
+   int (*ref_users)(void *ptr);
int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
void *(*restore)(struct ckpt_ctx *ctx);
 };
 
 struct ckpt_obj {
+   int users;
int objref;
void *ptr;
struct ckpt_obj_ops *ops;
struct hlist_node hash;
+   struct hlist_node next;
 };
 
 struct ckpt_obj_hash {
struct hlist_head *head;
+   struct hlist_head list;
int next_free_objref;
 };
 
@@ -53,7 +57,7 @@ void *restore_bad(struct ckpt_ctx *ctx)
return ERR_PTR(-EINVAL);
 }
 
-/* helper grab/drop functions: */
+/* helper grab/drop/users functions */
 
 static void obj_no_drop(void *ptr)
 {
@@ -86,6 +90,11 @@ static void obj_file_table_drop(void *ptr)
put_files_struct((struct files_struct *) ptr);
 }
 
+static int obj_file_table_users(void *ptr)
+{
+   return atomic_read(((struct files_struct *) ptr)-count);
+}
+
 static int obj_file_grab(void *ptr)
 {
get_file((struct file *) ptr);
@@ -97,6 +106,11 @@ static void obj_file_drop(void *ptr)
fput((struct file *) ptr);
 }
 
+static int obj_file_users(void *ptr)
+{
+   return atomic_long_read(((struct file *) ptr)-f_count);
+}
+
 static int obj_mm_grab(void *ptr)
 {
atomic_inc(((struct mm_struct *) ptr)-mm_users);
@@ -108,6 +122,11 @@ static void obj_mm_drop(void *ptr)
mmput((struct mm_struct *) ptr);
 }
 
+static int obj_mm_users(void *ptr)
+{
+   return atomic_read(((struct mm_struct *) ptr)-mm_users);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
/* ignored object */
{
@@ -131,6 +150,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.obj_type = CKPT_OBJ_FILE_TABLE,
.ref_drop = obj_file_table_drop,
.ref_grab = obj_file_table_grab,
+   .ref_users = obj_file_table_users,
.checkpoint = checkpoint_file_table,
.restore = restore_file_table,
},
@@ -140,6 +160,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.obj_type = CKPT_OBJ_FILE,
.ref_drop = obj_file_drop,
.ref_grab = obj_file_grab,
+   .ref_users = obj_file_users,
.checkpoint = checkpoint_file,
.restore = restore_file,
},
@@ -149,6 +170,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.obj_type = CKPT_OBJ_MM,
.ref_drop = obj_mm_drop,
.ref_grab = obj_mm_grab,
+   .ref_users = obj_mm_users,
.checkpoint = checkpoint_mm,
.restore = restore_mm,
},
@@ -201,6 +223,7 @@ int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)
 
obj_hash-head = head;
obj_hash-next_free_objref = 1;
+   INIT_HLIST_HEAD(obj_hash-list);
 
ctx-obj_hash = obj_hash;
return 0;
@@ -259,6 +282,7 @@ static int obj_new(struct ckpt_ctx *ctx, void *ptr, int 
objref,
 
obj-ptr = ptr;
obj-ops = ops;
+   obj-users = 2;  /* extra reference that objhash itself takes */
 
if (objref) {
/* use @obj-objref to index (restart) */
@@ -271,10 +295,12 @@ 

[Devel] Re: [PATCH 6/8] cr: checkpoint and restore task credentials

2009-05-27 Thread Alexey Dobriyan
On Tue, May 26, 2009 at 12:33:54PM -0500, Serge E. Hallyn wrote:
 +struct ckpt_hdr_cred {
 + struct ckpt_hdr h;
 + __u32 version; /* especially since capability sets might grow */

Oh, no. Image version should be incremented.

 + __u32 uid, suid, euid, fsuid;
 + __u32 gid, sgid, egid, fsgid;
 + __u64 cap_i, cap_p, cap_e;
 + __u64 cap_x;  /* bounding set ('X') */
 + __s32 user_ref;
 + __s32 groupinfo_ref;
 + __u32 padding;
 +} __attribute__((aligned(8)));
 +
 +struct ckpt_hdr_groupinfo {
 + struct ckpt_hdr h;
 + __u32 ngroups;
 + /*
 +  * This is followed by ngroups __u32s
 +  */
 + __u32 groups[0];
 +} __attribute__((aligned(8)));

 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1871,6 +1871,12 @@ static inline struct user_struct *get_uid(struct 
 user_struct *u)
  extern void free_uid(struct user_struct *);
  extern void release_uids(struct user_namespace *ns);
  
 +#ifdef CONFIG_CHECKPOINT
 +struct ckpt_ctx;
 +int checkpoint_write_user(struct ckpt_ctx *, struct user_struct *);
 +struct user_struct *restore_read_user(struct ckpt_ctx *);
 +#endif

I'll rip credential stuff from sched.h, better not add more.

 --- a/kernel/groups.c
 +++ b/kernel/groups.c
 @@ -287,3 +288,58 @@ int in_egroup_p(gid_t grp)
  }
  
  EXPORT_SYMBOL(in_egroup_p);
 +
 +#ifdef CONFIG_CHECKPOINT
 +int checkpoint_write_groupinfo(struct ckpt_ctx *ctx, struct group_info *g)
 +{
 + int ret, i, size;
 + struct ckpt_hdr_groupinfo *h;
 +
 + size = sizeof(*h) + g-ngroups * sizeof(__u32);
 + h = ckpt_hdr_get_type(ctx, size, CKPT_HDR_GROUPINFO);
 + if (!h)
 + return -ENOMEM;
 +
 + h-ngroups = g-ngroups;
 + for (i = 0; i  g-ngroups; i++)
 + h-groups[i] = GROUP_AT(g, i);
 +
 + ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
 + ckpt_hdr_put(ctx, h);
 +
 + return ret;
 +}
 +
 +/*
 + * TODO - switch to reading in blocks, and only return an
 + * error for truly obscene # groups (like 1)
 + */
 +#define CKPT_MAXGROUPS 100
 +#define MAX_GROUPINFO_SIZE (sizeof(*h)+CKPT_MAXGROUPS*sizeof(gid_t))
 +struct group_info *restore_read_groupinfo(struct ckpt_ctx *ctx)
 +{
 + struct group_info *g;
 + struct ckpt_hdr_groupinfo *h;
 + int i;
 +
 + h = ckpt_read_buf_type(ctx, MAX_GROUPINFO_SIZE, CKPT_HDR_GROUPINFO);
 + if (IS_ERR(h))
 + return ERR_PTR(PTR_ERR(h));
 + if (h-ngroups  CKPT_MAXGROUPS) {
 + g = ERR_PTR(-EINVAL);
 + goto out;
 + }
 + g = groups_alloc(h-ngroups);
 + if (!g) {
 + g = ERR_PTR(-ENOMEM);
 + goto out;
 + }
 + for (i = 0; i  h-ngroups; i++)
 + GROUP_AT(g, i) = h-groups[i];
 +
 +out:
 + ckpt_hdr_put(ctx, h);
 + return g;
 +}

No checks, that groups in image are a) sorted, b) -ngroups is compatible
with object image.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/8] a start to credentials c/r

2009-05-27 Thread Serge E. Hallyn
Quoting Casey Schaufler (ca...@schaufler-ca.com):
 Serge E. Hallyn wrote:
  Quoting Casey Schaufler (ca...@schaufler-ca.com):
...
  Uh, so yes, bsaed on info in the file as well  :)  Except
  of course the LSM would just be fed the checkpointed context
  and the checkpoint file context (and can deduce current's context).

 
 And SELinux can do whatever calculations it likes based on the
 three contexts and the loaded policy.  Are you at all concerned
 about the possibility that the policy may have changed? I can
 envision scenarios in which it would be impossible for a process
 to gain a particular context under current policy, but that a
 checkpointed process may have stored away.

Good point.  But on the other hand, if the program were running
the whole time, instead of being checkpointed and restarted, then
the running program wouldn't be relabeled when the policy changed,
right?  Now if the domain becomes invalid, then presumably the
restart would fail.  But if the (source_domain,entry_type)-new_domain
set changes from (root_t,x_entry_t)-x_t to (root_t,x_entry_t)-y_t,
a task running as x_t won't be relabeled to y_t.  So I don't thnk
restarting a task which is checkpointed as x_t, under the x_t
domain, is wrong.

   and one which determines
  the task-cred-security filed based upon any of:
1. current_security() of the task calling sys_restart()
2. the task-cred-security checkpointed in the ckpt file
3. the -security of the checkpoint file


  For Smack the correct behavior would be:
 
  1. for sys_restart() callers without CAP_MAC_ADMIN
  2. for sys_restart() callers with CAP_MAC_ADMIN
  3. never
  
 
  That makes sense, and is basically analagous (if I'm thinking
  right) to how I'm doing capabilities.
 
  So the first (authorization hook) for smack would just always
  return TRUE?

 
 I suggest that it needs to check for a valid Smack label. Even though
 they're just text strings they do have limitations, including size
 ( 0  24) and character set. A call to smk_import() is the right
 way to do it, as it also makes sure the label is in the internal list.
 If smk_import() returns NULL something's amiss.

Ok, thanks.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC v16][PATCH 41/43] c/r: (s390): expose a constant for the number of words (CRs)

2009-05-27 Thread Alexey Dobriyan
On Wed, May 27, 2009 at 01:33:07PM -0400, Oren Laadan wrote:
 We need to use this value in the checkpoint/restart code and would like to
 have a constant instead of a magic '3'.
 
 Changelog:
 Mar 30:
 . Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
 Mar 03:
 . Picked up additional use of magic '3' in ptrace.h
 
 Signed-off-by: Dan Smith da...@us.ibm.com
 ---
  arch/s390/Kconfig |4 
  1 files changed, 4 insertions(+), 0 deletions(-)
 
 diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
 index 2eca5fe..bf62cad 100644
 --- a/arch/s390/Kconfig
 +++ b/arch/s390/Kconfig
 @@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL
  config GENERIC_CLOCKEVENTS
   def_bool y
  
 +config CHECKPOINT_SUPPORT
 + bool
 + default y if 64BIT
 +
  config GENERIC_BUG
   bool
   depends on BUG

Changelog and content aren't compatible.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 18/38] C/R: core stuff

2009-05-27 Thread Dave Hansen
On Tue, 2009-05-26 at 23:35 +0400, Alexey Dobriyan wrote:
 The other part, is that I looked at Oren patchset, found quite a lot of
 suspicious, broken and unclean places and decided that it'd be faster
 to start from scratch because sending patches will overhaul like 85% of
 the code.

I know the feeling.  I get sudden urges to rewrite the whole patch set,
but I'm working on getting past them too. :)

As long as we have two patch sets, *nobody* is going to get their
patches in, that's virtually guaranteed.  Just look at the poor I/O
controller.

The OpenVZ users are almost certainly the most important container and
c/r users out there today.  Meeting their needs with whatever we come up
with should be a top priority and I know I'm counting on you to help us
do that.

But, I'm having a really hard time culling the OpenVZ user needs from
your patch set.  If we really need to rewrite 85% of Oren's stuff to
meet the OpenVZ needs, then by all means let's do it.  I'm even willing
to help you.  But, I honestly don't know what you need.

Can we talk about specifics?

-- Dave

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 26/43] splice: added support for pipe-to-pipe splice()

2009-05-27 Thread Oren Laadan
This patch is a modified version of Max Kellerman patch that allows
splice() between pipes (see http://patchwork/kernel/org/patch/21042).
By refactoring link_pipe(), do_tee() and do_splice_pipes() shrink
considerably. Below is Max's original description:

--
This patch enables the splice() system call to copy buffers from one
pipe to another.  This obvious and trivial use case for splice() was
not supported until now.

It reuses the functions link_ipipe_prep() and link_opipe_prep() from
the tee() system call implementation.
--

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 fs/splice.c |  203 ---
 1 files changed, 166 insertions(+), 37 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 92dd63c..96e0d58 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -903,13 +903,95 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info 
*pipe, struct file *out,
 EXPORT_SYMBOL(generic_splice_sendpage);
 
 /*
+ * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
+ * location, so checking -i_pipe is not enough to verify that this is a
+ * pipe.
+ */
+static inline struct pipe_inode_info *pipe_info(struct inode *inode)
+{
+   if (S_ISFIFO(inode-i_mode))
+   return inode-i_pipe;
+
+   return NULL;
+}
+
+static int link_pipe_prep(struct pipe_inode_info *ipipe,
+ struct pipe_inode_info *opipe,
+ unsigned int flags);
+static long do_link_pipe(struct pipe_inode_info *ipipe,
+struct pipe_inode_info *opipe,
+size_t len, unsigned int flags, int move);
+
+/**
+* Splice pages from one pipe to another.
+*
+* @ipipe the input pipe
+* @opipe the output pipe
+* @len the maximum number of bytes to move
+* @flags splice modifier flags
+*/
+static long do_splice_pipes(struct pipe_inode_info *ipipe,
+   struct pipe_inode_info *opipe,
+   size_t len, unsigned int flags)
+{
+   int do_wakeup;
+   long ret;
+
+   if (ipipe == opipe)
+   return -EINVAL;
+
+   ret = link_pipe_prep(ipipe, opipe, flags);
+   if (ret  0)
+   return ret;
+
+   /* both pipes are now locked */
+
+   do_wakeup = ipipe-nrbufs;
+   ret = do_link_pipe(ipipe, opipe, len, flags, 1);
+   do_wakeup -= ipipe-nrbufs;
+
+   pipe_unlock(ipipe);
+   pipe_unlock(opipe);
+
+   if (do_wakeup) {
+   /* at least one buffer was removed from the
+  input pipe: wake up potential writers */
+   smp_mb();
+   if (waitqueue_active(ipipe-wait))
+   wake_up_interruptible(ipipe-wait);
+   kill_fasync(ipipe-fasync_writers, SIGIO, POLL_OUT);
+   }
+
+   /*
+* If we put data in the output pipe, wakeup any potential
+* readers.
+*/
+   if (ret  0) {
+   smp_mb();
+   if (waitqueue_active(opipe-wait))
+   wake_up_interruptible(opipe-wait);
+   kill_fasync(opipe-fasync_readers, SIGIO, POLL_IN);
+   }
+
+   return ret;
+}
+
+/*
  * Attempt to initiate a splice from pipe to file.
  */
 static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
   loff_t *ppos, size_t len, unsigned int flags)
 {
+   struct pipe_inode_info *opipe;
int ret;
 
+   opipe = pipe_info(out-f_dentry-d_inode);
+   if (opipe) {
+   if (unlikely(!(out-f_mode  FMODE_WRITE)))
+   return -EBADF;
+   return do_splice_pipes(pipe, opipe, len, flags);
+   }
+
if (unlikely(!out-f_op || !out-f_op-splice_write))
return -EINVAL;
 
@@ -933,8 +1015,16 @@ static long do_splice_to(struct file *in, loff_t *ppos,
 struct pipe_inode_info *pipe, size_t len,
 unsigned int flags)
 {
+   struct pipe_inode_info *ipipe;
int ret;
 
+   ipipe = pipe_info(in-f_dentry-d_inode);
+   if (ipipe) {
+   if (unlikely(!(in-f_mode  FMODE_READ)))
+   return -EBADF;
+   return do_splice_pipes(ipipe, pipe, len, flags);
+   }
+
if (unlikely(!in-f_op || !in-f_op-splice_read))
return -EINVAL;
 
@@ -1113,19 +1203,6 @@ long do_splice_direct(struct file *in, loff_t *ppos, 
struct file *out,
 }
 
 /*
- * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
- * location, so checking -i_pipe is not enough to verify that this is a
- * pipe.
- */
-static inline struct pipe_inode_info *pipe_info(struct inode *inode)
-{
-   if (S_ISFIFO(inode-i_mode))
-   return inode-i_pipe;
-
-   return NULL;
-}
-
-/*
  * Determine where to splice to/from.
  */
 static long do_splice(struct file *in, loff_t __user *off_in,
@@ -1140,7 +1217,10 @@ static long do_splice(struct file *in, loff_t __user 

[Devel] [RFC v16][PATCH 01/43] c/r: extend arch_setup_additional_pages()

2009-05-27 Thread Oren Laadan
From: Alexey Dobriyan adobri...@gmail.com

Add start argument, to request to map vDSO to a specific place,
and fail the operation if not.

This is useful for restart(2) to ensure that memory layout is restore
exactly as needed.

Signed-off-by: Alexey Dobriyan adobri...@gmail.com
Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 arch/powerpc/include/asm/elf.h |1 +
 arch/powerpc/kernel/vdso.c |   11 ++-
 arch/s390/include/asm/elf.h|2 +-
 arch/s390/kernel/vdso.c|   13 -
 arch/sh/include/asm/elf.h  |1 +
 arch/sh/kernel/vsyscall/vsyscall.c |2 +-
 arch/x86/include/asm/elf.h |3 ++-
 arch/x86/vdso/vdso32-setup.c   |9 +++--
 arch/x86/vdso/vma.c|9 +++--
 fs/binfmt_elf.c|2 +-
 10 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index d6b4a12..3946e01 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -271,6 +271,7 @@ extern int ucache_bsize;
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+  unsigned long start,
   int uses_interp);
 #define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b);
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index ad06d5c..48beff6 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -184,7 +184,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+   unsigned long start, int uses_interp)
 {
struct mm_struct *mm = current-mm;
struct page **vdso_pagelist;
@@ -211,6 +212,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
vdso_base = VDSO32_MBASE;
 #endif
 
+   /* in case restart(2) mandates a specific location */
+   if (start)
+   vdso_base = start;
+
current-mm-context.vdso_base = 0;
 
/* vDSO has a problem and was disabled, just don't enable it for the
@@ -234,6 +239,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
goto fail_mmapsem;
}
 
+   /* for restart(2), double check that we got we asked for */
+   if (start  vdso_base != start)
+   goto fail_mmapsem;
+
/*
 * our vma flags don't have VM_WRITE so by default, the process isn't
 * allowed to write those pages.
diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h
index 74d0bbb..54235bc 100644
--- a/arch/s390/include/asm/elf.h
+++ b/arch/s390/include/asm/elf.h
@@ -205,6 +205,6 @@ do {
\
 struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
-int arch_setup_additional_pages(struct linux_binprm *, int);
+int arch_setup_additional_pages(struct linux_binprm *, unsigned long, int);
 
 #endif
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 89b2e7f..bab43b3 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -182,7 +182,8 @@ static void vdso_init_cr5(void)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+   unsigned long start, int uses_interp)
 {
struct mm_struct *mm = current-mm;
struct page **vdso_pagelist;
@@ -213,6 +214,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
vdso_pages = vdso32_pages;
 #endif
 
+   /* in case restart(2) mandates a specific location */
+   if (start)
+   vdso_base = start;
+
/*
 * vDSO has a problem and was disabled, just don't enable it for
 * the process
@@ -235,6 +240,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
goto out_up;
}
 
+   /* for restart(2), double check that we got we asked for */
+   if (start  vdso_base != start) {
+   rc = -EINVAL;
+   goto out_up;
+   }
+
/*
 * our vma flags don't have VM_WRITE so by default, the process
 * isn't allowed to write those pages.
diff --git a/arch/sh/include/asm/elf.h b/arch/sh/include/asm/elf.h
index ccb1d93..6c27b1f 100644
--- a/arch/sh/include/asm/elf.h
+++ b/arch/sh/include/asm/elf.h
@@ -202,6 +202,7 @@ do {  

[Devel] [RFC v16][PATCH 34/43] c/r: save and restore ipc namespace basics

2009-05-27 Thread Oren Laadan
Save and restores the common state (parameters) of ipc namespace.

Also add logic to iterate through the objects of sysvipc shared memory,
message queues and semaphores. The logic to save and restore the state
of these objects will be added in the next few patches.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/process.c   |4 -
 include/linux/checkpoint.h |5 +-
 include/linux/checkpoint_hdr.h |   22 +
 ipc/checkpoint.c   |  203 ++--
 4 files changed, 220 insertions(+), 14 deletions(-)

diff --git a/checkpoint/process.c b/checkpoint/process.c
index eff3d76..b604a85 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -121,10 +121,8 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct 
nsproxy *nsproxy)
 
if (ns_flags  CLONE_NEWUTS)
ret = checkpoint_uts_ns(ctx, nsproxy-uts_ns);
-#if 0
if (!ret  (ns_flags  CLONE_NEWIPC))
ret = checkpoint_ipc_ns(ctx, nsproxy-ipc_ns);
-#endif
 
/* FIX: Write other namespaces here */
return ret;
@@ -472,10 +470,8 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
ckpt_debug(uts ns: %d\n, ret);
if (ret  0)
goto out;
-#if 0
ret = restore_ipc_ns(ctx, h-ipc_objref, h-flags);
ckpt_debug(ipc ns: %d\n, ret);
-#endif
 
/* FIX: add more namespaces here */
  out:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 9a7517f..d5498bc 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -85,7 +85,6 @@ extern int restore_restart_block(struct ckpt_ctx *ctx);
 extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
 extern void *restore_ns(struct ckpt_ctx *ctx);
 
-#if 0
 /* ipc-ns */
 #ifdef CONFIG_SYSVIPC
 extern int checkpoint_ipc_ns(struct ckpt_ctx *ctx,
@@ -98,7 +97,9 @@ static inline int checkpoint_ipc_ns(struct ckpt_ctx *ctx,
 static inline int restore_ipc_ns(struct ckpt_ctx *ctx)
 { return 0; }
 #endif /* CONFIG_SYSVIPC */
-#endif
+
+extern int checkpoint_ipcns(struct ckpt_ctx *ctx, struct ipc_namespace 
*ipc_ns);
+extern int restore_ipcns(struct ckpt_ctx *ctx);
 
 /* file table */
 extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 05769f4..406b5d6 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -305,6 +305,28 @@ struct ckpt_hdr_pgarr {
 
 
 /* ipc commons */
+struct ckpt_hdr_ipcns {
+   struct ckpt_hdr h;
+   __u64 shm_ctlmax;
+   __u64 shm_ctlall;
+   __s32 shm_ctlmni;
+
+   __s32 msg_ctlmax;
+   __s32 msg_ctlmnb;
+   __s32 msg_ctlmni;
+
+   __s32 sem_ctl_msl;
+   __s32 sem_ctl_mns;
+   __s32 sem_ctl_opm;
+   __s32 sem_ctl_mni;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc {
+   struct ckpt_hdr h;
+   __u32 ipc_type;
+   __u32 ipc_count;
+} __attribute__((aligned(8)));
+
 struct ckpt_hdr_ipc_perms {
__s32 id;
__u32 key;
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index b7b48b0..436be5e 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -20,15 +20,12 @@
 
 #include util.h
 
-int checkpoint_ipcns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
-{
-   return 0;
-}
+/* for ckpt_debug */
+static char *ipc_ind_to_str[] = { sem, msg, shm };
 
-int restore_ipcns(struct ckpt_ctx *ctx)
-{
-   return 0;
-}
+/**
+ * Checkpoint
+ */
 
 int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
  struct kern_ipc_perm *perm)
@@ -48,6 +45,82 @@ int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
return 0;
 }
 
+static int checkpoint_ipc_any(struct ckpt_ctx *ctx,
+ struct ipc_namespace *ipc_ns,
+ int ipc_ind, int ipc_type,
+ int (*func)(int id, void *p, void *data))
+{
+   struct ckpt_hdr_ipc *h;
+   struct ipc_ids *ipc_ids = ipc_ns-ids[ipc_ind];
+   int ret = -ENOMEM;
+
+   down_read(ipc_ids-rw_mutex);
+   h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC);
+   if (!h)
+   goto out;
+
+   h-ipc_type = ipc_type;
+   h-ipc_count = ipc_ids-in_use;
+   ckpt_debug(ipc-%s count %d\n, ipc_ind_to_str[ipc_ind], h-ipc_count);
+
+   ret = ckpt_write_obj(ctx, h-h);
+   ckpt_hdr_put(ctx, h);
+   if (ret  0)
+   goto out;
+
+   ret = idr_for_each(ipc_ids-ipcs_idr, func, ctx);
+   ckpt_debug(ipc-%s ret %d\n, ipc_ind_to_str[ipc_ind], ret);
+ out:
+   up_read(ipc_ids-rw_mutex);
+   return ret;
+}
+
+int checkpoint_ipc_ns(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns)
+{
+   struct ckpt_hdr_ipcns *h;
+   int ret;
+
+   h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS);
+   if (!h)
+   

[Devel] [RFC v16][PATCH 31/43] deferqueue: generic queue to defer work

2009-05-27 Thread Oren Laadan
Add a interface to postpone an action until the end of the entire
checkpoint or restart operation. This is useful when during the
scan of tasks an operation cannot be performed in place, to avoid
the need for a second scan.

One use case is when restoring an ipc shared memory region that has
been deleted (but is still attached), during restart it needs to be
create, attached and then deleted. However, creation and attachment
are performed in distinct locations, so deletion can not be performed
on the spot. Instead, this work (delete) is deferred until later.
(This example is in one of the following patches).

This interface allows chronic procrastination in the kernel:

deferqueue_create(void):
Allocates and returns a new deferqueue.

deferqueue_run(deferqueue):
Executes all the pending works in the queue. Returns the number
of works executed, or an error upon the first error reported by
a deferred work.

deferqueue_add(deferqueue, data, size, func, dtor):
Enqueue a deferred work. @function is the callback function to
do the work, which will be called with @data as an argument.
@size tells the size of data. @dtor is a destructor callback
that is invoked for deferred works remaining in the queue when
the queue is destroyed. NOTE: for a given deferred work, @dtor
is _not_ called if @func was already called (regardless of the
return value of the latter).

deferqueue_destroy(deferqueue):
Free the deferqueue and any queued items while invoking the
@dtor callback for each queued item.

Why aren't we using the existing kernel workqueue mechanism?  We need
to defer to work until the end of the operation: not earlier, since we
need other things to be in place; not later, to not block waiting for
it. However, the workqueue schedules the work for 'some time later'.
Also, the kernel workqueue may run in any task context, but we require
many times that an operation be run in the context of some specific
restarting task (e.g., restoring IPC state of a certain ipc_ns).

Instead, this mechanism is a simple way for the c/r operation as a
whole, and later a task in particular, to defer some action until
later (but not arbitrarily later) _in the restore_ operation.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/Kconfig |5 ++
 include/linux/deferqueue.h |   58 +++
 kernel/Makefile|1 +
 kernel/deferqueue.c|  109 
 4 files changed, 173 insertions(+), 0 deletions(-)

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
index 1761b0a..53ed6fa 100644
--- a/checkpoint/Kconfig
+++ b/checkpoint/Kconfig
@@ -2,9 +2,14 @@
 # implemented the hooks for processor state etc. needed by the
 # core checkpoint/restart code.
 
+config DEFERQUEUE
+   bool
+   default n
+
 config CHECKPOINT
bool Enable checkpoint/restart (EXPERIMENTAL)
depends on CHECKPOINT_SUPPORT  EXPERIMENTAL
+   select DEFERQUEUE
help
  Application checkpoint/restart is the ability to save the
  state of a running application so that it can later resume
diff --git a/include/linux/deferqueue.h b/include/linux/deferqueue.h
new file mode 100644
index 000..2eb58cf
--- /dev/null
+++ b/include/linux/deferqueue.h
@@ -0,0 +1,58 @@
+/*
+ * deferqueue.h --- deferred work queue handling for Linux.
+ */
+
+#ifndef _LINUX_DEFERQUEUE_H
+#define _LINUX_DEFERQUEUE_H
+
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+
+/*
+ * This interface allows chronic procrastination in the kernel:
+ *
+ * deferqueue_create(void):
+ * Allocates and returns a new deferqueue.
+ *
+ * deferqueue_run(deferqueue):
+ * Executes all the pending works in the queue. Returns the number
+ * of works executed, or an error upon the first error reported by
+ * a deferred work.
+ *
+ * deferqueue_add(deferqueue, data, size, func, dtor):
+ * Enqueue a deferred work. @function is the callback function to
+ *  do the work, which will be called with @data as an argument.
+ *  @size tells the size of data. @dtor is a destructor callback
+ *  that is invoked for deferred works remaining in the queue when
+ *  the queue is destroyed. NOTE: for a given deferred work, @dtor
+ *  is _not_ called if @func was already called (regardless of the
+ *  return value of the latter).
+ *
+ * deferqueue_destroy(deferqueue):
+ *  Free the deferqueue and any queued items while invoking the
+ *  @dtor callback for each queued item.
+ */
+
+
+typedef int (*deferqueue_func_t)(void *);
+
+struct deferqueue_entry {
+   deferqueue_func_t function;
+   deferqueue_func_t destructor;
+   struct list_head list;
+   char data[0];
+};
+
+struct deferqueue_head {
+   spinlock_t lock;
+   struct list_head list;
+};
+
+struct deferqueue_head *deferqueue_create(void);
+void deferqueue_destroy(struct deferqueue_head *head);
+int 

[Devel] Re: [PATCH 1/1] cr: nsproxy: fix refcounting

2009-05-27 Thread Oren Laadan
thanks, applied.

Serge E. Hallyn wrote:
 [This is the fix for the bug I was trying to nail down earlier today]
 
 If more than one restarted task are to share a checkpointed nsproxy,
 then we must inc the count on the nsproxy for each new task, as
 switch_task_namespaces() does not do that for us.
 
 Signed-off-by: Serge E. Hallyn se...@us.ibm.com
 ---
  checkpoint/process.c |4 +++-
  1 files changed, 3 insertions(+), 1 deletions(-)
 
 diff --git a/checkpoint/process.c b/checkpoint/process.c
 index fa166cd..52d2a9c 100644
 --- a/checkpoint/process.c
 +++ b/checkpoint/process.c
 @@ -603,8 +603,10 @@ static int restore_ns_obj(struct ckpt_ctx *ctx, int 
 ns_objref)
   if (IS_ERR(nsproxy))
   return PTR_ERR(nsproxy);
  
 - if (nsproxy != task_nsproxy(current))
 + if (nsproxy != task_nsproxy(current)) {
 + get_nsproxy(nsproxy);
   switch_task_namespaces(current, nsproxy);
 + }
  
   return 0;
  }
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 40/43] c/r: support semaphore sysv-ipc

2009-05-27 Thread Oren Laadan
Checkpoint of sysvipc semaphores is performed by iterating through all
sem objects and dumping the contents of each one. The semaphore array
of each sem is dumped with that object.

The semaphore array (sem-sem_base) holds an array of 'struct sem',
which is a {int, int}. Because this translates into the same format
on 32- and 64-bit architectures, the checkpoint format is simply the
dump of this array as is.

TODO: this patch does not handle semaphore-undo -- this data should be
saved per-task while iterating through the tasks.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 include/linux/checkpoint_hdr.h |8 ++
 ipc/Makefile   |3 +-
 ipc/checkpoint.c   |4 -
 ipc/checkpoint_sem.c   |  220 
 ipc/util.h |5 +
 5 files changed, 235 insertions(+), 5 deletions(-)

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b05f39c..cd427d8 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -376,6 +376,14 @@ struct ckpt_hdr_ipc_msg_msg {
__u32 m_ts;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_sem {
+   struct ckpt_hdr h;
+   struct ckpt_hdr_ipc_perms perms;
+   __u64 sem_otime;
+   __u64 sem_ctime;
+   __u32 sem_nsems;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
((sizeof(a)  sizeof(b))  ((a)  SHORT_MAX))
diff --git a/ipc/Makefile b/ipc/Makefile
index ca408ff..81af168 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,5 +9,6 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_CHECKPOINT) += checkpoint.o checkpoint_shm.o checkpoint_msg.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o \
+   checkpoint_shm.o checkpoint_msg.o checkpoint_sem.o
 
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 7eece96..f621226 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -108,12 +108,10 @@ int checkpoint_ipc_ns(struct ckpt_ctx *ctx, struct 
ipc_namespace *ipc_ns)
return ret;
ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
-#if 0 /* NEXT FEW PATCHES */
if (ret  0)
return ret;
ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
 CKPT_HDR_IPC_SEM, checkpoint_ipc_sem);
-#endif
return ret;
 }
 
@@ -220,12 +218,10 @@ static int do_restore_ipc_ns(struct ckpt_ctx *ctx)
goto out;
ret = restore_ipc_any(ctx, IPC_MSG_IDS,
  CKPT_HDR_IPC_MSG, restore_ipc_msg);
-#if 0 /* NEXT FEW PATCHES */
if (ret  0)
goto out;
ret = restore_ipc_any(ctx, IPC_SEM_IDS,
  CKPT_HDR_IPC_SEM, restore_ipc_sem);
-#endif
  out:
ckpt_hdr_put(ctx, h);
return ret;
diff --git a/ipc/checkpoint_sem.c b/ipc/checkpoint_sem.c
new file mode 100644
index 000..34dea40
--- /dev/null
+++ b/ipc/checkpoint_sem.c
@@ -0,0 +1,220 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc sem
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include linux/mm.h
+#include linux/sem.h
+#include linux/rwsem.h
+#include linux/sched.h
+#include linux/syscalls.h
+#include linux/nsproxy.h
+#include linux/ipc_namespace.h
+
+#include linux/msg.h /* needed for util.h that uses 'struct msg_msg' */
+#include util.h
+
+#include linux/checkpoint.h
+#include linux/checkpoint_hdr.h
+
+/
+ * ipc checkpoint
+ */
+
+static int fill_ipc_sem_hdr(struct ckpt_ctx *ctx,
+  struct ckpt_hdr_ipc_sem *h,
+  struct sem_array *sem)
+{
+   int ret = 0;
+
+   ipc_lock_by_ptr(sem-sem_perm);
+
+   ret = checkpoint_fill_ipc_perms(h-perms, sem-sem_perm);
+   if (ret  0)
+   goto unlock;
+
+   h-sem_otime = sem-sem_otime;
+   h-sem_ctime = sem-sem_ctime;
+   h-sem_nsems = sem-sem_nsems;
+
+ unlock:
+   ipc_unlock(sem-sem_perm);
+   ckpt_debug(sem: nsems %u\n, h-sem_nsems);
+
+   return ret;
+}
+
+/**
+ * ckpt_write_sem_array - dump the state of a semaphore array
+ * @ctx: checkpoint context
+ * @sem: semphore array
+ *
+ * The state of a sempahore is an array of 'struct sem'. This structure
+ * is {int, int}, which translates to the same format {32 bits, 32 bits}
+ * on both 32- and 64-bit architectures. So we simply dump the array.
+ *
+ * The sem-undo information is not saved 

[Devel] [RFC v16][PATCH 18/43] c/r: restore anonymous- and file-mapped- shared memory

2009-05-27 Thread Oren Laadan
The bulk of the work is in ckpt_read_vma(), which has been refactored:
the part that create the suitable 'struct file *' for the mapping is
now larger and moved to a separate function. What's left is to read
the VMA description, get the file pointer, create the mapping, and
proceed to read the contents in.

Both anonymous shared VMAs that have been read earlier (as indicated
by a look up to objhash) and file-mapped shared VMAs are skipped.
Anonymous shared VMAs seen for the first time have their contents
read in directly to the backing inode, as indexed by the page numbers
(as opposed to virtual addresses).

Changelog[v14]:
  - Introduce patch

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/memory.c|   66 ---
 include/linux/checkpoint.h |6 
 include/linux/mm.h |2 +
 mm/filemap.c   |   13 -
 mm/shmem.c |   49 
 5 files changed, 118 insertions(+), 18 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 2b73abc..c163b76 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -785,13 +785,36 @@ static int restore_read_page(struct ckpt_ctx *ctx, struct 
page *page, void *p)
return 0;
 }
 
+static struct page *bring_private_page(unsigned long addr)
+{
+   struct page *page;
+   int ret;
+
+   ret = get_user_pages(current, current-mm, addr, 1, 1, 1, page, NULL);
+   if (ret  0)
+   page = ERR_PTR(ret);
+   return page;
+}
+
+static struct page *bring_shared_page(unsigned long idx, struct inode *ino)
+{
+   struct page *page = NULL;
+   int ret;
+
+   ret = shmem_getpage(ino, idx, page, SGP_WRITE, NULL);
+   if (ret  0)
+   return ERR_PTR(ret);
+   if (page)
+   unlock_page(page);
+   return page;
+}
+
 /**
  * read_pages_contents - read in data of pages in page-array chain
  * @ctx - restart context
  */
-static int read_pages_contents(struct ckpt_ctx *ctx)
+static int read_pages_contents(struct ckpt_ctx *ctx, struct inode *inode)
 {
-   struct mm_struct *mm = current-mm;
struct ckpt_pgarr *pgarr;
unsigned long *vaddrs;
char *buf;
@@ -801,17 +824,22 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
if (!buf)
return -ENOMEM;
 
-   down_read(mm-mmap_sem);
+   down_read(current-mm-mmap_sem);
list_for_each_entry_reverse(pgarr, ctx-pgarr_list, list) {
vaddrs = pgarr-vaddrs;
for (i = 0; i  pgarr-nr_used; i++) {
struct page *page;
 
_ckpt_debug(CKPT_DPAGE, got page %#lx\n, vaddrs[i]);
-   ret = get_user_pages(current, mm, vaddrs[i],
-1, 1, 1, page, NULL);
-   if (ret  0)
+   if (inode)
+   page = bring_shared_page(vaddrs[i], inode);
+   else
+   page = bring_private_page(vaddrs[i]);
+
+   if (IS_ERR(page)) {
+   ret = PTR_ERR(page);
goto out;
+   }
 
ret = restore_read_page(ctx, page, buf);
page_cache_release(page);
@@ -822,14 +850,15 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
}
 
  out:
-   up_read(mm-mmap_sem);
+   up_read(current-mm-mmap_sem);
kfree(buf);
return 0;
 }
 
 /**
- * restore_memory_contents - restore contents of a VMA with private memory
+ * restore_memory_contents - restore contents of a memory region
  * @ctx - restart context
+ * @inode - backing inode
  *
  * Reads a header that specifies how many pages will follow, then reads
  * a list of virtual addresses into ctx-pgarr_list page-array chain,
@@ -837,7 +866,7 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
  * these steps until reaching a header specifying 0 pages, which marks
  * the end of the contents.
  */
-static int restore_memory_contents(struct ckpt_ctx *ctx)
+int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode)
 {
struct ckpt_hdr_pgarr *h;
unsigned long nr_pages;
@@ -864,7 +893,7 @@ static int restore_memory_contents(struct ckpt_ctx *ctx)
ret = read_pages_vaddrs(ctx, nr_pages);
if (ret  0)
break;
-   ret = read_pages_contents(ctx);
+   ret = read_pages_contents(ctx, inode);
if (ret  0)
break;
pgarr_reset_all(ctx);
@@ -922,9 +951,9 @@ static unsigned long calc_map_flags_bits(unsigned long 
orig_vm_flags)
  * @file - file to map (NULL for anonymous)
  * @h - vma header data
  */
-static unsigned long generic_vma_restore(struct mm_struct *mm,
-

[Devel] [RFC v16][PATCH 10/43] c/r: restore open file descriptors

2009-05-27 Thread Oren Laadan
For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the
hash table; If not found in the hash table, (first occurence), read in
'struct ckpt_hdr_file', create a new file and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

Changelog[v16]:
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects

Changelog[v14]:
  - Introduce a per file-type restore() callback
  - Revert change to pr_debug(), back to ckpt_debug()
  - Rename:  restore_files() = restore_fd_table()
  - Rename:  ckpt_read_fd_data() = restore_file()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'hh-parent'

Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/files.c |  285 
 checkpoint/objhash.c   |2 +
 checkpoint/process.c   |   20 +++
 checkpoint/restart.c   |9 ++
 include/linux/checkpoint.h |5 +
 5 files changed, 321 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index d10dfb6..d7583d3 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -16,6 +16,8 @@
 #include linux/sched.h
 #include linux/file.h
 #include linux/fdtable.h
+#include linux/fsnotify.h
+#include linux/syscalls.h
 #include linux/checkpoint.h
 #include linux/checkpoint_hdr.h
 
@@ -309,3 +311,286 @@ int checkpoint_obj_file_table(struct ckpt_ctx *ctx, 
struct task_struct *t)
 
return objref;
 }
+
+/**
+ * Restart
+ */
+
+/**
+ * read_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ * @mode: file mode
+ */
+static struct file *read_open_fname(struct ckpt_ctx *ctx, int flags, int mode)
+{
+   struct ckpt_hdr *h;
+   struct file *file;
+   char *fname;
+
+   h = ckpt_read_buf_type(ctx, PATH_MAX, CKPT_HDR_FILE_NAME);
+   if (IS_ERR(h))
+   return (struct file *) h;
+   fname = (char *) (h + 1);
+   ckpt_debug(fname '%s' flags %#x mode %#x\n, fname, flags, mode);
+
+   file = filp_open(fname, flags, mode);
+   ckpt_hdr_put(ctx, h);
+   return file;
+}
+
+static int close_all_fds(struct files_struct *files)
+{
+   int *fdtable;
+   int nfds;
+
+   nfds = scan_fds(files, fdtable);
+   if (nfds  0)
+   return nfds;
+   while (nfds--)
+   sys_close(fdtable[nfds]);
+   kfree(fdtable);
+   return 0;
+}
+
+/**
+ * attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int attach_file(struct file *file)
+{
+   int fd = get_unused_fd_flags(0);
+
+   if (fd = 0) {
+   get_file(file);
+   fsnotify_open(file-f_path.dentry);
+   fd_install(fd, file);
+   }
+   return fd;
+}
+
+#define CKPT_SETFL_MASK  \
+   (O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME)
+
+int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+   struct ckpt_hdr_file *h)
+{
+   int ret;
+
+   /* FIX: need to restore uid, gid, owner etc */
+
+   /* safe to set 1st arg (fd) to 0, as command is F_SETFL */
+   ret = vfs_fcntl(0, F_SETFL, h-f_flags  CKPT_SETFL_MASK, file);
+   if (ret  0)
+   goto out;
+
+   ret = vfs_llseek(file, h-f_pos, SEEK_SET);
+   if (ret == -ESPIPE) /* ignore error on non-seekable files */
+   ret = 0;
+ out:
+   return ret;
+}
+
+static struct file *generic_file_restore(struct ckpt_ctx *ctx,
+struct ckpt_hdr_file *ptr)
+{
+   struct file *file;
+   int ret;
+
+   if (ptr-h.type != CKPT_HDR_FILE  ||
+   ptr-h.len != sizeof(*ptr) || ptr-f_type != CKPT_FILE_GENERIC)
+   return ERR_PTR(-EINVAL);
+
+   file = read_open_fname(ctx, ptr-f_flags, ptr-f_mode);
+   if (IS_ERR(file))
+   return file;
+
+   ret = restore_file_common(ctx, file, ptr);
+   if (ret  0) {
+   fput(file);
+   file = ERR_PTR(ret);
+   }
+   return file;
+}
+
+struct restore_file_ops {
+   char *file_name;
+   enum file_type file_type;
+   struct file * (*restore) (struct ckpt_ctx *ctx,
+ struct ckpt_hdr_file *ptr);
+};
+
+static struct restore_file_ops restore_file_ops[] = {
+   /* ignored file */
+   {
+   .file_name = IGNORE,
+   .file_type = CKPT_FILE_IGNORE,
+   .restore = NULL,
+   },
+   /* regular file/directory */
+   {
+   .file_name = GENERIC,
+   .file_type = CKPT_FILE_GENERIC,
+   .restore = generic_file_restore,
+   

[Devel] [RFC v16][PATCH 38/43] c/r: support message-queues sysv-ipc

2009-05-27 Thread Oren Laadan
Checkpoint of sysvipc message-queues is performed by iterating through
all 'msq' objects and dumping the contents of each one. The message
queued on each 'msq' are dumped with that object.

Message of a specific queue get written one by one. The queue lock
cannot be held while dumping them, but the loop must be protected from
someone (who ?) writing or reading. To do that we grab the lock, then
hijack the entire chain of messages from the queue, drop the lock,
and then safely dump them in a loop. Finally, with the lock held, we
re-attach the chain while verifying that there isn't other (new) data
on that queue.

Writing the message contents themselves is straight forward. The code
is similar to that in ipc/msgutil.c, the main difference being that
we deal with kernel memory and not user memory.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 include/linux/checkpoint_hdr.h |   21 +++-
 ipc/Makefile   |2 +-
 ipc/checkpoint.c   |6 +-
 ipc/checkpoint_msg.c   |  362 
 ipc/util.h |3 +
 5 files changed, 389 insertions(+), 5 deletions(-)

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index f7e331d..b05f39c 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -73,6 +73,7 @@ enum {
CKPT_HDR_IPC = 501,
CKPT_HDR_IPC_SHM,
CKPT_HDR_IPC_MSG,
+   CKPT_HDR_IPC_MSG_MSG,
CKPT_HDR_IPC_SEM,
 
CKPT_HDR_TAIL = 9001,
@@ -356,6 +357,25 @@ struct ckpt_hdr_ipc_shm {
__u32 objref;
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_ipc_msg {
+   struct ckpt_hdr h;
+   struct ckpt_hdr_ipc_perms perms;
+   __u64 q_stime;
+   __u64 q_rtime;
+   __u64 q_ctime;
+   __u64 q_cbytes;
+   __u64 q_qnum;
+   __u64 q_qbytes;
+   __s32 q_lspid;
+   __s32 q_lrpid;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc_msg_msg {
+   struct ckpt_hdr h;
+   __s32 m_type;
+   __u32 m_ts;
+} __attribute__((aligned(8)));
+
 
 #define CKPT_TST_OVERFLOW_16(a, b) \
((sizeof(a)  sizeof(b))  ((a)  SHORT_MAX))
@@ -366,5 +386,4 @@ struct ckpt_hdr_ipc_shm {
 #define CKPT_TST_OVERFLOW_64(a, b) \
((sizeof(a)  sizeof(b))  ((a)  LONG_MAX))
 
-
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/ipc/Makefile b/ipc/Makefile
index 7e23683..ca408ff 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,5 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_CHECKPOINT) += checkpoint.o checkpoint_shm.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o checkpoint_shm.o checkpoint_msg.o
 
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 25d2277..7eece96 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -104,11 +104,11 @@ int checkpoint_ipc_ns(struct ckpt_ctx *ctx, struct 
ipc_namespace *ipc_ns)
 
ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
 CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
if (ret  0)
return ret;
ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
 CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
+#if 0 /* NEXT FEW PATCHES */
if (ret  0)
return ret;
ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
@@ -216,11 +216,11 @@ static int do_restore_ipc_ns(struct ckpt_ctx *ctx)
 
ret = restore_ipc_any(ctx, IPC_SHM_IDS,
  CKPT_HDR_IPC_SHM, restore_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
if (ret  0)
goto out;
-   ret = ckpt_read_ipc_any(ctx, IPC_MSG_IDS,
+   ret = restore_ipc_any(ctx, IPC_MSG_IDS,
  CKPT_HDR_IPC_MSG, restore_ipc_msg);
+#if 0 /* NEXT FEW PATCHES */
if (ret  0)
goto out;
ret = restore_ipc_any(ctx, IPC_SEM_IDS,
diff --git a/ipc/checkpoint_msg.c b/ipc/checkpoint_msg.c
new file mode 100644
index 000..a988a9e
--- /dev/null
+++ b/ipc/checkpoint_msg.c
@@ -0,0 +1,362 @@
+/*
+ *  Checkpoint/restart - dump state of sysvipc msg
+ *
+ *  Copyright (C) 2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DIPC
+
+#include linux/mm.h
+#include linux/msg.h
+#include linux/rwsem.h
+#include linux/sched.h
+#include linux/syscalls.h
+#include linux/nsproxy.h
+#include linux/ipc_namespace.h
+
+#include util.h
+
+#include linux/checkpoint.h
+#include linux/checkpoint_hdr.h
+
+/
+ * ipc checkpoint
+ */
+
+static int fill_ipc_msg_hdr(struct ckpt_ctx *ctx,
+  

[Devel] [RFC v16][PATCH 22/43] c/r: checkpoint multiple processes

2009-05-27 Thread Oren Laadan
Checkpointing of multiple processes works by recording the tasks tree
structure below a given root task. The root task is expected to be a
container init, and then an entire container is checkpointed. However,
passing CHECKPOINT_SUBTREE to checkpoint(2) relaxes this requirement
and allows to checkpoint a subtree of processes from the root task.

For a given root task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.

The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.

Whether checkpoints and restarts require CAP_SYS_ADMIN is determined
by sysctl 'ckpt_unpriv_allowed': if 1, then regular permission checks
are intended to prevent privilege escalation, however if 0 it prevents
unprivileged users from exploiting any privilege escalation bugs.

The logic is suitable for creation of processes during restart either
in userspace or by the kernel.

Currently we ignore threads and zombies.

Changelog[v16]:
  - CHECKPOINT_SUBTREE flags allows subtree (not whole container)
  - sysctl variable 'ckpt_unpriv_allowed' controls needed privileges

Changelog[v14]:
  - Refuse non-self checkpoint if target task isn't frozen
  - Refuse checkpoint (for now) if task is ptraced
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Check retval of ckpt_tree_count_tasks() in ckpt_build_tree()
  - Discard 'h.parent' field
  - Check whether calls to ckpt_hbuf_get() fail
  - Disallow threads or siblings to container init

Changelog[v13]:
  - Release tasklist_lock in error path in ckpt_tree_count_tasks()
  - Use separate index for 'tasks_arr' and 'hh' in ckpt_write_pids()

Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/checkpoint.c  |  237 --
 checkpoint/restart.c |2 +-
 checkpoint/sys.c |   33 +-
 include/linux/checkpoint_hdr.h   |   16 +++-
 include/linux/checkpoint_types.h |   16 ++-
 kernel/sysctl.c  |   17 +++
 6 files changed, 305 insertions(+), 16 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 3999d80..92f219e 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -249,8 +249,27 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
return ret;
 }
 
+/* dump all tasks in ctx-tasks_arr[] */
+static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
+{
+   int n, ret = 0;
+
+   for (n = 0; n  ctx-nr_tasks; n++) {
+   ckpt_debug(dumping task #%d\n, n);
+   ret = checkpoint_task(ctx, ctx-tasks_arr[n]);
+   if (ret  0)
+   break;
+   }
+
+   return ret;
+}
+
 static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
+   struct task_struct *root = ctx-root_task;
+
+   ckpt_debug(check %d\n, task_pid_nr_ns(t, ctx-root_nsproxy-pid_ns));
+
if (t-state == TASK_DEAD) {
pr_warning(c/r: task %d is TASK_DEAD\n, task_pid_vnr(t));
return -EAGAIN;
@@ -276,14 +295,211 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, 
struct task_struct *t)
return -EBUSY;
}
 
+   /*
+* FIX: for now, disallow siblings of container init created
+* via CLONE_PARENT (unclear if they will remain possible)
+*/
+   if (ctx-root_init  t != root  t-tgid != root-tgid 
+   t-real_parent == root-real_parent) {
+   __ckpt_write_err(ctx, task %d (%s) is sibling of root,
+task_pid_vnr(t), t-comm);
+   return -EINVAL;
+   }
+
+   /* FIX: change this when namespaces are added */
+   if (task_nsproxy(t) != ctx-root_nsproxy)
+   return -EPERM;
+
+   return 0;
+}
+
+#define CKPT_HDR_PIDS_CHUNK256
+
+static int checkpoint_pids(struct ckpt_ctx *ctx)
+{
+   struct ckpt_hdr_pids *h;
+   struct pid_namespace *ns;
+   struct task_struct *task;
+   struct task_struct **tasks_arr;
+   int nr_tasks, n, pos = 0, ret = 0;
+
+   ns = ctx-root_nsproxy-pid_ns;
+   tasks_arr = ctx-tasks_arr;
+   nr_tasks = ctx-nr_tasks;
+   BUG_ON(nr_tasks = 0);
+
+   ret = ckpt_write_obj_type(ctx, NULL,
+ sizeof(*h) * nr_tasks,
+ CKPT_HDR_BUFFER);
+   if (ret  0)
+   return ret;
+
+   h = ckpt_hdr_get(ctx, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
+   if (!h)
+   return -ENOMEM;
+
+   do {
+   rcu_read_lock();
+   for (n = 0; n  min(nr_tasks, CKPT_HDR_PIDS_CHUNK); n++) {

[Devel] [RFC v16][PATCH 17/43] c/r: dump anonymous- and file-mapped- shared memory

2009-05-27 Thread Oren Laadan
We now handle anonymous and file-mapped shared memory. Support for IPC
shared memory requires support for IPC first. We extend ckpt_write_vma()
to detect shared memory VMAs and handle it separately than private
memory.

There is not much to do for file-mapped shared memory, except to force
msync() on the region to ensure that the file system is consistent
with the checkpoint image. Use our internal type CKPT_VMA_SHM_FILE.

Anonymous shared memory is always backed by inode in shmem filesystem.
We use that inode to look up the VMA in the objhash and register it if
not found (on first encounter). In this case, the type of the VMA is
CKPT_VMA_SHM_ANON, and we dump the contents. On the other hand, if it is
found there, we must have already saved it before, so we change the
type to CKPT_VMA_SHM_ANON_SKIP and skip it.

To dump the contents of a shmem VMA, we loop through the pages of the
inode in the shmem filesystem, and dump the contents of each dirty
(allocated) page - unallocated pages must be clean.

Note that we save the original size of a shmem VMA because it may have
been re-mapped partially. The format itself remains like with private
VMAs, except that instead of addresses we record _indices_ (page nr)
into the backing inode.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/memory.c|  143 +++
 checkpoint/objhash.c   |   29 
 include/linux/checkpoint.h |   15 +++--
 include/linux/checkpoint_hdr.h |8 ++
 mm/filemap.c   |   39 +++-
 mm/mmap.c  |2 +-
 mm/shmem.c |   33 +
 7 files changed, 246 insertions(+), 23 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 99bafaa..2b73abc 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -21,6 +21,7 @@
 #include linux/pagemap.h
 #include linux/mm_types.h
 #include linux/proc_fs.h
+#include linux/swap.h
 #include linux/checkpoint.h
 #include linux/checkpoint_hdr.h
 
@@ -281,6 +282,54 @@ static struct page *consider_private_page(struct 
vm_area_struct *vma,
 }
 
 /**
+ * consider_shared_page - return page pointer for dirty pages
+ * @ino - inode of shmem object
+ * @idx - page index in shmem object
+ *
+ * Looks up the page that corresponds to the index in the shmem object,
+ * and returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ */
+static struct page *consider_shared_page(struct inode *ino, unsigned long idx)
+{
+   struct page *page = NULL;
+   int ret;
+
+   /*
+* Inspired by do_shmem_file_read(): very simplified version.
+*
+* FIXME: consolidate with do_shmem_file_read()
+*/
+
+   ret = shmem_getpage(ino, idx, page, SGP_READ, NULL);
+   if (ret  0)
+   return ERR_PTR(ret);
+
+   /*
+* Only care about dirty pages; shmem_getpage() only returns
+* pages that have been allocated, so they must be dirty. The
+* pages returned are locked and referenced.
+*/
+
+   if (page) {
+   unlock_page(page);
+   /*
+* If users can be writing to this page using arbitrary
+* virtual addresses, take care about potential aliasing
+* before reading the page on the kernel side.
+*/
+   if (mapping_writably_mapped(ino-i_mapping))
+   flush_dcache_page(page);
+   /*
+* Mark the page accessed if we read the beginning.
+*/
+   mark_page_accessed(page);
+   }
+
+   return page;
+}
+
+/**
  * vma_fill_pgarr - fill a page-array with addr/page tuples
  * @ctx - checkpoint context
  * @vma - vma to scan
@@ -289,17 +338,16 @@ static struct page *consider_private_page(struct 
vm_area_struct *vma,
  * Returns the number of pages collected
  */
 static int vma_fill_pgarr(struct ckpt_ctx *ctx,
- struct vm_area_struct *vma,
- unsigned long *start)
+ struct vm_area_struct *vma, struct inode *inode,
+ unsigned long *start, unsigned long end)
 {
-   unsigned long end = vma-vm_end;
unsigned long addr = *start;
struct ckpt_pgarr *pgarr;
int nr_used;
int cnt = 0;
 
/* this function is only for private memory (anon or file-mapped) */
-   BUG_ON(vma-vm_flags  (VM_SHARED | VM_MAYSHARE));
+   BUG_ON(inode  vma);
 
do {
pgarr = pgarr_current(ctx);
@@ -311,7 +359,11 @@ static int vma_fill_pgarr(struct ckpt_ctx *ctx,
while (addr  end) {
struct page *page;
 
-   page = consider_private_page(vma, addr);
+   if (vma)
+   page = consider_private_page(vma, addr);
+   else
+   

[Devel] [RFC v16][PATCH 21/43] c/r: restart-blocks

2009-05-27 Thread Oren Laadan
(Paraphrasing what's said this message:
http://lists.openwall.net/linux-kernel/2007/12/05/64)

Restart blocks are callbacks used cause a system call to be restarted
with the arguments specified in the system call restart block. It is
useful for system call that are not idempotent, i.e. the argument(s)
might be a relative timeout, where some adjustments are required when
restarting the system call. It relies on the system call itself to set
up its restart point and the argument save area.  They are rare: an
actual signal would turn that it an EINTR. The only case that should
ever trigger this is some kernel action that interrupts the system
call, but does not actually result in any user-visible state changes -
like freeze and thaw.

So restart blocks are about time remaining for the system call to
sleep/wait. Generally in c/r, there are two possible time models that
we can follow: absolute, relative. Here, I chose to save the relative
timeout, measured from the beginning of the checkpoint. The time when
the checkpoint (and restart) begin is also saved. This information is
sufficient to restart in either model (absolute or negative).

Which model to use should eventually be a per application choice (and
possible configurable via cradvise() or some sort). For now, we adopt
the relative model, namely, at restart the timeout is set relative to
the beginning of the restart.

To checkpoint, we check if a task has a valid restart block, and if so
we save the *remaining* time that is has to wait/sleep, and the type
of the restart block.

To restart, we fill in the data required at the proper place in the
thread information. If the system call return an error (which is
possibly an -ERESTARTSYS eg), we not only use that error as our own
return value, but also arrange for the task to execute the signal
handler (by faking a signal). The handler, in turn, already has the
code to handle these restart request gracefully.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 arch/x86/include/asm/checkpoint_hdr.h |1 -
 arch/x86/mm/checkpoint.c  |   10 +-
 checkpoint/checkpoint.c   |1 +
 checkpoint/process.c  |  226 +
 checkpoint/restart.c  |   35 +-
 checkpoint/sys.c  |1 +
 include/linux/checkpoint.h|4 +
 include/linux/checkpoint_hdr.h|   22 +++
 include/linux/checkpoint_types.h  |3 +
 9 files changed, 293 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/checkpoint_hdr.h 
b/arch/x86/include/asm/checkpoint_hdr.h
index cf90170..ee23df9 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -57,7 +57,6 @@ struct ckpt_hdr_header_arch {
 
 struct ckpt_hdr_thread {
struct ckpt_hdr h;
-   /* FIXME: restart blocks */
__u16 gdt_entry_tls_entries;
__u16 sizeof_tls_array;
__u16 ntls; /* number of TLS entries to follow */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index c781416..7cd7494 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -63,13 +63,9 @@ int checkpoint_thread(struct ckpt_ctx *ctx, struct 
task_struct *t)
 * FIXME: the TLS descriptors in the GDT should be called out and
 * not tied to the in-kernel representation.
 */
-   ret = ckpt_write_obj_type(ctx, thread-tls_array,
- sizeof(thread-tls_array),
- CKPT_HDR_THREAD_TLS);
-
-   /* IGNORE RESTART BLOCKS FOR NOW ... */
-
-   return ret;
+   return ckpt_write_obj_type(ctx, thread-tls_array,
+  sizeof(thread-tls_array),
+  CKPT_HDR_THREAD_TLS);
 }
 
 #ifndef CONFIG_X86_64
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 086f2d9..3999d80 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -23,6 +23,7 @@
 #include linux/mount.h
 #include linux/utsname.h
 #include linux/magic.h
+#include linux/hrtimer.h
 #include linux/checkpoint.h
 #include linux/checkpoint_hdr.h
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 3ce82cb..876be3e 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -12,6 +12,9 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include linux/sched.h
+#include linux/posix-timers.h
+#include linux/futex.h
+#include linux/poll.h
 #include linux/checkpoint.h
 #include linux/checkpoint_hdr.h
 
@@ -80,6 +83,116 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, 
struct task_struct *t)
return ret;
 }
 
+/* dump the task_struct of a given task */
+int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+   struct ckpt_hdr_restart_block *h;
+   struct restart_block *restart_block;
+   long (*fn)(struct restart_block *);
+   s64 base, expire = 0;
+   int ret;
+
+   h = ckpt_hdr_get_type(ctx, sizeof(*h), 

[Devel] [RFC v16][PATCH 36/43] c/r: support share-memory sysv-ipc

2009-05-27 Thread Oren Laadan
Checkpoint of sysvipc shared memory is performed in two steps: first,
the entire ipc namespace is dumped as a whole by iterating through all
shm objects and dumping the contents of each one. The shmem inode is
registered in the objhash. Second, for each vma that refers to ipc
shared memory we find the inode in the objhash, and save the objref.

(If we find a new inode, that indicates that the ipc namespace is not
entirely frozen and someone must have manipulated it since step 1).

Handling of shm objects that have been deleted (via IPC_RMID) is left
to a later patch in this series.

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/memory.c  |   28 -
 checkpoint/sys.c |   10 ++
 include/linux/checkpoint.h   |3 +
 include/linux/checkpoint_hdr.h   |   19 +++-
 include/linux/checkpoint_types.h |1 +
 include/linux/shm.h  |9 ++
 ipc/Makefile |2 +-
 ipc/checkpoint.c |4 +-
 ipc/checkpoint_shm.c |  261 ++
 ipc/shm.c|   73 +++
 ipc/util.h   |4 +-
 11 files changed, 406 insertions(+), 8 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index c163b76..997359f 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -20,6 +20,7 @@
 #include linux/mman.h
 #include linux/pagemap.h
 #include linux/mm_types.h
+#include linux/shm.h
 #include linux/proc_fs.h
 #include linux/swap.h
 #include linux/checkpoint.h
@@ -459,9 +460,9 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
  * virtual addresses into ctx-pgarr_list page-array chain. Then dump
  * the addresses, followed by the page contents.
  */
-static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
- struct vm_area_struct *vma,
- struct inode *inode)
+int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+  struct vm_area_struct *vma,
+  struct inode *inode)
 {
struct ckpt_hdr_pgarr *h;
unsigned long addr, end;
@@ -1022,6 +1023,13 @@ static int anon_private_restore(struct ckpt_ctx *ctx,
return private_vma_restore(ctx, mm, NULL, h);
 }
 
+static int bad_vma_restore(struct ckpt_ctx *ctx,
+  struct mm_struct *mm,
+  struct ckpt_hdr_vma *h)
+{
+   return -EINVAL;
+}
+
 /* callbacks to restore vma per its type: */
 struct restore_vma_ops {
char *vma_name;
@@ -1074,6 +1082,20 @@ static struct restore_vma_ops restore_vma_ops[] = {
.vma_type = CKPT_VMA_SHM_FILE,
.restore = filemap_restore,
},
+   /* sysvipc shared */
+   {
+   .vma_name = IPC SHARED,
+   .vma_type = CKPT_VMA_SHM_IPC,
+   /* ipc inode itself is restore by restore_ipc_ns()... */
+   .restore = bad_vma_restore,
+
+   },
+   /* sysvipc shared (skip) */
+   {
+   .vma_name = IPC SHARED (skip),
+   .vma_type = CKPT_VMA_SHM_IPC_SKIP,
+   .restore = ipcshm_restore,
+   },
 };
 
 /**
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index f6cf0ac..ac3bf7c 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -20,6 +20,7 @@
 #include linux/uaccess.h
 #include linux/capability.h
 #include linux/checkpoint.h
+#include linux/deferqueue.h
 
 /*
  * ckpt_unpriv_allowed - sysctl controlled, do not allow checkpoints or
@@ -188,8 +189,17 @@ static void task_arr_free(struct ckpt_ctx *ctx)
 
 static void ckpt_ctx_free(struct ckpt_ctx *ctx)
 {
+   int ret;
+
BUG_ON(atomic_read(ctx-refcount));
 
+   if (ctx-deferqueue) {
+   ret = deferqueue_run(ctx-deferqueue);
+   if (ret != 0)
+   pr_warning(c/r: deferqueue had %d entries\n, ret);
+   deferqueue_destroy(ctx-deferqueue);
+   }
+
if (ctx-file)
fput(ctx-file);
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index d5498bc..064dd25 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -145,6 +145,9 @@ extern unsigned long generic_vma_restore(struct mm_struct 
*mm,
 extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
   struct file *file, struct ckpt_hdr_vma *h);
 
+extern int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+ struct vm_area_struct *vma,
+ struct inode *inode);
 extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
 
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 406b5d6..f7e331d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -278,7 +278,9 @@ enum vma_type {
CKPT_VMA_SHM_ANON,  /* 

[Devel] [RFC v16][PATCH 07/43] c/r: infrastructure for shared objects

2009-05-27 Thread Oren Laadan
The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kernel address).
From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

The hash is one-way: objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.

The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.

Changelog[v16]:
  - Introduce ckpt_obj_lookup() to find an object by its ptr

Changelog[v14]:
  - Introduce 'struct ckpt_obj_ops' to better modularize shared objs.
  - Replace long 'switch' statements with table lookups and callbacks.
  - Introduce checkpoint_obj() and restart_obj() helpers
  - Shared objects now dumped/saved right before they are referenced
  - Cleanup interface of shared objects

Changelog[v13]:
  - Use hash_long() with 'unsigned long' cast to support 64bit archs
(Nathan Lynch n...@pobox.com)

Changelog[v11]:
  - Doc: be explicit about grabbing a reference and object lifetime

Changelog[v4]:
  - Fix calculation of hash table size

Changelog[v3]:
  - Use standard hlist_... for hash table

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/Makefile  |1 +
 checkpoint/objhash.c |  397 ++
 checkpoint/restart.c |   46 +
 checkpoint/sys.c |7 +
 include/linux/checkpoint.h   |   15 ++
 include/linux/checkpoint_hdr.h   |   14 ++
 include/linux/checkpoint_types.h |2 +
 7 files changed, 482 insertions(+), 0 deletions(-)

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 99364cc..5aa6a75 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -4,6 +4,7 @@
 
 obj-$(CONFIG_CHECKPOINT) += \
sys.o \
+   objhash.o \
checkpoint.o \
restart.o \
process.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 000..82b4618
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,397 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DOBJ
+
+#include linux/kernel.h
+#include linux/hash.h
+#include linux/checkpoint.h
+#include linux/checkpoint_hdr.h
+
+struct ckpt_obj;
+struct ckpt_obj_ops;
+
+/* object operations */
+struct ckpt_obj_ops {
+   char *obj_name;
+   enum obj_type obj_type;
+   void (*ref_drop)(void *ptr);
+   int (*ref_grab)(void *ptr);
+   int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
+   void *(*restore)(struct ckpt_ctx *ctx);
+};
+
+struct ckpt_obj {
+   int objref;
+   void *ptr;
+   struct ckpt_obj_ops *ops;
+   struct hlist_node hash;
+};
+
+struct ckpt_obj_hash {
+   struct hlist_head *head;
+   int next_free_objref;
+};
+
+/* helper grab/drop functions: */
+
+static void obj_no_drop(void *ptr)
+{
+   return;
+}
+
+static int obj_no_grab(void *ptr)
+{
+   return 0;
+}
+
+static struct ckpt_obj_ops ckpt_obj_ops[] = {
+   /* ignored object */
+   {
+   .obj_name = IGNORED,
+   .obj_type = CKPT_OBJ_IGNORE,
+   .ref_drop = obj_no_drop,
+   .ref_grab = obj_no_grab,
+   },
+};
+
+
+#define CKPT_OBJ_HASH_NBITS  10
+#define CKPT_OBJ_HASH_TOTAL  (1UL  CKPT_OBJ_HASH_NBITS)
+
+static void obj_hash_clear(struct ckpt_obj_hash *obj_hash)
+{
+   struct hlist_head *h = obj_hash-head;
+   struct hlist_node *n, *t;
+   struct ckpt_obj *obj;
+   int i;
+
+   for (i = 0; i  CKPT_OBJ_HASH_TOTAL; i++) {
+   hlist_for_each_entry_safe(obj, n, t, h[i], hash) {
+   obj-ops-ref_drop(obj-ptr);
+   kfree(obj);
+   }
+   }
+}
+
+void ckpt_obj_hash_free(struct ckpt_ctx *ctx)
+{
+   struct ckpt_obj_hash *obj_hash = ctx-obj_hash;
+
+   if (obj_hash) {
+   obj_hash_clear(obj_hash);
+   kfree(obj_hash-head);
+   kfree(ctx-obj_hash);
+   ctx-obj_hash = NULL;
+   }
+}
+
+int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)
+{
+   struct ckpt_obj_hash *obj_hash;
+   struct hlist_head *head;
+
+   obj_hash = 

[Devel] [RFC v16][PATCH 09/43] c/r: dump open file descriptors

2009-05-27 Thread Oren Laadan
Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).

Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.

Changelog[v16]:
  - Reorder patch (move earlier in series)
  - Handle shared files_struct objects

Changelog[v14]:
  - File objects are dumped/restored prior to the first reference
  - Introduce a per file-type restore() callback
  - Use struct file_operations-checkpoint()
  - Put code for generic file descriptors in a separate function
  - Use one CKPT_FILE_GENERIC for both regular files and dirs
  - Revert change to pr_debug(), back to ckpt_debug()
  - Use only unsigned fields in checkpoint headers
  - Rename:  ckpt_write_files() = checkpoint_fd_table()
  - Rename:  ckpt_write_fd_data() = checkpoint_file()
  - Discard field 'h-parent'

Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - ckpt_scan_fds() retries from scratch if hits size limits

Changelog[v9]:
  - Fix a couple of leaks in ckpt_write_files()
  - Drop useless kfree from ckpt_scan_fds()

Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning

Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/Makefile  |3 +-
 checkpoint/checkpoint.c  |   25 +++
 checkpoint/files.c   |  311 ++
 checkpoint/objhash.c |   40 +
 checkpoint/process.c |   28 
 checkpoint/sys.c |1 +
 include/linux/checkpoint.h   |   14 ++-
 include/linux/checkpoint_hdr.h   |   49 ++
 include/linux/checkpoint_types.h |8 +
 include/linux/fs.h   |4 +
 10 files changed, 481 insertions(+), 2 deletions(-)

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 5aa6a75..1d0c058 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \
objhash.o \
checkpoint.o \
restart.o \
-   process.o
+   process.o \
+   files.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 409c78b..a346b7e 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -15,6 +15,7 @@
 #include linux/time.h
 #include linux/fs.h
 #include linux/file.h
+#include linux/fs_struct.h
 #include linux/dcache.h
 #include linux/mount.h
 #include linux/utsname.h
@@ -244,10 +245,34 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
return ret;
 }
 
+/* setup checkpoint-specific parts of ctx */
+static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+   struct fs_struct *fs;
+
+   ctx-root_pid = pid;
+
+   /*
+* assume checkpointer is in container's root vfs
+* FIXME: this works for now, but will change with real containers
+*/
+
+   fs = current-fs;
+   read_lock(fs-lock);
+   ctx-fs_mnt = fs-root;
+   path_get(ctx-fs_mnt);
+   read_unlock(fs-lock);
+
+   return 0;
+}
+
 int do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
 {
int ret;
 
+   ret = init_checkpoint_ctx(ctx, pid);
+   if (ret  0)
+   goto out;
ret = checkpoint_write_header(ctx);
if (ret  0)
goto out;
diff --git a/checkpoint/files.c b/checkpoint/files.c
new file mode 100644
index 000..d10dfb6
--- /dev/null
+++ b/checkpoint/files.c
@@ -0,0 +1,311 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG  CKPT_DFILE
+
+#include linux/kernel.h
+#include linux/module.h
+#include linux/sched.h
+#include linux/file.h
+#include linux/fdtable.h
+#include linux/checkpoint.h
+#include linux/checkpoint_hdr.h
+
+
+/**
+ * Checkpoint
+ */
+
+/**
+ * fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+static char *fill_fname(struct path *path, struct path *root,
+   char *buf, int *len)
+{
+   struct path tmp = *root;
+   char *fname;
+
+   BUG_ON(!buf);
+   spin_lock(dcache_lock);
+   fname = 

[Devel] [RFC v16][PATCH 15/43] c/r: restore memory address space (private memory)

2009-05-27 Thread Oren Laadan
Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the vma state and contents.
Call do_mmap_pgoffset() for each vma and then read in the data.

Changelog[v16]:
  - Restore mm-exe_file

Changelog[v14]:
  - Introduce per vma-type restore() function
  - Merge restart code into same file as checkpoint (memory.c)
  - Compare saved 'vdso' field of mm_context with current value
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'h-parent'
  - Revert change to pr_debug(), back to ckpt_debug()

Changelog[v13]:
  - Avoid access to hh-vma_type after the header is freed
  - Test for no vma's in exit_mmap() before calling unmap_vma() (or it
may crash if restart fails after having removed all vma's)

Changelog[v12]:
  - Replace obsolete ckpt_debug() with pr_debug()

Changelog[v9]:
  - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)

Changelog[v5]:
  - Improve memory restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of vaddrs, pages
instead of one long list of each
  - Memory restore now maps user pages explicitly to copy data into them,
instead of reading directly to user space; got rid of mprotect_fixup()

Changelog[v4]:
  - Use standard list_... for ckpt_pgarr


Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 arch/x86/include/asm/ldt.h   |7 +
 arch/x86/mm/checkpoint.c |   64 ++
 checkpoint/memory.c  |  463 ++
 checkpoint/objhash.c |1 +
 checkpoint/process.c |3 +
 fs/exec.c|2 +-
 include/linux/checkpoint.h   |7 +
 include/linux/checkpoint_hdr.h   |2 +-
 include/linux/checkpoint_types.h |1 +
 include/linux/mm.h   |   12 +
 mm/filemap.c |   19 ++
 mm/mmap.c|   23 ++-
 12 files changed, 601 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/ldt.h b/arch/x86/include/asm/ldt.h
index 46727eb..f2845f9 100644
--- a/arch/x86/include/asm/ldt.h
+++ b/arch/x86/include/asm/ldt.h
@@ -37,4 +37,11 @@ struct user_desc {
 #define MODIFY_LDT_CONTENTS_CODE   2
 
 #endif /* !__ASSEMBLY__ */
+
+#ifdef __KERNEL__
+#include linux/linkage.h
+asmlinkage int sys_modify_ldt(int func, void __user *ptr,
+ unsigned long bytecount);
+#endif
+
 #endif /* _ASM_X86_LDT_H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index dc4fbb4..c781416 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -13,6 +13,7 @@
 
 #include asm/desc.h
 #include asm/i387.h
+#include asm/elf.h
 
 #include linux/checkpoint_types.h
 #include asm/checkpoint_hdr.h
@@ -461,3 +462,66 @@ int restore_read_header_arch(struct ckpt_ctx *ctx)
ckpt_hdr_put(ctx, h);
return ret;
 }
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+   struct ckpt_hdr_mm_context *h;
+   unsigned int n;
+   int ret;
+
+   h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+   if (IS_ERR(h))
+   return PTR_ERR(h);
+
+   ckpt_debug(nldt %d vdso %#lx (%p)\n,
+h-nldt, (unsigned long) h-vdso, mm-context.vdso);
+
+   ret = -EINVAL;
+   if (h-vdso != (unsigned long) mm-context.vdso)
+   goto out;
+   if (h-ldt_entry_size != LDT_ENTRY_SIZE)
+   goto out;
+
+   ret = _ckpt_read_obj_type(ctx, NULL,
+ h-nldt * LDT_ENTRY_SIZE,
+ CKPT_HDR_MM_CONTEXT_LDT);
+   if (ret  0)
+   goto out;
+
+   /*
+* to utilize the syscall modify_ldt() we first convert the data
+* in the checkpoint image from 'struct desc_struct' to 'struct
+* user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+*/
+   for (n = 0; n  h-nldt; n++) {
+   struct user_desc info;
+   struct desc_struct desc;
+   mm_segment_t old_fs;
+
+   ret = ckpt_kread(ctx, desc, LDT_ENTRY_SIZE);
+   if (ret  0)
+   break;
+
+   info.entry_number = n;
+   info.base_addr = desc.base0 | (desc.base1  16);
+   info.limit = desc.limit0;
+   info.seg_32bit = desc.d;
+   info.contents = desc.type  2;
+   info.read_exec_only = (desc.type  1) ^ 1;
+   info.limit_in_pages = desc.g;
+   info.seg_not_present = desc.p ^ 1;
+   info.useable = desc.avl;
+
+   old_fs = get_fs();
+   set_fs(get_ds());
+   ret = sys_modify_ldt(1, (struct user_desc __user *) info,
+   

[Devel] [RFC v16][PATCH 14/43] c/r: dump memory address space (private memory)

2009-05-27 Thread Oren Laadan
For each vma, there is a 'struct ckpt_vma'; Then comes the actual
contents, in one or more chunk: each chunk begins with a header that
specifies how many pages it holds, then the virtual addresses of all
the dumped pages in that chunk, followed by the actual contents of all
dumped pages. A header with zero number of pages marks the end of the
contents.  Then comes the next vma and so on.

To checkpoint a vma, call the ops-checkpoint() method of that vma.
Normally the per-vma function will invoke generic_vma_checkpoint()
which first writes the vma description, followed by the specific
logic to dump the contents of the pages.

Currently for private mapped memory we save the pathname of the file
that is mapped (restart will use it to re-open it and then map it).
Later we change that to reference a file object.

Changelog[v16]:
  - Precede vaddrs/pages with a buffer header
  - Checkpoint mm-exe_file
  - Handle shared task-mm

Changelog[v14]:
  - Modify the ops-checkpoint method to be much more powerful
  - Improve support for VDSO (with special_mapping checkpoint callback)
  - Save new field 'vdso' in mm_context
  - Revert change to pr_debug(), back to ckpt_debug()
  - Check whether calls to ckpt_hbuf_get() fail
  - Discard field 'h-parent'

Changelog[v13]:
  - pgprot_t is an abstract type; use the proper accessor (fix for
64-bit powerpc (Nathan Lynch n...@pobox.com)

Changelog[v12]:
  - Hide pgarr management inside ckpt_private_vma_fill_pgarr()
  - Fix management of pgarr chain reset and alloc/expand: keep empty
pgarr in a pool chain
  - Replace obsolete ckpt_debug() with pr_debug()

Changelog[v11]:
  - Copy contents of 'init-fs-root' instead of pointing to them.
  - Add missing test for VM_MAYSHARE when dumping memory

Changelog[v10]:
  - Acquire dcache_lock around call to __d_path() in ckpt_fill_name()

Changelog[v9]:
  - Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
  - Test if __d_path() changes mnt/dentry (when crossing filesystem
namespace boundary). for now ckpt_fill_fname() fails the checkpoint.

Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore

Changelog[v6]:
  - Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)

Changelog[v5]:
  - Improve memory dump code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of vaddrs, pages
instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages

Changelog[v4]:
  - Use standard list_... for ckpt_pgarr

Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 arch/x86/include/asm/checkpoint_hdr.h |8 +
 arch/x86/mm/checkpoint.c  |   32 ++
 checkpoint/Makefile   |3 +-
 checkpoint/memory.c   |  633 +
 checkpoint/objhash.c  |   19 +
 checkpoint/process.c  |   10 +
 checkpoint/sys.c  |4 +
 include/linux/checkpoint.h|   26 ++-
 include/linux/checkpoint_hdr.h|   47 +++
 include/linux/checkpoint_types.h  |3 +
 mm/filemap.c  |   28 ++
 mm/mmap.c |   31 ++
 12 files changed, 842 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/checkpoint_hdr.h 
b/arch/x86/include/asm/checkpoint_hdr.h
index 362b499..cf90170 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -43,6 +43,7 @@
 enum {
CKPT_HDR_THREAD_TLS = 201,
CKPT_HDR_CPU_FPU,
+   CKPT_HDR_MM_CONTEXT_LDT,
 };
 
 struct ckpt_hdr_header_arch {
@@ -107,4 +108,11 @@ struct ckpt_hdr_cpu {
/* thread_xstate contents follow (if used_math) */
 } __attribute__((aligned(8)));
 
+struct ckpt_hdr_mm_context {
+   struct ckpt_hdr h;
+   __u64 vdso;
+   __u32 ldt_entry_size;
+   __u32 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index f54fe80..dc4fbb4 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -14,6 +14,7 @@
 #include asm/desc.h
 #include asm/i387.h
 
+#include linux/checkpoint_types.h
 #include asm/checkpoint_hdr.h
 #include linux/checkpoint.h
 
@@ -239,6 +240,37 @@ int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
return ret;
 }
 
+/* dump the mm-context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+   struct ckpt_hdr_mm_context *h;
+   int ret;
+
+   h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+   if (!h)
+   return -ENOMEM;
+
+   mutex_lock(mm-context.lock);
+
+   h-vdso = (unsigned long) mm-context.vdso;
+   h-ldt_entry_size = LDT_ENTRY_SIZE;
+   h-nldt = mm-context.size;
+
+   ckpt_debug(nldt %d vdso %#llx\n, h-nldt, h-vdso);
+
+   ret = ckpt_write_obj(ctx, h-h);
+   

[Devel] [RFC v16][PATCH 42/43] c/r: add CKPT_COPY() macro

2009-05-27 Thread Oren Laadan
From: Dan Smith da...@us.ibm.com

As suggested by Dave[1], this provides us a way to make the copy-in and
copy-out processes symmetric.  CKPT_COPY_ARRAY() provides us a way to do
the same thing but for arrays.  It's not critical, but it helps us unify
the checkpoint and restart paths for some things.

Changelog:
Mar 04:
. Removed semicolons
. Added build-time check for __must_be_array in CKPT_COPY_ARRAY
Feb 27:
. Changed CKPT_COPY() to use assignment, eliminating the need
  for the CKPT_COPY_BIT() macro
. Add CKPT_COPY_ARRAY() macro to help copying register arrays,
  etc
. Move the macro definitions inside the CR #ifdef
Feb 25:
. Changed WARN_ON() to BUILD_BUG_ON()

Signed-off-by: Dan Smith da...@us.ibm.com
Signed-off-by: Oren Laadan or...@cs.columbia.edu

1: 
https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html
 (all the way at the bottom)
---
 include/linux/checkpoint.h |   29 +
 1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 064dd25..669e90c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -157,6 +157,34 @@ extern int restore_memory_contents(struct ckpt_ctx *ctx, 
struct inode *inode);
 VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
 
 
+/* useful macros to copy fields and buffers to/from ckpt_hdr_xxx structures */
+#define CKPT_CPT 1
+#define CKPT_RST 2
+
+#define CKPT_COPY(op, SAVE, LIVE)  \
+   do {\
+   if (op == CKPT_CPT) \
+   SAVE = LIVE;\
+   else\
+   LIVE = SAVE;\
+   } while (0)
+
+/*
+ * Copy @count items from @LIVE to @SAVE if op is CKPT_CPT (otherwise,
+ * copy in the reverse direction)
+ */
+#define CKPT_COPY_ARRAY(op, SAVE, LIVE, count) \
+   do {\
+   (void)__must_be_array(SAVE);\
+   (void)__must_be_array(LIVE);\
+   BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));   \
+   if (op == CKPT_CPT) \
+   memcpy(SAVE, LIVE, count * sizeof(*SAVE));  \
+   else\
+   memcpy(LIVE, SAVE, count * sizeof(*SAVE));  \
+   } while (0)
+
+
 /* debugging flags */
 #define CKPT_DBASE 0x1 /* anything */
 #define CKPT_DSYS  0x2 /* generic (system) */
@@ -189,6 +217,7 @@ extern unsigned long ckpt_debug_level;
  * CKPT_DBASE is the base flags, doesn't change
  * CKPT_DFLAG is to be redfined in each source file
  */
+
 #define ckpt_debug(fmt, args...)  \
_ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args)
 
-- 
1.6.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC v16][PATCH 28/43] c/r: make ckpt_may_checkpoint_task() check each namespace individually

2009-05-27 Thread Oren Laadan
From: Dan Smith da...@us.ibm.com

Signed-off-by: Dan Smith da...@us.ibm.com
Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/checkpoint.c|   20 ++--
 checkpoint/objhash.c   |   28 +++
 checkpoint/process.c   |  101 
 include/linux/checkpoint.h |4 ++
 include/linux/checkpoint_hdr.h |8 +++
 5 files changed, 157 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index b70adf4..e66f82b 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -267,6 +267,8 @@ static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
 static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 {
struct task_struct *root = ctx-root_task;
+   struct nsproxy *nsproxy;
+   int ret = 0;
 
ckpt_debug(check %d\n, task_pid_nr_ns(t, ctx-root_nsproxy-pid_ns));
 
@@ -306,11 +308,21 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, 
struct task_struct *t)
return -EINVAL;
}
 
-   /* FIX: change this when namespaces are added */
-   if (task_nsproxy(t) != ctx-root_nsproxy)
-   return -EPERM;
+   rcu_read_lock();
+   nsproxy = task_nsproxy(t);
+   if (nsproxy-uts_ns != ctx-root_nsproxy-uts_ns)
+   ret = -EPERM;
+   if (nsproxy-ipc_ns != ctx-root_nsproxy-ipc_ns)
+   ret = -EPERM;
+   if (nsproxy-mnt_ns != ctx-root_nsproxy-mnt_ns)
+   ret = -EPERM;
+   if (nsproxy-pid_ns != ctx-root_nsproxy-pid_ns)
+   ret = -EPERM;
+   if (nsproxy-net_ns != ctx-root_nsproxy-net_ns)
+   ret = -EPERM;
+   rcu_read_unlock();
 
-   return 0;
+   return ret;
 }
 
 #define CKPT_HDR_PIDS_CHUNK256
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index e481911..56553ae 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -127,6 +127,22 @@ static int obj_mm_users(void *ptr)
return atomic_read(((struct mm_struct *) ptr)-mm_users);
 }
 
+static int obj_ns_grab(void *ptr)
+{
+   get_nsproxy((struct nsproxy *) ptr);
+   return 0;
+}
+
+static void obj_ns_drop(void *ptr)
+{
+   put_nsproxy((struct nsproxy *) ptr);
+}
+
+static int obj_ns_users(void *ptr)
+{
+   return atomic_read(((struct nsproxy *) ptr)-count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
/* ignored object */
{
@@ -174,6 +190,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.checkpoint = checkpoint_mm,
.restore = restore_mm,
},
+   /* ns object */
+   {
+   .obj_name = NSPROXY,
+   .obj_type = CKPT_OBJ_NS,
+   .ref_drop = obj_ns_drop,
+   .ref_grab = obj_ns_grab,
+   .ref_users = obj_ns_users,
+   .checkpoint = checkpoint_ns,
+   .restore = restore_ns,
+   },
 };
 
 
@@ -396,6 +422,8 @@ int ckpt_obj_contained(struct ckpt_ctx *ctx)
 
/* account for ctx-file reference (if in the table already) */
ckpt_obj_users_inc(ctx, ctx-file, 1);
+   /* account for ctx-root_nsproxy reference (if in the table already) */
+   ckpt_obj_users_inc(ctx, ctx-root_nsproxy, 1);
 
hlist_for_each_entry(obj, node, ctx-obj_hash-list, next) {
if (!obj-ops-ref_users)
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 876be3e..fbe0d16 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -12,6 +12,7 @@
 #define CKPT_DFLAG  CKPT_DSYS
 
 #include linux/sched.h
+#include linux/nsproxy.h
 #include linux/posix-timers.h
 #include linux/futex.h
 #include linux/poll.h
@@ -49,6 +50,45 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, 
struct task_struct *t)
return ckpt_write_string(ctx, t-comm, TASK_COMM_LEN);
 }
 
+
+static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
+{
+   return 0;
+}
+
+int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+   return do_checkpoint_ns(ctx, (struct nsproxy *) ptr);
+}
+
+static int checkpoint_task_ns(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+   struct ckpt_hdr_task_ns *h;
+   struct nsproxy *nsproxy;
+   int ns_objref;
+   int ret;
+
+   rcu_read_lock();
+   nsproxy = task_nsproxy(t);
+   get_nsproxy(nsproxy);
+   rcu_read_unlock();
+
+   ns_objref = checkpoint_obj(ctx, nsproxy, CKPT_OBJ_NS);
+   put_nsproxy(nsproxy);
+
+   ckpt_debug(nsproxy: objref %d\n, ns_objref);
+   if (ns_objref  0)
+   return ns_objref;
+
+   h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_NS);
+   if (!h)
+   return -ENOMEM;
+   h-ns_objref = ns_objref;
+   ret = ckpt_write_obj(ctx, h-h);
+   ckpt_hdr_put(ctx, h);
+   return ret;
+}
+
 static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
 {
struct ckpt_hdr_task_objs *h;
@@ -56,6 +96,18 

[Devel] [RFC v16][PATCH 30/43] c/r: stub implementation for IPC namespace

2009-05-27 Thread Oren Laadan
From: Dan Smith da...@us.ibm.com

Changes:
 - Update to match UTS changes

Signed-off-by: Dan Smith da...@us.ibm.com
Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 checkpoint/checkpoint.c|2 --
 checkpoint/objhash.c   |   28 
 checkpoint/process.c   |   24 ++--
 include/linux/checkpoint.h |   15 +++
 include/linux/checkpoint_hdr.h |3 +++
 5 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 904f19b..afc7300 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -310,8 +310,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct 
task_struct *t)
 
rcu_read_lock();
nsproxy = task_nsproxy(t);
-   if (nsproxy-ipc_ns != ctx-root_nsproxy-ipc_ns)
-   ret = -EPERM;
if (nsproxy-mnt_ns != ctx-root_nsproxy-mnt_ns)
ret = -EPERM;
if (nsproxy-pid_ns != ctx-root_nsproxy-pid_ns)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 8b7adc6..045a920 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -15,6 +15,8 @@
 #include linux/hash.h
 #include linux/file.h
 #include linux/fdtable.h
+#include linux/sched.h
+#include linux/ipc_namespace.h
 #include linux/checkpoint.h
 #include linux/checkpoint_hdr.h
 
@@ -159,6 +161,22 @@ static int obj_uts_ns_users(void *ptr)
return atomic_read(((struct uts_namespace *) ptr)-kref.refcount);
 }
 
+static int obj_ipc_ns_grab(void *ptr)
+{
+   get_ipc_ns((struct ipc_namespace *) ptr);
+   return 0;
+}
+
+static void obj_ipc_ns_drop(void *ptr)
+{
+   put_ipc_ns((struct ipc_namespace *) ptr);
+}
+
+static int obj_ipc_ns_users(void *ptr)
+{
+   return atomic_read(((struct ipc_namespace *) ptr)-count);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
/* ignored object */
{
@@ -226,6 +244,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.checkpoint = checkpoint_bad,
.restore = restore_bad,
},
+   /* ipc_ns object */
+   {
+   .obj_name = IPC_NS,
+   .obj_type = CKPT_OBJ_IPC_NS,
+   .ref_drop = obj_ipc_ns_drop,
+   .ref_grab = obj_ipc_ns_grab,
+   .ref_users = obj_ipc_ns_users,
+   .checkpoint = checkpoint_bad,
+   .restore = restore_bad,
+   },
 };
 
 
diff --git a/checkpoint/process.c b/checkpoint/process.c
index a827987..eff3d76 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -89,6 +89,7 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct 
nsproxy *nsproxy)
struct ckpt_hdr_ns *h;
int ns_flags = 0;
int uts_objref;
+   int ipc_objref;
int first, ret;
 
uts_objref = ckpt_obj_lookup_add(ctx, nsproxy-uts_ns,
@@ -98,12 +99,20 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct 
nsproxy *nsproxy)
if (first)
ns_flags |= CLONE_NEWUTS;
 
+   ipc_objref = ckpt_obj_lookup_add(ctx, nsproxy-ipc_ns,
+CKPT_OBJ_IPC_NS, first);
+   if (ipc_objref  0)
+   return ipc_objref;
+   if (first)
+   ns_flags |= CLONE_NEWIPC;
+
h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NS);
if (!h)
return -ENOMEM;
 
h-flags = ns_flags;
h-uts_objref = uts_objref;
+   h-ipc_objref = ipc_objref;
 
ret = ckpt_write_obj(ctx, h-h);
ckpt_hdr_put(ctx, h);
@@ -112,6 +121,10 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct 
nsproxy *nsproxy)
 
if (ns_flags  CLONE_NEWUTS)
ret = checkpoint_uts_ns(ctx, nsproxy-uts_ns);
+#if 0
+   if (!ret  (ns_flags  CLONE_NEWIPC))
+   ret = checkpoint_ipc_ns(ctx, nsproxy-ipc_ns);
+#endif
 
/* FIX: Write other namespaces here */
return ret;
@@ -438,9 +451,10 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
return (struct nsproxy *) h;
 
ret = -EINVAL;
-   if (h-uts_objref = 0)
+   if (h-uts_objref = 0 ||
+   h-ipc_objref = 0)
goto out;
-   if (h-flags  ~CLONE_NEWUTS)
+   if (h-flags  ~(CLONE_NEWUTS | CLONE_NEWIPC))
goto out;
 
/* each unseen-before namespace will be un-shared now */
@@ -456,6 +470,12 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
 */
ret = restore_uts_ns(ctx, h-uts_objref, h-flags);
ckpt_debug(uts ns: %d\n, ret);
+   if (ret  0)
+   goto out;
+#if 0
+   ret = restore_ipc_ns(ctx, h-ipc_objref, h-flags);
+   ckpt_debug(ipc ns: %d\n, ret);
+#endif
 
/* FIX: add more namespaces here */
  out:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index a7125fc..5a42399 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -81,6 +81,21 

[Devel] [RFC v16][PATCH 43/43] c/r: define s390-specific checkpoint-restart code

2009-05-27 Thread Oren Laadan
From: Dan Smith da...@us.ibm.com

Implement the s390 arch-specific checkpoint/restart helpers.  This
is on top of Oren Laadan's c/r code.

With these, I am able to checkpoint and restart simple programs as per
Oren's patch intro.  While on x86 I never had to freeze a single task
to checkpoint it, on s390 I do need to.  That is a prereq for consistent
snapshots (esp with multiple processes) anyway so I don't see that as
a problem.

Changelog:
Apr 11:
. Introduce ckpt_arch_vdso()
Feb 27:
. Add checkpoint_s390.h
. Fixed up save and restore of PSW, with the non-address bits
  properly masked out
Feb 25:
. Make checkpoint_hdr.h safe for inclusion in userspace
. Replace comment about vsdo code
. Add comment about restoring access registers
. Write and read an empty ckpt_hdr_head_arch record to appease
  code (mktree) that expects it to be there
. Utilize NUM_CKPT_WORDS in checkpoint_hdr.h
Feb 24:
. Use CKPT_COPY() to unify the un/loading of cpu and mm state
. Fix fprs definition in ckpt_hdr_cpu
. Remove debug WARN_ON() from checkpoint.c
Feb 23:
. Macro-ize the un/packing of trace flags
. Fix the crash when externally-linked
. Break out the restart functions into restart.c
. Remove unneeded s390_enable_sie() call
Jan 30:
. Switched types in ckpt_hdr_cpu to __u64 etc.
  (Per Oren suggestion)
. Replaced direct inclusion of structs in
  ckpt_hdr_cpu with the struct members.
  (Per Oren suggestion)
. Also ended up adding a bunch of new things
  into restart (mm_segment, ksp, etc) in vain
  attempt to get code using fpu to not segfault
  after restart.

Signed-off-by: Serge E. Hallyn se...@us.ibm.com
Signed-off-by: Dan Smith da...@us.ibm.com
---
 arch/s390/include/asm/checkpoint_hdr.h |   81 ++
 arch/s390/include/asm/unistd.h |4 +-
 arch/s390/kernel/compat_wrapper.S  |   12 ++
 arch/s390/kernel/syscalls.S|2 +
 arch/s390/mm/Makefile  |1 +
 arch/s390/mm/checkpoint.c  |  183 
 arch/s390/mm/checkpoint_s390.h |   23 
 7 files changed, 305 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/checkpoint_hdr.h 
b/arch/s390/include/asm/checkpoint_hdr.h
new file mode 100644
index 000..185194b
--- /dev/null
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -0,0 +1,81 @@
+#ifndef __ASM_S390_CKPT_HDR_H
+#define __ASM_S390_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers s/390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include linux/types.h
+#include linux/checkpoint_hdr.h
+#include asm/ptrace.h
+
+#ifdef __KERNEL__
+#include asm/processor.h
+#else
+#include sys/user.h
+#endif
+
+/*
+ * Notes
+ * NUM_GPRS defined in asm/ptrace.h to be 16
+ * NUM_FPRS defined in asm/ptrace.h to be 16
+ * NUM_APRS defined in asm/ptrace.h to be 16
+ * NUM_CR_WORDS defined in asm/ptrace.h to be 3
+ */
+struct ckpt_hdr_cpu {
+   struct ckpt_hdr h;
+   __u64 args[1];
+   __u64 gprs[NUM_GPRS];
+   __u64 orig_gpr2;
+   __u16 svcnr;
+   __u16 ilc;
+   __u32 acrs[NUM_ACRS];
+   __u64 ieee_instruction_pointer;
+
+   /* psw_t */
+   __u64 psw_t_mask;
+   __u64 psw_t_addr;
+
+   /* s390_fp_regs_t */
+   __u32 fpc;
+   union {
+   float f;
+   double d;
+   __u64 ui;
+   struct {
+   __u32 fp_hi;
+   __u32 fp_lo;
+   } fp;
+   } fprs[NUM_FPRS];
+
+   /* per_struct */
+   __u64 per_control_regs[NUM_CR_WORDS];
+   __u64 starting_addr;
+   __u64 ending_addr;
+   __u64 address;
+   __u16 perc_atmid;
+   __u8 access_id;
+   __u8 single_step;
+   __u8 instruction_fetch;
+};
+
+struct ckpt_hdr_mm_context {
+   struct ckpt_hdr h;
+   unsigned long vdso_base;
+   int noexec;
+   int has_pgste;
+   int alloc_pgste;
+   unsigned long asce_bits;
+   unsigned long asce_limit;
+};
+
+struct ckpt_hdr_header_arch {
+   struct ckpt_hdr h;
+};
+
+#endif /* __ASM_S390_CKPT_HDR__H */
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index f0f19e6..3d22f17 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -267,7 +267,9 @@
 #define __NR_epoll_create1 327
 #define__NR_preadv 328
 #define__NR_pwritev329
-#define NR_syscalls 330
+#define __NR_checkpoint330
+#define 

[Devel] [RFC v16][PATCH 11/43] c/r: add generic '-checkpoint' f_op to ext fses

2009-05-27 Thread Oren Laadan
From: Dave Hansen d...@linux.vnet.ibm.com

This marks ext[234] as being checkpointable.  There will be many
more to do this to, but this is a start.

Signed-off-by: Dave Hansen d...@linux.vnet.ibm.com
---
 fs/ext2/dir.c  |1 +
 fs/ext2/file.c |2 ++
 fs/ext3/dir.c  |1 +
 fs/ext3/file.c |1 +
 fs/ext4/dir.c  |1 +
 fs/ext4/file.c |1 +
 6 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 2999d72..4f1dd79 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -721,4 +721,5 @@ const struct file_operations ext2_dir_operations = {
.compat_ioctl   = ext2_compat_ioctl,
 #endif
.fsync  = ext2_sync_file,
+   .checkpoint = generic_file_checkpoint,
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 45ed071..e1731c5 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -58,6 +58,7 @@ const struct file_operations ext2_file_operations = {
.fsync  = ext2_sync_file,
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
+   .checkpoint = generic_file_checkpoint,
 };
 
 #ifdef CONFIG_EXT2_FS_XIP
@@ -73,6 +74,7 @@ const struct file_operations ext2_xip_file_operations = {
.open   = generic_file_open,
.release= ext2_release_file,
.fsync  = ext2_sync_file,
+   .checkpoint = generic_file_checkpoint,
 };
 #endif
 
diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 3d724a9..54b05d2 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = {
 #endif
.fsync  = ext3_sync_file,   /* BKL held */
.release= ext3_release_dir,
+   .checkpoint = generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..a421e07 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -126,6 +126,7 @@ const struct file_operations ext3_file_operations = {
.fsync  = ext3_sync_file,
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
+   .checkpoint = generic_file_checkpoint,
 };
 
 const struct inode_operations ext3_file_inode_operations = {
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index b647899..2787fdb 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = {
 #endif
.fsync  = ext4_sync_file,
.release= ext4_release_dir,
+   .checkpoint = generic_file_checkpoint,
 };
 
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 588af8c..c2dab33 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -161,6 +161,7 @@ const struct file_operations ext4_file_operations = {
.fsync  = ext4_sync_file,
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
+   .checkpoint = generic_file_checkpoint,
 };
 
 const struct inode_operations ext4_file_inode_operations = {
-- 
1.6.0.4

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page

2009-05-27 Thread Oren Laadan


Sukadev Bhattiprolu wrote:
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:39 -0700
 Subject: [PATCH 1/7] [PATCH] Factor out code to allocate pidmap page
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 ---

Reviewed-by: Oren Laadan or...@cs.columbia.edu

  kernel/pid.c |   43 ---
  1 files changed, 28 insertions(+), 15 deletions(-)
 
 diff --git a/kernel/pid.c b/kernel/pid.c
 index b2e5f78..c0aaebe 100644
 --- a/kernel/pid.c
 +++ b/kernel/pid.c
 @@ -122,6 +122,31 @@ static void free_pidmap(struct upid *upid)
   atomic_inc(map-nr_free);
  }
  
 +static int alloc_pidmap_page(struct pidmap *map)
 +{
 + void *page;
 +
 + if (likely(map-page))
 + return 0;
 +
 + page = kzalloc(PAGE_SIZE, GFP_KERNEL);
 +
 + /*
 +  * Free the page if someone raced with us installing it:
 +  */
 + spin_lock_irq(pidmap_lock);
 + if (map-page)
 + kfree(page);
 + else
 + map-page = page;
 + spin_unlock_irq(pidmap_lock);
 +
 + if (unlikely(!map-page))
 + return -1;
 +
 + return 0;
 +}
 +
  static int alloc_pidmap(struct pid_namespace *pid_ns)
  {
   int i, offset, max_scan, pid, last = pid_ns-last_pid;
 @@ -134,21 +159,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
   map = pid_ns-pidmap[pid/BITS_PER_PAGE];
   max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
   for (i = 0; i = max_scan; ++i) {
 - if (unlikely(!map-page)) {
 - void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
 - /*
 -  * Free the page if someone raced with us
 -  * installing it:
 -  */
 - spin_lock_irq(pidmap_lock);
 - if (map-page)
 - kfree(page);
 - else
 - map-page = page;
 - spin_unlock_irq(pidmap_lock);
 - if (unlikely(!map-page))
 - break;
 - }
 + if (alloc_pidmap_page(map))
 + break;
 +
   if (likely(atomic_read(map-nr_free))) {
   do {
   if (!test_and_set_bit(offset, map-page)) {
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC v16][PATCH 23/43] c/r: restart multiple processes

2009-05-27 Thread Alexey Dobriyan
On Wed, May 27, 2009 at 01:32:49PM -0400, Oren Laadan wrote:
 Restarting of multiple processes expects all restarting tasks to call
 sys_restart(). Once inside the system call, each task will restart
 itself at the same order that they were saved. The internals of the
 syscall will take care of in-kernel synchronization bewteen tasks.
 
 This patch does _not_ create the task tree in the kernel. Instead it
 assumes that all tasks are created in some way and then invoke the
 restart syscall. You can use the userspace mktree.c program to do
 that.
 
 The init task (*) has a special role: it allocates the restart context
 (ctx), and coordinates the operation. In particular, it first waits
 until all participating tasks enter the kernel, and provides them the
 common restart context. Once everyone in ready, it begins to restart
 itself.
 
 In contrast, the other tasks enter the kernel, locate the init task (*)
 and grab its restart context, and then wait for their turn to restore.
 
 When a task (init or not) completes its restart, it hands the control
 over to the next in line, by waking that task.
 
 An array of pids (the one saved during the checkpoint) is used to
 synchronize the operation. The first task in the array is the init
 task (*). The restart context (ctx) maintain a current position in
 the array, which indicates which task is currently active. Once the
 currently active task completes its own restart, it increments that
 position and wakes up the next task.
 
 Restart assumes that userspace provides meaningful data, otherwise
 it's garbage-in-garbage-out. In this case, the syscall may block
 indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
 otherwise kill the stray restarting tasks.
 
 In terms of security, restart runs as the user the invokes it, so it
 will not allow a user to do more than is otherwise permitted by the
 usual system semantics and policy.
 
 Currently we ignore threads and zombies

Let's discuss threads and zombies.

1. Will zombie end up in a image?
2. If yes, how it will be restored. Will it be forked, call restart(2)
   and then somehow zombified inside kernel?
3. How thread group will be restored, will every thread be CLONE_THREAD'ed?
   What to do with exited thread group leaders, will they be forked, then
   CLONE_THREAD thread group?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/7] [PATCH] Have alloc_pidmap() return actual error code

2009-05-27 Thread Oren Laadan


Sukadev Bhattiprolu wrote:
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:40 -0700
 Subject: [PATCH 2/7] [PATCH] Have alloc_pidmap() return actual error code
 
 alloc_pidmap() can fail either because all pid numbers are in use or
 we can't allocate memory. With support for setting a specific pid
 number, alloc_pidmap() would also fail if either the given pid
 number is invalid or in use.
 
 Rather than have caller assume -ENOMEM, have alloc_pidmap() return
 the actual error.
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 ---

Reviewed-by: Oren Laadan or...@cs.columbia.edu

  kernel/fork.c |5 +++--
  kernel/pid.c  |9 ++---
  2 files changed, 9 insertions(+), 5 deletions(-)
 
 diff --git a/kernel/fork.c b/kernel/fork.c
 index b9e2edd..f8411a8 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -1119,10 +1119,11 @@ static struct task_struct *copy_process(unsigned long 
 clone_flags,
   goto bad_fork_cleanup_io;
  
   if (pid != init_struct_pid) {
 - retval = -ENOMEM;
   pid = alloc_pid(p-nsproxy-pid_ns);
 - if (!pid)
 + if (IS_ERR(pid)) {
 + retval = PTR_ERR(pid);
   goto bad_fork_cleanup_io;
 + }
  
   if (clone_flags  CLONE_NEWPID) {
   retval = pid_ns_prepare_proc(p-nsproxy-pid_ns);
 diff --git a/kernel/pid.c b/kernel/pid.c
 index c0aaebe..fd72ad9 100644
 --- a/kernel/pid.c
 +++ b/kernel/pid.c
 @@ -151,6 +151,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
  {
   int i, offset, max_scan, pid, last = pid_ns-last_pid;
   struct pidmap *map;
 + int rc = -EAGAIN;
  
   pid = last + 1;
   if (pid = pid_max)
 @@ -159,8 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
   map = pid_ns-pidmap[pid/BITS_PER_PAGE];
   max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
   for (i = 0; i = max_scan; ++i) {
 - if (alloc_pidmap_page(map))
 + if (alloc_pidmap_page(map)) {
 + rc = -ENOMEM;
   break;
 + }
  
   if (likely(atomic_read(map-nr_free))) {
   do {
 @@ -192,7 +195,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
   }
   pid = mk_pid(pid_ns, map, offset);
   }
 - return -1;
 + return rc;
  }
  
  int next_pidmap(struct pid_namespace *pid_ns, int last)
 @@ -297,7 +300,7 @@ out_free:
   free_pidmap(pid-numbers + i);
  
   kmem_cache_free(ns-pid_cachep, pid);
 - pid = NULL;
 + pid = ERR_PTR(nr);
   goto out;
  }
  
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 6/7] [PATCH] Define do_fork_with_pids()

2009-05-27 Thread Oren Laadan


Sukadev Bhattiprolu wrote:
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:44 -0700
 Subject: [PATCH 6/7] [PATCH] Define do_fork_with_pids()
 
 do_fork_with_pids() is same as do_fork(), except that it takes an
 additional, target_pids, parameter. This parameter, currently unused,
 specifies the target_pids of the process in each of its pid namespaces.
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 ---

Reviewed-by: Oren Laadan or...@cs.columbia.edu

  include/linux/sched.h |1 +
  kernel/fork.c |   17 ++---
  2 files changed, 15 insertions(+), 3 deletions(-)
 
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index b4c38bc..2173df1 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1995,6 +1995,7 @@ extern int disallow_signal(int);
  
  extern int do_execve(char *, char __user * __user *, char __user * __user *, 
 struct pt_regs *);
  extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned 
 long, int __user *, int __user *);
 +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs 
 *, unsigned long, int __user *, int __user *, pid_t *target_pids);
  struct task_struct *fork_idle(int);
  
  extern void set_task_comm(struct task_struct *tsk, char *from);
 diff --git a/kernel/fork.c b/kernel/fork.c
 index 373411e..912d008 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -1340,17 +1340,17 @@ struct task_struct * __cpuinit fork_idle(int cpu)
   * It copies the process, and if successful kick-starts
   * it and waits for it to finish using the VM if required.
   */
 -long do_fork(unsigned long clone_flags,
 +long do_fork_with_pids(unsigned long clone_flags,
 unsigned long stack_start,
 struct pt_regs *regs,
 unsigned long stack_size,
 int __user *parent_tidptr,
 -   int __user *child_tidptr)
 +   int __user *child_tidptr,
 +   pid_t *target_pids)
  {
   struct task_struct *p;
   int trace = 0;
   long nr;
 - pid_t *target_pids = NULL;
  
   /*
* Do some preliminary argument and permissions checking before we
 @@ -1448,6 +1448,17 @@ long do_fork(unsigned long clone_flags,
   return nr;
  }
  
 +long do_fork(unsigned long clone_flags,
 +   unsigned long stack_start,
 +   struct pt_regs *regs,
 +   unsigned long stack_size,
 +   int __user *parent_tidptr,
 +   int __user *child_tidptr)
 +{
 + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
 + parent_tidptr, child_tidptr, NULL);
 +}
 +
  #ifndef ARCH_MIN_MMSTRUCT_ALIGN
  #define ARCH_MIN_MMSTRUCT_ALIGN 0
  #endif
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process()

2009-05-27 Thread Oren Laadan


Sukadev Bhattiprolu wrote:
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:43 -0700
 Subject: [PATCH 5/7] [PATCH] Add target_pids parameter to copy_process()
 
 The new parameter will be used in a follow-on patch when clone_with_pids()
 is implemented.
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 ---

Reviewed-by: Oren Laadan or...@cs.columbia.edu

  kernel/fork.c |7 ---
  1 files changed, 4 insertions(+), 3 deletions(-)
 
 diff --git a/kernel/fork.c b/kernel/fork.c
 index d2d69d3..373411e 100644
 --- a/kernel/fork.c
 +++ b/kernel/fork.c
 @@ -949,12 +949,12 @@ static struct task_struct *copy_process(unsigned long 
 clone_flags,
   unsigned long stack_size,
   int __user *child_tidptr,
   struct pid *pid,
 + pid_t *target_pids,
   int trace)
  {
   int retval;
   struct task_struct *p;
   int cgroup_callbacks_done = 0;
 - pid_t *target_pids = NULL;
  
   if ((clone_flags  (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
   return ERR_PTR(-EINVAL);
 @@ -1327,7 +1327,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
   struct pt_regs regs;
  
   task = copy_process(CLONE_VM, 0, idle_regs(regs), 0, NULL,
 - init_struct_pid, 0);
 + init_struct_pid, NULL, 0);
   if (!IS_ERR(task))
   init_idle(task, cpu);
  
 @@ -1350,6 +1350,7 @@ long do_fork(unsigned long clone_flags,
   struct task_struct *p;
   int trace = 0;
   long nr;
 + pid_t *target_pids = NULL;
  
   /*
* Do some preliminary argument and permissions checking before we
 @@ -1390,7 +1391,7 @@ long do_fork(unsigned long clone_flags,
   trace = tracehook_prepare_clone(clone_flags);
  
   p = copy_process(clone_flags, stack_start, regs, stack_size,
 -  child_tidptr, NULL, trace);
 +  child_tidptr, NULL, target_pids, trace);
   /*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 7/7] [PATCH] Define clone_with_pids syscall

2009-05-27 Thread Oren Laadan


Sukadev Bhattiprolu wrote:
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 4 May 2009 01:17:45 -0700
 Subject: [PATCH 7/7] [PATCH] Define clone_with_pids syscall
 
 clone_with_pids() is same as clone(), except that it takes a 'target_pid_set'
 paramter which lets caller choose a specific pid number for the child process
 in each of the child process's pid namespace. This system call would be needed
 to implement Checkpoint/Restart (i.e after a checkpoint, restart a process 
 with
 its original pids).
 
 Call clone_with_pids as follows:
 
   pid_t pids[] = { 0, 77, 99 };
   struct target_pid_set pid_set;
 
   pid_set.num_pids = sizeof(pids) / sizeof(int);
   pid_set.target_pids = pids;
 
   syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, pid_set);
 
 If a target-pid is 0, the kernel continues to assign a pid for the process in
 that namespace. In the above example, pids[0] is 0, meaning the kernel will
 assign next available pid to the process in init_pid_ns. But kernel will 
 assign
 pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
 77 or 99 are taken, the system call fails with -EBUSY.
 
 If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
 the system call fails with -EINVAL.
 
 Its mostly an exploratory patch seeking feedback on the interface.
 
 NOTE:
   Compared to clone(), clone_with_pids() needs to pass in two more
   pieces of information:
 
   - number of pids in the set
   - user buffer containing the list of pids.
 
   But since clone() already takes 5 parameters, use a 'struct
   target_pid_set'.
 
 TODO:
   - Gently tested.
   - May need additional sanity checks in check_target_pids()
   - Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in
 the namespace is either 1 or 0).
 
 Changelog[v1]:
   - Fixed some compile errors (had fixed these errors earlier in my
 git tree but had not refreshed patches before emailing them)
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 ---

Reviewed-by: Oren Laadan or...@cs.columbia.edu

but ...

[...]

 +static pid_t *copy_target_pids(unsigned long clone_flags, void __user 
 *upid_setp)
 +{
 + int rc;
 + int size;
 + pid_t __user *utarget_pids;
 + pid_t *target_pids;
 + struct target_pid_set pid_set;
 +
 + if (copy_from_user(pid_set, upid_setp, sizeof(pid_set)))
 + return ERR_PTR(-EFAULT);
 +
 + size = pid_set.num_pids * sizeof(pid_t);

...either test pid_set.num_pids  0 (and give -EINVAL),
or...

[...]

  
 +struct target_pid_set {
 + int num_pids;

... make this 'size_t' ?


 + pid_t *target_pids;
 +};
 +
  #endif   /* __KERNEL__ */
  #endif /*  __ASSEMBLY__ */
  #endif /* _LINUX_TYPES_H */
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 18/38] C/R: core stuff

2009-05-27 Thread Oren Laadan


Alexey Dobriyan wrote:
 On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote:
 Quoting Alexey Dobriyan (adobri...@gmail.com):
 Introduction
 
 Checkpoint/restart (C/R from now) allows to dump group of processes to disk
 for various reasons like saving process state in case of box failure or
 restoration of group of processes on another or same machine later.

 Unlike, let's say, hypervisor C/R style which only needs to freeze guest 
 kernel
 and dump more or less raw pages, proposed C/R doesn't require hypervisor.
 For that C/R code needs to know about all little and big intimate kernel 
 details.

 The good thing is that not all details needs to be serialized and saved
 like, say, readahead state. The bad things is still quite a few things
 need to be.
 Hi Alexey,

 the last time you posted this, I went through and tried to discern the
 meaningful differences between yours and Oren's patchsets.  Then I sent some
 patches to Oren to make his set configurable to act more like yours.  And 
 Oren
 took them!  But now you resend this patchset with no real changelog, no
 acknowledgment that Oren's set even exists
 
 Is this a requirement? Everybody following topic already knows about
 Oren's patchset.

Some people do ack other people's work. See for example patches #1
and #24 in my recent post. You're welcome.

 
 - or is much farther along and pretty widely reviewed and tested (which is
 only because he started earlier and, when we asked for your counterpatches
 at an earlier stage, you would never reply) - or, most importantly, what
 it is that you think your patchset does that his does not and cannot.
 
 There are differences. And they're not small like you're trying to describe
 but pretty big compared the scale of the problem.

I've asked before, and I repeat now: can you enumerate these big
scary differences that make it such a big problem ?

So far, we identified two main design issues -

1) Whether or not allow c/r of sub-container (partial hierarchy)

2) Creation of restarting process hierarchy in kernel or in userspace

As for #1, you are the _only_ one who advocates restricting c/r to
a full container only. I guess you have your reasons, but I'm unsure
what they may be.

On the other hand, there has been a handful of use-cases and opinions
in favor of allowing both capabilities to co-exist. Not the mention
that nearly no additional code is necessary, on the contrary.

As for #2, you didn't even bother to reply to the discussion that I
had started about it. This decision is important to allow future
flexibility of the mechanism, and to address the needs of several
potential users, as seen in that discussion and others. Here, too,
you are the _only_ one that advocates that direction.

And the funniest thing -- *both* decisions can be *easily* overturned
in my patchset. In fact, regarding #2 - either way can be easily done
in it.

So I wonder, what are the big issues that bother you so much ?
if there is a will, there is a way.

 
 *Why* are you spending your time on this instead of helping with Oren's set?
 
 Because we disagree with some core directions Oren chose.
 ANK literally said: I don't know how to dump live netns.

Eh... and you have it all sorted out ?  (yeah, I do, but not in
this patchset).

 
 So, partly patchset was created so that absolutely nobody will tell us
 to shut up and show the code.

Oh well ... the code meaning your code I suppose.

 
 The other part, is that I looked at Oren patchset, found quite a lot of
 suspicious, broken and unclean places and decided that it'd be faster
 to start from scratch because sending patches will overhaul like 85% of
 the code.

So you actually took the time to read and review. And then you spent
even more time in  calculating this number !  Feedback appreciated.

If you looked closely you would have seen that we do address your
concerns over time.

 
 One example, is why CKPT_HDR_CPU and CKPT_RESTART_BLOCK exist at all?
 Should objects in image be only what sharable objects are in kernel
 (expect VMAs, pages and possibly file descriptors)? pt_regs don't exist
 by themselves after all.

A good reason to break it into small pieces is for ease of maintenance
and debugging, as well as in the future easier transition between
incompatible kernel versions. I think it's better than a few-pages-long
single struct. And it encourages more naming of things.

But ... I'm confused ... is this your big concern ?  Oh well, if
that's what stands in your way, we could even rework that (~1.3% of
the code ? I reckon...).

 
 And since you guys showed that just idea of in-kernel checkpointing is not
 rejected outright, it doesn't mean that you can drag every single idea too.
 Because history shows, that once something (especially user-visible,
 like restart syscall semantics) is in kernel it's nearly impossible
 to cut it out, so it's very-very important to get it right from the very
 beginning.
 

Yes. Let's indeed talk about how to get 

[Devel] Re: [RFC v16][PATCH 19/43] c/r: external checkpoint of a task other than ourself

2009-05-27 Thread Alexey Dobriyan
On Wed, May 27, 2009 at 01:32:45PM -0400, Oren Laadan wrote:
 Now we can do external checkpoint, i.e. act on another task.

 +static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
 +{
 + if (t-state == TASK_DEAD) {
 + pr_warning(c/r: task %d is TASK_DEAD\n, task_pid_vnr(t));
 + return -EAGAIN;
 + }
 +
 + if (!ptrace_may_access(t, PTRACE_MODE_READ)) {
 + __ckpt_write_err(ctx, access to task %d (%s) denied,
 +  task_pid_vnr(t), t-comm);
 + return -EPERM;
 + }
 +
 + /* verify that the task is frozen (unless self) */
 + if (t != current  !frozen(t)) {
 + __ckpt_write_err(ctx, task %d (%s) is not frozen,
 +  task_pid_vnr(t), t-comm);
 + return -EBUSY;
 + }
 +
 + /* FIX: add support for ptraced tasks */
 + if (task_ptrace(t)) {
 + __ckpt_write_err(ctx, task %d (%s) is ptraced,
 +  task_pid_vnr(t), t-comm);
 + return -EBUSY;
 + }
 +
 + return 0;
 +}
 +
 +static int get_container(struct ckpt_ctx *ctx, pid_t pid)
 +{
 + struct task_struct *task = NULL;
 + struct nsproxy *nsproxy = NULL;
 + int ret;
 +
 + ctx-root_pid = pid;
 +
 + read_lock(tasklist_lock);
 + task = find_task_by_vpid(pid);
 + if (task)
 + get_task_struct(task);
 + read_unlock(tasklist_lock);
 +
 + if (!task)
 + return -ESRCH;
 +
 + ret = may_checkpoint_task(ctx, task);
 + if (ret) {
 + ckpt_write_err(ctx, NULL);
 + put_task_struct(task);
 + return ret;
 + }
 +
 + rcu_read_lock();
 + nsproxy = task_nsproxy(task);
 + get_nsproxy(nsproxy);

Will oops if init is multi-threaded and thread group leader exited
(nsproxy = NULL). I need to think what to do, too.

 + rcu_read_unlock();
 +
 + ctx-root_task = task;
 + ctx-root_nsproxy = nsproxy;
 +
 + return 0;
 +}
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC v16][PATCH 23/43] c/r: restart multiple processes

2009-05-27 Thread Oren Laadan


Alexey Dobriyan wrote:
 On Wed, May 27, 2009 at 01:32:49PM -0400, Oren Laadan wrote:
 Restarting of multiple processes expects all restarting tasks to call
 sys_restart(). Once inside the system call, each task will restart
 itself at the same order that they were saved. The internals of the
 syscall will take care of in-kernel synchronization bewteen tasks.

 This patch does _not_ create the task tree in the kernel. Instead it
 assumes that all tasks are created in some way and then invoke the
 restart syscall. You can use the userspace mktree.c program to do
 that.

 The init task (*) has a special role: it allocates the restart context
 (ctx), and coordinates the operation. In particular, it first waits
 until all participating tasks enter the kernel, and provides them the
 common restart context. Once everyone in ready, it begins to restart
 itself.

 In contrast, the other tasks enter the kernel, locate the init task (*)
 and grab its restart context, and then wait for their turn to restore.

 When a task (init or not) completes its restart, it hands the control
 over to the next in line, by waking that task.

 An array of pids (the one saved during the checkpoint) is used to
 synchronize the operation. The first task in the array is the init
 task (*). The restart context (ctx) maintain a current position in
 the array, which indicates which task is currently active. Once the
 currently active task completes its own restart, it increments that
 position and wakes up the next task.

 Restart assumes that userspace provides meaningful data, otherwise
 it's garbage-in-garbage-out. In this case, the syscall may block
 indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
 otherwise kill the stray restarting tasks.

 In terms of security, restart runs as the user the invokes it, so it
 will not allow a user to do more than is otherwise permitted by the
 usual system semantics and policy.

 Currently we ignore threads and zombies
 
 Let's discuss threads and zombies.
 
 1. Will zombie end up in a image?

Zombies will be mentioned in the hierarchy description, and will
have very little state saved (e.g. exit status, parent).

 2. If yes, how it will be restored. Will it be forked, call restart(2)
and then somehow zombified inside kernel?

(not part of this patchset, but soon will be added to ckpt-v16-dev)
Zombie will be restarted as a normal process, will restore bare
minimum needed, and will call do_exit(). It will have to ensure
that there are no side effects on (=signals to) parent/children.

 3. How thread group will be restored, will every thread be CLONE_THREAD'ed?
What to do with exited thread group leaders, will they be forked, then
CLONE_THREAD thread group?

First, user space creates the entire tree hierarchy, including
zombies. Then each task calls sys_restart(). Inside, they are
coordinated to restore their state one after the other. So that
eventually, the to-be-zombies, be it a thread-group-leader or not,
will call do_exit() and zombify themselves.

Take a look at mktree.c (part of the user tools). It's already done
there using CLONE_THREAD.  The reason I wrote that it isn't supported
well is because I think that in full-container mode the link count
won't work correctly. Other than that, threads should work as long
as you don't play with partial sharing (e.g. only CLONE_FS).

Oren.


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 18/38] C/R: core stuff

2009-05-27 Thread Alexey Dobriyan
On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote:
 Alexey Dobriyan wrote:
  On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote:
  Quoting Alexey Dobriyan (adobri...@gmail.com):
  Introduction
  
  Checkpoint/restart (C/R from now) allows to dump group of processes to 
  disk
  for various reasons like saving process state in case of box failure or
  restoration of group of processes on another or same machine later.
 
  Unlike, let's say, hypervisor C/R style which only needs to freeze guest 
  kernel
  and dump more or less raw pages, proposed C/R doesn't require hypervisor.
  For that C/R code needs to know about all little and big intimate kernel 
  details.
 
  The good thing is that not all details needs to be serialized and saved
  like, say, readahead state. The bad things is still quite a few things
  need to be.
  Hi Alexey,
 
  the last time you posted this, I went through and tried to discern the
  meaningful differences between yours and Oren's patchsets.  Then I sent 
  some
  patches to Oren to make his set configurable to act more like yours.  And 
  Oren
  took them!  But now you resend this patchset with no real changelog, no
  acknowledgment that Oren's set even exists
  
  Is this a requirement? Everybody following topic already knows about
  Oren's patchset.
 
 Some people do ack other people's work. See for example patches #1
 and #24 in my recent post. You're welcome.
 
  
  - or is much farther along and pretty widely reviewed and tested (which is
  only because he started earlier and, when we asked for your counterpatches
  at an earlier stage, you would never reply) - or, most importantly, what
  it is that you think your patchset does that his does not and cannot.
  
  There are differences. And they're not small like you're trying to describe
  but pretty big compared the scale of the problem.
 
 I've asked before, and I repeat now: can you enumerate these big
 scary differences that make it such a big problem ?
 
 So far, we identified two main design issues -

Why in ? Yes, they are high-level design issues.

 1) Whether or not allow c/r of sub-container (partial hierarchy)
 
 2) Creation of restarting process hierarchy in kernel or in userspace
 
 As for #1, you are the _only_ one who advocates restricting c/r to
 a full container only. I guess you have your reasons, but I'm unsure
 what they may be.

The reason is that checkpointing half-frozen, half-live container is
essentially equivalent to live container which adds much complexity
to code fundamentally preventing kernel from taking coherent snapshot.

In such situations kernel will do its job badly.

Manpage will be filled with strings like if $FOO is shared then $BAR is
not guaranteed.

What to do if user simply doesn't know if container is bounded?
Checkpoint and to hell with consequences?

If two tasks share mm_struct you can't even detect that pages you dump
aren't filled with garbage meanwhile from second task.

If two tasks share mm_struct, other task can issue AIO indefinitely
preventing from taking even coherent filesystem snapshot.

That's why I raise this issue again to hear from people what they think
and these people shouldn't be containers and C/R people, because the
latter already made up their minds.

This is super-important issue to get right from the beginning.

 On the other hand, there has been a handful of use-cases and opinions
 in favor of allowing both capabilities to co-exist. Not the mention
 that nearly no additional code is necessary, on the contrary.
 
 As for #2, you didn't even bother to reply to the discussion that I
 had started about it. This decision is important to allow future
 flexibility of the mechanism, and to address the needs of several
 potential users, as seen in that discussion and others. Here, too,
 you are the _only_ one that advocates that direction.

Are you going to fork to-become-zombies, make them call restart(2) and
zombify?

 And the funniest thing -- *both* decisions can be *easily* overturned
 in my patchset. In fact, regarding #2 - either way can be easily done
 in it.
 
 So I wonder, what are the big issues that bother you so much ?
 if there is a will, there is a way.

Oren, don't you really understand?

Users want millions of things, but every thing has price.

Some think hardlinking of directories should be implemented. You can ask
VFS guys how hard would it be and how hard would it be to do reliably
without races/deadlocks et al.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 18/38] C/R: core stuff

2009-05-27 Thread Alexey Dobriyan
On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote:
  Now here goes second version, with prefixes fixed (kstate_) like Ingo
  suggested and so Linus could look at the code and with C/R code moved
  close to usual code and with more checks added (which you should have
  already!) to not restore null selector in %cs for example.
 
 It is far from perfect. In fact, it's even clearly commented as such,
 and exactly there.  It would have been helpful if you pointed that
 out in a review, or even - god forbid - sent a patch to improve it.

This is ridiculous.

First, you declare that restart(2) should be allowed for anyone(!).
and then send patchset for inclusion in -mm which doesn't even check
if selectors are right!

 But it works, and it lets people play with a more-than-a-toy
 implementation and provide us with important feedback. Oh, and by
 the way, it doesn't require that people use containers to try it out.

Setting up container for playing is not hard:

CLONE_NEWUTS=y
CLONE_NEWIPC=y
CLONE_NEWPID=y
CLONE_NEWUSER=y
CLONE_NEWNET=y

#define _GNU_SOURCE
#include stdio.h
#include stdlib.h
#include unistd.h
#include sched.h
#include sys/mount.h
#include sys/types.h
#include sys/stat.h
#include sys/wait.h
#include fcntl.h

#define CLONE_NEWNS 0x0002
#define CLONE_NEWUTS0x0400
#define CLONE_NEWIPC0x0800
#define CLONE_NEWUSER   0x1000
#define CLONE_NEWPID0x2000
#define CLONE_NEWNET0x4000

static int fn(void *_argv)
{
char **argv = (char **)_argv;

setsid();
setpgid(getpid(), getpid());

execve(argv[0], argv, __environ);
return 1;
}

int main(int argc, char *argv[])
{
unsigned long flags = 0;
int status;
pid_t pid;
void *p;

flags |= CLONE_NEWNS;
flags |= CLONE_NEWUTS;
flags |= CLONE_NEWIPC;
flags |= CLONE_NEWUSER;
flags |= CLONE_NEWPID;
flags |= CLONE_NEWNET;

p = malloc(4 * 4096);
if (!p)
return 1;
argv++;
pid = clone(fn, p + 4 * 4096, flags, (void *)argv);
fprintf(stderr, pid = %d\n, pid);
if (pid == -1)
return 1;
waitpid(pid, status, __WALL);
return 0;
}
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC v16][PATCH 19/43] c/r: external checkpoint of a task other than ourself

2009-05-27 Thread Oren Laadan
On Thu, 28 May 2009, Alexey Dobriyan wrote:

 On Wed, May 27, 2009 at 01:32:45PM -0400, Oren Laadan wrote:
  Now we can do external checkpoint, i.e. act on another task.
 
  +static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
  +{
  +   if (t-state == TASK_DEAD) {
  +   pr_warning(c/r: task %d is TASK_DEAD\n, task_pid_vnr(t));
  +   return -EAGAIN;
  +   }
  +
  +   if (!ptrace_may_access(t, PTRACE_MODE_READ)) {
  +   __ckpt_write_err(ctx, access to task %d (%s) denied,
  +task_pid_vnr(t), t-comm);
  +   return -EPERM;
  +   }
  +
  +   /* verify that the task is frozen (unless self) */
  +   if (t != current  !frozen(t)) {
  +   __ckpt_write_err(ctx, task %d (%s) is not frozen,
  +task_pid_vnr(t), t-comm);
  +   return -EBUSY;
  +   }
  +
  +   /* FIX: add support for ptraced tasks */
  +   if (task_ptrace(t)) {
  +   __ckpt_write_err(ctx, task %d (%s) is ptraced,
  +task_pid_vnr(t), t-comm);
  +   return -EBUSY;
  +   }
  +
  +   return 0;
  +}
  +
  +static int get_container(struct ckpt_ctx *ctx, pid_t pid)
  +{
  +   struct task_struct *task = NULL;
  +   struct nsproxy *nsproxy = NULL;
  +   int ret;
  +
  +   ctx-root_pid = pid;
  +
  +   read_lock(tasklist_lock);
  +   task = find_task_by_vpid(pid);
  +   if (task)
  +   get_task_struct(task);
  +   read_unlock(tasklist_lock);
  +
  +   if (!task)
  +   return -ESRCH;
  +
  +   ret = may_checkpoint_task(ctx, task);
  +   if (ret) {
  +   ckpt_write_err(ctx, NULL);
  +   put_task_struct(task);
  +   return ret;
  +   }
  +
  +   rcu_read_lock();
  +   nsproxy = task_nsproxy(task);
  +   get_nsproxy(nsproxy);
 
 Will oops if init is multi-threaded and thread group leader exited
 (nsproxy = NULL). I need to think what to do, too.


ood catch. Since all threads share same nsproxy (except those
who exits.. duh) we can test for this case, and get the nsproxy
from any of the other threads, something like this (untested):

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index afc7300..b303876 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -522,9 +522,33 @@ static int get_container(struct ckpt_ctx *ctx, pid_t pid)
 
rcu_read_lock();
nsproxy = task_nsproxy(task);
-   get_nsproxy(nsproxy);
+   if (nsproxy)
+   get_nsproxy(nsproxy);
rcu_read_unlock();
 
+   /*
+* If we hit a zombie thread-group-leader, nsproxy will be NULL,
+* and we instead grab it from one of the other threads.
+*/
+   if (!nsproxy) {
+   struct task_struct *p = next_thread(task);
+
+   BUG_ON(task-state != TASK_DEAD);
+   read_lock(tasklist_lock);
+   while (p != task  !task_nsproxy(p))
+   p = next_thread(p);
+   nsproxy = get_nsproxy(p);
+   if (nsproxy)
+   get_nsproxy(nsproxy);
+   read_unlock(tasklist_lock);
+   }
+
+   /* still not ... too bad ... */
+   if (!nsproxy) {
+   put_task_struct(task);
+   return -ESRCH;
+   }
+
ctx-root_task = task;
ctx-root_nsproxy = nsproxy;
ctx-root_init = is_container_init(task);

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 18/38] C/R: core stuff

2009-05-27 Thread Andrew Morton
On Thu, 28 May 2009 02:17:53 +0400
Alexey Dobriyan adobri...@gmail.com wrote:

  1) Whether or not allow c/r of sub-container (partial hierarchy)
  
  2) Creation of restarting process hierarchy in kernel or in userspace
  
  As for #1, you are the _only_ one who advocates restricting c/r to
  a full container only. I guess you have your reasons, but I'm unsure
  what they may be.
 
 The reason is that checkpointing half-frozen, half-live container is
 essentially equivalent to live container which adds much complexity
 to code fundamentally preventing kernel from taking coherent snapshot.
 
 In such situations kernel will do its job badly.
 
 Manpage will be filled with strings like if $FOO is shared then $BAR is
 not guaranteed.
 
 What to do if user simply doesn't know if container is bounded?
 Checkpoint and to hell with consequences?
 
 If two tasks share mm_struct you can't even detect that pages you dump
 aren't filled with garbage meanwhile from second task.
 
 If two tasks share mm_struct, other task can issue AIO indefinitely
 preventing from taking even coherent filesystem snapshot.
 
 That's why I raise this issue again to hear from people what they think
 and these people shouldn't be containers and C/R people, because the
 latter already made up their minds.
 
 This is super-important issue to get right from the beginning.

pipes up

yeah, checkpointing a partial hierarchy at this stage sounds like
overreach.  Get full-container working usably first, think about
sub-containers in version 2.

pipes down again
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 18/38] C/R: core stuff

2009-05-27 Thread Oren Laadan


Alexey Dobriyan wrote:
 On Wed, May 27, 2009 at 04:56:27PM -0400, Oren Laadan wrote:
 Alexey Dobriyan wrote:
 On Tue, May 26, 2009 at 08:16:44AM -0500, Serge E. Hallyn wrote:
 Quoting Alexey Dobriyan (adobri...@gmail.com):
 Introduction
 
 Checkpoint/restart (C/R from now) allows to dump group of processes to 
 disk
 for various reasons like saving process state in case of box failure or
 restoration of group of processes on another or same machine later.

 Unlike, let's say, hypervisor C/R style which only needs to freeze guest 
 kernel
 and dump more or less raw pages, proposed C/R doesn't require hypervisor.
 For that C/R code needs to know about all little and big intimate kernel 
 details.

 The good thing is that not all details needs to be serialized and saved
 like, say, readahead state. The bad things is still quite a few things
 need to be.
 Hi Alexey,

 the last time you posted this, I went through and tried to discern the
 meaningful differences between yours and Oren's patchsets.  Then I sent 
 some
 patches to Oren to make his set configurable to act more like yours.  And 
 Oren
 took them!  But now you resend this patchset with no real changelog, no
 acknowledgment that Oren's set even exists
 Is this a requirement? Everybody following topic already knows about
 Oren's patchset.
 Some people do ack other people's work. See for example patches #1
 and #24 in my recent post. You're welcome.

 - or is much farther along and pretty widely reviewed and tested (which is
 only because he started earlier and, when we asked for your counterpatches
 at an earlier stage, you would never reply) - or, most importantly, what
 it is that you think your patchset does that his does not and cannot.
 There are differences. And they're not small like you're trying to describe
 but pretty big compared the scale of the problem.
 I've asked before, and I repeat now: can you enumerate these big
 scary differences that make it such a big problem ?

 So far, we identified two main design issues -
 
 Why in ? Yes, they are high-level design issues.
 

In quotes, because I argued further on that, although my patchset
takes a stand on both issues, it can be easily reverted _within_
that patchset. Moreover, I argue that they can co-exist.

 1) Whether or not allow c/r of sub-container (partial hierarchy)

 2) Creation of restarting process hierarchy in kernel or in userspace

 As for #1, you are the _only_ one who advocates restricting c/r to
 a full container only. I guess you have your reasons, but I'm unsure
 what they may be.
 
 The reason is that checkpointing half-frozen, half-live container is
 essentially equivalent to live container which adds much complexity
 to code fundamentally preventing kernel from taking coherent snapshot.
 
 In such situations kernel will do its job badly.

In such situation the kernel will do a bad job if the user is asking
for a bad job. Just like checkpointing without snapshotting the
file system and expecting it to always work.

But if the user is a bit more careful (and even then, not that much),
she can enjoy the wonderful benefits of c/r without the wonderful
benefits of containers.

If useful, it's easy to pass a flag to checkpoint() that will ask
to enforce, say, shared memory leaks but not nsproxy or file leaks.

In fact, even shared memory leaks may be useful for some users (e.g.
what the guys from kerlabs pointed out).

 
 Manpage will be filled with strings like if $FOO is shared then $BAR is
 not guaranteed.
 
 What to do if user simply doesn't know if container is bounded?
 Checkpoint and to hell with consequences?
 
 If two tasks share mm_struct you can't even detect that pages you dump
 aren't filled with garbage meanwhile from second task.
 
 If two tasks share mm_struct, other task can issue AIO indefinitely
 preventing from taking even coherent filesystem snapshot.
 
 That's why I raise this issue again to hear from people what they think
 and these people shouldn't be containers and C/R people, because the
 latter already made up their minds.

Lol .. and disagreement persists among us :)

And indeed, I have heard and seen already a few opinions in favor
of permitting non-container checkpoint. From potential users (not
c/r people).

 
 This is super-important issue to get right from the beginning.
 
 On the other hand, there has been a handful of use-cases and opinions
 in favor of allowing both capabilities to co-exist. Not the mention
 that nearly no additional code is necessary, on the contrary.

 As for #2, you didn't even bother to reply to the discussion that I
 had started about it. This decision is important to allow future
 flexibility of the mechanism, and to address the needs of several
 potential users, as seen in that discussion and others. Here, too,
 you are the _only_ one that advocates that direction.
 
 Are you going to fork to-become-zombies, make them call restart(2) and
 zombify?

Yes.

 
 And the funniest thing -- *both* decisions 

[Devel] Re: [PATCH 7/7] [PATCH] Define clone_with_pids syscall

2009-05-27 Thread Sukadev Bhattiprolu
|  +   if (copy_from_user(pid_set, upid_setp, sizeof(pid_set)))
|  +   return ERR_PTR(-EFAULT);
|  +
|  +   size = pid_set.num_pids * sizeof(pid_t);
| 
| ...either test pid_set.num_pids  0 (and give -EINVAL),
| or...

Good point. I now check for num_pids  0 and treat num_pids == 0 as
normal clone().

While addressing this I realized I had a lot of arch-independent code
in arch/x86/kernel/process_32.c. I have now moved this common code to
kernel/fork.c. Its non-trivial code move, so need new review/acks from
you and Serge for at least patches 6 and 7.

Sukadev
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 1/7][v2] Factor out code to allocate pidmap page

2009-05-27 Thread Sukadev Bhattiprolu

From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:39 -0700
Subject: [RFC][PATCH 1/7][v2] Factor out code to allocate pidmap page

To implement support for clone_with_pids() system call we would
need to allocate pidmap page in more than one place. Move this
code to a new function alloc_pidmap_page().

Changelog[v2]:
- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
  -ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Acked-by: Serge Hallyn se...@us.ibm.com
Reviewed-by: Oren Laadan or...@cs.columbia.edu
---
 kernel/pid.c |   46 ++
 1 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index b2e5f78..9ff33cc 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,9 +122,34 @@ static void free_pidmap(struct upid *upid)
atomic_inc(map-nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+   void *page;
+
+   if (likely(map-page))
+   return 0;
+
+   page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+   /*
+* Free the page if someone raced with us installing it:
+*/
+   spin_lock_irq(pidmap_lock);
+   if (map-page)
+   kfree(page);
+   else
+   map-page = page;
+   spin_unlock_irq(pidmap_lock);
+
+   if (unlikely(!map-page))
+   return -ENOMEM;
+
+   return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
-   int i, offset, max_scan, pid, last = pid_ns-last_pid;
+   int i, rc, offset, max_scan, pid, last = pid_ns-last_pid;
struct pidmap *map;
 
pid = last + 1;
@@ -134,21 +159,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
map = pid_ns-pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i = max_scan; ++i) {
-   if (unlikely(!map-page)) {
-   void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-   /*
-* Free the page if someone raced with us
-* installing it:
-*/
-   spin_lock_irq(pidmap_lock);
-   if (map-page)
-   kfree(page);
-   else
-   map-page = page;
-   spin_unlock_irq(pidmap_lock);
-   if (unlikely(!map-page))
-   break;
-   }
+   rc = alloc_pidmap_page(map);
+   if (rc)
+   break;
+
if (likely(atomic_read(map-nr_free))) {
do {
if (!test_and_set_bit(offset, map-page)) {
-- 
1.5.2.5

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 2/7][v2] Have alloc_pidmap() return actual error code

2009-05-27 Thread Sukadev Bhattiprolu

From 991fb474b055d36c4516cf7f79a247b7d79819ae Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:40 -0700
Subject: [RFC][PATCH 2/7][v2] Have alloc_pidmap() return actual error code

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Acked-by: Serge Hallyn se...@us.ibm.com
Reviewed-by: Oren Laadan or...@cs.columbia.edu
---
 kernel/fork.c |5 +++--
 kernel/pid.c  |9 ++---
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index b9e2edd..f8411a8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1119,10 +1119,11 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
goto bad_fork_cleanup_io;
 
if (pid != init_struct_pid) {
-   retval = -ENOMEM;
pid = alloc_pid(p-nsproxy-pid_ns);
-   if (!pid)
+   if (IS_ERR(pid)) {
+   retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
+   }
 
if (clone_flags  CLONE_NEWPID) {
retval = pid_ns_prepare_proc(p-nsproxy-pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index 9ff33cc..b2d6a19 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -158,6 +158,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
offset = pid  BITS_PER_PAGE_MASK;
map = pid_ns-pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+   rc = -EAGAIN;
for (i = 0; i = max_scan; ++i) {
rc = alloc_pidmap_page(map);
if (rc)
@@ -188,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
} else {
map = pid_ns-pidmap[0];
offset = RESERVED_PIDS;
-   if (unlikely(last == offset))
+   if (unlikely(last == offset)) {
+   rc = -EAGAIN;
break;
+   }
}
pid = mk_pid(pid_ns, map, offset);
}
-   return -1;
+   return rc;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -298,7 +301,7 @@ out_free:
free_pidmap(pid-numbers + i);
 
kmem_cache_free(ns-pid_cachep, pid);
-   pid = NULL;
+   pid = ERR_PTR(nr);
goto out;
 }
 
-- 
1.5.2.5

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap()

2009-05-27 Thread Sukadev Bhattiprolu

From a1fdec1036a952359d02a7c667d126bd2fff6804 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:41 -0700
Subject: [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap()

With support for setting a specific pid number for a process,
alloc_pidmap() will need a paramter a 'target_pid' parameter.

Changelog[v2]:
- (Serge Hallyn) Check for 'pid  0' in set_pidmap().(Code
  actually checks for 'pid = 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 kernel/pid.c |   28 ++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index b2d6a19..b44dd21 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,11 +147,35 @@ static int alloc_pidmap_page(struct pidmap *map)
return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+   int offset;
+   struct pidmap *map;
+
+   if (pid = 0 || pid = pid_max)
+   return -EINVAL;
+
+   offset = pid  BITS_PER_PAGE_MASK;
+   map = pid_ns-pidmap[pid/BITS_PER_PAGE];
+
+   if (alloc_pidmap_page(map))
+   return -ENOMEM;
+
+   if (test_and_set_bit(offset, map-page))
+   return -EBUSY;
+
+   atomic_dec(map-nr_free);
+   return pid;
+}
+
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
 {
int i, rc, offset, max_scan, pid, last = pid_ns-last_pid;
struct pidmap *map;
 
+   if (target_pid)
+   return set_pidmap(pid_ns, target_pid);
+
pid = last + 1;
if (pid = pid_max)
pid = RESERVED_PIDS;
@@ -270,7 +294,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
tmp = ns;
for (i = ns-level; i = 0; i--) {
-   nr = alloc_pidmap(tmp);
+   nr = alloc_pidmap(tmp, 0);
if (nr  0)
goto out_free;
 
-- 
1.5.2.5

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 4/7][v2] Add target_pids parameter to alloc_pid()

2009-05-27 Thread Sukadev Bhattiprolu

From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:42 -0700
Subject: [RFC][PATCH 4/7][v2] Add target_pids parameter to alloc_pid()

With support for setting a specific pid numbers, alloc_pid() would need
to take a set of 'target-pids' which gives the user-specified pids. Add
this parameter to alloc_pid(), but leave it set to NULL for now. The
parameter will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Acked-by: Serge Hallyn se...@us.ibm.com
---
 include/linux/pid.h |2 +-
 kernel/fork.c   |3 ++-
 kernel/pid.c|9 +++--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index f8411a8..d2d69d3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -954,6 +954,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
+   pid_t *target_pids = NULL;
 
if ((clone_flags  (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1119,7 +1120,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
goto bad_fork_cleanup_io;
 
if (pid != init_struct_pid) {
-   pid = alloc_pid(p-nsproxy-pid_ns);
+   pid = alloc_pid(p-nsproxy-pid_ns, target_pids);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index b44dd21..090b221 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -280,13 +280,14 @@ void free_pid(struct pid *pid)
call_rcu(pid-rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
+   int tpid;
 
pid = kmem_cache_alloc(ns-pid_cachep, GFP_KERNEL);
if (!pid)
@@ -294,7 +295,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
tmp = ns;
for (i = ns-level; i = 0; i--) {
-   nr = alloc_pidmap(tmp, 0);
+   tpid = 0;
+   if (target_pids)
+   tpid = target_pids[i];
+
+   nr = alloc_pidmap(tmp, tpid);
if (nr  0)
goto out_free;
 
-- 
1.5.2.5

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 6/7][v2] Define do_fork_with_pids()

2009-05-27 Thread Sukadev Bhattiprolu

From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:44 -0700
Subject: [RFC][PATCH 6/7][v2] Define do_fork_with_pids()

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v2]:
- [v1] of this patch had some architecture-indpendent code in
  arch/x86/kernel/process_32.c.  To facilitate moving this code
  to kernel/fork.c, in the next patch, [v2] of the patch passes
  'struct target_pid_set __user *' to do_fork_with_pids() instead
  of 'pid_t *'.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 include/linux/sched.h |1 +
 include/linux/types.h |5 +
 kernel/fork.c |   16 ++--
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..8468e54 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1995,6 +1995,7 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, 
struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned 
long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, 
unsigned long, int __user *, int __user *, struct target_pid_set __user 
*pid_set);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/include/linux/types.h b/include/linux/types.h
index 5abe354..17ec186 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -204,6 +204,11 @@ struct ustat {
charf_fpack[6];
 };
 
+struct target_pid_set {
+   int num_pids;
+   pid_t *target_pids;
+};
+
 #endif /* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 373411e..a16ef7b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1340,12 +1340,13 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
  unsigned long stack_start,
  struct pt_regs *regs,
  unsigned long stack_size,
  int __user *parent_tidptr,
- int __user *child_tidptr)
+ int __user *child_tidptr,
+ struct target_pid_set __user *pid_setp)
 {
struct task_struct *p;
int trace = 0;
@@ -1448,6 +1449,17 @@ long do_fork(unsigned long clone_flags,
return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+ unsigned long stack_start,
+ struct pt_regs *regs,
+ unsigned long stack_size,
+ int __user *parent_tidptr,
+ int __user *child_tidptr)
+{
+   return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+   parent_tidptr, child_tidptr, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
1.5.2.5

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 7/7][v2] Define clone_with_pids syscall

2009-05-27 Thread Sukadev Bhattiprolu

From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:45 -0700
Subject: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall

clone_with_pids() is same as clone(), except that it takes a 'target_pid_set'
paramter which lets caller choose a specific pid number for the child process
in each of the child process's pid namespace. This system call would be needed
to implement Checkpoint/Restart (i.e after a checkpoint, restart a process with
its original pids).

Call clone_with_pids as follows:

pid_t pids[] = { 0, 77, 99 };
struct target_pid_set pid_set;

pid_set.num_pids = sizeof(pids) / sizeof(int);
pid_set.target_pids = pids;

syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, pid_set);

If a target-pid is 0, the kernel continues to assign a pid for the process in
that namespace. In the above example, pids[0] is 0, meaning the kernel will
assign next available pid to the process in init_pid_ns. But kernel will assign
pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
77 or 99 are taken, the system call fails with -EBUSY.

If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
the system call fails with -EINVAL.

Its mostly an exploratory patch seeking feedback on the interface.

NOTE:
1. clone_with_pids(), at least for now, needs CAP_SYS_ADMIN to prevent
   misuse of the interface.
   
2. Compared to clone(), clone_with_pids() needs to pass in two more
   pieces of information:

- number of pids in the set
- user buffer containing the list of pids.

   But since clone() already takes 5 parameters, use a 'struct
   target_pid_set'.

TODO:
- Gently tested.
- May need additional sanity checks in do_fork_with_pids().
- Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in
  the namespace is either 1 or 0).

Changelog[v2]:
- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
- (Oren Laadan) Add checks for 'num_pids  0' (return -EINVAL) and
  'num_pids == 0' (fall back to normal clone()).
- Move arch-independent code (sanity checks and copy-in of target-pids)
  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
- Fixed some compile errors (had fixed these errors earlier in my
  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
---
 arch/x86/include/asm/syscalls.h|1 +
 arch/x86/include/asm/unistd_32.h   |1 +
 arch/x86/kernel/entry_32.S |1 +
 arch/x86/kernel/process_32.c   |   21 +
 arch/x86/kernel/syscall_table_32.S |1 +
 kernel/fork.c  |   81 +++-
 6 files changed, 105 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 7043408..1fdc149 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -31,6 +31,7 @@ asmlinkage int sys_get_thread_area(struct user_desc __user *);
 /* kernel/process_32.c */
 int sys_fork(struct pt_regs *);
 int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
 int sys_vfork(struct pt_regs *);
 int sys_execve(struct pt_regs *);
 
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..90f906f 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
 #define __NR_inotify_init1 332
 #define __NR_preadv333
 #define __NR_pwritev   334
+#define __NR_clone_with_pids   335
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c929add..ee92b0d 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -707,6 +707,7 @@ ptregs_##name: \
 PTREGSCALL(iopl)
 PTREGSCALL(fork)
 PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
 PTREGSCALL(vfork)
 PTREGSCALL(execve)
 PTREGSCALL(sigaltstack)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 76f8f84..1efc3de 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -445,6 +445,27 @@ int sys_clone(struct pt_regs *regs)
return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, 
child_tidptr);
 }
 
+int sys_clone_with_pids(struct pt_regs *regs)
+{
+   unsigned long clone_flags;
+   unsigned long newsp;
+   int __user *parent_tidptr;
+   int __user *child_tidptr;
+   void __user *upid_setp;
+
+   clone_flags = regs-bx;
+   newsp = regs-cx;
+   parent_tidptr = (int __user *)regs-dx;
+   child_tidptr = (int __user *)regs-di;
+   upid_setp = (void __user *)regs-bp;
+
+   if (!newsp)
+   newsp = regs-sp;
+
+   return do_fork_with_pids(clone_flags, newsp, regs, 

[Devel] [RFC][PATCH 5/7][v2] Add target_pids parameter to copy_process()

2009-05-27 Thread Sukadev Bhattiprolu

From 432bc68b622661cd4a28379e98e3ecc8f44d915d Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Date: Mon, 4 May 2009 01:17:43 -0700
Subject: [RFC][PATCH 5/7][v2] Add target_pids parameter to copy_process()

Add a 'target_pids' parameter to copy_process().  The new parameter will
be used in a follow-on patch when clone_with_pids() is implemented.

Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Acked-by: Serge Hallyn se...@us.ibm.com
Reviewed-by: Oren Laadan or...@cs.columbia.edu
---
 kernel/fork.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d2d69d3..373411e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -949,12 +949,12 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
+   pid_t *target_pids,
int trace)
 {
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
-   pid_t *target_pids = NULL;
 
if ((clone_flags  (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1327,7 +1327,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
struct pt_regs regs;
 
task = copy_process(CLONE_VM, 0, idle_regs(regs), 0, NULL,
-   init_struct_pid, 0);
+   init_struct_pid, NULL, 0);
if (!IS_ERR(task))
init_idle(task, cpu);
 
@@ -1350,6 +1350,7 @@ long do_fork(unsigned long clone_flags,
struct task_struct *p;
int trace = 0;
long nr;
+   pid_t *target_pids = NULL;
 
/*
 * Do some preliminary argument and permissions checking before we
@@ -1390,7 +1391,7 @@ long do_fork(unsigned long clone_flags,
trace = tracehook_prepare_clone(clone_flags);
 
p = copy_process(clone_flags, stack_start, regs, stack_size,
-child_tidptr, NULL, trace);
+child_tidptr, NULL, target_pids, trace);
/*
 * Do this prior waking up the new thread - the thread pointer
 * might get invalid after that point, if the thread exits quickly.
-- 
1.5.2.5

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel