Re: [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page

2010-05-05 Thread Oren Laadan

Hi David,

I suppose you are looking for more details than those found in the
current patch-0 (http://lkml.org/lkml/2010/5/1/140).

We omitted them for brevity sake; here is a link to patch-0 of a 
previous post of the patchset: http://lkml.org/lkml/2009/9/23/423


Thanks,

Oren.

David Howells wrote:

With a huge patch series like this, can you post a cover note at the front
(usually patch 0) saying what the point of the whole series is?

David



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v21 002/100] eclone (2/11): Have alloc_pidmap() return actual error code

2010-05-01 Thread Oren Laadan
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Changelog[v1]:
- [Oren Laadan] Rebase to kernel 2.6.33

Cc: linux-...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
Tested-by: Serge E. Hallyn se...@us.ibm.com
Reviewed-by: Oren Laadan or...@cs.columbia.edu
---
 kernel/fork.c |5 +++--
 kernel/pid.c  |   10 ++
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 44b0791..afdfb08 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1147,10 +1147,11 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
goto bad_fork_cleanup_io;
 
if (pid != init_struct_pid) {
-   retval = -ENOMEM;
pid = alloc_pid(p-nsproxy-pid_ns);
-   if (!pid)
+   if (IS_ERR(pid)) {
+   retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
+   }
 
if (clone_flags  CLONE_NEWPID) {
retval = pid_ns_prepare_proc(p-nsproxy-pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index 52a371a..8330488 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -160,7 +160,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
for (i = 0; i = max_scan; ++i) {
if (unlikely(!map-page))
if (alloc_pidmap_page(map)  0)
-   break;
+   return -ENOMEM;
if (likely(atomic_read(map-nr_free))) {
do {
if (!test_and_set_bit(offset, map-page)) {
@@ -191,7 +191,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
}
pid = mk_pid(pid_ns, map, offset);
}
-   return -1;
+   return -EBUSY;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -260,8 +260,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
struct upid *upid;
 
pid = kmem_cache_alloc(ns-pid_cachep, GFP_KERNEL);
-   if (!pid)
+   if (!pid) {
+   pid = ERR_PTR(-ENOMEM);
goto out;
+   }
 
tmp = ns;
for (i = ns-level; i = 0; i--) {
@@ -295,7 +297,7 @@ out_free:
free_pidmap(pid-numbers + i);
 
kmem_cache_free(ns-pid_cachep, pid);
-   pid = NULL;
+   pid = ERR_PTR(nr);
goto out;
 }
 
-- 
1.6.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v21 008/100] eclone (8/11): Implement sys_eclone for x86 (32, 64)

2010-05-01 Thread Oren Laadan
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

eclone(), intended for use during restart, is the same as
clone(), except that it takes a 'pids' paramter. This parameter lets
caller choose specific pid numbers for the child process, in the
process's active and ancestor pid namespaces. (Descendant pid namespaces
in general don't matter since processes don't have pids in them anyway,
but see comments in copy_target_pids() regarding CLONE_NEWPID).

eclone() also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and all but
one of these are in use. If more new clone flags are needed, we will be
forced to define a new variant of the clone() system call. To address
this, eclone() allows at least 64 clone flags with some room
for more if necessary.

To prevent unprivileged processes from misusing this interface,
eclone() currently needs CAP_SYS_ADMIN, when the 'pids' parameter
is non-NULL.

See Documentation/eclone in next patch for more details and an
example of its usage.

NOTE:
- System calls are restricted to 6 parameters and the number and sizes
  of parameters needed for eclone() exceed 6 integers. The new
  prototype works around this restriction while providing some
  flexibility if eclone() needs to be further extended in the
  future.
TODO:
- We should convert clone-flags to 64-bit value in all architectures.
  Its probably best to do that as a separate patchset since clone_flags
  touches several functions and that patchset seems independent of this
  new system call.

Changelog[v14]:
- [Oren Laadan] Rebase to kernel 2.6.33
  * introduce PTREGSCALL4 for sys_eclone
  * consolidate syscall definitions for 32/64 bit
- [Oren Laadan] Merge x86_64 (trivial patch) with current
- [Serge Hallyn] Add eclone stub for ia32 eclone

Changelog[v13]:
- [Dave Hansen]: Reorg to enable sharing code between x86 and x86-64.
- [Arnd Bergmann]: With args_size parameter, -reserved1 is redundant
  and can be removed.
- [Nathan Lynch]: stop warnings about assigning u64 to a (32-bit) int*.
- [Nathan Lynch, Serge Hallyn] Rename -child_stack_base to
  -child_stack and ensure -child_stack_size is 0 on architectures
  that don't need it (see comments in types.h for details).

Changelog[v12]:
- [Serge Hallyn] Ignore -child_stack_size if -child_stack_base
  is NULL.
- [Oren Laadan, Serge Hallyn] Rename clone_with_pids() to eclone()
Changelog[v11]:
- [Dave Hansen] Move clone_args validation checks to arch-indpeendent
  code.
- [Oren Laadan] Make args_size a parameter to system call and remove
  it from 'struct clone_args'

Changelog[v10]:
- Rename clone3() to clone_with_pids()
- [Linus Torvalds] Use PTREGSCALL() rather than the generic syscall
  implementation

Changelog[v9]:
- [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
  architectures split the new clone-flags into 'low' and 'high'
  words and pass in the 'lower' flags as the first argument.
  This would maintain similarity of the clone3() with clone()/
  clone2(). Also has the side-effect of the name matching the
  number of parameters :-)
- [Roland McGrath] Rename structure to 'clone_args' and add a
  'child_stack_size' field

Changelog[v8]
- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
  must be 64-bit.
- clone2() is in use in IA64. Rename system call to clone3().

Changelog[v7]:
- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
  and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
  is constant across architectures.
- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
  'unum_pids  0' check.

Changelog[v4]:
- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
  in the target_pids[] list and setting it 0. See copy_target_pids()).
- (Oren Laadan) Specified target pids should apply only to youngest
  pid-namespaces (see copy_target_pids())
- (Matt Helsley) Update patch description.

Changelog[v2]:
- Remove unnecessary printk and add a note to callers of
  copy_target_pids() to free target_pids.
- (Serge

[PATCH v21 009/100] eclone (9/11): Implement sys_eclone for s390

2010-05-01 Thread Oren Laadan
From: Serge E. Hallyn se...@us.ibm.com

Implement the s390 hook for sys_eclone().

Changelog:
Nov 24: Removed user-space code from commit log. See user-cr git tree.
Nov 17: remove redundant flags_high check
Nov 13: As suggested by Heiko, convert eclone to take its
parameters via registers.

Cc: linux-...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Signed-off-by: Serge E. Hallyn se...@us.ibm.com
---
 arch/s390/include/asm/unistd.h|3 ++-
 arch/s390/kernel/compat_linux.c   |   17 +
 arch/s390/kernel/compat_wrapper.S |8 
 arch/s390/kernel/process.c|   37 +
 arch/s390/kernel/syscalls.S   |1 +
 5 files changed, 65 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index 5f00751..ff13be1 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,8 @@
 #define__NR_pwritev329
 #define __NR_rt_tgsigqueueinfo 330
 #define __NR_perf_event_open   331
-#define NR_syscalls 332
+#define __NR_eclone332
+#define NR_syscalls 333
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 73b624e..1f70d6f 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -663,6 +663,23 @@ asmlinkage long sys32_write(unsigned int fd, char __user * 
buf, size_t count)
return sys_write(fd, buf, count);
 }
 
+asmlinkage long sys32_clone(void)
+{
+   struct pt_regs *regs = task_pt_regs(current);
+   unsigned long clone_flags;
+   unsigned long newsp;
+   int __user *parent_tidptr, *child_tidptr;
+
+   clone_flags = regs-gprs[3]  0xUL;
+   newsp = regs-orig_gpr2  0x7fffUL;
+   parent_tidptr = compat_ptr(regs-gprs[4]);
+   child_tidptr = compat_ptr(regs-gprs[5]);
+   if (!newsp)
+   newsp = regs-gprs[15];
+   return do_fork(clone_flags, newsp, regs, 0,
+  parent_tidptr, child_tidptr);
+}
+
 /*
  * 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64.
  * These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE}
diff --git a/arch/s390/kernel/compat_wrapper.S 
b/arch/s390/kernel/compat_wrapper.S
index 672ce52..b7bedfa 100644
--- a/arch/s390/kernel/compat_wrapper.S
+++ b/arch/s390/kernel/compat_wrapper.S
@@ -1847,6 +1847,14 @@ sys_clone_wrapper:
llgtr   %r5,%r5 # int *
jg  sys_clone   # branch to system call
 
+   .globl  sys_eclone_wrapper
+sys_eclone_wrapper:
+   llgfr   %r2,%r2 # unsigned int
+   llgtr   %r3,%r3 # struct clone_args *
+   lgfr%r4,%r4 # int
+   llgtr   %r5,%r5 # pid_t *
+   jg  sys_eclone  # branch to system call
+
.globl  sys32_execve_wrapper
 sys32_execve_wrapper:
llgtr   %r2,%r2 # char *
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 1039fde..799cbb0 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -240,6 +240,43 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned 
long, clone_flags,
   parent_tidptr, child_tidptr);
 }
 
+SYSCALL_DEFINE4(eclone, unsigned int, flags_low, struct clone_args __user *,
+   uca, int, args_size, pid_t __user *, pids)
+{
+   int rc;
+   struct pt_regs *regs = task_pt_regs(current);
+   struct clone_args kca;
+   int __user *parent_tid_ptr;
+   int __user *child_tid_ptr;
+   unsigned long flags;
+   unsigned long __user child_stack;
+   unsigned long stack_size;
+
+   rc = fetch_clone_args_from_user(uca, args_size, kca);
+   if (rc)
+   return rc;
+
+   flags = flags_low;
+   parent_tid_ptr = (int __user *) kca.parent_tid_ptr;
+   child_tid_ptr =  (int __user *) kca.child_tid_ptr;
+
+   stack_size = (unsigned long) kca.child_stack_size;
+   if (stack_size)
+   return -EINVAL;
+
+   child_stack = (unsigned long) kca.child_stack;
+   if (!child_stack)
+   child_stack = regs-gprs[15];
+
+   /*
+* TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+* to several functions. Need to convert clone_flags to 64-bit.
+*/
+   return do_fork_with_pids(flags, child_stack, regs, stack_size,
+   parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+   pids);
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 201ce6b..08eab1d 100644
--- a/arch/s390/kernel/syscalls.S

[PATCH v21 010/100] eclone (10/11): Implement sys_eclone for powerpc

2010-05-01 Thread Oren Laadan
From: Nathan Lynch n...@pobox.com

Wired up for both ppc32 and ppc64, but tested only with the latter.

Changelog:
  - Jan 20: (ntl) fix 32-bit build
  - Nov 17: (serge) remove redundant flags_high check, and
don't fold it into flags.

Cc: linux-...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Signed-off-by: Nathan Lynch n...@pobox.com
Signed-off-by: Serge E. Hallyn se...@us.ibm.com
---
 arch/powerpc/include/asm/syscalls.h |6 
 arch/powerpc/include/asm/systbl.h   |1 +
 arch/powerpc/include/asm/unistd.h   |3 +-
 arch/powerpc/kernel/entry_32.S  |8 +
 arch/powerpc/kernel/entry_64.S  |5 +++
 arch/powerpc/kernel/process.c   |   54 ++-
 6 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/syscalls.h 
b/arch/powerpc/include/asm/syscalls.h
index 4084e56..920cefd 100644
--- a/arch/powerpc/include/asm/syscalls.h
+++ b/arch/powerpc/include/asm/syscalls.h
@@ -23,6 +23,12 @@ asmlinkage int sys_execve(unsigned long a0, unsigned long a1,
 asmlinkage int sys_clone(unsigned long clone_flags, unsigned long usp,
int __user *parent_tidp, void __user *child_threadptr,
int __user *child_tidp, int p6, struct pt_regs *regs);
+asmlinkage int sys_eclone(unsigned long flags_low,
+ struct clone_args __user *args,
+ size_t args_size,
+ pid_t __user *pids,
+ unsigned long p5, unsigned long p6,
+ struct pt_regs *regs);
 asmlinkage int sys_fork(unsigned long p1, unsigned long p2,
unsigned long p3, unsigned long p4, unsigned long p5,
unsigned long p6, struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index a5ee345..f94fc43 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -326,3 +326,4 @@ SYSCALL_SPU(perf_event_open)
 COMPAT_SYS_SPU(preadv)
 COMPAT_SYS_SPU(pwritev)
 COMPAT_SYS(rt_tgsigqueueinfo)
+PPC_SYS(eclone)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index f0a1026..4cdbd5c 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -345,10 +345,11 @@
 #define __NR_preadv320
 #define __NR_pwritev   321
 #define __NR_rt_tgsigqueueinfo 322
+#define __NR_eclone323
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls  323
+#define __NR_syscalls  324
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 1175a85..579f1da 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -586,6 +586,14 @@ ppc_clone:
stw r0,_TRAP(r1)/* register set saved */
b   sys_clone
 
+   .globl  ppc_eclone
+ppc_eclone:
+   SAVE_NVGPRS(r1)
+   lwz r0,_TRAP(r1)
+   rlwinm  r0,r0,0,0,30/* clear LSB to indicate full */
+   stw r0,_TRAP(r1)/* register set saved */
+   b   sys_eclone
+
.globl  ppc_swapcontext
 ppc_swapcontext:
SAVE_NVGPRS(r1)
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 07109d8..b763340 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -344,6 +344,11 @@ _GLOBAL(ppc_clone)
bl  .sys_clone
b   syscall_exit
 
+_GLOBAL(ppc_eclone)
+   bl  .save_nvgprs
+   bl  .sys_eclone
+   b   syscall_exit
+
 _GLOBAL(ppc32_swapcontext)
bl  .save_nvgprs
bl  .compat_sys_swapcontext
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index e4d71ce..b183287 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -961,7 +961,59 @@ int sys_clone(unsigned long clone_flags, unsigned long usp,
child_tidp = TRUNC_PTR(child_tidp);
}
 #endif
-   return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp);
+   return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp);
+}
+
+int sys_eclone(unsigned long clone_flags_low,
+  struct clone_args __user *uclone_args,
+  size_t size,
+  pid_t __user *upids,
+  unsigned long p5, unsigned long p6,
+  struct pt_regs *regs)
+{
+   struct clone_args kclone_args;
+   unsigned long stack_base;
+   int __user *parent_tidp;
+   int __user *child_tidp;
+   unsigned long stack_sz;
+   unsigned int nr_pids;
+   unsigned long flags;
+   unsigned long usp;
+   int rc;
+
+   CHECK_FULL_REGS(regs);
+
+   rc = fetch_clone_args_from_user(uclone_args, size, kclone_args);
+   if (rc)
+   return rc;
+
+   stack_sz = 

[PATCH v21 004/100] eclone (4/11): Add target_pids parameter to alloc_pid()

2010-05-01 Thread Oren Laadan
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

This parameter is currently NULL, but will be used in a follow-on patch.

Cc: linux-...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
Tested-by: Serge E. Hallyn se...@us.ibm.com
Reviewed-by: Oren Laadan or...@cs.columbia.edu
---
 include/linux/pid.h |2 +-
 kernel/fork.c   |3 ++-
 kernel/pid.c|9 +++--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index afdfb08..62018c8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -962,6 +962,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
+   pid_t *target_pids = NULL;
 
if ((clone_flags  (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1147,7 +1148,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
goto bad_fork_cleanup_io;
 
if (pid != init_struct_pid) {
-   pid = alloc_pid(p-nsproxy-pid_ns);
+   pid = alloc_pid(p-nsproxy-pid_ns, target_pids);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 4eaf975..57f1344 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -276,13 +276,14 @@ void free_pid(struct pid *pid)
call_rcu(pid-rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
+   pid_t tpid;
 
pid = kmem_cache_alloc(ns-pid_cachep, GFP_KERNEL);
if (!pid) {
@@ -292,7 +293,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
tmp = ns;
for (i = ns-level; i = 0; i--) {
-   nr = alloc_pidmap(tmp);
+   tpid = 0;
+   if (target_pids)
+   tpid = target_pids[i];
+
+   nr = set_pidmap(tmp, tpid);
if (nr  0)
goto out_free;
 
-- 
1.6.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v21 012/100] c/r: extend arch_setup_additional_pages()

2010-05-01 Thread Oren Laadan
From: Alexey Dobriyan adobri...@gmail.com

Add start argument, to request to map vDSO to a specific place,
and fail the operation if not.

This is useful for restart(2) to ensure that memory layout is restore
exactly as needed.

Changelog[v19]:
  - [serge hallyn] Fix potential use-before-set ret
Changelog[v2]:
  - [ntl] powerpc: vdso build fix (ckpt-v17)

Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Signed-off-by: Alexey Dobriyan adobri...@gmail.com
Signed-off-by: Oren Laadan or...@cs.columbia.edu
---
 arch/powerpc/include/asm/elf.h |1 +
 arch/powerpc/kernel/vdso.c |   13 -
 arch/s390/include/asm/elf.h|2 +-
 arch/s390/kernel/vdso.c|   13 -
 arch/sh/include/asm/elf.h  |1 +
 arch/sh/kernel/vsyscall/vsyscall.c |2 +-
 arch/x86/include/asm/elf.h |3 ++-
 arch/x86/vdso/vdso32-setup.c   |9 +++--
 arch/x86/vdso/vma.c|   11 ---
 fs/binfmt_elf.c|2 +-
 10 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index c376eda..0b06255 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -266,6 +266,7 @@ extern int ucache_bsize;
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+  unsigned long start,
   int uses_interp);
 #define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b);
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index d84d192..74210ab 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -188,7 +188,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+   unsigned long start, int uses_interp)
 {
struct mm_struct *mm = current-mm;
struct page **vdso_pagelist;
@@ -220,6 +221,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
vdso_base = VDSO32_MBASE;
 #endif
 
+   /* in case restart(2) mandates a specific location */
+   if (start)
+   vdso_base = start;
+
current-mm-context.vdso_base = 0;
 
/* vDSO has a problem and was disabled, just don't enable it for the
@@ -249,6 +254,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
/* Add required alignment. */
vdso_base = ALIGN(vdso_base, VDSO_ALIGNMENT);
 
+   /* for restart(2), double check that we got we asked for */
+   if (start  vdso_base != start) {
+   rc = -EBUSY;
+   goto fail_mmapsem;
+   }
+
/*
 * Put vDSO base into mm struct. We need to do this before calling
 * install_special_mapping or the perf counter mmap tracking code
diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h
index 354d426..5081938 100644
--- a/arch/s390/include/asm/elf.h
+++ b/arch/s390/include/asm/elf.h
@@ -216,6 +216,6 @@ do {
\
 struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
-int arch_setup_additional_pages(struct linux_binprm *, int);
+int arch_setup_additional_pages(struct linux_binprm *, unsigned long, int);
 
 #endif
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 6bc9c19..54dad2f 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -195,7 +195,8 @@ static void vdso_init_cr5(void)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+   unsigned long start, int uses_interp)
 {
struct mm_struct *mm = current-mm;
struct page **vdso_pagelist;
@@ -226,6 +227,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
vdso_pages = vdso32_pages;
 #endif
 
+   /* in case restart(2) mandates a specific location */
+   if (start)
+   vdso_base = start;
+
/*
 * vDSO has a problem and was disabled, just don't enable it for
 * the process
@@ -248,6 +253,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
goto out_up;
}
 
+   /* for restart(2), double check that we got we asked for */
+   if (start  vdso_base != start) {
+   rc = -EINVAL;
+   goto out_up

[PATCH v21 011/100] eclone (11/11): Document sys_eclone

2010-05-01 Thread Oren Laadan
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

This gives a brief overview of the eclone() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v13]:
- [Nathan Lynch, Serge Hallyn] Rename -child_stack_base to
  -child_stack and ensure -child_stack_size is 0 on architectures
  that don't need it.
- [Arnd Bergmann] Remove -reserved1 field
- [Louis Rilling, Dave Hansen] Combine the two asm statements in the
  example into one and use memory constraint to avoid unncessary copies.
Changelog[v12]:
- [Serge Hallyn] Fix/simplify stack-setup in the example code
- [Serge Hallyn, Oren Laadan] Rename syscall to eclone()

Changelog[v11]:
- [Dave Hansen] Move clone_args validation checks to arch-indpendent
  code.
- [Oren Laadan] Make args_size a parameter to system call and remove
  it from 'struct clone_args'
- [Oren Laadan] Fix some typos and clarify the order of pids in the
  @pids parameter.

Changelog[v10]:
- Rename clone3() to clone_with_pids() and fix some typos.
- Modify example to show usage with the ptregs implementation.
Changelog[v9]:
- [Pavel Machek]: Fix an inconsistency and rename new file to
  Documentation/clone3.
- [Roland McGrath, H. Peter Anvin] Updates to description and
  example to reflect new prototype of clone3() and the updated/
  renamed 'struct clone_args'.

Changelog[v8]:
- clone2() is already in use in IA64. Rename syscall to clone3()
- Add notes to say that we return -EINVAL if invalid clone flags
  are specified or if the reserved fields are not 0.
Changelog[v7]:
- Rename clone_with_pids() to clone2()
- Changes to reflect new prototype of clone2() (using clone_struct).

Cc: linux-...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
Acked-by: Oren Laadan  or...@cs.columbia.edu
---
 Documentation/eclone |  348 ++
 1 files changed, 348 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/eclone

diff --git a/Documentation/eclone b/Documentation/eclone
new file mode 100644
index 000..c2f1b4b
--- /dev/null
+++ b/Documentation/eclone
@@ -0,0 +1,348 @@
+
+struct clone_args {
+   u64 clone_flags_high;
+   u64 child_stack;
+   u64 child_stack_size;
+   u64 parent_tid_ptr;
+   u64 child_tid_ptr;
+   u32 nr_pids;
+   u32 reserved0;
+};
+
+
+sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
+   pid_t * __user pids)
+
+   In addition to doing everything that clone() system call does, the
+   eclone() system call:
+
+   - allows additional clone flags (31 of 32 bits in the flags
+ parameter to clone() are in use)
+
+   - allows user to specify a pid for the child process in its
+ active and ancestor pid namespaces.
+
+   This system call is meant to be used when restarting an application
+   from a checkpoint. Such restart requires that the processes in the
+   application have the same pids they had when the application was
+   checkpointed. When containers are nested, the processes within the
+   containers exist in multiple pid namespaces and hence have multiple
+   pids to specify during restart.
+
+   The @flags_low parameter is identical to the 'clone_flags' parameter
+   in existing clone() system call.
+
+   The fields in 'struct clone_args' are meant to be used as follows:
+
+   u64 clone_flags_high:
+
+   When eclone() supports more than 32 flags, the additional bits
+   in the clone_flags should be specified in this field. This
+   field is currently unused and must be set to 0.
+
+   u64 child_stack;
+   u64 child_stack_size;
+
+   These two fields correspond to the 'child_stack' fields in
+   clone() and clone2() (on IA64) system calls. The usage of
+   these two fields depends on the processor architecture.
+
+   Most architectures use -child_stack to pass-in a stack-pointer
+   itself and don't need the -child_stack_size field. On these
+   architectures the -child_stack_size field must be 0.
+
+   Some architectures, eg IA64, use -child_stack to pass-in the
+   base of the region allocated for stack. These architectures
+   must pass in the size of the stack-region in -child_stack_size.
+
+   u64 parent_tid_ptr;
+   u64 child_tid_ptr;
+
+   These two fields correspond to the 'parent_tid_ptr' and
+   'child_tid_ptr' fields in the clone() system call

[PATCH v21 005/100] eclone (5/11): Add target_pids parameter to copy_process()

2010-05-01 Thread Oren Laadan
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

Add a 'target_pids' parameter to copy_process().  The new parameter will be
used in a follow-on patch when eclone() is implemented.

Cc: linux-...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Cc: Oleg Nesterov o...@redhat.com
Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
Tested-by: Serge E. Hallyn se...@us.ibm.com
Reviewed-by: Oren Laadan or...@cs.columbia.edu
---
 kernel/fork.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 62018c8..9d2b57e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -957,12 +957,12 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
+   pid_t *target_pids,
int trace)
 {
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
-   pid_t *target_pids = NULL;
 
if ((clone_flags  (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1339,7 +1339,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
struct pt_regs regs;
 
task = copy_process(CLONE_VM, 0, idle_regs(regs), 0, NULL,
-   init_struct_pid, 0);
+   init_struct_pid, NULL, 0);
if (!IS_ERR(task))
init_idle(task, cpu);
 
@@ -1362,6 +1362,7 @@ long do_fork(unsigned long clone_flags,
struct task_struct *p;
int trace = 0;
long nr;
+   pid_t *target_pids = NULL;
 
/*
 * Do some preliminary argument and permissions checking before we
@@ -1402,7 +1403,7 @@ long do_fork(unsigned long clone_flags,
trace = tracehook_prepare_clone(clone_flags);
 
p = copy_process(clone_flags, stack_start, regs, stack_size,
-child_tidptr, NULL, trace);
+child_tidptr, NULL, target_pids, trace);
/*
 * Do this prior waking up the new thread - the thread pointer
 * might get invalid after that point, if the thread exits quickly.
-- 
1.6.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v21 006/100] eclone (6/11): Check invalid clone flags

2010-05-01 Thread Oren Laadan
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

As pointed out by Oren Laadan, we want to ensure that unused bits in the
clone-flags remain unused and available for future. To ensure this, define
a mask of clone-flags and check the flags in the clone() system calls.

Changelog[v9]:
- Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS
  to avoid breaking any applications that may have set it. IOW, this
  patch/check only applies to clone-flags bits 33 and higher.

Changelog[v8]:
- New patch in set

Cc: linux-...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Cc: Oleg Nesterov o...@redhat.com
Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
Tested-by: Serge E. Hallyn se...@us.ibm.com
Acked-by: Oren Laadan orenl.cs.columbia.edu
---
 include/linux/sched.h |   12 
 kernel/fork.c |3 +++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dad7f66..5de3ce5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -29,6 +29,18 @@
 #define CLONE_NEWNET   0x4000  /* New network namespace */
 #define CLONE_IO   0x8000  /* Clone io context */
 
+#define CLONE_UNUSED   0x1000  /* Can be reused ? */
+
+#define VALID_CLONE_FLAGS  (CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
+CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\
+CLONE_VFORK  | CLONE_PARENT | CLONE_THREAD  |\
+CLONE_NEWNS  | CLONE_SYSVSEM | CLONE_SETTLS |\
+CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
+CLONE_DETACHED | CLONE_UNTRACED |\
+CLONE_CHILD_SETTID | CLONE_STOPPED  |\
+CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
+CLONE_NEWPID | CLONE_NEWNET | CLONE_IO)
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9d2b57e..e41b3d1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -964,6 +964,9 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
struct task_struct *p;
int cgroup_callbacks_done = 0;
 
+   if (clone_flags  ~VALID_CLONE_FLAGS)
+   return ERR_PTR(-EINVAL);
+
if ((clone_flags  (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
 
-- 
1.6.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v21 007/100] eclone (7/11): Define do_fork_with_pids()

2010-05-01 Thread Oren Laadan
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v7]:
- Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
  instead of 'struct pid_set *'.

Changelog[v6]:
- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
  is constant across architectures.
- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.

Changelog[v4]:
- Rename 'struct target_pid_set' to 'struct pid_set' since it may
  be useful in other contexts.

Changelog[v3]:
- Fix long-line warning from checkpatch.pl

Changelog[v2]:
- To facilitate moving architecture-inpdendent code to kernel/fork.c
  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
  rather than 'pid_t *' (next patch moves the arch-independent
  code to kernel/fork.c)

Cc: linux-...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Cc: Oleg Nesterov o...@redhat.com
Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
Tested-by: Serge E. Hallyn se...@us.ibm.com
Reviewed-by: Oren Laadan or...@cs.columbia.edu
---
 include/linux/sched.h |3 +++
 kernel/fork.c |   17 +++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5de3ce5..f4ae3e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2129,6 +2129,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, 
struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned 
long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+   unsigned long, int __user *, int __user *,
+   unsigned int, pid_t __user *);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/kernel/fork.c b/kernel/fork.c
index e41b3d1..2559d7a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1355,12 +1355,14 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
  unsigned long stack_start,
  struct pt_regs *regs,
  unsigned long stack_size,
  int __user *parent_tidptr,
- int __user *child_tidptr)
+ int __user *child_tidptr,
+ unsigned int num_pids,
+ pid_t __user *upids)
 {
struct task_struct *p;
int trace = 0;
@@ -1463,6 +1465,17 @@ long do_fork(unsigned long clone_flags,
return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+ unsigned long stack_start,
+ struct pt_regs *regs,
+ unsigned long stack_size,
+ int __user *parent_tidptr,
+ int __user *child_tidptr)
+{
+   return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+   parent_tidptr, child_tidptr, 0, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
1.6.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v21 003/100] eclone (3/11): Define set_pidmap() function

2010-05-01 Thread Oren Laadan
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

Define a set_pidmap() interface which is like alloc_pidmap() only that
caller specifies the pid number to be assigned.

Changelog[v13]:
- Don't let do_alloc_pidmap return 0 if it failed to find a pid.
Changelog[v9]:
- Completely rewrote this patch based on Eric Biederman's code.
Changelog[v7]:
- [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
Changelog[v6]:
- Separate target_pid  0 case to minimize the number of checks needed.
Changelog[v3]:
- (Eric Biederman): Avoid set_pidmap() function. Added couple of
  checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
- (Serge Hallyn) Check for 'pid  0' in set_pidmap().(Code
  actually checks for 'pid = 0' for completeness).

Cc: linux-...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu suka...@us.ibm.com
Signed-off-by: Serge E. Hallyn se...@us.ibm.com
Reviewed-by: Oren Laadan or...@cs.columbia.edu
---
 kernel/pid.c |   41 +
 1 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 8330488..4eaf975 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -146,17 +146,18 @@ static int alloc_pidmap_page(struct pidmap *map)
return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
+   int max)
 {
-   int i, offset, max_scan, pid, last = pid_ns-last_pid;
+   int i, offset, max_scan, pid;
struct pidmap *map;
 
pid = last + 1;
if (pid = pid_max)
-   pid = RESERVED_PIDS;
+   pid = min;
offset = pid  BITS_PER_PAGE_MASK;
map = pid_ns-pidmap[pid/BITS_PER_PAGE];
-   max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+   max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i = max_scan; ++i) {
if (unlikely(!map-page))
if (alloc_pidmap_page(map)  0)
@@ -165,7 +166,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
do {
if (!test_and_set_bit(offset, map-page)) {
atomic_dec(map-nr_free);
-   pid_ns-last_pid = pid;
return pid;
}
offset = find_next_offset(map, offset);
@@ -176,16 +176,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 * bitmap block and the final block was the same
 * as the starting point, pid is before last_pid.
 */
-   } while (offset  BITS_PER_PAGE  pid  pid_max 
+   } while (offset  BITS_PER_PAGE  pid  max 
(i != max_scan || pid  last ||
!((last+1)  BITS_PER_PAGE_MASK)));
}
-   if (map  pid_ns-pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+   if (map  pid_ns-pidmap[(max-1)/BITS_PER_PAGE]) {
++map;
offset = 0;
} else {
map = pid_ns-pidmap[0];
-   offset = RESERVED_PIDS;
+   offset = min;
if (unlikely(last == offset))
break;
}
@@ -194,6 +194,31 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
return -EBUSY;
 }
 
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+   int nr;
+
+   nr = do_alloc_pidmap(pid_ns, pid_ns-last_pid, RESERVED_PIDS, pid_max);
+   if (nr = 0)
+   pid_ns-last_pid = nr;
+   return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+   if (!target)
+   return alloc_pidmap(pid_ns);
+
+   if (target = pid_max)
+   return -EINVAL;
+
+   if ((target  0) || (target  RESERVED_PIDS 
+   pid_ns-last_pid = RESERVED_PIDS))
+   return -EINVAL;
+
+   return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
int offset;
-- 
1.6.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page

2010-05-01 Thread Oren Laadan
From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.

Changelog[v4]:
- [Oren Laadan] Adapt to kernel 2.6.33-rc5
Changelog[v3]:
- Earlier version of patchset called alloc_pidmap_page() from two
  places. But now its called from only one place. Even so, moving
  this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
  -ENOMEM on error instead of -1.

Cc: linux-...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
Tested-by: Serge E. Hallyn se...@us.ibm.com
Reviewed-by: Oren Laadan or...@cs.columbia.edu
---
 kernel/pid.c |   41 ++---
 1 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index aebb30d..52a371a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,30 @@ static void free_pidmap(struct upid *upid)
atomic_inc(map-nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+   void *page;
+
+   if (likely(map-page))
+   return 0;
+
+   page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+   /*
+* Free the page if someone raced with us installing it:
+*/
+   spin_lock_irq(pidmap_lock);
+   if (!map-page) {
+   map-page = page;
+   page = NULL;
+   }
+   spin_unlock_irq(pidmap_lock);
+   kfree(page);
+   if (unlikely(!map-page))
+   return -1;
+
+   return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
int i, offset, max_scan, pid, last = pid_ns-last_pid;
@@ -134,22 +158,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
map = pid_ns-pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i = max_scan; ++i) {
-   if (unlikely(!map-page)) {
-   void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-   /*
-* Free the page if someone raced with us
-* installing it:
-*/
-   spin_lock_irq(pidmap_lock);
-   if (!map-page) {
-   map-page = page;
-   page = NULL;
-   }
-   spin_unlock_irq(pidmap_lock);
-   kfree(page);
-   if (unlikely(!map-page))
+   if (unlikely(!map-page))
+   if (alloc_pidmap_page(map)  0)
break;
-   }
if (likely(atomic_read(map-nr_free))) {
do {
if (!test_and_set_bit(offset, map-page)) {
-- 
1.6.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v21 021/100] c/r: create syscalls: sys_checkpoint, sys_restart

2010-05-01 Thread Oren Laadan
Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a pid, a file descriptor (for the image file) and
flags as arguments. The pid identifies the top-most (root) task in the
process tree, e.g. the container init: for sys_checkpoint the first
argument identifies the pid of the target container/subtree; for
sys_restart it will identify the pid of restarting root task.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart.  Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic.  They also would significantly complicate
checkpoint that includes self.

Changelog[v21-rc3]:
  - Reorganize code:move checkpoint/* to kernel/checkpoint/*
Changelog[v19-rc1]:
  - Add 'int logfd' to prototype of sys_{checkpoint,restart}
Changelog[v18]:
  - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
Changelog[v17]:
  - Move checkpoint closer to namespaces (kconfig)
  - Kill Enable in c/r config option
Changelog[v16]:
  - Change sys_restart() first argument to be 'pid_t pid'
Changelog[v14]:
  - Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo)
  - Remove line 'def_bool n' (default is already 'n')
  - Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
Changelog[v5]:
  - Config is 'def_bool n' by default

Cc: linux-...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-s...@vger.kernel.org
Cc: linuxppc-...@ozlabs.org
Signed-off-by: Oren Laadan or...@cs.columbia.edu
Signed-off-by: Dave Hansen d...@linux.vnet.ibm.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
Tested-by: Serge E. Hallyn se...@us.ibm.com
---
 Makefile   |2 +-
 arch/x86/Kconfig   |4 +++
 arch/x86/include/asm/unistd_32.h   |4 ++-
 arch/x86/kernel/syscall_table_32.S |2 +
 include/linux/syscalls.h   |4 +++
 init/Kconfig   |2 +
 kernel/Makefile|1 +
 kernel/checkpoint/Kconfig  |   14 +++
 kernel/checkpoint/Makefile |5 
 kernel/checkpoint/sys.c|   45 
 kernel/sys_ni.c|4 +++
 11 files changed, 85 insertions(+), 2 deletions(-)
 create mode 100644 kernel/checkpoint/Kconfig
 create mode 100644 kernel/checkpoint/Makefile
 create mode 100644 kernel/checkpoint/sys.c

diff --git a/Makefile b/Makefile
index fa1db90..93be4e1 100644
--- a/Makefile
+++ b/Makefile
@@ -409,7 +409,7 @@ endif
 # of make so .config is not included in this case either (for *config).
 
 no-dot-config-targets := clean mrproper distclean \
-cscope TAGS tags help %docs check% \
+cscope TAGS tags help %docs checkstack \
 include/linux/version.h headers_% \
 kernelrelease kernelversion
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9458685..0874484 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -93,6 +93,10 @@ config STACKTRACE_SUPPORT
 config HAVE_LATENCYTOP_SUPPORT
def_bool y
 
+config CHECKPOINT_SUPPORT
+   bool
+   default y if X86_32
+
 config MMU
def_bool y
 
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index e543b0e..007d7cd 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -344,10 +344,12 @@
 #define __NR_perf_event_open   336
 #define __NR_recvmmsg  337
 #define __NR_eclone338
+#define __NR_checkpoint339
+#define __NR_restart   340
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 339
+#define NR_syscalls 341
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S 
b/arch/x86/kernel/syscall_table_32.S
index 0c92570..2d5a6b0 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -338,3 +338,5 @@ ENTRY(sys_call_table)
.long sys_perf_event_open
.long

[PATCH v21 084/100] powerpc: reserve checkpoint arch identifiers

2010-05-01 Thread Oren Laadan
From: Nathan Lynch n...@pobox.com

Changelog [v19-rc1]:
  - [Matt Helsley] Add cpp definitions for enums

Cc: linuxppc-...@ozlabs.org
Signed-off-by: Nathan Lynch n...@pobox.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
Tested-by: Serge E. Hallyn se...@us.ibm.com
---
 include/linux/checkpoint_hdr.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index f2779d1..90cbc15 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -193,6 +193,10 @@ enum {
 #define CKPT_ARCH_X86_64 CKPT_ARCH_X86_64
CKPT_ARCH_S390X,
 #define CKPT_ARCH_S390X CKPT_ARCH_S390X
+   CKPT_ARCH_PPC32,
+#define CKPT_ARCH_PPC32 CKPT_ARCH_PPC32
+   CKPT_ARCH_PPC64,
+#define CKPT_ARCH_PPC64 CKPT_ARCH_PPC64
 };
 
 /* shared objrects (objref) */
-- 
1.6.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v21 086/100] powerpc: checkpoint/restart implementation

2010-05-01 Thread Oren Laadan
From: Nathan Lynch n...@pobox.com

Support for checkpointing and restarting GPRs, FPU state, DABR, and
Altivec state.

The portion of the checkpoint image manipulated by this code begins
with a bitmask of features indicating the various contexts saved.
Fields in image that can vary depending on kernel configuration
(e.g. FP regs due to VSX) have their sizes explicitly recorded, except
for GPRS, so migrating between ppc32 and ppc64 won't work yet.

The restart code ensures that the task is not modified until the
checkpoint image is validated against the current kernel configuration
and hardware features (e.g. can't restart a task using Altivec on
non-Altivec systems).

What works:
* self and external checkpoint of simple (single thread, one open
  file) 32- and 64-bit processes on a ppc64 kernel

What doesn't work:
* restarting a 32-bit task from a 64-bit task and vice versa

Untested:
* ppc32 (but it builds)

Changelog[v21]:
  - Do not include checkpoint_hdr.h explicitly
Changelog[v19]:
  - [Serge Hallyn] Add hook task_has_saved_sigmask()
Changelog[v19-rc3]:
  - [Oren Laadan] Move checkpoint.c from arch/powerpc/{mm-kernel}
  - [Nathan Lynch] Warn if full register state unavailable

Cc: linuxppc-...@ozlabs.org
Signed-off-by: Nathan Lynch n...@pobox.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
[Oren Laadan or...@cs.columbia.edu] Add arch-specific tty support
---
 arch/powerpc/include/asm/Kbuild   |1 +
 arch/powerpc/include/asm/checkpoint_hdr.h |   37 ++
 arch/powerpc/kernel/Makefile  |1 +
 arch/powerpc/kernel/checkpoint.c  |  532 +
 arch/powerpc/kernel/signal.c  |6 +
 5 files changed, 577 insertions(+), 0 deletions(-)
 create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h
 create mode 100644 arch/powerpc/kernel/checkpoint.c

diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index 5ab7d7f..20379f1 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -12,6 +12,7 @@ header-y += shmbuf.h
 header-y += socket.h
 header-y += termbits.h
 header-y += fcntl.h
+header-y += checkpoint_hdr.h
 header-y += poll.h
 header-y += sockios.h
 header-y += ucontext.h
diff --git a/arch/powerpc/include/asm/checkpoint_hdr.h 
b/arch/powerpc/include/asm/checkpoint_hdr.h
new file mode 100644
index 000..fbb1705
--- /dev/null
+++ b/arch/powerpc/include/asm/checkpoint_hdr.h
@@ -0,0 +1,37 @@
+#ifndef __ASM_POWERPC_CKPT_HDR_H
+#define __ASM_POWERPC_CKPT_HDR_H
+
+#include linux/types.h
+
+/* arch dependent constants */
+#define CKPT_ARCH_NSIG 64
+#define CKPT_TTY_NCC  10
+
+#ifdef __KERNEL__
+
+#include asm/signal.h
+#if CKPT_ARCH_NSIG != _NSIG
+#error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
+#endif
+
+#include linux/tty.h
+#if CKPT_TTY_NCC != NCC
+#error CKPT_TTY_NCC size is wrong per asm-generic/termios.h
+#endif
+
+#endif /* __KERNEL__ */
+
+#ifdef __KERNEL__
+#ifdef CONFIG_PPC64
+#define CKPT_ARCH_ID CKPT_ARCH_PPC64
+#else
+#define CKPT_ARCH_ID CKPT_ARCH_PPC32
+#endif
+#endif
+
+struct ckpt_hdr_header_arch {
+   struct ckpt_hdr h;
+   __u32 what;
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_POWERPC_CKPT_HDR_H */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 8773263..6d294a4 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -63,6 +63,7 @@ obj64-$(CONFIG_HIBERNATION)   += swsusp_asm64.o
 obj-$(CONFIG_MODULES)  += module.o module_$(CONFIG_WORD_SIZE).o
 obj-$(CONFIG_44x)  += cpu_setup_44x.o
 obj-$(CONFIG_FSL_BOOKE)+= cpu_setup_fsl_booke.o dbell.o
+obj-$(CONFIG_CHECKPOINT)   += checkpoint.o
 
 extra-y:= head_$(CONFIG_WORD_SIZE).o
 extra-$(CONFIG_PPC_BOOK3E_32)  := head_new_booke.o
diff --git a/arch/powerpc/kernel/checkpoint.c b/arch/powerpc/kernel/checkpoint.c
new file mode 100644
index 000..492c604
--- /dev/null
+++ b/arch/powerpc/kernel/checkpoint.c
@@ -0,0 +1,532 @@
+/*
+ * PowerPC architecture support for checkpoint/restart.
+ * Based on x86 implementation.
+ *
+ * Copyright (C) 2008 Oren Laadan
+ * Copyright 2009 IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ */
+
+#if 0
+#define DEBUG
+#endif
+
+#include linux/checkpoint.h
+#include linux/kernel.h
+#include asm/processor.h
+#include asm/ptrace.h
+#include asm/system.h
+
+enum ckpt_cpu_feature {
+   CKPT_USED_FP,
+   CKPT_USED_DEBUG,
+   CKPT_USED_ALTIVEC,
+   CKPT_USED_SPE,
+   CKPT_USED_VSX,
+   CKPT_FTR_END = 31,
+};
+
+#define x(ftr) (1UL  ftr)
+
+/* features this kernel can handle for restart */
+enum {
+   CKPT_FTRS_POSSIBLE =
+#ifdef CONFIG_PPC_FPU
+   x(CKPT_USED_FP) |
+#endif
+   x(CKPT_USED_DEBUG) |
+#ifdef CONFIG_ALTIVEC
+   x(CKPT_USED_ALTIVEC

[PATCH v21 085/100] powerpc: provide APIs for validating and updating DABR

2010-05-01 Thread Oren Laadan
From: Nathan Lynch n...@pobox.com

A checkpointed task image may specify a value for the DABR (Data
Access Breakpoint Register).  The restart code needs to validate this
value before making any changes to the current task.

ptrace_set_debugreg encapsulates the bounds checking and platform
dependencies of programming the DABR.  Split this into validate
(debugreg_valid) and update (debugreg_update) functions, and make
them available for use outside of the ptrace code.

Also ptrace_set_debugreg has extern linkage, but no users outside of
ptrace.c.  Make it static.

Cc: linuxppc-...@ozlabs.org
Signed-off-by: Nathan Lynch n...@pobox.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
---
 arch/powerpc/include/asm/ptrace.h |7 +++
 arch/powerpc/kernel/ptrace.c  |   83 ++---
 2 files changed, 66 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/ptrace.h 
b/arch/powerpc/include/asm/ptrace.h
index 9e2d84c..a88d711 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -87,6 +87,8 @@ struct pt_regs {
 
 #ifndef __ASSEMBLY__
 
+#include linux/types.h
+
 #define instruction_pointer(regs) ((regs)-nip)
 #define user_stack_pointer(regs) ((regs)-gpr[1])
 #define regs_return_value(regs) ((regs)-gpr[3])
@@ -141,6 +143,11 @@ do {   
  \
 #define arch_has_block_step()  (!cpu_has_feature(CPU_FTR_601))
 #define ARCH_HAS_USER_SINGLE_STEP_INFO
 
+/* for reprogramming DABR/DAC during restart of a checkpointed task */
+extern bool debugreg_valid(unsigned long val, unsigned int index);
+extern void debugreg_update(struct task_struct *task, unsigned long val,
+   unsigned int index);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index ed2cfe1..972e6a1 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -763,19 +763,23 @@ void user_disable_single_step(struct task_struct *task)
clear_tsk_thread_flag(task, TIF_SINGLESTEP);
 }
 
-int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
-  unsigned long data)
+/**
+ * debugreg_valid() - validate the value to be written to a debug register
+ * @val:   The prospective contents of the register.
+ * @index: Must be zero.
+ *
+ * Returns true if @val is an acceptable value for the register indicated by
+ * @index, false otherwise.
+ */
+bool debugreg_valid(unsigned long val, unsigned int index)
 {
-   /* For ppc64 we support one DABR and no IABR's at the moment (ppc64).
-*  For embedded processors we support one DAC and no IAC's at the
-*  moment.
-*/
-   if (addr  0)
-   return -EINVAL;
+   /* We support only one debug register for now */
+   if (index != 0)
+   return false;
 
/* The bottom 3 bits in dabr are flags */
-   if ((data  ~0x7UL) = TASK_SIZE)
-   return -EIO;
+   if ((val  ~0x7UL) = TASK_SIZE)
+   return false;
 
 #ifndef CONFIG_PPC_ADV_DEBUG_REGS
/* For processors using DABR (i.e. 970), the bottom 3 bits are flags.
@@ -791,19 +795,38 @@ int ptrace_set_debugreg(struct task_struct *task, 
unsigned long addr,
 */
 
/* Ensure breakpoint translation bit is set */
-   if (data  !(data  DABR_TRANSLATION))
-   return -EIO;
-
-   /* Move contents to the DABR register */
-   task-thread.dabr = data;
-#else /* CONFIG_PPC_ADV_DEBUG_REGS */
+   if (val  !(val  DABR_TRANSLATION))
+   return false;
+#else
/* As described above, it was assumed 3 bits were passed with the data
 *  address, but we will assume only the mode bits will be passed
 *  as to not cause alignment restrictions for DAC-based processors.
 */
 
+   /* Read or Write bits must be set */
+   if (!(val  0x3UL))
+   return -EINVAL;
+#endif
+   return true;
+}
+
+/**
+ * debugreg_update() - update a debug register associated with a task
+ * @task:  The task whose register state is to be modified.
+ * @val:   The value to be written to the debug register.
+ * @index: Specifies the debug register.  Currently unused.
+ *
+ * Set a task's DABR/DAC to @val, which should be validated with
+ * debugreg_valid() beforehand.
+ */
+void debugreg_update(struct task_struct *task, unsigned long val,
+unsigned int index)
+{
+#ifndef CONFIG_PPC_ADV_DEBUG_REGS
+   task-thread.dabr = val;
+#else
/* DAC's hold the whole address without any mode flags */
-   task-thread.dac1 = data  ~0x3UL;
+   task-thread.dabr = val  ~0x3UL;
 
if (task-thread.dac1 == 0) {
dbcr_dac(task) = ~(DBCR_DAC1R | DBCR_DAC1W);
@@ -812,13 +835,8 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned 
long addr,

[PATCH v21 088/100] powerpc: enable checkpoint support in Kconfig

2010-05-01 Thread Oren Laadan
From: Nathan Lynch n...@pobox.com

Cc: linuxppc-...@ozlabs.org
Signed-off-by: Nathan Lynch n...@pobox.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
---
 arch/powerpc/Kconfig |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2e19500..16416b0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -26,6 +26,9 @@ config MMU
bool
default y
 
+config CHECKPOINT_SUPPORT
+   def_bool y
+
 config GENERIC_CMOS_UPDATE
def_bool y
 
-- 
1.6.3.3

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH v21 087/100] powerpc: wire up checkpoint and restart syscalls

2010-05-01 Thread Oren Laadan
From: Nathan Lynch n...@pobox.com

Changelog [v21]:
 - Fix build break with CONFIG_CHECKPOINT=n
Changelog [v19]:
 - Checkpoint/powerpc: fix up checkpoint syscall, tidy restart

Cc: linuxppc-...@ozlabs.org
Signed-off-by: Nathan Lynch n...@pobox.com
Acked-by: Serge E. Hallyn se...@us.ibm.com
---
 arch/powerpc/include/asm/systbl.h |2 ++
 arch/powerpc/include/asm/unistd.h |4 +++-
 arch/powerpc/kernel/checkpoint.c  |   18 ++
 arch/powerpc/kernel/entry_32.S|   23 +++
 arch/powerpc/kernel/entry_64.S|   16 
 arch/powerpc/kernel/process.c |1 +
 6 files changed, 63 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index f94fc43..b5afba3 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -327,3 +327,5 @@ COMPAT_SYS_SPU(preadv)
 COMPAT_SYS_SPU(pwritev)
 COMPAT_SYS(rt_tgsigqueueinfo)
 PPC_SYS(eclone)
+PPC_SYS(checkpoint)
+PPC_SYS(restart)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 4cdbd5c..54f6ecb 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -346,10 +346,12 @@
 #define __NR_pwritev   321
 #define __NR_rt_tgsigqueueinfo 322
 #define __NR_eclone323
+#define __NR_checkpoint324
+#define __NR_restart   325
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls  324
+#define __NR_syscalls  326
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
diff --git a/arch/powerpc/kernel/checkpoint.c b/arch/powerpc/kernel/checkpoint.c
index 492c604..9aeab89 100644
--- a/arch/powerpc/kernel/checkpoint.c
+++ b/arch/powerpc/kernel/checkpoint.c
@@ -530,3 +530,21 @@ int restore_mm_context(struct ckpt_ctx *ctx, struct 
mm_struct *mm)
 {
return 0;
 }
+
+int sys_checkpoint(unsigned long pid, unsigned long fd, unsigned long flags,
+  unsigned long logfd, unsigned long p5, unsigned long p6,
+  struct pt_regs *regs)
+{
+   CHECK_FULL_REGS(regs);
+
+   return do_sys_checkpoint(pid, fd, flags, logfd);
+}
+
+int sys_restart(unsigned long pid, unsigned long fd, unsigned long flags,
+   unsigned long logfd, unsigned long p5, unsigned long p6,
+   struct pt_regs *regs)
+{
+   CHECK_FULL_REGS(regs);
+
+   return do_sys_restart(pid, fd, flags, logfd);
+}
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 579f1da..853814b 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -594,6 +594,29 @@ ppc_eclone:
stw r0,_TRAP(r1)/* register set saved */
b   sys_eclone
 
+/* To handle self-checkpoint we must save nvpgprs */
+   .globl  ppc_checkpoint
+ppc_checkpoint:
+   SAVE_NVGPRS(r1)
+   lwz r0,_TRAP(r1)
+   rlwinm  r0,r0,0,0,30/* clear LSB to indicate full */
+   stw r0,_TRAP(r1)/* register set saved */
+   b   sys_checkpoint
+
+/* The full register set must be restored upon return from restart.
+ * Save nvgprs unconditionally so the caller's state is
+ * restored correctly in case of error.
+ */
+   .globl  ppc_restart
+ppc_restart:
+   SAVE_NVGPRS(r1)
+   lwz r0,_TRAP(r1)
+   rlwinm  r0,r0,0,0,30/* clear LSB to indicate full */
+   stw r0,_TRAP(r1)/* register set saved */
+   bl  sys_restart
+   REST_NVGPRS(r1)
+   b ret_from_syscall
+
.globl  ppc_swapcontext
 ppc_swapcontext:
SAVE_NVGPRS(r1)
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index b763340..228f592 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -349,6 +349,22 @@ _GLOBAL(ppc_eclone)
bl  .sys_eclone
b   syscall_exit
 
+/* To handle self-checkpoint we must save nvpgprs */
+_GLOBAL(ppc_checkpoint)
+   bl  .save_nvgprs
+   bl  .sys_checkpoint
+   b   syscall_exit
+
+/* The full register set must be restored upon return from restart.
+ * Save nvgprs unconditionally so the caller's state is
+ * restored correctly in case of error.
+ */
+_GLOBAL(ppc_restart)
+   bl  .save_nvgprs
+   bl  .sys_restart
+   REST_NVGPRS(r1)
+   b   syscall_exit
+
 _GLOBAL(ppc32_swapcontext)
bl  .save_nvgprs
bl  .compat_sys_swapcontext
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index b183287..1664586 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -30,6 +30,7 @@
 #include linux/init_task.h
 #include linux/module.h
 #include linux/kallsyms.h
+#include linux/checkpoint.h
 #include linux/mqueue.h
 #include linux/hardirq.h
 #include linux/utsname.h
-- 
1.6.3.3

___
Linuxppc-dev mailing list

Re: [PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation

2009-03-18 Thread Oren Laadan

An alternative: the task that created the container namely, is the parent
(outside the container) of the container init(1). In turn, init(1) creates
a special 'monitor' thread that monitors the restart, and the outside task
reaps the exit status of that thread (and only that thread).

[Hmmm... thinking about this - what happens if the container init(1) calls
clone() with CLONE_PARENT ??  does it not generate sort of a competing
container init(1) ??!!

Oren.


Cedric Le Goater wrote:
 Again, how would 'cr' obtain exit status for these tasks, and how would
 it distinguish failure from normal operation?
 
 Here's our solution to this issue.
 
 mcr maintains in its kernel container object an exitcode attribute for 
 the mcr-restart process. This process is detached from the fork tree of 
 the restarted application.  
 
 when the restart is finished, an mcr-wait command can be called to reap 
 this exitcode. This make it possible to distinguish an exit of the 
 application process from an exit of the mcr-restart process.
 
 This is a must-have for batch managers in an HPC environment. 
 
 Cheers,
 
 C.
 
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation

2009-03-12 Thread Oren Laadan


Nathan Lynch wrote:
 On Tue, 24 Feb 2009 13:58:26 -0600
 Serge E. Hallyn se...@us.ibm.com wrote:
 
 Quoting Nathan Lynch (n...@pobox.com):
 Nathan Lynch n...@pobox.com wrote:
 Oren Laadan wrote:
 Nathan Lynch wrote:
 What doesn't work:
 * restarting a 32-bit task from a 64-bit task and vice versa
 Is there a test to bail if we attempt to checkpoint such tasks ?
 No, but I'll add one if it looks too hard to fix for the next round.
 Unfortunately, adding a check for this is hard.

 The point of no return in the restart path is cr_read_mm, which tears
 down current's address space.  cr_read_mm runs way before cr_read_cpu,
 which is the only restart method I've implemented for powerpc so far.
 So, checking for this condition in cr_read_cpu is too late if I want
 restart(2) to return an error and leave the caller's memory map
 intact.  (And I do want this: restart should be as robust as execve.)

 Well okay then, cr_read_head_arch seems to be the right place in the
 restart sequence for the architecture code to handle this.  However,
 cr_write_head_arch (which produces the buffer that cr_read_head_arch
 consumes) is not provided a reference to the task to be checkpointed,
 nor can it assume that it's operating on current.  I need a reference
 to a task before I can determine whether it's running in 32- or 64-bit
 mode, or using the FPU, Altivec, SPE, whatever.

 In any case, mixing 32- and 64-bit tasks across restart is something I
 eventually want to support, not reject.  But the problem I've outlined
 applies to FPU state and vector extensions (VMX, SPE), as well as
 sanity-checking debug register (DABR) contents.  We'll need to be able
 to error out gracefully from restart when a checkpoint image specifies a
 feature unsupported by the current kernel or hardware.  But I don't see
 how to do it with the current architecture.  Am I missing something?
 I suspect I can guess the response to this suggestion, but how about we
 accept that if sys_restart() fails due to something like this, the
 task is lost and can't exit gracefully?
 
 In the short term it might be necessary.  But the restart code should
 forcibly kill the task instead of returning an error back up to
 userspace in this case.  Once the memory map of the process has been
 altered, there is no point in allowing it to continue (and likely dump
 a useless core).  Btw, this failure mode seems to apply when
 cr_read_files() fails, too...
 
 But in the long term, things need to be more robust (e.g. restart(2)
 returns ENOEXEC without messing with current-mm).  I think it's worth
 looking at how execve operates... if I understand correctly, it sets up
 a new mm_struct disconnected from the current task and activates it at
 the last moment.
 

That's a good idea, and I have considered it in the past.

However, it is easier to restarti a task in its own, new, context,
including the MM. For instance, you can leverage all memory syscalls.

An in-between way would be to switch to the new MM but not tear down
the original one, but rather save it along side. If a failure occur -
restore it.

Then, you'll have to ask the same question about all other resources -
signal handlers, open files, etc. Either you make all changes atomic
at once, or none - if you want the operation to be non-intrusive in
the case of an error.

However, I do think that this is not necessary: the tasks that are
doing the restart have been created from scratch for that purpose,
so they need not return any specific value to the user. It is the
task that initiates the restart that needs to handle error gracefully.
The scheme I proposed in the previous email does exactly that.

(This does not apply to self-restart, for obvious reasons, but that
is a special case anyway).

Oren.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation

2009-03-12 Thread Oren Laadan


Nathan Lynch wrote:
 Nathan Lynch n...@pobox.com wrote:
 Oren Laadan wrote:
 Nathan Lynch wrote:
 What doesn't work:
 * restarting a 32-bit task from a 64-bit task and vice versa
 Is there a test to bail if we attempt to checkpoint such tasks ?
 No, but I'll add one if it looks too hard to fix for the next round.
 
 Unfortunately, adding a check for this is hard.
 
 The point of no return in the restart path is cr_read_mm, which tears
 down current's address space.  cr_read_mm runs way before cr_read_cpu,
 which is the only restart method I've implemented for powerpc so far.
 So, checking for this condition in cr_read_cpu is too late if I want
 restart(2) to return an error and leave the caller's memory map
 intact.  (And I do want this: restart should be as robust as execve.)

In the case of restarting a container, I think it's ok if a restarting
tasks dies in an ugly way -- this will be observed and handled by the
initiating task outside the container, which will gracefully report to
the caller/user.

Even if you close this hole, then any other failure later on during
restart - even a failure to allocate kernel memory due to memory pressure,
will give that undesired effect that you are trying to avoid.

That said, any difference in the architecture that may cause restart to
fail is probably best placed in cr_write_head_arch.

 
 Well okay then, cr_read_head_arch seems to be the right place in the
 restart sequence for the architecture code to handle this.  However,
 cr_write_head_arch (which produces the buffer that cr_read_head_arch
 consumes) is not provided a reference to the task to be checkpointed,
 nor can it assume that it's operating on current.  I need a reference
 to a task before I can determine whether it's running in 32- or 64-bit
 mode, or using the FPU, Altivec, SPE, whatever.
 
 In any case, mixing 32- and 64-bit tasks across restart is something I
 eventually want to support, not reject.  But the problem I've outlined
 applies to FPU state and vector extensions (VMX, SPE), as well as
 sanity-checking debug register (DABR) contents.  We'll need to be able
 to error out gracefully from restart when a checkpoint image specifies a
 feature unsupported by the current kernel or hardware.  But I don't see
 how to do it with the current architecture.  Am I missing something?
 

More specifically, I envision restart to work like this:

1) user invokes user-land utility (e.g. cr --restart ...
2) 'cr' will create a new container
3) 'cr' will start a child in that container
4) child will create rest of tree (in kernel or in user space - tbd)
5) each task in that tree will restore itself
6) 'cr' monitors this process
7) if all goes well - 'cr' report ok.
8) if something goes bad, 'cr' notices and notifies caller/user

so tasks that are restarting may just as well die badly - we don't care.

Does that make sense ?

Oren.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation

2009-02-04 Thread Oren Laadan


Benjamin Herrenschmidt wrote:
 On Wed, 2009-02-04 at 09:54 -0600, Serge E. Hallyn wrote:
 Quoting Benjamin Herrenschmidt (b...@kernel.crashing.org):
 +struct cr_hdr_cpu {
 +  struct pt_regs pt_regs;
 +  /* relevant fields from thread_struct */
 +  double fpr[32][TS_FPRWIDTH];
 +  unsigned int fpscr;
 +  int fpexc_mode;
 +  /* unsigned int align_ctl; this is never updated? */
 +  unsigned long dabr;
 +};
 Is there some version or other identification somewhere ? If not there
 should be. ie, we're going to add things here. For example, what about
 the vector registers ? Also, some CPUs will have more HW debug registers
 than just the DABR (we plan to add support for all the BookE architected
 IACs and DACs for example), etc...
 The arch-independent checkpoint header does have kernel
 maj:min:rev:patch info.  We expect to have to do more,
 assuming that the .config can change the arch-dependent
 cpu header (i.e. perhaps TS_FPRWIDTH could be changed).
 
 It could to a certain extent... things like VSX, VSR, or freescale SPE,
 or even the Cell SPU state etc
 
 I wonder if we want a tagged structure so we can easily add things...

From the little bit I read hear, I suspect that the sub-arch classification
is best done in an arch-dependent header. I'd follow the following rule
of thumb:

* Anything that is decided at compiled time should probably go to the arch-
dependent header.

* Anything that can change at boot time (e.g., for x86 that would include
the capabilities of the FPU), or even run time (is there any ?) should
be described to the letter (in fine print) in 'struct cr_hdr_cpu' and
friends.

Oren.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation

2009-01-29 Thread Oren Laadan
Nathan,

Thanks for the patch. Looks good, see some comments below.
(disclaimer: I'm not very familiar with ppc architecture)

Nathan Lynch wrote:
 The only thing of significance here is that the checkpointed task's
 pt_regs and fp state are saved and restored (see cr_write_cpu and
 cr_read_cpu); the rest of the code consists of dummy implementations
 of the APIs the arch needs to provide to the checkpoint/restart core.
 
 What works:
 * self and external checkpoint of simple (single thread, one open
   file) 32- and 64-bit processes on a ppc64 kernel
 
 What doesn't work:
 * restarting a 32-bit task from a 64-bit task and vice versa

Is there a test to bail if we attempt to checkpoint such tasks ?

 
 Untested:
 * ppc32 (but it builds)
 
 Signed-off-by: Nathan Lynch n...@pobox.com
 ---
  arch/powerpc/include/asm/checkpoint_hdr.h |   40 +
  arch/powerpc/mm/Makefile  |1 +
  arch/powerpc/mm/checkpoint.c  |  261 
 +
  3 files changed, 302 insertions(+), 0 deletions(-)
  create mode 100644 arch/powerpc/include/asm/checkpoint_hdr.h
  create mode 100644 arch/powerpc/mm/checkpoint.c
 
 diff --git a/arch/powerpc/include/asm/checkpoint_hdr.h 
 b/arch/powerpc/include/asm/checkpoint_hdr.h
 new file mode 100644
 index 000..752c53f
 --- /dev/null
 +++ b/arch/powerpc/include/asm/checkpoint_hdr.h
 @@ -0,0 +1,40 @@
 +#ifndef __ASM_PPC_CKPT_HDR_H
 +#define __ASM_PPC_CKPT_HDR_H
 +/*
 + *  Checkpoint/restart - architecture specific headers ppc
 + *
 + *  Copyright (C) 2008 Oren Laadan
 + *
 + *  This file is subject to the terms and conditions of the GNU General 
 Public
 + *  License.  See the file COPYING in the main directory of the Linux
 + *  distribution for more details.
 + */
 +
 +#include linux/types.h
 +#include asm/ptrace.h
 +#include asm/mmu.h
 +#include asm/processor.h
 +
 +struct cr_hdr_head_arch {
 + __u32 unimplemented;
 +};
 +
 +struct cr_hdr_thread {
 + __u32 unimplemented;
 +};
 +
 +struct cr_hdr_cpu {
 + struct pt_regs pt_regs;

It has been suggested (as done in x86/32 code) not to use 'struct pt_regs'
because it can (and has) changed on x86 and because it only container
the registers that the kernel trashes, not all usermode registers.

https://lists.linux-foundation.org/pipermail/containers/2008-August/012355.html

 + /* relevant fields from thread_struct */
 + double fpr[32][TS_FPRWIDTH];

Can TS_FPRWIDTH change between sub-archs or kernel versions ?  If so, it
needs to be stated explicitly.

 + unsigned int fpscr;
 + int fpexc_mode;
 + /* unsigned int align_ctl; this is never updated? */
 + unsigned long dabr;

Are these fields always guarantee to compile to the same number of bytes
regardless of 32/64 bit choice of compiler (or sub-arch?) ?

In the x86(32/64) architecture we use types with explicit size such as
__u32 and the like to ensure that it always compiled to the same size.

 +};
 +
 +struct cr_hdr_mm_context {
 + __u32 unimplemented;
 +};
 +
 +#endif /* __ASM_PPC_CKPT_HDR__H */
 diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
 index e7392b4..8a523a0 100644
 --- a/arch/powerpc/mm/Makefile
 +++ b/arch/powerpc/mm/Makefile
 @@ -24,3 +24,4 @@ obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o
  obj-$(CONFIG_PPC_MM_SLICES)  += slice.o
  obj-$(CONFIG_HUGETLB_PAGE)   += hugetlbpage.o
  obj-$(CONFIG_PPC_SUBPAGE_PROT)   += subpage-prot.o
 +obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o
 diff --git a/arch/powerpc/mm/checkpoint.c b/arch/powerpc/mm/checkpoint.c
 new file mode 100644
 index 000..8cdff24
 --- /dev/null
 +++ b/arch/powerpc/mm/checkpoint.c
 @@ -0,0 +1,261 @@
 +/*
 + *  Checkpoint/restart - architecture specific support for powerpc.
 + *  Based on x86 implementation.
 + *
 + *  Copyright (C) 2008 Oren Laadan
 + *  Copyright 2009 IBM Corp.
 + *
 + *  This file is subject to the terms and conditions of the GNU General 
 Public
 + *  License.  See the file COPYING in the main directory of the Linux
 + *  distribution for more details.
 + */
 +
 +#define DEBUG 1 /* for pr_debug */
 +
 +#include linux/checkpoint.h
 +#include linux/checkpoint_hdr.h
 +#include linux/kernel.h
 +#include asm/processor.h
 +
 +static void cr_hdr_init(struct cr_hdr *hdr, __s16 type, __s16 len, __u32 
 parent)
 +{
 + hdr-type = type;
 + hdr-len = len;
 + hdr-parent = parent;
 +}
 +

This function is rather generic and useful to non-arch-dependent and other
architectures code. Perhaps put in a separate patch ?

 +/* dump the thread_struct of a given task */
 +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
 +{
 + struct cr_hdr_thread *thread_hdr;
 + struct cr_hdr cr_hdr;
 + u32 parent;
 + int ret;
 +
 + thread_hdr = cr_hbuf_get(ctx, sizeof(*thread_hdr));
 + if (!thread_hdr)
 + return -ENOMEM;
 +
 + parent = task_pid_vnr(t);
 +
 + cr_hdr_init(cr_hdr, CR_HDR_THREAD, sizeof(*thread_hdr), parent);
 +
 + thread_hdr-unimplemented

Re: [PATCH 1/3] powerpc: bare minimum checkpoint/restart implementation

2009-01-29 Thread Oren Laadan

Nathan Lynch wrote:
 Hey Oren, thanks for taking a look.
 
 Oren Laadan wrote:
 Nathan Lynch wrote:
 What doesn't work:
 * restarting a 32-bit task from a 64-bit task and vice versa
 Is there a test to bail if we attempt to checkpoint such tasks ?
 
 No, but I'll add one if it looks too hard to fix for the next round.
 
 
 +struct cr_hdr_cpu {
 +   struct pt_regs pt_regs;
 It has been suggested (as done in x86/32 code) not to use 'struct pt_regs'
 because it can (and has) changed on x86 and because it only container
 the registers that the kernel trashes, not all usermode registers.

 https://lists.linux-foundation.org/pipermail/containers/2008-August/012355.html
 
 Yeah, I considered that discussion, but the situation is different for
 powerpc (someone on linuxppc-dev smack me if I'm wrong here :)
 pt_regs is part of the ABI, and it encompasses all user mode registers
 except for floating point, which are handled separately.
 
 
 +   /* relevant fields from thread_struct */
 +   double fpr[32][TS_FPRWIDTH];
 Can TS_FPRWIDTH change between sub-archs or kernel versions ?  If so, it
 needs to be stated explicitly.

 +   unsigned int fpscr;
 +   int fpexc_mode;
 +   /* unsigned int align_ctl; this is never updated? */
 +   unsigned long dabr;
 Are these fields always guarantee to compile to the same number of bytes
 regardless of 32/64 bit choice of compiler (or sub-arch?) ?

 In the x86(32/64) architecture we use types with explicit size such as
 __u32 and the like to ensure that it always compiled to the same
 size.
 
 Yeah, I'll have to fix these up.
 
 
 
 +static void cr_hdr_init(struct cr_hdr *hdr, __s16 type, __s16 len, __u32 
 parent)
 +{
 +   hdr-type = type;
 +   hdr-len = len;
 +   hdr-parent = parent;
 +}
 +
 This function is rather generic and useful to non-arch-dependent and other
 architectures code. Perhaps put in a separate patch ?
 
 Alright.  By the way, why are cr_hdr-type and cr_hdr-len signed
 types?
 

No particular reason. I can change that in v14.

 
 +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
 +{
 +   struct cr_hdr_cpu *cpu_hdr;
 +   struct pt_regs *pt_regs;
 +   struct cr_hdr cr_hdr;
 +   u32 parent;
 +   int ret;
 +
 +   cpu_hdr = cr_hbuf_get(ctx, sizeof(*cpu_hdr));
 +   if (!cpu_hdr)
 +   return -ENOMEM;
 +
 +   parent = task_pid_vnr(t);
 +
 +   cr_hdr_init(cr_hdr, CR_HDR_CPU, sizeof(*cpu_hdr), parent);
 +
 +   /* pt_regs: GPRs, MSR, etc */
 +   pt_regs = task_pt_regs(t);
 +   cpu_hdr-pt_regs = *pt_regs;
 +
 +   /* FP state */
 +   memcpy(cpu_hdr-fpr, t-thread.fpr, sizeof(cpu_hdr-fpr));
 As note above, is sizeof(cpu_hdr-fpr) the same on all chips ?
 
 It can differ depending on kernel configuration.

So the actual size needs to be explicitly indicated (and compared with).

 
 
 +/* restart APIs */
 +
 The restart APIs belong in a separate file: arch/powerpc/mm/restart.c
 
 Explain why, please?  This isn't a lot of code, and it seems likely
 that checkpoint and restart paths will share data structures and tend
 to be modified together over time.

This one has little code, but usually that isn't the case, and many of
the data structures shared are anyway exported. Since the split makes
sense in other cases, it makes sense to follow convention.

Personally I don't have a strong opinion on this. However one of the
initial feedbacks for the existing patchset requested that I split the
functionality between files (and to separate commits).

In other words, if nobody else cries, I won't spoil it ;)

 
 
 +   pr_debug(%s: unexpected thread_hdr contents: 0x%lx\n,
 +__func__, (unsigned long)thread_hdr-unimplemented);
 Given the macro for 'pr_fmt' in include/linux/checkpoint.h, the use of
 __func__ is redunant.
 
 It seems to me that defining your own pr_fmt in a public header like
 that is inappropriate, or at least unconventional.  Any file that
 happens to include linux/checkpoint.h will have any prior definitions
 of pr_fmt overridden, no?
 

Hmmm.. didn't think of it this way. Using the pr_debug() there was yet
another feedback from LKML, and it seemed reasonable to me. Can you
think of a case where linux/checkpoint.h will happen to be included
in checkpoint-related code ?

 
 +   regs = task_pt_regs(current);
 +   *regs = cpu_hdr-pt_regs;
 +
 +   regs-msr = sanitize_msr(regs-msr);
 +
 +   /* FP state */
 +   memcpy(current-thread.fpr, cpu_hdr-fpr, sizeof(current-thread.fpr));
 +   current-thread.fpscr.val = cpu_hdr-fpscr;
 +   current-thread.fpexc_mode = cpu_hdr-fpexc_mode;
 +
 +   /* debug registers */
 +   current-thread.dabr = cpu_hdr-dabr;
 I'm unfamiliar with powerpc; is it necessary to sanitize any of the registers
 here ?  For instance, can the user cause harm with specially crafted values
 of some registers ?
 
 I had this in mind with the treatment of MSR, but I'll check on the
 others, thanks.
 
 
 +int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int 
 rparent)
 +{
 +   struct cr_hdr_mm_context *mm_hdr