[Devel] Re: [PATCH] io-controller: Add io group reference handling for request

2009-05-18 Thread Vivek Goyal
On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
 On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
  On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
   On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
Vivek Goyal wrote:
...
  }
 @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
  /*
   * Find the io group bio belongs to.
   * If create is set, io group is created if it is not already 
 present.
 + * If curr is set, io group is information is searched for current
 + * task and not with the help of bio.
 + *
 + * FIXME: Can we assume that if bio is NULL then lookup group for 
 current
 + * task and not create extra function parameter ?
   *
 - * Note: There is a narrow window of race where a group is being 
 freed
 - * by cgroup deletion path and some rq has slipped through in this 
 group.
 - * Fix it.
   */
 -struct io_group *io_get_io_group_bio(struct request_queue *q, struct 
 bio *bio,
 - int create)
 +struct io_group *io_get_io_group(struct request_queue *q, struct bio 
 *bio,
 + int create, int curr)

  Hi Vivek,

  IIUC we can get rid of curr, and just determine iog from bio. If bio 
is not NULL,
  get iog from bio, otherwise get it from current task.
   
   Consider also that get_cgroup_from_bio() is much more slow than
   task_cgroup() and need to lock/unlock_page_cgroup() in
   get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
   
  
  True.
  
   BTW another optimization could be to use the blkio-cgroup functionality
   only for dirty pages and cut out some blkio_set_owner(). For all the
   other cases IO always occurs in the same context of the current task,
   and you can use task_cgroup().
   
  
  Yes, may be in some cases we can avoid setting page owner. I will get
  to it once I have got functionality going well. In the mean time if
  you have a patch for it, it will be great.
  
   However, this is true only for page cache pages, for IO generated by
   anonymous pages (swap) you still need the page tracking functionality
   both for reads and writes.
   
  
  Right now I am assuming that all the sync IO will belong to task
  submitting the bio hence use task_cgroup() for that. Only for async
  IO, I am trying to use page tracking functionality to determine the owner.
  Look at elv_bio_sync(bio).
  
  You seem to be saying that there are cases where even for sync IO, we
  can't use submitting task's context and need to rely on page tracking
  functionlity? In case of getting page (read) from swap, will it not happen
  in the context of process who will take a page fault and initiate the
  swap read?
 
 No, for example in read_swap_cache_async():
 
 @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, 
 gfp_t gfp_mask,
*/
   __set_page_locked(new_page);
   SetPageSwapBacked(new_page);
 + blkio_cgroup_set_owner(new_page, current-mm);
   err = add_to_swap_cache(new_page, entry, gfp_mask  GFP_KERNEL);
   if (likely(!err)) {
   /*
 
 This is a read, but the current task is not always the owner of this
 swap cache page, because it's a readahead operation.
 

But will this readahead be not initiated in the context of the task taking
the page fault?

handle_pte_fault()
do_swap_page()
swapin_readahead()
read_swap_cache_async()

If yes, then swap reads issued will still be in the context of process and
we should be fine?

 Anyway, this is a minor corner case I think. And probably it is safe to
 consider this like any other read IO and get rid of the
 blkio_cgroup_set_owner().

Agreed.

 
 I wonder if it would be better to attach the blkio_cgroup to the
 anonymous page only when swap-out occurs.

Swap seems to be an interesting case in general. Somebody raised this
question on lwn io controller article also. A user process never asked
for swap activity. It is something enforced by kernel. So while doing
some swap outs, it does not seem too fair to charge the write out to
the process page belongs to and the fact of the matter may be that there
is some other memory hungry application which is forcing these swap outs.

Keeping this in mind, should swap activity be considered as system
activity and be charged to root group instead of to user tasks in other
cgroups?
  
 I mean, just put the
 blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
 the IO generated by direct reclaim of anon memory. For all the other
 cases we can simply use the submitting task's context.
 
 BTW, O_DIRECT is another case that is possible to optimize, because all
 the bios generated by direct IO occur in the same context of the current
 task.

Agreed about the direct IO 

[Devel] Re: [PATCH] io-controller: Add io group reference handling for request

2009-05-18 Thread Andrea Righi
On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
 On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
  On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
   On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
 Vivek Goyal wrote:
 ...
   }
  @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
   /*
* Find the io group bio belongs to.
* If create is set, io group is created if it is not already 
  present.
  + * If curr is set, io group is information is searched for 
  current
  + * task and not with the help of bio.
  + *
  + * FIXME: Can we assume that if bio is NULL then lookup group for 
  current
  + * task and not create extra function parameter ?
*
  - * Note: There is a narrow window of race where a group is being 
  freed
  - * by cgroup deletion path and some rq has slipped through in this 
  group.
  - * Fix it.
*/
  -struct io_group *io_get_io_group_bio(struct request_queue *q, 
  struct bio *bio,
  -   int create)
  +struct io_group *io_get_io_group(struct request_queue *q, struct 
  bio *bio,
  +   int create, int curr)
 
   Hi Vivek,
 
   IIUC we can get rid of curr, and just determine iog from bio. If 
 bio is not NULL,
   get iog from bio, otherwise get it from current task.

Consider also that get_cgroup_from_bio() is much more slow than
task_cgroup() and need to lock/unlock_page_cgroup() in
get_blkio_cgroup_id(), while task_cgroup() is rcu protected.

   
   True.
   
BTW another optimization could be to use the blkio-cgroup functionality
only for dirty pages and cut out some blkio_set_owner(). For all the
other cases IO always occurs in the same context of the current task,
and you can use task_cgroup().

   
   Yes, may be in some cases we can avoid setting page owner. I will get
   to it once I have got functionality going well. In the mean time if
   you have a patch for it, it will be great.
   
However, this is true only for page cache pages, for IO generated by
anonymous pages (swap) you still need the page tracking functionality
both for reads and writes.

   
   Right now I am assuming that all the sync IO will belong to task
   submitting the bio hence use task_cgroup() for that. Only for async
   IO, I am trying to use page tracking functionality to determine the owner.
   Look at elv_bio_sync(bio).
   
   You seem to be saying that there are cases where even for sync IO, we
   can't use submitting task's context and need to rely on page tracking
   functionlity? In case of getting page (read) from swap, will it not happen
   in the context of process who will take a page fault and initiate the
   swap read?
  
  No, for example in read_swap_cache_async():
  
  @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, 
  gfp_t gfp_mask,
   */
  __set_page_locked(new_page);
  SetPageSwapBacked(new_page);
  +   blkio_cgroup_set_owner(new_page, current-mm);
  err = add_to_swap_cache(new_page, entry, gfp_mask  GFP_KERNEL);
  if (likely(!err)) {
  /*
  
  This is a read, but the current task is not always the owner of this
  swap cache page, because it's a readahead operation.
  
 
 But will this readahead be not initiated in the context of the task taking
 the page fault?
 
 handle_pte_fault()
   do_swap_page()
   swapin_readahead()
   read_swap_cache_async()
 
 If yes, then swap reads issued will still be in the context of process and
 we should be fine?

Right. I was trying to say that the current task may swap-in also pages
belonging to a different task, so from a certain point of view it's not
so fair to charge the current task for the whole activity. But ok, I
think it's a minor issue.

 
  Anyway, this is a minor corner case I think. And probably it is safe to
  consider this like any other read IO and get rid of the
  blkio_cgroup_set_owner().
 
 Agreed.
 
  
  I wonder if it would be better to attach the blkio_cgroup to the
  anonymous page only when swap-out occurs.
 
 Swap seems to be an interesting case in general. Somebody raised this
 question on lwn io controller article also. A user process never asked
 for swap activity. It is something enforced by kernel. So while doing
 some swap outs, it does not seem too fair to charge the write out to
 the process page belongs to and the fact of the matter may be that there
 is some other memory hungry application which is forcing these swap outs.
 
 Keeping this in mind, should swap activity be considered as system
 activity and be charged to root group instead of to user tasks in other
 cgroups?

In 

[Devel] bugs with ckpt-v15-dev

2009-05-18 Thread Nathan Lynch
Last commit is ed3b275 allow error string during checkpoint while
holding a spinlock.

# bash -c 'exec - - 2- ; while : ; do : ; done' 
[1] 2269
# ckpt $!  /tmp/bash.ckpt

BUG: sleeping function called from invalid context at mm/slub.c:1595
in_atomic(): 1, irqs_disabled(): 0, pid: 2270, name: ckpt
1 lock held by ckpt/2270:
 #0:  (tasklist_lock){.+.+.+}, at: [c03911e6] tree_count_tasks+0x2a/0x2a2
Pid: 2270, comm: ckpt Not tainted 2.6.30-rc3-00074-ged3b275 #30
Call Trace:
 [c024b6f9] ? __debug_show_held_locks+0x1e/0x20
 [c02234da] __might_sleep+0x100/0x107
 [c02a9372] kmem_cache_alloc+0x35/0x11f
 [c039100f] ? __ckpt_generate_err+0x25/0x12b
 [c024a9c7] ? put_lock_stats+0x1e/0x29
 [c039100f] __ckpt_generate_err+0x25/0x12b
 [c0203703] ? ftrace_call+0x5/0x8
 [c03911ba] __ckpt_write_err+0x16/0x18
 [c03912ae] tree_count_tasks+0xf2/0x2a2
 [c03915ae] do_checkpoint+0x150/0x5f2
 [c0390cd8] ? kzalloc+0x10/0x12
 [c0390d0f] ? ckpt_obj_hash_alloc+0x35/0x60
 [c039033d] ? ckpt_ctx_alloc+0x77/0x99
 [c0390465] sys_checkpoint+0x6c/0x82
 [c0202ce5] syscall_call+0x7/0xb
[ cut here ]
kernel BUG at checkpoint/checkpoint.c:136!
invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/block/sda/size
Modules linked in:

Pid: 2270, comm: ckpt Not tainted (2.6.30-rc3-00074-ged3b275 #30) 
EIP: 0060:[c03910dc] EFLAGS: 00010246 CPU: 0
EIP is at __ckpt_generate_err+0xf2/0x12b
EAX: df051300 EBX: deb72f30 ECX: df051530 EDX: 001c
ESI: df051430 EDI: deb72f28 EBP: deb72f10 ESP: deb72ef8
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process ckpt (pid: 2270, ti=deb72000 task=df9adf60 task.ti=deb72000)
Stack:
 c072ce85 df051300 001c deb75600 df9ad1c0  deb72f18 c03911ba
 deb72f50 c03912ae df051300 c072ce85 08dd df9ad4ec df051300 df9ad1c0
    deb75600 deb75604 df051300 deb72f98 c03915ae
Call Trace:
 [c03911ba] ? __ckpt_write_err+0x16/0x18
 [c03912ae] ? tree_count_tasks+0xf2/0x2a2
 [c03915ae] ? do_checkpoint+0x150/0x5f2
 [c0390cd8] ? kzalloc+0x10/0x12
 [c0390d0f] ? ckpt_obj_hash_alloc+0x35/0x60
 [c039033d] ? ckpt_ctx_alloc+0x77/0x99
 [c0390465] ? sys_checkpoint+0x6c/0x82
 [c0202ce5] ? syscall_call+0x7/0xb
Code: 08 0c 8b c0 03 74 1b f6 05 c2 8f ff c0 20 74 12 f6 05 c9 8f ff c0 10 74 
09 80 3d 47 94 83 c0 00 75 1d 8b 45 ec 83 78 2c 00 75 04 0f 0b eb fe 8b 55 ec 
31 c0 89 72 2c 8d 65 f4 5b 5e 5f 5d c3 31 
EIP: [c03910dc] __ckpt_generate_err+0xf2/0x12b SS:ESP 0068:deb72ef8
---[ end trace d54433b47f0c4829 ]---
note: ckpt[2270] exited with preempt_count 1
BUG: scheduling while atomic: ckpt/2270/0x1002
INFO: lockdep is turned off.
Modules linked in:
Pid: 2270, comm: ckpt Tainted: G  D2.6.30-rc3-00074-ged3b275 #30
Call Trace:
 [c0223f6b] __schedule_bug+0x63/0x6a
 [c05ec7dc] __schedule+0x8f/0x7ac
 [c024d299] ? print_lock_contention_bug+0x14/0xd7
 [c0298093] ? unmap_vmas+0x1e1/0x518
 [c0203703] ? ftrace_call+0x5/0x8
 [c0203703] ? ftrace_call+0x5/0x8
 [c05ecf10] schedule+0x17/0x38
 [c0224738] __cond_resched+0x26/0x3b
 [c05ed034] _cond_resched+0x2c/0x37
 [c0298379] unmap_vmas+0x4c7/0x518
 [c029b81b] exit_mmap+0x6c/0xb7
 [c022906a] mmput+0x3c/0x8f
 [c022c8a0] exit_mm+0xe3/0xeb
 [c022e0e2] do_exit+0x188/0x64b
 [c05ec415] ? printk+0x14/0x16
 [c022b08d] ? oops_exit+0x28/0x2d
 [c05efbe7] oops_end+0x92/0x9a
 [c020560f] die+0x59/0x5f
 [c05ef56b] do_trap+0x89/0xa2
 [c02039fc] ? do_invalid_op+0x0/0x80
 [c0203a72] do_invalid_op+0x76/0x80
 [c03910dc] ? __ckpt_generate_err+0xf2/0x12b
 [c0203703] ? ftrace_call+0x5/0x8
 [c039c95d] ? strnlen+0x8/0x1f
 [c039b8bd] ? string+0x34/0x82
 [c039c14a] ? vsnprintf+0x173/0x311
 [c039c05a] ? vsnprintf+0x83/0x311
 [c039c9d0] ? trace_hardirqs_off_thunk+0xc/0x10
 [c05ef322] error_code+0x72/0x78
 [c02039fc] ? do_invalid_op+0x0/0x80
 [c03910dc] ? __ckpt_generate_err+0xf2/0x12b
 [c03911ba] __ckpt_write_err+0x16/0x18
 [c03912ae] tree_count_tasks+0xf2/0x2a2
 [c03915ae] do_checkpoint+0x150/0x5f2
 [c0390cd8] ? kzalloc+0x10/0x12
 [c0390d0f] ? ckpt_obj_hash_alloc+0x35/0x60
 [c039033d] ? ckpt_ctx_alloc+0x77/0x99
 [c0390465] sys_checkpoint+0x6c/0x82
 [c0202ce5] syscall_call+0x7/0xb
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] failure to restart bash with ckpt-v15-dev

2009-05-18 Thread Nathan Lynch
Last commit is ed3b275 allow error string during checkpoint while
holding a spinlock.

The failure seems to involve a vma corresponding to
/usr/lib/gconv/gconv-modules.cache which is mapped read-only and shared,
e.g.

open(/usr/lib/gconv/gconv-modules.cache, O_RDONLY) = 3
mmap2(NULL, 26048, PROT_READ, MAP_SHARED, 3, 0) = 0xb7f52000

I believe the second check in filemap_restore() is where the restart
goes awry; the object is of type CKPT_VMA_SHM_FILE but doesn't have
VM_SHARED set in the flags (it does have VM_MAY_SHARE set, however).


Testcase:

# bash -c 'exec - - 2- ; while : ; do : ; done' 
[1] 2274
# mkdir -p  /cgroup/foo  for k in cpus mems ; do cat /cgroup/cpuset.$k  
/cgroup/foo/cpuset.$k ; done
# echo $!  /cgroup/foo/tasks 
# echo FROZEN  /cgroup/foo/freezer.state 
# cat  /cgroup/foo/freezer.state 
FROZEN
# ckpt $!  /tmp/bash.ckpt
# rstr  /tmp/bash.ckpt 
Segmentation fault


Debug output (which, btw, I can't figure out how to enable with
ckpt_debug=, so I opted to hack a printk into _ckpt_debug):

[2279:c/r:may_checkpoint_task] check 2274
[2279:c/r:may_checkpoint_task] check 2274
[2279:c/r:ckpt_write_obj] type 1 len 48
[2279:c/r:ckpt_write_obj_type] type 3 len 73
[2279:c/r:ckpt_write_obj_type] type 3 len 73
[2279:c/r:ckpt_write_obj_type] type 3 len 73
[2279:c/r:ckpt_write_obj] type 2 len 16
[2279:c/r:ckpt_write_obj] type 101 len 16
[2279:c/r:checkpoint_pids] task[0]: vpid 2274 vtgid 2274 parent 2252
[2279:c/r:checkpoint_all_tasks] dumping task #0
[2279:c/r:ckpt_write_obj] type 102 len 32
[2279:c/r:ckpt_write_obj_type] type 4 len 24
[2279:c/r:checkpoint_task] ret 0
[2279:c/r:ckpt_write_obj] type 6 len 16
[2279:c/r:ckpt_obj_lookup_add] UTS_NS objref 2 first 1
[2279:c/r:ckpt_obj_lookup_add] IPC_NS objref 3 first 1
[2279:c/r:ckpt_write_obj] type 108 len 24
[2279:c/r:ckpt_write_obj] type 109 len 16
[2279:c/r:ckpt_write_obj_type] type 4 len 73
[2279:c/r:ckpt_write_obj_type] type 4 len 73
[2279:c/r:ckpt_write_obj] type 110 len 56
[2279:c/r:checkpoint_ipc_any] ipc-shm count 0
[2279:c/r:ckpt_write_obj] type 401 len 16
[2279:c/r:checkpoint_ipc_any] ipc-shm ret 0
[2279:c/r:checkpoint_ipc_any] ipc-msg count 0
[2279:c/r:ckpt_write_obj] type 401 len 16
[2279:c/r:checkpoint_ipc_any] ipc-msg ret 0
[2279:c/r:checkpoint_ipc_any] ipc-sem count 0
[2279:c/r:ckpt_write_obj] type 401 len 16
[2279:c/r:checkpoint_ipc_any] ipc-sem ret 0
[2279:c/r:checkpoint_task_ns] nsproxy: objref 1
[2279:c/r:ckpt_write_obj] type 103 len 16
[2279:c/r:ckpt_write_obj] type 6 len 16
[2279:c/r:ckpt_write_obj] type 6 len 16
[2279:c/r:ckpt_write_obj] type 303 len 40
[2279:c/r:ckpt_write_obj_type] type 5 len 18
[2279:c/r:ckpt_write_obj] type 201 len 104
[2279:c/r:do_checkpoint_mm] vma 0x2bd000-0x2dd000 flags 0x8000875
[2279:c/r:ckpt_write_obj] type 6 len 16
[2279:c/r:ckpt_write_obj] type 303 len 40
[2279:c/r:ckpt_write_obj_type] type 5 len 23
[2279:c/r:generic_vma_checkpoint] vma 0x2bd000-0x2dd000 flags 0x8000875 type 3
[2279:c/r:ckpt_write_obj] type 202 len 72
[2279:c/r:ckpt_write_obj] type 203 len 16
[2279:c/r:do_checkpoint_mm] vma 0x2de000-0x2df000 flags 0x8100871
[2279:c/r:generic_vma_checkpoint] vma 0x2de000-0x2df000 flags 0x8100871 type 3
[2279:c/r:ckpt_write_obj] type 202 len 72
[2279:c/r:vma_fill_pgarr] got page 0x2de000
[2279:c/r:checkpoint_memory_contents] collected 1 pages
[2279:c/r:ckpt_write_obj] type 203 len 16
[2279:c/r:pgarr_release_pages] total pages 1
[2279:c/r:ckpt_write_obj] type 203 len 16
[2279:c/r:do_checkpoint_mm] vma 0x2df000-0x2e flags 0x8100873
[2279:c/r:generic_vma_checkpoint] vma 0x2df000-0x2e flags 0x8100873 type 3
[2279:c/r:ckpt_write_obj] type 202 len 72
[2279:c/r:vma_fill_pgarr] got page 0x2df000
[2279:c/r:checkpoint_memory_contents] collected 1 pages
[2279:c/r:ckpt_write_obj] type 203 len 16
[2279:c/r:pgarr_release_pages] total pages 1
[2279:c/r:ckpt_write_obj] type 203 len 16
[2279:c/r:do_checkpoint_mm] vma 0x2e2000-0x45 flags 0x875
[2279:c/r:ckpt_write_obj] type 6 len 16
[2279:c/r:ckpt_write_obj] type 303 len 40
[2279:c/r:ckpt_write_obj_type] type 5 len 25
[2279:c/r:generic_vma_checkpoint] vma 0x2e2000-0x45 flags 0x875 type 3
[2279:c/r:ckpt_write_obj] type 202 len 72
[2279:c/r:ckpt_write_obj] type 203 len 16
[2279:c/r:do_checkpoint_mm] vma 0x45-0x452000 flags 0x8100071
[2279:c/r:generic_vma_checkpoint] vma 0x45-0x452000 flags 0x8100071 type 3
[2279:c/r:ckpt_write_obj] type 202 len 72
[2279:c/r:vma_fill_pgarr] got page 0x451000
[2279:c/r:checkpoint_memory_contents] collected 1 pages
[2279:c/r:ckpt_write_obj] type 203 len 16
[2279:c/r:pgarr_release_pages] total pages 1
[2279:c/r:ckpt_write_obj] type 203 len 16
[2279:c/r:do_checkpoint_mm] vma 0x452000-0x453000 flags 0x8100073
[2279:c/r:generic_vma_checkpoint] vma 0x452000-0x453000 flags 0x8100073 type 3
[2279:c/r:ckpt_write_obj] type 202 len 72
[2279:c/r:vma_fill_pgarr] got page 0x452000
[2279:c/r:checkpoint_memory_contents] collected 1 pages
[2279:c/r:ckpt_write_obj] type 203 len 16
[2279:c/r:pgarr_release_pages] total pages 1

[Devel] Re: bugs with ckpt-v15-dev

2009-05-18 Thread Serge E. Hallyn
Quoting Nathan Lynch (n...@pobox.com):
 Last commit is ed3b275 allow error string during checkpoint while
 holding a spinlock.
 
 # bash -c 'exec - - 2- ; while : ; do : ; done' 
 [1] 2269
 # ckpt $!  /tmp/bash.ckpt
 
 BUG: sleeping function called from invalid context at mm/slub.c:1595

Yeah, not only does ckpt_write_err() get called under task_lock, but
the fn returns without ver doing put_task_struct.  (I'd generate and
send the quick trivial patch, but my git tree is in a bit of a debugme
state right now)

Now mind you this shows that your ckpt program isn't sending
CHECKPOINT_SUBTREE with flags.  This in turns means you are probably
not using the ckpt-v15-dev version of user-cr, and if that is
the case it makes your problems with gconf shared file mapping more
suspect ask well...?

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] cr test cases moved

2009-05-18 Thread Serge E. Hallyn
I was keeping a few automated testcases in a tarball under
sf.net/projects/lxc.  I've moved these to a git tree which you
can fetch using

git clone git://git.sr71.net/~hallyn/cr_tests.git

Two purposes for these:  first, to eventually hook some of them
into LTP.  Second, to try and let people easily and reliably
reproduce tests across arches.  For instance, it appears the
fileio test has problems on x86, while it works fine on s390.
The (not-yet-automated) sleep testcase shows that restart blocks
on s390 aren't yet working, but I've not yet tested on x86 to
make sure those work at all.

Suka, please let me know if the fileio test succeeds on your
x86 box.  If you could also try the sleep testcase, I'd be
very interested.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: bugs with ckpt-v15-dev

2009-05-18 Thread Nathan Lynch
Serge E. Hallyn se...@us.ibm.com writes:

 Quoting Nathan Lynch (n...@pobox.com):
 Last commit is ed3b275 allow error string during checkpoint while
 holding a spinlock.
 
 # bash -c 'exec - - 2- ; while : ; do : ; done' 
 [1] 2269
 # ckpt $!  /tmp/bash.ckpt
 
 BUG: sleeping function called from invalid context at mm/slub.c:1595

 Yeah, not only does ckpt_write_err() get called under task_lock, but
 the fn returns without ver doing put_task_struct.  (I'd generate and
 send the quick trivial patch, but my git tree is in a bit of a debugme
 state right now)

Would prefer to just rip that thing out, it's cost me more trouble then
it's worth.


 Now mind you this shows that your ckpt program isn't sending
 CHECKPOINT_SUBTREE with flags.

I don't follow.  There is user error here in that I'm not freezing the
task before checkpointing[1], but my ckpt command is passing the subtree
flag (0x4) afaict:

SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540


 This in turns means you are probably
 not using the ckpt-v15-dev version of user-cr, and if that is
 the case it makes your problems with gconf shared file mapping more
 suspect ask well...?

After updating to the latest user-cr I get the same BUGs.

[1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
before checkpoint, is there any mechanism apart from
cgroup/freezer.state to do this?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: failure to restart bash with ckpt-v15-dev

2009-05-18 Thread Nathan Lynch
Nathan Lynch n...@pobox.com writes:

 Last commit is ed3b275 allow error string during checkpoint while
 holding a spinlock.

 The failure seems to involve a vma corresponding to
 /usr/lib/gconv/gconv-modules.cache which is mapped read-only and shared,
 e.g.

 open(/usr/lib/gconv/gconv-modules.cache, O_RDONLY) = 3
 mmap2(NULL, 26048, PROT_READ, MAP_SHARED, 3, 0) = 0xb7f52000

 I believe the second check in filemap_restore() is where the restart
 goes awry; the object is of type CKPT_VMA_SHM_FILE but doesn't have
 VM_SHARED set in the flags (it does have VM_MAY_SHARE set, however).

As Serge suggested in other bug report, updated to latest user-cr and
got same result.

Also, I forgot to mention that if I modify the check in filemap_restore
to allow VM_MAY_SHARE, the restart operation appears to complete
successfully but the kernel gets an oops on an address in that vma when
the task runs.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling

2009-05-18 Thread IKEDA, Munehiro
Hi Gui,

Gui Jianfeng wrote:
 Hi Vivek,
 
 This patch enables per-cgroup per-device weight and ioprio_class handling.
 A new cgroup interface policy is introduced. You can make use of this 
 file to configure weight and ioprio_class for each device in a given cgroup.
 The original weight and ioprio_class files are still available. If you
 don't do special configuration for a particular device, weight and 
 ioprio_class are used as default values in this device.
 
 You can use the following format to play with the new interface.
 #echo DEV:weight:ioprio_class  /patch/to/cgroup/policy
 weight=0 means removing the policy for DEV.
 
 Examples:
 Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
 # echo /dev/hdb:300:2  io.policy
 # cat io.policy
 dev weight class
 /dev/hdb 300 2

Users can specify a device file of a partition for io.policy.
In this case, io_policy_node::dev_name is set as a name of the
partition device like /dev/sda2.

ex)
  # cd /mnt/cgroup
  # cat /dev/sda2:500:2  io.policy
  # echo io.policy
dev weight class
/dev/sda2 500 2

I believe io_policy_node::dev_name should be set a generic
device name like /dev/sda.
What do you think about it?

Signed-off-by: Munehiro Muuhh Ikeda m-ik...@ds.jp.nec.com
---
 block/elevator-fq.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 39fa2a1..5d3d55c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1631,11 +1631,12 @@ static struct io_policy_node *policy_search_node(const 
struct io_cgroup *iocg,
return NULL;
 }
 
-static int devname_to_devnum(const char *buf, dev_t *dev)
+static int devname_to_devnum(char *buf, dev_t *dev)
 {
struct block_device *bdev;
struct gendisk *disk;
int part;
+   char *c;
 
bdev = lookup_bdev(buf);
if (IS_ERR(bdev))
@@ -1645,6 +1646,10 @@ static int devname_to_devnum(const char *buf, dev_t *dev)
*dev = MKDEV(disk-major, disk-first_minor);
bdput(bdev);
 
+   c = strrchr(buf, '/');
+   if (c)
+   strcpy(c+1, disk-disk_name);
+
return 0;
 }
 
-- 
1.5.4.3


-- 
IKEDA, Munehiro
  NEC Corporation of America
m-ik...@ds.jp.nec.com

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: bugs with ckpt-v15-dev

2009-05-18 Thread Serge E. Hallyn
Quoting Nathan Lynch (n...@pobox.com):
 Serge E. Hallyn se...@us.ibm.com writes:
 
  Quoting Nathan Lynch (n...@pobox.com):
  Last commit is ed3b275 allow error string during checkpoint while
  holding a spinlock.
  
  # bash -c 'exec - - 2- ; while : ; do : ; done' 
  [1] 2269
  # ckpt $!  /tmp/bash.ckpt
  
  BUG: sleeping function called from invalid context at mm/slub.c:1595
 
  Yeah, not only does ckpt_write_err() get called under task_lock, but
  the fn returns without ver doing put_task_struct.  (I'd generate and
  send the quick trivial patch, but my git tree is in a bit of a debugme
  state right now)
 
 Would prefer to just rip that thing out, it's cost me more trouble then
 it's worth.

Which thing - CHECKPOINT_SUBTREE, freezer check, or ckpt_write_err?

  Now mind you this shows that your ckpt program isn't sending
  CHECKPOINT_SUBTREE with flags.
 
 I don't follow.  There is user error here in that I'm not freezing the
 task before checkpointing[1], but my ckpt command is passing the subtree
 flag (0x4) afaict:
 
 SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 
 2540

Oh, it's the freezer test in may_checkpoint_task you're getting the
error on?  (in my git tree I'd commented that one out temporarily so I
just assumed it was the subtree check in get_container :)

  This in turns means you are probably
  not using the ckpt-v15-dev version of user-cr, and if that is
  the case it makes your problems with gconf shared file mapping more
  suspect ask well...?
 
 After updating to the latest user-cr I get the same BUGs.
 
 [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
 CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
 before checkpoint, is there any mechanism apart from
 cgroup/freezer.state to do this?

A task can self-checkpoint without the freezer though.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: bugs with ckpt-v15-dev

2009-05-18 Thread Matt Helsley
On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote:
 Serge E. Hallyn se...@us.ibm.com writes:
 
  Quoting Nathan Lynch (n...@pobox.com):
  Last commit is ed3b275 allow error string during checkpoint while
  holding a spinlock.
  
  # bash -c 'exec - - 2- ; while : ; do : ; done' 
  [1] 2269
  # ckpt $!  /tmp/bash.ckpt
  
  BUG: sleeping function called from invalid context at mm/slub.c:1595
 
  Yeah, not only does ckpt_write_err() get called under task_lock, but
  the fn returns without ver doing put_task_struct.  (I'd generate and
  send the quick trivial patch, but my git tree is in a bit of a debugme
  state right now)
 
 Would prefer to just rip that thing out, it's cost me more trouble then
 it's worth.
 
 
  Now mind you this shows that your ckpt program isn't sending
  CHECKPOINT_SUBTREE with flags.
 
 I don't follow.  There is user error here in that I'm not freezing the
 task before checkpointing[1], but my ckpt command is passing the subtree
 flag (0x4) afaict:
 
 SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 
 2540
 
 
  This in turns means you are probably
  not using the ckpt-v15-dev version of user-cr, and if that is
  the case it makes your problems with gconf shared file mapping more
  suspect ask well...?
 
 After updating to the latest user-cr I get the same BUGs.
 
 [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
 CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
 before checkpoint, is there any mechanism apart from
 cgroup/freezer.state to do this?

Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze the tasks
-- they'd still be capable of responding to some signals (CONT, TERM..). Also 
they'd presumably be placed in the stopped state upon restart so a SIGCONT will
be needed. In the case of bash, at least, that will technically change what
happens upon restart. My guess is that in many cases it won't matter but there
are some where it will. 

The freezer documentation shows an example of what happens with bash
when attempting to use only STOP/CONT rather than the freezer. gdb might
also present interesting cases when just utilizing STOP/CONT signals..

Cheers,
-Matt Helsley
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: bugs with ckpt-v15-dev

2009-05-18 Thread Nathan Lynch
Matt Helsley matth...@us.ibm.com writes:

 On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote:
 
 [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
 CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
 before checkpoint, is there any mechanism apart from
 cgroup/freezer.state to do this?

 Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze
 the tasks -- they'd still be capable of responding to some signals
 (CONT, TERM..). Also they'd presumably be placed in the stopped state
 upon restart so a SIGCONT will be needed. In the case of bash, at
 least, that will technically change what happens upon restart. My
 guess is that in many cases it won't matter but there are some where
 it will.

Hmm, I'm having trouble understanding your suggestion.  The current
checkpoint implementation requires non-self tasks to be frozen (p-flags
 PF_FROZEN), which is not equivalent to stopped state (task-state 
__TASK_STOPPED).  That is, it would refuse to checkpoint tasks in
stopped state.  See may_checkpoint_task().
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: bugs with ckpt-v15-dev

2009-05-18 Thread Matt Helsley
On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote:
 Matt Helsley matth...@us.ibm.com writes:
 
  On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote:
  
  [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
  CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
  before checkpoint, is there any mechanism apart from
  cgroup/freezer.state to do this?
 
  Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze
  the tasks -- they'd still be capable of responding to some signals
  (CONT, TERM..). Also they'd presumably be placed in the stopped state
  upon restart so a SIGCONT will be needed. In the case of bash, at
  least, that will technically change what happens upon restart. My
  guess is that in many cases it won't matter but there are some where
  it will.
 
 Hmm, I'm having trouble understanding your suggestion.  The current
 checkpoint implementation requires non-self tasks to be frozen (p-flags
  PF_FROZEN), which is not equivalent to stopped state (task-state 
 __TASK_STOPPED).  That is, it would refuse to checkpoint tasks in
 stopped state.  See may_checkpoint_task().

Oops. You're right. That would require changing may_checkpoint_task() to include
__TASK_STOPPED -- not something we'd want in the final code. I had assumed
you wanted to try a different mechanism for debugging purposes.

Cheers,
-Matt Helsley
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/6] cr: credentials

2009-05-18 Thread Serge E. Hallyn
Here is my latest version of the task credentials c/r patchset.
The last patch isn't meant to go upstream - it just helped me
to straighten out the refcounting, to the point where nested
user namespace c/r now appears to be robust.

thanks,
-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/6] cr: break out new_user_ns()

2009-05-18 Thread Serge E. Hallyn
Break out the core function which checks privilege and (if
allowed) creates a new user namespace, with the passed-in
creating user_struct.  Note that a user_namespace, unlike
other namespace pointers, is not stored in the nsproxy.
Rather it is purely a property of user_structs.

This will let us keep the task restore code simpler.

Signed-off-by: Serge E. Hallyn se...@us.ibm.com
---
 include/linux/user_namespace.h |8 ++
 kernel/user_namespace.c|   53 
 2 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index cc4f453..a2b82d5 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -20,6 +20,8 @@ extern struct user_namespace init_user_ns;
 
 #ifdef CONFIG_USER_NS
 
+struct user_namespace *new_user_ns(struct user_struct *creator,
+  struct user_struct **newroot);
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 {
if (ns)
@@ -38,6 +40,12 @@ static inline void put_user_ns(struct user_namespace *ns)
 
 #else
 
+static inline struct user_namespace *new_user_ns(struct user_struct *creator,
+  struct user_struct **newroot)
+{
+   return -EINVAL;
+}
+
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 {
return init_user_ns;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 076c7c8..e624b0f 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -11,15 +11,8 @@
 #include linux/user_namespace.h
 #include linux/cred.h
 
-/*
- * Create a new user namespace, deriving the creator from the user in the
- * passed credentials, and replacing that user with the new root user for the
- * new namespace.
- *
- * This is called by copy_creds(), which will finish setting the target task's
- * credentials.
- */
-int create_user_ns(struct cred *new)
+static struct user_namespace *_new_user_ns(struct user_struct *creator,
+  struct user_struct **newroot)
 {
struct user_namespace *ns;
struct user_struct *root_user;
@@ -27,7 +20,7 @@ int create_user_ns(struct cred *new)
 
ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
if (!ns)
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
 
kref_init(ns-kref);
 
@@ -38,12 +31,43 @@ int create_user_ns(struct cred *new)
root_user = alloc_uid(ns, 0);
if (!root_user) {
kfree(ns);
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
}
 
/* set the new root user in the credentials under preparation */
-   ns-creator = new-user;
-   new-user = root_user;
+   ns-creator = creator;
+
+   /* alloc_uid() incremented the userns refcount.  Just set it to 1 */
+   kref_set(ns-kref, 1);
+
+   *newroot = root_user;
+   return ns;
+}
+
+struct user_namespace *new_user_ns(struct user_struct *creator,
+  struct user_struct **newroot)
+{
+   if (!capable(CAP_SYS_ADMIN))
+   return ERR_PTR(-EPERM);
+   return _new_user_ns(creator, newroot);
+}
+
+/*
+ * Create a new user namespace, deriving the creator from the user in the
+ * passed credentials, and replacing that user with the new root user for the
+ * new namespace.
+ *
+ * This is called by copy_creds(), which will finish setting the target task's
+ * credentials.
+ */
+int create_user_ns(struct cred *new)
+{
+   struct user_namespace *ns;
+
+   ns = new_user_ns(new-user, new-user);
+   if (IS_ERR(ns))
+   return PTR_ERR(ns);
+
new-uid = new-euid = new-suid = new-fsuid = 0;
new-gid = new-egid = new-sgid = new-fsgid = 0;
put_group_info(new-group_info);
@@ -54,9 +78,6 @@ int create_user_ns(struct cred *new)
 #endif
/* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */
 
-   /* alloc_uid() incremented the userns refcount.  Just set it to 1 */
-   kref_set(ns-kref, 1);
-
return 0;
 }
 
-- 
1.6.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/6] cr: split core function out of some set*{u, g}id functions

2009-05-18 Thread Serge E. Hallyn
When restarting tasks, we want to be able to change xuid and
xgid in a struct cred, and do so with security checks.  Break
the core functionality of set{fs,res}{u,g}id into cred_setX
which performs the access checks based on current_cred(),
but performs the requested change on a passed-in cred.

This will allow us to securely construct struct creds based
on a checkpoint image, constrained by the caller's permissions,
and apply them to the caller at the end of sys_restart().

Signed-off-by: Serge E. Hallyn se...@us.ibm.com
---
 include/linux/cred.h |8 +++
 kernel/cred.c|  114 ++
 kernel/sys.c |  134 --
 3 files changed, 143 insertions(+), 113 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 3282ee4..bc5ffc2 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -20,6 +20,9 @@ struct user_struct;
 struct cred;
 struct inode;
 
+/* defined in sys.c, used in cred_setresuid */
+extern int set_user(struct cred *new);
+
 /*
  * COW Supplementary groups list
  */
@@ -343,4 +346,9 @@ do {\
*(_fsgid) = __cred-fsgid;  \
 } while(0)
 
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid);
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, gid_t sgid);
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid);
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid);
+
 #endif /* _LINUX_CRED_H */
diff --git a/kernel/cred.c b/kernel/cred.c
index 3a03918..a017399 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -589,3 +589,117 @@ int set_create_files_as(struct cred *new, struct inode 
*inode)
return security_kernel_create_files_as(new, inode);
 }
 EXPORT_SYMBOL(set_create_files_as);
+
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid)
+{
+   int retval;
+   const struct cred *old;
+
+   retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
+   if (retval)
+   return retval;
+   old = current_cred();
+
+   if (!capable(CAP_SETUID)) {
+   if (ruid != (uid_t) -1  ruid != old-uid 
+   ruid != old-euid   ruid != old-suid)
+   return -EPERM;
+   if (euid != (uid_t) -1  euid != old-uid 
+   euid != old-euid   euid != old-suid)
+   return -EPERM;
+   if (suid != (uid_t) -1  suid != old-uid 
+   suid != old-euid   suid != old-suid)
+   return -EPERM;
+   }
+
+   if (ruid != (uid_t) -1) {
+   new-uid = ruid;
+   if (ruid != old-uid) {
+   retval = set_user(new);
+   if (retval  0)
+   return retval;
+   }
+   }
+   if (euid != (uid_t) -1)
+   new-euid = euid;
+   if (suid != (uid_t) -1)
+   new-suid = suid;
+   new-fsuid = new-euid;
+
+   return security_task_fix_setuid(new, old, LSM_SETID_RES);
+}
+
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid,
+   gid_t sgid)
+{
+   const struct cred *old = current_cred();
+   int retval;
+
+   retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
+   if (retval)
+   return retval;
+
+   if (!capable(CAP_SETGID)) {
+   if (rgid != (gid_t) -1  rgid != old-gid 
+   rgid != old-egid   rgid != old-sgid)
+   return -EPERM;
+   if (egid != (gid_t) -1  egid != old-gid 
+   egid != old-egid   egid != old-sgid)
+   return -EPERM;
+   if (sgid != (gid_t) -1  sgid != old-gid 
+   sgid != old-egid   sgid != old-sgid)
+   return -EPERM;
+   }
+
+   if (rgid != (gid_t) -1)
+   new-gid = rgid;
+   if (egid != (gid_t) -1)
+   new-egid = egid;
+   if (sgid != (gid_t) -1)
+   new-sgid = sgid;
+   new-fsgid = new-egid;
+   return 0;
+}
+
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid)
+{
+   const struct cred *old;
+
+   old = current_cred();
+   *old_fsuid = old-fsuid;
+
+   if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS)  0)
+   return -EPERM;
+
+   if (uid == old-uid  || uid == old-euid  ||
+   uid == old-suid || uid == old-fsuid ||
+   capable(CAP_SETUID)) {
+   if (uid != *old_fsuid) {
+   new-fsuid = uid;
+   if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 
0)
+   return 0;
+   }
+   }
+   return -EPERM;
+}
+
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid)
+{
+   const struct cred *old;
+
+ 

[Devel] [PATCH 3/6] cr: capabilities: define checkpoint and restore fns

2009-05-18 Thread Serge E. Hallyn
An application checkpoint image will store capability sets
(and the bounding set) as __u64s.  Define checkpoint and
restart functions to translate between those and kernel_cap_t's.

Define a common function do_capset_tocred() which applies capability
set changes to a passed-in struct cred.

The restore function uses do_capset_tocred() to apply the restored
capabilities to the struct cred being crafted, subject to the
current task's (task executing sys_restart()) permissions.

TODO: one day we'll want to c/r the securebits as well.

Signed-off-by: Serge E. Hallyn se...@us.ibm.com
---
 include/linux/capability.h |5 ++
 kernel/capability.c|   94 +--
 2 files changed, 86 insertions(+), 13 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index c302110..572b5a0 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -536,6 +536,11 @@ extern const kernel_cap_t __cap_empty_set;
 extern const kernel_cap_t __cap_full_set;
 extern const kernel_cap_t __cap_init_eff_set;
 
+extern void checkpoint_save_cap(__u64 *dest, kernel_cap_t src);
+struct cred;
+extern int checkpoint_restore_cap(__u64 e, __u64 i, __u64 p, __u64 x,
+   struct cred *cred);
+
 /**
  * has_capability - Determine if a task has a superior capability available
  * @t: The task in question
diff --git a/kernel/capability.c b/kernel/capability.c
index 4e17041..d2c9bb3 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -217,6 +217,45 @@ SYSCALL_DEFINE2(capget, cap_user_header_t, header, 
cap_user_data_t, dataptr)
return ret;
 }
 
+static int do_capset_tocred(kernel_cap_t *effective, kernel_cap_t *inheritable,
+   kernel_cap_t *permitted, struct cred *new)
+{
+   int ret;
+
+   ret = security_capset(new, current_cred(),
+ effective, inheritable, permitted);
+   if (ret  0)
+   return ret;
+
+   /*
+* for checkpoint-restart, do we want to wait until end of restart?
+* not sure we care */
+   audit_log_capset(current-pid, new, current_cred());
+
+   return 0;
+}
+
+static int do_capset(kernel_cap_t *effective, kernel_cap_t *inheritable,
+   kernel_cap_t *permitted)
+{
+   struct cred *new;
+   int ret;
+
+   new = prepare_creds();
+   if (!new)
+   return -ENOMEM;
+
+   ret = do_capset_tocred(effective, inheritable, permitted, new);
+   if (ret  0)
+   goto error;
+
+   return commit_creds(new);
+
+error:
+   abort_creds(new);
+   return ret;
+}
+
 /**
  * sys_capset - set capabilities for a process or (*) a group of processes
  * @header: pointer to struct that contains capability version and
@@ -240,7 +279,6 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const 
cap_user_data_t, data)
struct __user_cap_data_struct kdata[_KERNEL_CAPABILITY_U32S];
unsigned i, tocopy;
kernel_cap_t inheritable, permitted, effective;
-   struct cred *new;
int ret;
pid_t pid;
 
@@ -271,22 +309,52 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const 
cap_user_data_t, data)
i++;
}
 
-   new = prepare_creds();
-   if (!new)
-   return -ENOMEM;
+   return do_capset(effective, inheritable, permitted);
 
-   ret = security_capset(new, current_cred(),
- effective, inheritable, permitted);
-   if (ret  0)
-   goto error;
+}
 
-   audit_log_capset(pid, new, current_cred());
 
-   return commit_creds(new);
+void checkpoint_save_cap(__u64 *dest, kernel_cap_t src)
+{
+   *dest = src.cap[0] | (src.cap[1]  sizeof(__u32));
+}
 
-error:
-   abort_creds(new);
-   return ret;
+static void do_capbset_drop(struct cred *cred, int cap)
+{
+   cap_lower(cred-cap_bset, cap);
+}
+
+int checkpoint_restore_cap(__u64 newe, __u64 newi, __u64 newp, __u64 newx,
+   struct cred *cred)
+{
+   kernel_cap_t effective, inheritable, permitted, bset;
+   int may_dropbcap = capable(CAP_SETPCAP);
+   int ret, i;
+
+   effective.cap[0] = newe;
+   effective.cap[1] = (newe  sizeof(__u32));
+   inheritable.cap[0] = newi;
+   inheritable.cap[1] = (newi  sizeof(__u32));
+   permitted.cap[0] = newp;
+   permitted.cap[1] = (newp  sizeof(__u32));
+   bset.cap[0] = newx;
+   bset.cap[1] = (newx  sizeof(__u32));
+
+   ret = do_capset_tocred(effective, inheritable, permitted, cred);
+   if (ret  0)
+   return ret;
+
+   for (i = 0; i  CAP_LAST_CAP; i++) {
+   if (cap_raised(bset, i))
+   continue;
+   if (!cap_raised(current_cred()-cap_bset, i))
+   continue;
+   if (!may_dropbcap)
+   return -EPERM;
+   do_capbset_drop(cred, i);
+   

[Devel] [PATCH 6/6] user namespaces: debug refcounts

2009-05-18 Thread Serge E. Hallyn
Create /proc/userns, which prints out all user namespaces.  It
prints the address of the user_ns itself, the uid and userns address
of the user who created it, and the reference count.

Signed-off-by: Serge E. Hallyn se...@us.ibm.com
---
 include/linux/user_namespace.h |2 +
 kernel/user.c  |1 +
 kernel/user_namespace.c|   84 
 3 files changed, 87 insertions(+), 0 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index a2b82d5..e64d602 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -14,8 +14,10 @@ struct user_namespace {
struct hlist_head   uidhash_table[UIDHASH_SZ];
struct user_struct  *creator;
struct work_struct  destroyer;
+   struct list_headlist;
 };
 
+extern spinlock_t usernslist_lock;
 extern struct user_namespace init_user_ns;
 
 #ifdef CONFIG_USER_NS
diff --git a/kernel/user.c b/kernel/user.c
index 850e0ba..6474422 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -23,6 +23,7 @@ struct user_namespace init_user_ns = {
.refcount   = ATOMIC_INIT(2),
},
.creator = root_user,
+   .list = LIST_HEAD_INIT(init_user_ns.list),
 };
 EXPORT_SYMBOL_GPL(init_user_ns);
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index e624b0f..8d7c725 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -10,6 +10,11 @@
 #include linux/slab.h
 #include linux/user_namespace.h
 #include linux/cred.h
+#include linux/proc_fs.h
+#include linux/seq_file.h
+#include linux/spinlock.h
+
+DEFINE_SPINLOCK(usernslist_lock);
 
 static struct user_namespace *_new_user_ns(struct user_struct *creator,
   struct user_struct **newroot)
@@ -40,6 +45,9 @@ static struct user_namespace *_new_user_ns(struct user_struct 
*creator,
/* alloc_uid() incremented the userns refcount.  Just set it to 1 */
kref_set(ns-kref, 1);
 
+   spin_lock(usernslist_lock);
+   list_add_tail(ns-list, init_user_ns.list);
+   spin_unlock(usernslist_lock);
*newroot = root_user;
return ns;
 }
@@ -90,6 +98,9 @@ static void free_user_ns_work(struct work_struct *work)
 {
struct user_namespace *ns =
container_of(work, struct user_namespace, destroyer);
+   spin_lock(usernslist_lock);
+   list_del(ns-list);
+   spin_unlock(usernslist_lock);
free_uid(ns-creator);
kfree(ns);
 }
@@ -103,3 +114,76 @@ void free_user_ns(struct kref *kref)
schedule_work(ns-destroyer);
 }
 EXPORT_SYMBOL(free_user_ns);
+
+#ifdef CONFIG_PROC_FS
+static int proc_userns_show(struct seq_file *m, void *v)
+{
+   struct user_namespace *ns = v;
+   seq_printf(m, userns %p creator (uid %d ns %p) count %d\n,
+   (void *)ns, ns-creator-uid, (void *) ns-creator-user_ns,
+   atomic_read(ns-kref.refcount));
+   return 0;
+}
+
+static void *proc_userns_start(struct seq_file *p, loff_t *_pos)
+{
+   loff_t pos = *_pos;
+   struct user_namespace *ns = init_user_ns;
+   spin_lock(usernslist_lock);
+   while (pos) {
+   pos--;
+   ns = list_entry(ns-list.next, struct user_namespace, list);
+   if (ns  == init_user_ns)
+   return NULL;
+   }
+   return ns;
+}
+
+static void *proc_userns_next(struct seq_file *p, void *v, loff_t *_pos)
+{
+   struct user_namespace *ns = v;
+   (*_pos)++;
+   ns = list_entry(ns-list.next, struct user_namespace, list);
+   if (ns == init_user_ns)
+   return NULL;
+   return ns;
+}
+
+static void proc_userns_stop(struct seq_file *p, void *v)
+{
+   spin_unlock(usernslist_lock);
+}
+
+static const struct seq_operations proc_userns_ops;
+
+static int proc_userns_open(struct inode *inode, struct file *filp)
+{
+   return seq_open(filp, proc_userns_ops);
+}
+
+static const struct seq_operations proc_userns_ops = {
+   .start  = proc_userns_start,
+   .next   = proc_userns_next,
+   .stop   = proc_userns_stop,
+   .show   = proc_userns_show,
+};
+
+const struct file_operations proc_userns_fops = {
+   .open   = proc_userns_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= seq_release,
+};
+
+static __init int user_ns_debug(void)
+{
+   struct proc_dir_entry *p;
+
+   p = proc_create(userns, 0, NULL, proc_userns_fops);
+   if (!p)
+   panic(cannot create /proc/userns\n);
+   return 0;
+}
+
+__initcall(user_ns_debug);
+#endif
-- 
1.6.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 4/6] cr: checkpoint and restore task credentials

2009-05-18 Thread Serge E. Hallyn
This patch adds the checkpointing and restart of credentials
(uids, gids, and capabilities) to Oren's c/r patchset (on top
of v14).  It goes to great pains to re-use (and define when
needed) common helpers, in order to make sure that as security
code is modified, the cr code will be updated.  Some of the
helpers should still be moved (i.e. _creds() functions should
be in kernel/cred.c).

When building the credentials for the restarted process, I
1. create a new struct cred as a copy of the running task's
cred (using prepare_cred())
2. always authorize any changes to the new struct cred
based on the permissions of current_cred() (not the current
transient state of the new cred).

While this may mean that certain transient_cred1-transient_cred2
states are allowed which otherwise wouldn't be allowed, the
fact remains that current_cred() is allowed to transition to
transient_cred2.

The reconstructed creds are applied to the task at the very
end of the sys_restart call.  This ensures that any objects which
need to be re-created (file, socket, etc) are re-created using
the creds of the task calling sys_restart - preventing an unpriv
user from creating a privileged object, and ensuring that a
root task can restart a process which had started out privileged,
created some privileged objects, then dropped its privilege.

With these patches, the root user can restart checkpoint images
(created by either hallyn or root) of user hallyn's tasks,
resulting in a program owned by hallyn.

Plenty of bugs to be found, no doubt.

Changelog:
May 18: fix more refcounting: if (userns 5, uid 0) had
no active tasks or child user_namespaces, then
it shouldn't exist at restart or it, its namespace,
and its whole chain of creators will be leaked.
May 14: fix some refcounting:
1. a new user_ns needs a ref to remain pinned
   by its root user
2. current_user_ns needs an extra ref bc objhash
   drops two on restart
3. cred needs a ref for the real credentials bc
   commit_creds eats one ref.
May 13: folded in fix to userns refcounting.

Signed-off-by: Serge E. Hallyn se...@us.ibm.com
---
 checkpoint/objhash.c |  119 ++-
 checkpoint/process.c |  457 +-
 include/linux/checkpoint.h   |   11 +
 include/linux/checkpoint_hdr.h   |   57 +
 include/linux/checkpoint_types.h |1 +
 5 files changed, 642 insertions(+), 3 deletions(-)

diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 5618fff..d2268c2 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -16,6 +16,7 @@
 #include linux/file.h
 #include linux/sched.h
 #include linux/ipc_namespace.h
+#include linux/user_namespace.h
 #include linux/checkpoint.h
 #include linux/checkpoint_hdr.h
 
@@ -155,6 +156,71 @@ static int obj_ipc_ns_users(void *ptr)
return atomic_read(((struct ipc_namespace *) ptr)-count);
 }
 
+static int obj_cred_grab(void *ptr)
+{
+   get_cred((struct cred *) ptr);
+   return 0;
+}
+
+static void obj_cred_drop(void *ptr)
+{
+   put_cred((struct cred *) ptr);
+}
+
+static int obj_cred_users(void *ptr)
+{
+   return atomic_read(((struct cred *) ptr)-usage);
+}
+
+static int obj_user_grab(void *ptr)
+{
+   struct user_struct *u = ptr;
+   (void) get_uid(u);
+   return 0;
+}
+
+static void obj_user_drop(void *ptr)
+{
+   free_uid((struct user_struct *) ptr);
+}
+
+static int obj_user_users(void *ptr)
+{
+   return atomic_read(((struct user_struct *) ptr)-__count);
+}
+
+static int obj_userns_grab(void *ptr)
+{
+   get_user_ns((struct user_namespace *) ptr);
+   return 0;
+}
+
+static void obj_userns_drop(void *ptr)
+{
+   put_user_ns((struct user_namespace *) ptr);
+}
+
+static int obj_user_ns_users(void *ptr)
+{
+   return atomic_read(((struct user_namespace *) ptr)-kref.refcount);
+}
+
+static int obj_groupinfo_grab(void *ptr)
+{
+   get_group_info((struct group_info *) ptr);
+   return 0;
+}
+
+static void obj_groupinfo_drop(void *ptr)
+{
+   put_group_info((struct group_info *) ptr);
+}
+
+static int obj_groupinfo_users(void *ptr)
+{
+   return atomic_read(((struct group_info *) ptr)-usage);
+}
+
 static struct ckpt_obj_ops ckpt_obj_ops[] = {
/* ignored object */
{
@@ -221,6 +287,46 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.checkpoint = checkpoint_bad,
.restore = restore_bad,
},
+   /* user_ns object */
+   {
+   .obj_name = USER_NS,
+   .obj_type = CKPT_OBJ_USER_NS,
+   .ref_drop = obj_userns_drop,
+   .ref_grab = obj_userns_grab,
+   .ref_users = obj_user_ns_users,
+   .checkpoint = checkpoint_userns,
+   .restore = restore_userns,
+   },
+   /* struct cred */
+   {
+

[Devel] [PATCH 5/6] cr: restore file-f_cred

2009-05-18 Thread Serge E. Hallyn
Ony seems useful if you're using coda or hppfs, but go ahead and
restore a file's f_cred.  This is set to the cred of the task doing
the open, so often it will be the same as that of the restarted task.

Signed-off-by: Serge E. Hallyn se...@us.ibm.com
---
 checkpoint/files.c |   16 ++--
 include/linux/checkpoint_hdr.h |1 +
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 6107361..edf81ce 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -151,7 +151,11 @@ int checkpoint_file_common(struct ckpt_ctx *ctx, struct 
file *file,
h-f_pos = file-f_pos;
h-f_version = file-f_version;
 
-   /* FIX: need also file-uid, file-gid, file-f_owner, etc */
+   h-f_credref = checkpoint_obj(ctx, file-f_cred, CKPT_OBJ_CRED);
+   if (h-f_credref  0)
+   return h-f_credref;
+
+   /* FIX: need also file-f_owner, etc */
 
return 0;
 }
@@ -359,8 +363,16 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file 
*file,
struct ckpt_hdr_file *h)
 {
int ret;
+   struct cred *cred;
+
+   /* FIX: need to restore owner etc */
 
-   /* FIX: need to restore uid, gid, owner etc */
+   /* restore the cred */
+   cred = ckpt_obj_fetch(ctx, h-f_credref, CKPT_OBJ_CRED);
+   if (IS_ERR(cred))
+   return PTR_ERR(cred);
+   put_cred(file-f_cred);
+   file-f_cred = get_cred(cred);
 
/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
ret = vfs_fcntl(0, F_SETFL, h-f_flags  CKPT_SETFL_MASK, file);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 98fc0f7..2693a72 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -341,6 +341,7 @@ struct ckpt_hdr_file {
__u32 _padding;
__u64 f_pos;
__u64 f_version;
+   __s32 f_credref;
 } __attribute__((aligned(8)));
 
 struct ckpt_hdr_file_generic {
-- 
1.6.1

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel