[Devel] Re: [PATCH] io-controller: Add io group reference handling for request
On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote: On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote: On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote: On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote: Vivek Goyal wrote: ... } @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru /* * Find the io group bio belongs to. * If create is set, io group is created if it is not already present. + * If curr is set, io group is information is searched for current + * task and not with the help of bio. + * + * FIXME: Can we assume that if bio is NULL then lookup group for current + * task and not create extra function parameter ? * - * Note: There is a narrow window of race where a group is being freed - * by cgroup deletion path and some rq has slipped through in this group. - * Fix it. */ -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio, - int create) +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio, + int create, int curr) Hi Vivek, IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL, get iog from bio, otherwise get it from current task. Consider also that get_cgroup_from_bio() is much more slow than task_cgroup() and need to lock/unlock_page_cgroup() in get_blkio_cgroup_id(), while task_cgroup() is rcu protected. True. BTW another optimization could be to use the blkio-cgroup functionality only for dirty pages and cut out some blkio_set_owner(). For all the other cases IO always occurs in the same context of the current task, and you can use task_cgroup(). Yes, may be in some cases we can avoid setting page owner. I will get to it once I have got functionality going well. In the mean time if you have a patch for it, it will be great. However, this is true only for page cache pages, for IO generated by anonymous pages (swap) you still need the page tracking functionality both for reads and writes. Right now I am assuming that all the sync IO will belong to task submitting the bio hence use task_cgroup() for that. Only for async IO, I am trying to use page tracking functionality to determine the owner. Look at elv_bio_sync(bio). You seem to be saying that there are cases where even for sync IO, we can't use submitting task's context and need to rely on page tracking functionlity? In case of getting page (read) from swap, will it not happen in the context of process who will take a page fault and initiate the swap read? No, for example in read_swap_cache_async(): @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, */ __set_page_locked(new_page); SetPageSwapBacked(new_page); + blkio_cgroup_set_owner(new_page, current-mm); err = add_to_swap_cache(new_page, entry, gfp_mask GFP_KERNEL); if (likely(!err)) { /* This is a read, but the current task is not always the owner of this swap cache page, because it's a readahead operation. But will this readahead be not initiated in the context of the task taking the page fault? handle_pte_fault() do_swap_page() swapin_readahead() read_swap_cache_async() If yes, then swap reads issued will still be in the context of process and we should be fine? Anyway, this is a minor corner case I think. And probably it is safe to consider this like any other read IO and get rid of the blkio_cgroup_set_owner(). Agreed. I wonder if it would be better to attach the blkio_cgroup to the anonymous page only when swap-out occurs. Swap seems to be an interesting case in general. Somebody raised this question on lwn io controller article also. A user process never asked for swap activity. It is something enforced by kernel. So while doing some swap outs, it does not seem too fair to charge the write out to the process page belongs to and the fact of the matter may be that there is some other memory hungry application which is forcing these swap outs. Keeping this in mind, should swap activity be considered as system activity and be charged to root group instead of to user tasks in other cgroups? I mean, just put the blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of the IO generated by direct reclaim of anon memory. For all the other cases we can simply use the submitting task's context. BTW, O_DIRECT is another case that is possible to optimize, because all the bios generated by direct IO occur in the same context of the current task. Agreed about the direct IO
[Devel] Re: [PATCH] io-controller: Add io group reference handling for request
On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote: On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote: On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote: On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote: On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote: Vivek Goyal wrote: ... } @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru /* * Find the io group bio belongs to. * If create is set, io group is created if it is not already present. + * If curr is set, io group is information is searched for current + * task and not with the help of bio. + * + * FIXME: Can we assume that if bio is NULL then lookup group for current + * task and not create extra function parameter ? * - * Note: There is a narrow window of race where a group is being freed - * by cgroup deletion path and some rq has slipped through in this group. - * Fix it. */ -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio, - int create) +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio, + int create, int curr) Hi Vivek, IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL, get iog from bio, otherwise get it from current task. Consider also that get_cgroup_from_bio() is much more slow than task_cgroup() and need to lock/unlock_page_cgroup() in get_blkio_cgroup_id(), while task_cgroup() is rcu protected. True. BTW another optimization could be to use the blkio-cgroup functionality only for dirty pages and cut out some blkio_set_owner(). For all the other cases IO always occurs in the same context of the current task, and you can use task_cgroup(). Yes, may be in some cases we can avoid setting page owner. I will get to it once I have got functionality going well. In the mean time if you have a patch for it, it will be great. However, this is true only for page cache pages, for IO generated by anonymous pages (swap) you still need the page tracking functionality both for reads and writes. Right now I am assuming that all the sync IO will belong to task submitting the bio hence use task_cgroup() for that. Only for async IO, I am trying to use page tracking functionality to determine the owner. Look at elv_bio_sync(bio). You seem to be saying that there are cases where even for sync IO, we can't use submitting task's context and need to rely on page tracking functionlity? In case of getting page (read) from swap, will it not happen in the context of process who will take a page fault and initiate the swap read? No, for example in read_swap_cache_async(): @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, */ __set_page_locked(new_page); SetPageSwapBacked(new_page); + blkio_cgroup_set_owner(new_page, current-mm); err = add_to_swap_cache(new_page, entry, gfp_mask GFP_KERNEL); if (likely(!err)) { /* This is a read, but the current task is not always the owner of this swap cache page, because it's a readahead operation. But will this readahead be not initiated in the context of the task taking the page fault? handle_pte_fault() do_swap_page() swapin_readahead() read_swap_cache_async() If yes, then swap reads issued will still be in the context of process and we should be fine? Right. I was trying to say that the current task may swap-in also pages belonging to a different task, so from a certain point of view it's not so fair to charge the current task for the whole activity. But ok, I think it's a minor issue. Anyway, this is a minor corner case I think. And probably it is safe to consider this like any other read IO and get rid of the blkio_cgroup_set_owner(). Agreed. I wonder if it would be better to attach the blkio_cgroup to the anonymous page only when swap-out occurs. Swap seems to be an interesting case in general. Somebody raised this question on lwn io controller article also. A user process never asked for swap activity. It is something enforced by kernel. So while doing some swap outs, it does not seem too fair to charge the write out to the process page belongs to and the fact of the matter may be that there is some other memory hungry application which is forcing these swap outs. Keeping this in mind, should swap activity be considered as system activity and be charged to root group instead of to user tasks in other cgroups? In
[Devel] bugs with ckpt-v15-dev
Last commit is ed3b275 allow error string during checkpoint while holding a spinlock. # bash -c 'exec - - 2- ; while : ; do : ; done' [1] 2269 # ckpt $! /tmp/bash.ckpt BUG: sleeping function called from invalid context at mm/slub.c:1595 in_atomic(): 1, irqs_disabled(): 0, pid: 2270, name: ckpt 1 lock held by ckpt/2270: #0: (tasklist_lock){.+.+.+}, at: [c03911e6] tree_count_tasks+0x2a/0x2a2 Pid: 2270, comm: ckpt Not tainted 2.6.30-rc3-00074-ged3b275 #30 Call Trace: [c024b6f9] ? __debug_show_held_locks+0x1e/0x20 [c02234da] __might_sleep+0x100/0x107 [c02a9372] kmem_cache_alloc+0x35/0x11f [c039100f] ? __ckpt_generate_err+0x25/0x12b [c024a9c7] ? put_lock_stats+0x1e/0x29 [c039100f] __ckpt_generate_err+0x25/0x12b [c0203703] ? ftrace_call+0x5/0x8 [c03911ba] __ckpt_write_err+0x16/0x18 [c03912ae] tree_count_tasks+0xf2/0x2a2 [c03915ae] do_checkpoint+0x150/0x5f2 [c0390cd8] ? kzalloc+0x10/0x12 [c0390d0f] ? ckpt_obj_hash_alloc+0x35/0x60 [c039033d] ? ckpt_ctx_alloc+0x77/0x99 [c0390465] sys_checkpoint+0x6c/0x82 [c0202ce5] syscall_call+0x7/0xb [ cut here ] kernel BUG at checkpoint/checkpoint.c:136! invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /sys/block/sda/size Modules linked in: Pid: 2270, comm: ckpt Not tainted (2.6.30-rc3-00074-ged3b275 #30) EIP: 0060:[c03910dc] EFLAGS: 00010246 CPU: 0 EIP is at __ckpt_generate_err+0xf2/0x12b EAX: df051300 EBX: deb72f30 ECX: df051530 EDX: 001c ESI: df051430 EDI: deb72f28 EBP: deb72f10 ESP: deb72ef8 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process ckpt (pid: 2270, ti=deb72000 task=df9adf60 task.ti=deb72000) Stack: c072ce85 df051300 001c deb75600 df9ad1c0 deb72f18 c03911ba deb72f50 c03912ae df051300 c072ce85 08dd df9ad4ec df051300 df9ad1c0 deb75600 deb75604 df051300 deb72f98 c03915ae Call Trace: [c03911ba] ? __ckpt_write_err+0x16/0x18 [c03912ae] ? tree_count_tasks+0xf2/0x2a2 [c03915ae] ? do_checkpoint+0x150/0x5f2 [c0390cd8] ? kzalloc+0x10/0x12 [c0390d0f] ? ckpt_obj_hash_alloc+0x35/0x60 [c039033d] ? ckpt_ctx_alloc+0x77/0x99 [c0390465] ? sys_checkpoint+0x6c/0x82 [c0202ce5] ? syscall_call+0x7/0xb Code: 08 0c 8b c0 03 74 1b f6 05 c2 8f ff c0 20 74 12 f6 05 c9 8f ff c0 10 74 09 80 3d 47 94 83 c0 00 75 1d 8b 45 ec 83 78 2c 00 75 04 0f 0b eb fe 8b 55 ec 31 c0 89 72 2c 8d 65 f4 5b 5e 5f 5d c3 31 EIP: [c03910dc] __ckpt_generate_err+0xf2/0x12b SS:ESP 0068:deb72ef8 ---[ end trace d54433b47f0c4829 ]--- note: ckpt[2270] exited with preempt_count 1 BUG: scheduling while atomic: ckpt/2270/0x1002 INFO: lockdep is turned off. Modules linked in: Pid: 2270, comm: ckpt Tainted: G D2.6.30-rc3-00074-ged3b275 #30 Call Trace: [c0223f6b] __schedule_bug+0x63/0x6a [c05ec7dc] __schedule+0x8f/0x7ac [c024d299] ? print_lock_contention_bug+0x14/0xd7 [c0298093] ? unmap_vmas+0x1e1/0x518 [c0203703] ? ftrace_call+0x5/0x8 [c0203703] ? ftrace_call+0x5/0x8 [c05ecf10] schedule+0x17/0x38 [c0224738] __cond_resched+0x26/0x3b [c05ed034] _cond_resched+0x2c/0x37 [c0298379] unmap_vmas+0x4c7/0x518 [c029b81b] exit_mmap+0x6c/0xb7 [c022906a] mmput+0x3c/0x8f [c022c8a0] exit_mm+0xe3/0xeb [c022e0e2] do_exit+0x188/0x64b [c05ec415] ? printk+0x14/0x16 [c022b08d] ? oops_exit+0x28/0x2d [c05efbe7] oops_end+0x92/0x9a [c020560f] die+0x59/0x5f [c05ef56b] do_trap+0x89/0xa2 [c02039fc] ? do_invalid_op+0x0/0x80 [c0203a72] do_invalid_op+0x76/0x80 [c03910dc] ? __ckpt_generate_err+0xf2/0x12b [c0203703] ? ftrace_call+0x5/0x8 [c039c95d] ? strnlen+0x8/0x1f [c039b8bd] ? string+0x34/0x82 [c039c14a] ? vsnprintf+0x173/0x311 [c039c05a] ? vsnprintf+0x83/0x311 [c039c9d0] ? trace_hardirqs_off_thunk+0xc/0x10 [c05ef322] error_code+0x72/0x78 [c02039fc] ? do_invalid_op+0x0/0x80 [c03910dc] ? __ckpt_generate_err+0xf2/0x12b [c03911ba] __ckpt_write_err+0x16/0x18 [c03912ae] tree_count_tasks+0xf2/0x2a2 [c03915ae] do_checkpoint+0x150/0x5f2 [c0390cd8] ? kzalloc+0x10/0x12 [c0390d0f] ? ckpt_obj_hash_alloc+0x35/0x60 [c039033d] ? ckpt_ctx_alloc+0x77/0x99 [c0390465] sys_checkpoint+0x6c/0x82 [c0202ce5] syscall_call+0x7/0xb ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] failure to restart bash with ckpt-v15-dev
Last commit is ed3b275 allow error string during checkpoint while holding a spinlock. The failure seems to involve a vma corresponding to /usr/lib/gconv/gconv-modules.cache which is mapped read-only and shared, e.g. open(/usr/lib/gconv/gconv-modules.cache, O_RDONLY) = 3 mmap2(NULL, 26048, PROT_READ, MAP_SHARED, 3, 0) = 0xb7f52000 I believe the second check in filemap_restore() is where the restart goes awry; the object is of type CKPT_VMA_SHM_FILE but doesn't have VM_SHARED set in the flags (it does have VM_MAY_SHARE set, however). Testcase: # bash -c 'exec - - 2- ; while : ; do : ; done' [1] 2274 # mkdir -p /cgroup/foo for k in cpus mems ; do cat /cgroup/cpuset.$k /cgroup/foo/cpuset.$k ; done # echo $! /cgroup/foo/tasks # echo FROZEN /cgroup/foo/freezer.state # cat /cgroup/foo/freezer.state FROZEN # ckpt $! /tmp/bash.ckpt # rstr /tmp/bash.ckpt Segmentation fault Debug output (which, btw, I can't figure out how to enable with ckpt_debug=, so I opted to hack a printk into _ckpt_debug): [2279:c/r:may_checkpoint_task] check 2274 [2279:c/r:may_checkpoint_task] check 2274 [2279:c/r:ckpt_write_obj] type 1 len 48 [2279:c/r:ckpt_write_obj_type] type 3 len 73 [2279:c/r:ckpt_write_obj_type] type 3 len 73 [2279:c/r:ckpt_write_obj_type] type 3 len 73 [2279:c/r:ckpt_write_obj] type 2 len 16 [2279:c/r:ckpt_write_obj] type 101 len 16 [2279:c/r:checkpoint_pids] task[0]: vpid 2274 vtgid 2274 parent 2252 [2279:c/r:checkpoint_all_tasks] dumping task #0 [2279:c/r:ckpt_write_obj] type 102 len 32 [2279:c/r:ckpt_write_obj_type] type 4 len 24 [2279:c/r:checkpoint_task] ret 0 [2279:c/r:ckpt_write_obj] type 6 len 16 [2279:c/r:ckpt_obj_lookup_add] UTS_NS objref 2 first 1 [2279:c/r:ckpt_obj_lookup_add] IPC_NS objref 3 first 1 [2279:c/r:ckpt_write_obj] type 108 len 24 [2279:c/r:ckpt_write_obj] type 109 len 16 [2279:c/r:ckpt_write_obj_type] type 4 len 73 [2279:c/r:ckpt_write_obj_type] type 4 len 73 [2279:c/r:ckpt_write_obj] type 110 len 56 [2279:c/r:checkpoint_ipc_any] ipc-shm count 0 [2279:c/r:ckpt_write_obj] type 401 len 16 [2279:c/r:checkpoint_ipc_any] ipc-shm ret 0 [2279:c/r:checkpoint_ipc_any] ipc-msg count 0 [2279:c/r:ckpt_write_obj] type 401 len 16 [2279:c/r:checkpoint_ipc_any] ipc-msg ret 0 [2279:c/r:checkpoint_ipc_any] ipc-sem count 0 [2279:c/r:ckpt_write_obj] type 401 len 16 [2279:c/r:checkpoint_ipc_any] ipc-sem ret 0 [2279:c/r:checkpoint_task_ns] nsproxy: objref 1 [2279:c/r:ckpt_write_obj] type 103 len 16 [2279:c/r:ckpt_write_obj] type 6 len 16 [2279:c/r:ckpt_write_obj] type 6 len 16 [2279:c/r:ckpt_write_obj] type 303 len 40 [2279:c/r:ckpt_write_obj_type] type 5 len 18 [2279:c/r:ckpt_write_obj] type 201 len 104 [2279:c/r:do_checkpoint_mm] vma 0x2bd000-0x2dd000 flags 0x8000875 [2279:c/r:ckpt_write_obj] type 6 len 16 [2279:c/r:ckpt_write_obj] type 303 len 40 [2279:c/r:ckpt_write_obj_type] type 5 len 23 [2279:c/r:generic_vma_checkpoint] vma 0x2bd000-0x2dd000 flags 0x8000875 type 3 [2279:c/r:ckpt_write_obj] type 202 len 72 [2279:c/r:ckpt_write_obj] type 203 len 16 [2279:c/r:do_checkpoint_mm] vma 0x2de000-0x2df000 flags 0x8100871 [2279:c/r:generic_vma_checkpoint] vma 0x2de000-0x2df000 flags 0x8100871 type 3 [2279:c/r:ckpt_write_obj] type 202 len 72 [2279:c/r:vma_fill_pgarr] got page 0x2de000 [2279:c/r:checkpoint_memory_contents] collected 1 pages [2279:c/r:ckpt_write_obj] type 203 len 16 [2279:c/r:pgarr_release_pages] total pages 1 [2279:c/r:ckpt_write_obj] type 203 len 16 [2279:c/r:do_checkpoint_mm] vma 0x2df000-0x2e flags 0x8100873 [2279:c/r:generic_vma_checkpoint] vma 0x2df000-0x2e flags 0x8100873 type 3 [2279:c/r:ckpt_write_obj] type 202 len 72 [2279:c/r:vma_fill_pgarr] got page 0x2df000 [2279:c/r:checkpoint_memory_contents] collected 1 pages [2279:c/r:ckpt_write_obj] type 203 len 16 [2279:c/r:pgarr_release_pages] total pages 1 [2279:c/r:ckpt_write_obj] type 203 len 16 [2279:c/r:do_checkpoint_mm] vma 0x2e2000-0x45 flags 0x875 [2279:c/r:ckpt_write_obj] type 6 len 16 [2279:c/r:ckpt_write_obj] type 303 len 40 [2279:c/r:ckpt_write_obj_type] type 5 len 25 [2279:c/r:generic_vma_checkpoint] vma 0x2e2000-0x45 flags 0x875 type 3 [2279:c/r:ckpt_write_obj] type 202 len 72 [2279:c/r:ckpt_write_obj] type 203 len 16 [2279:c/r:do_checkpoint_mm] vma 0x45-0x452000 flags 0x8100071 [2279:c/r:generic_vma_checkpoint] vma 0x45-0x452000 flags 0x8100071 type 3 [2279:c/r:ckpt_write_obj] type 202 len 72 [2279:c/r:vma_fill_pgarr] got page 0x451000 [2279:c/r:checkpoint_memory_contents] collected 1 pages [2279:c/r:ckpt_write_obj] type 203 len 16 [2279:c/r:pgarr_release_pages] total pages 1 [2279:c/r:ckpt_write_obj] type 203 len 16 [2279:c/r:do_checkpoint_mm] vma 0x452000-0x453000 flags 0x8100073 [2279:c/r:generic_vma_checkpoint] vma 0x452000-0x453000 flags 0x8100073 type 3 [2279:c/r:ckpt_write_obj] type 202 len 72 [2279:c/r:vma_fill_pgarr] got page 0x452000 [2279:c/r:checkpoint_memory_contents] collected 1 pages [2279:c/r:ckpt_write_obj] type 203 len 16 [2279:c/r:pgarr_release_pages] total pages 1
[Devel] Re: bugs with ckpt-v15-dev
Quoting Nathan Lynch (n...@pobox.com): Last commit is ed3b275 allow error string during checkpoint while holding a spinlock. # bash -c 'exec - - 2- ; while : ; do : ; done' [1] 2269 # ckpt $! /tmp/bash.ckpt BUG: sleeping function called from invalid context at mm/slub.c:1595 Yeah, not only does ckpt_write_err() get called under task_lock, but the fn returns without ver doing put_task_struct. (I'd generate and send the quick trivial patch, but my git tree is in a bit of a debugme state right now) Now mind you this shows that your ckpt program isn't sending CHECKPOINT_SUBTREE with flags. This in turns means you are probably not using the ckpt-v15-dev version of user-cr, and if that is the case it makes your problems with gconf shared file mapping more suspect ask well...? -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] cr test cases moved
I was keeping a few automated testcases in a tarball under sf.net/projects/lxc. I've moved these to a git tree which you can fetch using git clone git://git.sr71.net/~hallyn/cr_tests.git Two purposes for these: first, to eventually hook some of them into LTP. Second, to try and let people easily and reliably reproduce tests across arches. For instance, it appears the fileio test has problems on x86, while it works fine on s390. The (not-yet-automated) sleep testcase shows that restart blocks on s390 aren't yet working, but I've not yet tested on x86 to make sure those work at all. Suka, please let me know if the fileio test succeeds on your x86 box. If you could also try the sleep testcase, I'd be very interested. thanks, -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: bugs with ckpt-v15-dev
Serge E. Hallyn se...@us.ibm.com writes: Quoting Nathan Lynch (n...@pobox.com): Last commit is ed3b275 allow error string during checkpoint while holding a spinlock. # bash -c 'exec - - 2- ; while : ; do : ; done' [1] 2269 # ckpt $! /tmp/bash.ckpt BUG: sleeping function called from invalid context at mm/slub.c:1595 Yeah, not only does ckpt_write_err() get called under task_lock, but the fn returns without ver doing put_task_struct. (I'd generate and send the quick trivial patch, but my git tree is in a bit of a debugme state right now) Would prefer to just rip that thing out, it's cost me more trouble then it's worth. Now mind you this shows that your ckpt program isn't sending CHECKPOINT_SUBTREE with flags. I don't follow. There is user error here in that I'm not freezing the task before checkpointing[1], but my ckpt command is passing the subtree flag (0x4) afaict: SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540 This in turns means you are probably not using the ckpt-v15-dev version of user-cr, and if that is the case it makes your problems with gconf shared file mapping more suspect ask well...? After updating to the latest user-cr I get the same BUGs. [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state before checkpoint, is there any mechanism apart from cgroup/freezer.state to do this? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: failure to restart bash with ckpt-v15-dev
Nathan Lynch n...@pobox.com writes: Last commit is ed3b275 allow error string during checkpoint while holding a spinlock. The failure seems to involve a vma corresponding to /usr/lib/gconv/gconv-modules.cache which is mapped read-only and shared, e.g. open(/usr/lib/gconv/gconv-modules.cache, O_RDONLY) = 3 mmap2(NULL, 26048, PROT_READ, MAP_SHARED, 3, 0) = 0xb7f52000 I believe the second check in filemap_restore() is where the restart goes awry; the object is of type CKPT_VMA_SHM_FILE but doesn't have VM_SHARED set in the flags (it does have VM_MAY_SHARE set, however). As Serge suggested in other bug report, updated to latest user-cr and got same result. Also, I forgot to mention that if I modify the check in filemap_restore to allow VM_MAY_SHARE, the restart operation appears to complete successfully but the kernel gets an oops on an address in that vma when the task runs. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
Hi Gui, Gui Jianfeng wrote: Hi Vivek, This patch enables per-cgroup per-device weight and ioprio_class handling. A new cgroup interface policy is introduced. You can make use of this file to configure weight and ioprio_class for each device in a given cgroup. The original weight and ioprio_class files are still available. If you don't do special configuration for a particular device, weight and ioprio_class are used as default values in this device. You can use the following format to play with the new interface. #echo DEV:weight:ioprio_class /patch/to/cgroup/policy weight=0 means removing the policy for DEV. Examples: Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup # echo /dev/hdb:300:2 io.policy # cat io.policy dev weight class /dev/hdb 300 2 Users can specify a device file of a partition for io.policy. In this case, io_policy_node::dev_name is set as a name of the partition device like /dev/sda2. ex) # cd /mnt/cgroup # cat /dev/sda2:500:2 io.policy # echo io.policy dev weight class /dev/sda2 500 2 I believe io_policy_node::dev_name should be set a generic device name like /dev/sda. What do you think about it? Signed-off-by: Munehiro Muuhh Ikeda m-ik...@ds.jp.nec.com --- block/elevator-fq.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/block/elevator-fq.c b/block/elevator-fq.c index 39fa2a1..5d3d55c 100644 --- a/block/elevator-fq.c +++ b/block/elevator-fq.c @@ -1631,11 +1631,12 @@ static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg, return NULL; } -static int devname_to_devnum(const char *buf, dev_t *dev) +static int devname_to_devnum(char *buf, dev_t *dev) { struct block_device *bdev; struct gendisk *disk; int part; + char *c; bdev = lookup_bdev(buf); if (IS_ERR(bdev)) @@ -1645,6 +1646,10 @@ static int devname_to_devnum(const char *buf, dev_t *dev) *dev = MKDEV(disk-major, disk-first_minor); bdput(bdev); + c = strrchr(buf, '/'); + if (c) + strcpy(c+1, disk-disk_name); + return 0; } -- 1.5.4.3 -- IKEDA, Munehiro NEC Corporation of America m-ik...@ds.jp.nec.com ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: bugs with ckpt-v15-dev
Quoting Nathan Lynch (n...@pobox.com): Serge E. Hallyn se...@us.ibm.com writes: Quoting Nathan Lynch (n...@pobox.com): Last commit is ed3b275 allow error string during checkpoint while holding a spinlock. # bash -c 'exec - - 2- ; while : ; do : ; done' [1] 2269 # ckpt $! /tmp/bash.ckpt BUG: sleeping function called from invalid context at mm/slub.c:1595 Yeah, not only does ckpt_write_err() get called under task_lock, but the fn returns without ver doing put_task_struct. (I'd generate and send the quick trivial patch, but my git tree is in a bit of a debugme state right now) Would prefer to just rip that thing out, it's cost me more trouble then it's worth. Which thing - CHECKPOINT_SUBTREE, freezer check, or ckpt_write_err? Now mind you this shows that your ckpt program isn't sending CHECKPOINT_SUBTREE with flags. I don't follow. There is user error here in that I'm not freezing the task before checkpointing[1], but my ckpt command is passing the subtree flag (0x4) afaict: SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540 Oh, it's the freezer test in may_checkpoint_task you're getting the error on? (in my git tree I'd commented that one out temporarily so I just assumed it was the subtree check in get_container :) This in turns means you are probably not using the ckpt-v15-dev version of user-cr, and if that is the case it makes your problems with gconf shared file mapping more suspect ask well...? After updating to the latest user-cr I get the same BUGs. [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state before checkpoint, is there any mechanism apart from cgroup/freezer.state to do this? A task can self-checkpoint without the freezer though. -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: bugs with ckpt-v15-dev
On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote: Serge E. Hallyn se...@us.ibm.com writes: Quoting Nathan Lynch (n...@pobox.com): Last commit is ed3b275 allow error string during checkpoint while holding a spinlock. # bash -c 'exec - - 2- ; while : ; do : ; done' [1] 2269 # ckpt $! /tmp/bash.ckpt BUG: sleeping function called from invalid context at mm/slub.c:1595 Yeah, not only does ckpt_write_err() get called under task_lock, but the fn returns without ver doing put_task_struct. (I'd generate and send the quick trivial patch, but my git tree is in a bit of a debugme state right now) Would prefer to just rip that thing out, it's cost me more trouble then it's worth. Now mind you this shows that your ckpt program isn't sending CHECKPOINT_SUBTREE with flags. I don't follow. There is user error here in that I'm not freezing the task before checkpointing[1], but my ckpt command is passing the subtree flag (0x4) afaict: SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540 This in turns means you are probably not using the ckpt-v15-dev version of user-cr, and if that is the case it makes your problems with gconf shared file mapping more suspect ask well...? After updating to the latest user-cr I get the same BUGs. [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state before checkpoint, is there any mechanism apart from cgroup/freezer.state to do this? Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze the tasks -- they'd still be capable of responding to some signals (CONT, TERM..). Also they'd presumably be placed in the stopped state upon restart so a SIGCONT will be needed. In the case of bash, at least, that will technically change what happens upon restart. My guess is that in many cases it won't matter but there are some where it will. The freezer documentation shows an example of what happens with bash when attempting to use only STOP/CONT rather than the freezer. gdb might also present interesting cases when just utilizing STOP/CONT signals.. Cheers, -Matt Helsley ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: bugs with ckpt-v15-dev
Matt Helsley matth...@us.ibm.com writes: On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote: [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state before checkpoint, is there any mechanism apart from cgroup/freezer.state to do this? Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze the tasks -- they'd still be capable of responding to some signals (CONT, TERM..). Also they'd presumably be placed in the stopped state upon restart so a SIGCONT will be needed. In the case of bash, at least, that will technically change what happens upon restart. My guess is that in many cases it won't matter but there are some where it will. Hmm, I'm having trouble understanding your suggestion. The current checkpoint implementation requires non-self tasks to be frozen (p-flags PF_FROZEN), which is not equivalent to stopped state (task-state __TASK_STOPPED). That is, it would refuse to checkpoint tasks in stopped state. See may_checkpoint_task(). ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: bugs with ckpt-v15-dev
On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote: Matt Helsley matth...@us.ibm.com writes: On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote: [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state before checkpoint, is there any mechanism apart from cgroup/freezer.state to do this? Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze the tasks -- they'd still be capable of responding to some signals (CONT, TERM..). Also they'd presumably be placed in the stopped state upon restart so a SIGCONT will be needed. In the case of bash, at least, that will technically change what happens upon restart. My guess is that in many cases it won't matter but there are some where it will. Hmm, I'm having trouble understanding your suggestion. The current checkpoint implementation requires non-self tasks to be frozen (p-flags PF_FROZEN), which is not equivalent to stopped state (task-state __TASK_STOPPED). That is, it would refuse to checkpoint tasks in stopped state. See may_checkpoint_task(). Oops. You're right. That would require changing may_checkpoint_task() to include __TASK_STOPPED -- not something we'd want in the final code. I had assumed you wanted to try a different mechanism for debugging purposes. Cheers, -Matt Helsley ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/6] cr: credentials
Here is my latest version of the task credentials c/r patchset. The last patch isn't meant to go upstream - it just helped me to straighten out the refcounting, to the point where nested user namespace c/r now appears to be robust. thanks, -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/6] cr: break out new_user_ns()
Break out the core function which checks privilege and (if allowed) creates a new user namespace, with the passed-in creating user_struct. Note that a user_namespace, unlike other namespace pointers, is not stored in the nsproxy. Rather it is purely a property of user_structs. This will let us keep the task restore code simpler. Signed-off-by: Serge E. Hallyn se...@us.ibm.com --- include/linux/user_namespace.h |8 ++ kernel/user_namespace.c| 53 2 files changed, 45 insertions(+), 16 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index cc4f453..a2b82d5 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -20,6 +20,8 @@ extern struct user_namespace init_user_ns; #ifdef CONFIG_USER_NS +struct user_namespace *new_user_ns(struct user_struct *creator, + struct user_struct **newroot); static inline struct user_namespace *get_user_ns(struct user_namespace *ns) { if (ns) @@ -38,6 +40,12 @@ static inline void put_user_ns(struct user_namespace *ns) #else +static inline struct user_namespace *new_user_ns(struct user_struct *creator, + struct user_struct **newroot) +{ + return -EINVAL; +} + static inline struct user_namespace *get_user_ns(struct user_namespace *ns) { return init_user_ns; diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 076c7c8..e624b0f 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -11,15 +11,8 @@ #include linux/user_namespace.h #include linux/cred.h -/* - * Create a new user namespace, deriving the creator from the user in the - * passed credentials, and replacing that user with the new root user for the - * new namespace. - * - * This is called by copy_creds(), which will finish setting the target task's - * credentials. - */ -int create_user_ns(struct cred *new) +static struct user_namespace *_new_user_ns(struct user_struct *creator, + struct user_struct **newroot) { struct user_namespace *ns; struct user_struct *root_user; @@ -27,7 +20,7 @@ int create_user_ns(struct cred *new) ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL); if (!ns) - return -ENOMEM; + return ERR_PTR(-ENOMEM); kref_init(ns-kref); @@ -38,12 +31,43 @@ int create_user_ns(struct cred *new) root_user = alloc_uid(ns, 0); if (!root_user) { kfree(ns); - return -ENOMEM; + return ERR_PTR(-ENOMEM); } /* set the new root user in the credentials under preparation */ - ns-creator = new-user; - new-user = root_user; + ns-creator = creator; + + /* alloc_uid() incremented the userns refcount. Just set it to 1 */ + kref_set(ns-kref, 1); + + *newroot = root_user; + return ns; +} + +struct user_namespace *new_user_ns(struct user_struct *creator, + struct user_struct **newroot) +{ + if (!capable(CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + return _new_user_ns(creator, newroot); +} + +/* + * Create a new user namespace, deriving the creator from the user in the + * passed credentials, and replacing that user with the new root user for the + * new namespace. + * + * This is called by copy_creds(), which will finish setting the target task's + * credentials. + */ +int create_user_ns(struct cred *new) +{ + struct user_namespace *ns; + + ns = new_user_ns(new-user, new-user); + if (IS_ERR(ns)) + return PTR_ERR(ns); + new-uid = new-euid = new-suid = new-fsuid = 0; new-gid = new-egid = new-sgid = new-fsgid = 0; put_group_info(new-group_info); @@ -54,9 +78,6 @@ int create_user_ns(struct cred *new) #endif /* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */ - /* alloc_uid() incremented the userns refcount. Just set it to 1 */ - kref_set(ns-kref, 1); - return 0; } -- 1.6.1 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/6] cr: split core function out of some set*{u, g}id functions
When restarting tasks, we want to be able to change xuid and xgid in a struct cred, and do so with security checks. Break the core functionality of set{fs,res}{u,g}id into cred_setX which performs the access checks based on current_cred(), but performs the requested change on a passed-in cred. This will allow us to securely construct struct creds based on a checkpoint image, constrained by the caller's permissions, and apply them to the caller at the end of sys_restart(). Signed-off-by: Serge E. Hallyn se...@us.ibm.com --- include/linux/cred.h |8 +++ kernel/cred.c| 114 ++ kernel/sys.c | 134 -- 3 files changed, 143 insertions(+), 113 deletions(-) diff --git a/include/linux/cred.h b/include/linux/cred.h index 3282ee4..bc5ffc2 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -20,6 +20,9 @@ struct user_struct; struct cred; struct inode; +/* defined in sys.c, used in cred_setresuid */ +extern int set_user(struct cred *new); + /* * COW Supplementary groups list */ @@ -343,4 +346,9 @@ do {\ *(_fsgid) = __cred-fsgid; \ } while(0) +int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid); +int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, gid_t sgid); +int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid); +int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid); + #endif /* _LINUX_CRED_H */ diff --git a/kernel/cred.c b/kernel/cred.c index 3a03918..a017399 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -589,3 +589,117 @@ int set_create_files_as(struct cred *new, struct inode *inode) return security_kernel_create_files_as(new, inode); } EXPORT_SYMBOL(set_create_files_as); + +int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid) +{ + int retval; + const struct cred *old; + + retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES); + if (retval) + return retval; + old = current_cred(); + + if (!capable(CAP_SETUID)) { + if (ruid != (uid_t) -1 ruid != old-uid + ruid != old-euid ruid != old-suid) + return -EPERM; + if (euid != (uid_t) -1 euid != old-uid + euid != old-euid euid != old-suid) + return -EPERM; + if (suid != (uid_t) -1 suid != old-uid + suid != old-euid suid != old-suid) + return -EPERM; + } + + if (ruid != (uid_t) -1) { + new-uid = ruid; + if (ruid != old-uid) { + retval = set_user(new); + if (retval 0) + return retval; + } + } + if (euid != (uid_t) -1) + new-euid = euid; + if (suid != (uid_t) -1) + new-suid = suid; + new-fsuid = new-euid; + + return security_task_fix_setuid(new, old, LSM_SETID_RES); +} + +int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, + gid_t sgid) +{ + const struct cred *old = current_cred(); + int retval; + + retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES); + if (retval) + return retval; + + if (!capable(CAP_SETGID)) { + if (rgid != (gid_t) -1 rgid != old-gid + rgid != old-egid rgid != old-sgid) + return -EPERM; + if (egid != (gid_t) -1 egid != old-gid + egid != old-egid egid != old-sgid) + return -EPERM; + if (sgid != (gid_t) -1 sgid != old-gid + sgid != old-egid sgid != old-sgid) + return -EPERM; + } + + if (rgid != (gid_t) -1) + new-gid = rgid; + if (egid != (gid_t) -1) + new-egid = egid; + if (sgid != (gid_t) -1) + new-sgid = sgid; + new-fsgid = new-egid; + return 0; +} + +int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid) +{ + const struct cred *old; + + old = current_cred(); + *old_fsuid = old-fsuid; + + if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS) 0) + return -EPERM; + + if (uid == old-uid || uid == old-euid || + uid == old-suid || uid == old-fsuid || + capable(CAP_SETUID)) { + if (uid != *old_fsuid) { + new-fsuid = uid; + if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0) + return 0; + } + } + return -EPERM; +} + +int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid) +{ + const struct cred *old; + +
[Devel] [PATCH 3/6] cr: capabilities: define checkpoint and restore fns
An application checkpoint image will store capability sets (and the bounding set) as __u64s. Define checkpoint and restart functions to translate between those and kernel_cap_t's. Define a common function do_capset_tocred() which applies capability set changes to a passed-in struct cred. The restore function uses do_capset_tocred() to apply the restored capabilities to the struct cred being crafted, subject to the current task's (task executing sys_restart()) permissions. TODO: one day we'll want to c/r the securebits as well. Signed-off-by: Serge E. Hallyn se...@us.ibm.com --- include/linux/capability.h |5 ++ kernel/capability.c| 94 +-- 2 files changed, 86 insertions(+), 13 deletions(-) diff --git a/include/linux/capability.h b/include/linux/capability.h index c302110..572b5a0 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -536,6 +536,11 @@ extern const kernel_cap_t __cap_empty_set; extern const kernel_cap_t __cap_full_set; extern const kernel_cap_t __cap_init_eff_set; +extern void checkpoint_save_cap(__u64 *dest, kernel_cap_t src); +struct cred; +extern int checkpoint_restore_cap(__u64 e, __u64 i, __u64 p, __u64 x, + struct cred *cred); + /** * has_capability - Determine if a task has a superior capability available * @t: The task in question diff --git a/kernel/capability.c b/kernel/capability.c index 4e17041..d2c9bb3 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -217,6 +217,45 @@ SYSCALL_DEFINE2(capget, cap_user_header_t, header, cap_user_data_t, dataptr) return ret; } +static int do_capset_tocred(kernel_cap_t *effective, kernel_cap_t *inheritable, + kernel_cap_t *permitted, struct cred *new) +{ + int ret; + + ret = security_capset(new, current_cred(), + effective, inheritable, permitted); + if (ret 0) + return ret; + + /* +* for checkpoint-restart, do we want to wait until end of restart? +* not sure we care */ + audit_log_capset(current-pid, new, current_cred()); + + return 0; +} + +static int do_capset(kernel_cap_t *effective, kernel_cap_t *inheritable, + kernel_cap_t *permitted) +{ + struct cred *new; + int ret; + + new = prepare_creds(); + if (!new) + return -ENOMEM; + + ret = do_capset_tocred(effective, inheritable, permitted, new); + if (ret 0) + goto error; + + return commit_creds(new); + +error: + abort_creds(new); + return ret; +} + /** * sys_capset - set capabilities for a process or (*) a group of processes * @header: pointer to struct that contains capability version and @@ -240,7 +279,6 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data) struct __user_cap_data_struct kdata[_KERNEL_CAPABILITY_U32S]; unsigned i, tocopy; kernel_cap_t inheritable, permitted, effective; - struct cred *new; int ret; pid_t pid; @@ -271,22 +309,52 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data) i++; } - new = prepare_creds(); - if (!new) - return -ENOMEM; + return do_capset(effective, inheritable, permitted); - ret = security_capset(new, current_cred(), - effective, inheritable, permitted); - if (ret 0) - goto error; +} - audit_log_capset(pid, new, current_cred()); - return commit_creds(new); +void checkpoint_save_cap(__u64 *dest, kernel_cap_t src) +{ + *dest = src.cap[0] | (src.cap[1] sizeof(__u32)); +} -error: - abort_creds(new); - return ret; +static void do_capbset_drop(struct cred *cred, int cap) +{ + cap_lower(cred-cap_bset, cap); +} + +int checkpoint_restore_cap(__u64 newe, __u64 newi, __u64 newp, __u64 newx, + struct cred *cred) +{ + kernel_cap_t effective, inheritable, permitted, bset; + int may_dropbcap = capable(CAP_SETPCAP); + int ret, i; + + effective.cap[0] = newe; + effective.cap[1] = (newe sizeof(__u32)); + inheritable.cap[0] = newi; + inheritable.cap[1] = (newi sizeof(__u32)); + permitted.cap[0] = newp; + permitted.cap[1] = (newp sizeof(__u32)); + bset.cap[0] = newx; + bset.cap[1] = (newx sizeof(__u32)); + + ret = do_capset_tocred(effective, inheritable, permitted, cred); + if (ret 0) + return ret; + + for (i = 0; i CAP_LAST_CAP; i++) { + if (cap_raised(bset, i)) + continue; + if (!cap_raised(current_cred()-cap_bset, i)) + continue; + if (!may_dropbcap) + return -EPERM; + do_capbset_drop(cred, i); +
[Devel] [PATCH 6/6] user namespaces: debug refcounts
Create /proc/userns, which prints out all user namespaces. It prints the address of the user_ns itself, the uid and userns address of the user who created it, and the reference count. Signed-off-by: Serge E. Hallyn se...@us.ibm.com --- include/linux/user_namespace.h |2 + kernel/user.c |1 + kernel/user_namespace.c| 84 3 files changed, 87 insertions(+), 0 deletions(-) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index a2b82d5..e64d602 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -14,8 +14,10 @@ struct user_namespace { struct hlist_head uidhash_table[UIDHASH_SZ]; struct user_struct *creator; struct work_struct destroyer; + struct list_headlist; }; +extern spinlock_t usernslist_lock; extern struct user_namespace init_user_ns; #ifdef CONFIG_USER_NS diff --git a/kernel/user.c b/kernel/user.c index 850e0ba..6474422 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -23,6 +23,7 @@ struct user_namespace init_user_ns = { .refcount = ATOMIC_INIT(2), }, .creator = root_user, + .list = LIST_HEAD_INIT(init_user_ns.list), }; EXPORT_SYMBOL_GPL(init_user_ns); diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index e624b0f..8d7c725 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -10,6 +10,11 @@ #include linux/slab.h #include linux/user_namespace.h #include linux/cred.h +#include linux/proc_fs.h +#include linux/seq_file.h +#include linux/spinlock.h + +DEFINE_SPINLOCK(usernslist_lock); static struct user_namespace *_new_user_ns(struct user_struct *creator, struct user_struct **newroot) @@ -40,6 +45,9 @@ static struct user_namespace *_new_user_ns(struct user_struct *creator, /* alloc_uid() incremented the userns refcount. Just set it to 1 */ kref_set(ns-kref, 1); + spin_lock(usernslist_lock); + list_add_tail(ns-list, init_user_ns.list); + spin_unlock(usernslist_lock); *newroot = root_user; return ns; } @@ -90,6 +98,9 @@ static void free_user_ns_work(struct work_struct *work) { struct user_namespace *ns = container_of(work, struct user_namespace, destroyer); + spin_lock(usernslist_lock); + list_del(ns-list); + spin_unlock(usernslist_lock); free_uid(ns-creator); kfree(ns); } @@ -103,3 +114,76 @@ void free_user_ns(struct kref *kref) schedule_work(ns-destroyer); } EXPORT_SYMBOL(free_user_ns); + +#ifdef CONFIG_PROC_FS +static int proc_userns_show(struct seq_file *m, void *v) +{ + struct user_namespace *ns = v; + seq_printf(m, userns %p creator (uid %d ns %p) count %d\n, + (void *)ns, ns-creator-uid, (void *) ns-creator-user_ns, + atomic_read(ns-kref.refcount)); + return 0; +} + +static void *proc_userns_start(struct seq_file *p, loff_t *_pos) +{ + loff_t pos = *_pos; + struct user_namespace *ns = init_user_ns; + spin_lock(usernslist_lock); + while (pos) { + pos--; + ns = list_entry(ns-list.next, struct user_namespace, list); + if (ns == init_user_ns) + return NULL; + } + return ns; +} + +static void *proc_userns_next(struct seq_file *p, void *v, loff_t *_pos) +{ + struct user_namespace *ns = v; + (*_pos)++; + ns = list_entry(ns-list.next, struct user_namespace, list); + if (ns == init_user_ns) + return NULL; + return ns; +} + +static void proc_userns_stop(struct seq_file *p, void *v) +{ + spin_unlock(usernslist_lock); +} + +static const struct seq_operations proc_userns_ops; + +static int proc_userns_open(struct inode *inode, struct file *filp) +{ + return seq_open(filp, proc_userns_ops); +} + +static const struct seq_operations proc_userns_ops = { + .start = proc_userns_start, + .next = proc_userns_next, + .stop = proc_userns_stop, + .show = proc_userns_show, +}; + +const struct file_operations proc_userns_fops = { + .open = proc_userns_open, + .read = seq_read, + .llseek = seq_lseek, + .release= seq_release, +}; + +static __init int user_ns_debug(void) +{ + struct proc_dir_entry *p; + + p = proc_create(userns, 0, NULL, proc_userns_fops); + if (!p) + panic(cannot create /proc/userns\n); + return 0; +} + +__initcall(user_ns_debug); +#endif -- 1.6.1 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 4/6] cr: checkpoint and restore task credentials
This patch adds the checkpointing and restart of credentials (uids, gids, and capabilities) to Oren's c/r patchset (on top of v14). It goes to great pains to re-use (and define when needed) common helpers, in order to make sure that as security code is modified, the cr code will be updated. Some of the helpers should still be moved (i.e. _creds() functions should be in kernel/cred.c). When building the credentials for the restarted process, I 1. create a new struct cred as a copy of the running task's cred (using prepare_cred()) 2. always authorize any changes to the new struct cred based on the permissions of current_cred() (not the current transient state of the new cred). While this may mean that certain transient_cred1-transient_cred2 states are allowed which otherwise wouldn't be allowed, the fact remains that current_cred() is allowed to transition to transient_cred2. The reconstructed creds are applied to the task at the very end of the sys_restart call. This ensures that any objects which need to be re-created (file, socket, etc) are re-created using the creds of the task calling sys_restart - preventing an unpriv user from creating a privileged object, and ensuring that a root task can restart a process which had started out privileged, created some privileged objects, then dropped its privilege. With these patches, the root user can restart checkpoint images (created by either hallyn or root) of user hallyn's tasks, resulting in a program owned by hallyn. Plenty of bugs to be found, no doubt. Changelog: May 18: fix more refcounting: if (userns 5, uid 0) had no active tasks or child user_namespaces, then it shouldn't exist at restart or it, its namespace, and its whole chain of creators will be leaked. May 14: fix some refcounting: 1. a new user_ns needs a ref to remain pinned by its root user 2. current_user_ns needs an extra ref bc objhash drops two on restart 3. cred needs a ref for the real credentials bc commit_creds eats one ref. May 13: folded in fix to userns refcounting. Signed-off-by: Serge E. Hallyn se...@us.ibm.com --- checkpoint/objhash.c | 119 ++- checkpoint/process.c | 457 +- include/linux/checkpoint.h | 11 + include/linux/checkpoint_hdr.h | 57 + include/linux/checkpoint_types.h |1 + 5 files changed, 642 insertions(+), 3 deletions(-) diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c index 5618fff..d2268c2 100644 --- a/checkpoint/objhash.c +++ b/checkpoint/objhash.c @@ -16,6 +16,7 @@ #include linux/file.h #include linux/sched.h #include linux/ipc_namespace.h +#include linux/user_namespace.h #include linux/checkpoint.h #include linux/checkpoint_hdr.h @@ -155,6 +156,71 @@ static int obj_ipc_ns_users(void *ptr) return atomic_read(((struct ipc_namespace *) ptr)-count); } +static int obj_cred_grab(void *ptr) +{ + get_cred((struct cred *) ptr); + return 0; +} + +static void obj_cred_drop(void *ptr) +{ + put_cred((struct cred *) ptr); +} + +static int obj_cred_users(void *ptr) +{ + return atomic_read(((struct cred *) ptr)-usage); +} + +static int obj_user_grab(void *ptr) +{ + struct user_struct *u = ptr; + (void) get_uid(u); + return 0; +} + +static void obj_user_drop(void *ptr) +{ + free_uid((struct user_struct *) ptr); +} + +static int obj_user_users(void *ptr) +{ + return atomic_read(((struct user_struct *) ptr)-__count); +} + +static int obj_userns_grab(void *ptr) +{ + get_user_ns((struct user_namespace *) ptr); + return 0; +} + +static void obj_userns_drop(void *ptr) +{ + put_user_ns((struct user_namespace *) ptr); +} + +static int obj_user_ns_users(void *ptr) +{ + return atomic_read(((struct user_namespace *) ptr)-kref.refcount); +} + +static int obj_groupinfo_grab(void *ptr) +{ + get_group_info((struct group_info *) ptr); + return 0; +} + +static void obj_groupinfo_drop(void *ptr) +{ + put_group_info((struct group_info *) ptr); +} + +static int obj_groupinfo_users(void *ptr) +{ + return atomic_read(((struct group_info *) ptr)-usage); +} + static struct ckpt_obj_ops ckpt_obj_ops[] = { /* ignored object */ { @@ -221,6 +287,46 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = { .checkpoint = checkpoint_bad, .restore = restore_bad, }, + /* user_ns object */ + { + .obj_name = USER_NS, + .obj_type = CKPT_OBJ_USER_NS, + .ref_drop = obj_userns_drop, + .ref_grab = obj_userns_grab, + .ref_users = obj_user_ns_users, + .checkpoint = checkpoint_userns, + .restore = restore_userns, + }, + /* struct cred */ + { +
[Devel] [PATCH 5/6] cr: restore file-f_cred
Ony seems useful if you're using coda or hppfs, but go ahead and restore a file's f_cred. This is set to the cred of the task doing the open, so often it will be the same as that of the restarted task. Signed-off-by: Serge E. Hallyn se...@us.ibm.com --- checkpoint/files.c | 16 ++-- include/linux/checkpoint_hdr.h |1 + 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/checkpoint/files.c b/checkpoint/files.c index 6107361..edf81ce 100644 --- a/checkpoint/files.c +++ b/checkpoint/files.c @@ -151,7 +151,11 @@ int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file, h-f_pos = file-f_pos; h-f_version = file-f_version; - /* FIX: need also file-uid, file-gid, file-f_owner, etc */ + h-f_credref = checkpoint_obj(ctx, file-f_cred, CKPT_OBJ_CRED); + if (h-f_credref 0) + return h-f_credref; + + /* FIX: need also file-f_owner, etc */ return 0; } @@ -359,8 +363,16 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file, struct ckpt_hdr_file *h) { int ret; + struct cred *cred; + + /* FIX: need to restore owner etc */ - /* FIX: need to restore uid, gid, owner etc */ + /* restore the cred */ + cred = ckpt_obj_fetch(ctx, h-f_credref, CKPT_OBJ_CRED); + if (IS_ERR(cred)) + return PTR_ERR(cred); + put_cred(file-f_cred); + file-f_cred = get_cred(cred); /* safe to set 1st arg (fd) to 0, as command is F_SETFL */ ret = vfs_fcntl(0, F_SETFL, h-f_flags CKPT_SETFL_MASK, file); diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h index 98fc0f7..2693a72 100644 --- a/include/linux/checkpoint_hdr.h +++ b/include/linux/checkpoint_hdr.h @@ -341,6 +341,7 @@ struct ckpt_hdr_file { __u32 _padding; __u64 f_pos; __u64 f_version; + __s32 f_credref; } __attribute__((aligned(8))); struct ckpt_hdr_file_generic { -- 1.6.1 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel