Re: [Devel] [PATCH 3/8] memcg, slab: never try to merge memcg caches
On Tue, Feb 4, 2014 at 7:27 PM, Vladimir Davydov vdavy...@parallels.com wrote: On 02/04/2014 07:11 PM, Michal Hocko wrote: On Tue 04-02-14 18:59:23, Vladimir Davydov wrote: On 02/04/2014 06:52 PM, Michal Hocko wrote: On Sun 02-02-14 20:33:48, Vladimir Davydov wrote: Suppose we are creating memcg cache A that could be merged with cache B of the same memcg. Since any memcg cache has the same parameters as its parent cache, parent caches PA and PB of memcg caches A and B must be mergeable too. That means PA was merged with PB on creation or vice versa, i.e. PA = PB. From that it follows that A = B, and we couldn't even try to create cache B, because it already exists - a contradiction. I cannot tell I understand the above but I am totally not sure about the statement bellow. So let's remove unused code responsible for merging memcg caches. How come the code was unused? find_mergeable called cache_match_memcg... Oh, sorry for misleading comment. I mean the code handling merging of per-memcg caches is useless, AFAIU: if we find an alias for a per-memcg cache on kmem_cache_create_memcg(), the parent of the found alias must be the same as the parent_cache passed to kmem_cache_create_memcg(), but if it were so, we would never proceed to the memcg cache creation, because the cache we want to create already exists. I am still not sure I understand this correctly. So the outcome of this patch is that compatible caches of different memcgs can be merged together? Sorry if this is a stupid question but I am not that familiar with this area much I am just seeing that cache_match_memcg goes away and my understanding of the function is that it should prevent from different memcg's caches merging. Let me try to explain how I understand it. What is cache merging/aliasing? When we create a cache (kmem_cache_create()), we first try to find a compatible cache that already exists and can handle requests from the new cache. If it is, we do not create any new caches, instead we simply increment the old cache refcount and return it. What about memcgs? Currently, it operates in the same way, i.e. on memcg cache creation we also try to find a compatible cache of the same memcg first. But if there were such a cache, they parents would have been merged (i.e. it would be the same cache). That means we would not even get to this memcg cache creation, because it already exists. That's why the code handling memcg caches merging seems pointless to me. IIRC, this may not always hold. Some of the properties are configurable via sysfs, and it might be that you haven't merged two parent caches because they properties differ, but would be fine merging the child caches. If all properties we check are compile-time parameters, then it should be okay. -- E Mare, Libertas ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 1/8] memcg: export kmemcg cache id via cgroup fs
On Mon, Feb 3, 2014 at 10:57 AM, Vladimir Davydov vdavy...@parallels.com wrote: On 02/03/2014 10:21 AM, David Rientjes wrote: On Sun, 2 Feb 2014, Vladimir Davydov wrote: Per-memcg kmem caches are named as follows: global-cache-name(cgroup-kmem-id:cgroup-name) where cgroup-kmem-id is the unique id of the memcg the cache belongs to, cgroup-name is the relative name of the memcg on the cgroup fs. Cache names are exposed to userspace for debugging purposes (e.g. via sysfs in case of slub or via dmesg). Using relative names makes it impossible in general (in case the cgroup hierarchy is not flat) to find out which memcg a particular cache belongs to, because cgroup-kmem-id is not known to the user. Since using absolute cgroup names would be an overkill, let's fix this by exporting the id of kmem-active memcg via cgroup fs file memory.kmem.id. Hmm, I'm not sure exporting additional information is the best way to do it only for this purpose. I do understand the problem in naming collisions if the hierarchy isn't flat and we typically work around that by ensuring child memcgs still have a unique memcg. This isn't only a problem in slab cache naming, me also avoid printing the entire absolute names for things like the oom killer. AFAIU, cgroup identifiers dumped on oom (cgroup paths, currently) and memcg slab cache names serve for different purposes. The point is oom is a perfectly normal situation for the kernel, and info dumped to dmesg is for admin to find out the cause of the problem (a greedy user or cgroup). On the other hand, slab cache names are dumped to dmesg only on extraordinary situations - like bugs in slab implementation, or double free, or detected memory leaks - where we usually do not need the name of the memcg that triggered the problem, because the bug is likely to be in the kernel subsys using the cache. Plus, the names are exported to sysfs in case of slub, again for debugging purposes, AFAIK. So IMO the use cases for oom vs slab names are completely different - information vs debugging - and I want to export kmem.id only for the ability of debugging kmemcg and slab subsystems. Then maybe it is better to wrap it into some kind of CONFIG_DEBUG wrap. We already have other files like that. So it would be nice to have consensus on how people are supposed to identify memcgs with a hierarchy: either by exporting information like the id like you do here (but leave the oom killer still problematic) or by insisting people name their memcgs with unique names if they care to differentiate them. Anyway, I agree with you that this needs a consensus, because this is a functional change. Thanks. -- E Mare, Libertas ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v14 16/18] vmpressure: in-kernel notifications
I have the exact problem described above for a project I'm working on and this solution seems to solve it well. However, I had a few issues while trying to use this interface. I'll comment on them below, but please take this more as advice seeking than patch review. This patch extends that to also support in-kernel users. Events that should be generated for in-kernel consumption will be marked as such, and for those, we will call a registered function instead of triggering an eventfd notification. Please note that due to my lack of understanding of each shrinker user, I will stay away from converting the actual users, you are all welcome to do so. Signed-off-by: Glauber Costa glom...@openvz.org Signed-off-by: Vladimir Davydov vdavy...@parallels.com Acked-by: Anton Vorontsov an...@enomsg.org Acked-by: Pekka Enberg penb...@kernel.org Reviewed-by: Greg Thelen gthe...@google.com Cc: Dave Chinner dchin...@redhat.com Cc: John Stultz john.stu...@linaro.org Cc: Andrew Morton a...@linux-foundation.org Cc: Joonsoo Kim iamjoonsoo@lge.com Cc: Michal Hocko mho...@suse.cz Cc: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com Cc: Johannes Weiner han...@cmpxchg.org --- include/linux/vmpressure.h |5 + mm/vmpressure.c| 53 +--- 2 files changed, 55 insertions(+), 3 deletions(-) diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h index 3f3788d..9102e53 100644 --- a/include/linux/vmpressure.h +++ b/include/linux/vmpressure.h @@ -19,6 +19,9 @@ struct vmpressure { /* Have to grab the lock on events traversal or modifications. */ struct mutex events_lock; + /* False if only kernel users want to be notified, true otherwise. */ + bool notify_userspace; + struct work_struct work; }; @@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css, struct cftype *cft, struct eventfd_ctx *eventfd, const char *args); +extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css, + void (*fn)(void)); extern void vmpressure_unregister_event(struct cgroup_subsys_state *css, struct cftype *cft, struct eventfd_ctx *eventfd); diff --git a/mm/vmpressure.c b/mm/vmpressure.c index e0f6283..730e7c1 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, } struct vmpressure_event { - struct eventfd_ctx *efd; + union { + struct eventfd_ctx *efd; + void (*fn)(void); How does the callback access its private data? + }; enum vmpressure_levels level; + bool kernel_event; struct list_head node; }; @@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr, mutex_lock(vmpr-events_lock); list_for_each_entry(ev, vmpr-events, node) { - if (level = ev-level) { + if (ev-kernel_event) { + ev-fn(); I think it would be interesting to pass 'level' to the callback (I'll probably use it myself), but we could wait for a in-tree user before adding it. + } else if (vmpr-notify_userspace level = ev-level) { eventfd_signal(ev-efd, 1); signalled = true; } } + vmpr-notify_userspace = false; mutex_unlock(vmpr-events_lock); return signalled; @@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, * we account it too. */ if (!(gfp (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS))) - return; + goto schedule; /* * If we got here with no pages scanned, then that is an indicator @@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, vmpr-scanned += scanned; vmpr-reclaimed += reclaimed; scanned = vmpr-scanned; + /* + * If we didn't reach this point, only kernel events will be triggered. + * It is the job of the worker thread to clean this up once the + * notifications are all delivered. + */ + vmpr-notify_userspace = true; spin_unlock(vmpr-sr_lock); +schedule: if (scanned vmpressure_win) return; schedule_work(vmpr-work); @@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css, } /** + * vmpressure_register_kernel_event() - Register kernel-side notification + * @css: css that is interested in vmpressure notifications + * @fn: function to be called when pressure happens + * + * This function register in-kernel users interested in receiving notifications
Re: [Devel] [PATCH v14 16/18] vmpressure: in-kernel notifications
One correction: int vmpressure_register_kernel_event(struct cgroup_subsys_state *css, - void (*fn)(void)) +void (*fn)(void *data, int level), void *data) { - struct vmpressure *vmpr = css_to_vmpressure(css); + struct vmpressure *vmpr; struct vmpressure_event *ev; + vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL); + This looks like it could be improved. Better not to have that memcg specific thing here. Other than that it makes sense. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v14 16/18] vmpressure: in-kernel notifications
On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino lcapitul...@redhat.com wrote: On Fri, 20 Dec 2013 10:03:32 -0500 Luiz Capitulino lcapitul...@redhat.com wrote: The answer for all of your questions above can be summarized by noting that for the lack of other users (at the time), this patch does the bare minimum for memcg needs. I agree, for instance, that it would be good to pass the level but since memcg won't do anything with thta, I didn't pass it. That should be extended if you need to. That works for me. That is, including this minimal version first and extending it when we get in-tree users. Btw, there's something I was thinking just right now. If/when we convert shrink functions to use this API, they will come to depend on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n. Is this acceptable (this is an honest question)? Because today, they do work when CONFIG_MEMCG=n. Should those shrink functions use the shrinker API as a fallback? If you have a non-memcg user, that should obviously be available for CONFIG_MEMCG=n ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v14 16/18] vmpressure: in-kernel notifications
On Fri, Dec 20, 2013 at 8:53 PM, Luiz Capitulino lcapitul...@redhat.com wrote: On Fri, 20 Dec 2013 20:46:05 +0400 Glauber Costa glom...@gmail.com wrote: On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino lcapitul...@redhat.com wrote: On Fri, 20 Dec 2013 10:03:32 -0500 Luiz Capitulino lcapitul...@redhat.com wrote: The answer for all of your questions above can be summarized by noting that for the lack of other users (at the time), this patch does the bare minimum for memcg needs. I agree, for instance, that it would be good to pass the level but since memcg won't do anything with thta, I didn't pass it. That should be extended if you need to. That works for me. That is, including this minimal version first and extending it when we get in-tree users. Btw, there's something I was thinking just right now. If/when we convert shrink functions to use this API, they will come to depend on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n. Is this acceptable (this is an honest question)? Because today, they do work when CONFIG_MEMCG=n. Should those shrink functions use the shrinker API as a fallback? If you have a non-memcg user, that should obviously be available for CONFIG_MEMCG=n OK, which means we'll have to change it, right? Because, if I'm not missing something, today vmpressure does depend on CONFIG_MEMCG=y. You mean the main vmpressure mechanism? Sorry, this was out of my mental cachelines. Yes, vmpressure depends on MEMCG, because the pressure interface is memcg-specific (global == root memcg) You might want to change that so you can reuse the mechanism and let only the user interface depend on memcg. -- E Mare, Libertas ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 4/6] memcg, slab: check and init memcg_cahes under slab_mutex
On Thu, Dec 19, 2013 at 11:07 AM, Vladimir Davydov vdavy...@parallels.com wrote: On 12/18/2013 09:41 PM, Michal Hocko wrote: On Wed 18-12-13 17:16:55, Vladimir Davydov wrote: The memcg_params::memcg_caches array can be updated concurrently from memcg_update_cache_size() and memcg_create_kmem_cache(). Although both of these functions take the slab_mutex during their operation, the latter checks if memcg's cache has already been allocated w/o taking the mutex. This can result in a race as described below. Asume two threads schedule kmem_cache creation works for the same kmem_cache of the same memcg from __memcg_kmem_get_cache(). One of the works successfully creates it. Another work should fail then, but if it interleaves with memcg_update_cache_size() as follows, it does not: I am not sure I understand the race. memcg_update_cache_size is called when we start accounting a new memcg or a child is created and it inherits accounting from the parent. memcg_create_kmem_cache is called when a new cache is first allocated from, right? memcg_update_cache_size() is called when kmem accounting is activated for a memcg, no matter how. memcg_create_kmem_cache() is scheduled from __memcg_kmem_get_cache(). It's OK to have a bunch of such methods trying to create the same memcg cache concurrently, but only one of them should succeed. Why cannot we simply take slab_mutex inside memcg_create_kmem_cache? it is running from the workqueue context so it should clash with other locks. Hmm, Glauber's code never takes the slab_mutex inside memcontrol.c. I have always been wondering why, because it could simplify flow paths significantly (e.g. update_cache_sizes() - update_all_caches() - update_cache_size() - from memcontrol.c to slab_common.c and back again just to take the mutex). Because that is a layering violation and exposes implementation details of the slab to the outside world. I agree this would make things a lot simpler, but please check with Christoph if this is acceptable before going forward. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] memcg: remove KMEM_ACCOUNTED_ACTIVATED
On Mon, Dec 2, 2013 at 10:15 PM, Michal Hocko mho...@suse.cz wrote: [CCing Glauber - please do so in other posts for kmem related changes] On Mon 02-12-13 17:08:13, Vladimir Davydov wrote: The KMEM_ACCOUNTED_ACTIVATED was introduced by commit a8964b9b (memcg: use static branches when code not in use) in order to guarantee that static_key_slow_inc(memcg_kmem_enabled_key) will be called only once for each memory cgroup when its kmem limit is set. The point is that at that time the memcg_update_kmem_limit() function's workflow looked like this: bool must_inc_static_branch = false; cgroup_lock(); mutex_lock(set_limit_mutex); if (!memcg-kmem_account_flags val != RESOURCE_MAX) { /* The kmem limit is set for the first time */ ret = res_counter_set_limit(memcg-kmem, val); memcg_kmem_set_activated(memcg); must_inc_static_branch = true; } else ret = res_counter_set_limit(memcg-kmem, val); mutex_unlock(set_limit_mutex); cgroup_unlock(); if (must_inc_static_branch) { /* We can't do this under cgroup_lock */ static_key_slow_inc(memcg_kmem_enabled_key); memcg_kmem_set_active(memcg); } Today, we don't use cgroup_lock in memcg_update_kmem_limit(), and static_key_slow_inc() is called under the set_limit_mutex, but the leftover from the above-mentioned commit is still here. Let's remove it. OK, so I have looked there again and 692e89abd154b (memcg: increment static branch right after limit set) which went in after cgroup_mutex has been removed. It came along with the following comment. /* * setting the active bit after the inc will guarantee no one * starts accounting before all call sites are patched */ This suggests that the flag is needed after all because we have to be sure that _all_ the places have to be patched. AFAIU memcg_kmem_newpage_charge might see the static key already patched so it would do a charge but memcg_kmem_commit_charge would still see it unpatched and so the charge won't be committed. Or am I missing something? You are correct. This flag is there due to the way we are using static branches. The patching of one call site is atomic, but the patching of all of them are not. Therefore we need to use a two-flag scheme to guarantee that in the first time we turn the static branches on, there will be a clear point after which we're going to start accounting. -- E Mare, Libertas ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] memcg: remove KMEM_ACCOUNTED_ACTIVATED
On Mon, Dec 2, 2013 at 10:51 PM, Michal Hocko mho...@suse.cz wrote: On Mon 02-12-13 22:26:48, Glauber Costa wrote: On Mon, Dec 2, 2013 at 10:15 PM, Michal Hocko mho...@suse.cz wrote: [CCing Glauber - please do so in other posts for kmem related changes] On Mon 02-12-13 17:08:13, Vladimir Davydov wrote: The KMEM_ACCOUNTED_ACTIVATED was introduced by commit a8964b9b (memcg: use static branches when code not in use) in order to guarantee that static_key_slow_inc(memcg_kmem_enabled_key) will be called only once for each memory cgroup when its kmem limit is set. The point is that at that time the memcg_update_kmem_limit() function's workflow looked like this: bool must_inc_static_branch = false; cgroup_lock(); mutex_lock(set_limit_mutex); if (!memcg-kmem_account_flags val != RESOURCE_MAX) { /* The kmem limit is set for the first time */ ret = res_counter_set_limit(memcg-kmem, val); memcg_kmem_set_activated(memcg); must_inc_static_branch = true; } else ret = res_counter_set_limit(memcg-kmem, val); mutex_unlock(set_limit_mutex); cgroup_unlock(); if (must_inc_static_branch) { /* We can't do this under cgroup_lock */ static_key_slow_inc(memcg_kmem_enabled_key); memcg_kmem_set_active(memcg); } Today, we don't use cgroup_lock in memcg_update_kmem_limit(), and static_key_slow_inc() is called under the set_limit_mutex, but the leftover from the above-mentioned commit is still here. Let's remove it. OK, so I have looked there again and 692e89abd154b (memcg: increment static branch right after limit set) which went in after cgroup_mutex has been removed. It came along with the following comment. /* * setting the active bit after the inc will guarantee no one * starts accounting before all call sites are patched */ This suggests that the flag is needed after all because we have to be sure that _all_ the places have to be patched. AFAIU memcg_kmem_newpage_charge might see the static key already patched so it would do a charge but memcg_kmem_commit_charge would still see it unpatched and so the charge won't be committed. Or am I missing something? You are correct. This flag is there due to the way we are using static branches. The patching of one call site is atomic, but the patching of all of them are not. Therefore we need to use a two-flag scheme to guarantee that in the first time we turn the static branches on, there will be a clear point after which we're going to start accounting. So http://lkml.org/lkml/2013/11/27/314 is correct then, right? It looks correct. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] memcg: remove KMEM_ACCOUNTED_ACTIVATED
On Mon, Dec 2, 2013 at 11:21 PM, Vladimir Davydov vdavy...@parallels.com wrote: On 12/02/2013 10:26 PM, Glauber Costa wrote: On Mon, Dec 2, 2013 at 10:15 PM, Michal Hocko mho...@suse.cz wrote: [CCing Glauber - please do so in other posts for kmem related changes] On Mon 02-12-13 17:08:13, Vladimir Davydov wrote: The KMEM_ACCOUNTED_ACTIVATED was introduced by commit a8964b9b (memcg: use static branches when code not in use) in order to guarantee that static_key_slow_inc(memcg_kmem_enabled_key) will be called only once for each memory cgroup when its kmem limit is set. The point is that at that time the memcg_update_kmem_limit() function's workflow looked like this: bool must_inc_static_branch = false; cgroup_lock(); mutex_lock(set_limit_mutex); if (!memcg-kmem_account_flags val != RESOURCE_MAX) { /* The kmem limit is set for the first time */ ret = res_counter_set_limit(memcg-kmem, val); memcg_kmem_set_activated(memcg); must_inc_static_branch = true; } else ret = res_counter_set_limit(memcg-kmem, val); mutex_unlock(set_limit_mutex); cgroup_unlock(); if (must_inc_static_branch) { /* We can't do this under cgroup_lock */ static_key_slow_inc(memcg_kmem_enabled_key); memcg_kmem_set_active(memcg); } Today, we don't use cgroup_lock in memcg_update_kmem_limit(), and static_key_slow_inc() is called under the set_limit_mutex, but the leftover from the above-mentioned commit is still here. Let's remove it. OK, so I have looked there again and 692e89abd154b (memcg: increment static branch right after limit set) which went in after cgroup_mutex has been removed. It came along with the following comment. /* * setting the active bit after the inc will guarantee no one * starts accounting before all call sites are patched */ This suggests that the flag is needed after all because we have to be sure that _all_ the places have to be patched. AFAIU memcg_kmem_newpage_charge might see the static key already patched so it would do a charge but memcg_kmem_commit_charge would still see it unpatched and so the charge won't be committed. Or am I missing something? You are correct. This flag is there due to the way we are using static branches. The patching of one call site is atomic, but the patching of all of them are not. Therefore we need to use a two-flag scheme to guarantee that in the first time we turn the static branches on, there will be a clear point after which we're going to start accounting. Hi, Glauber Sorry, but I don't understand why we need two flags. Isn't checking the flag set after all call sites have been patched (I mean KMEM_ACCOUNTED_ACTIVE) not enough? Take a look at net/ipv4/tcp_memcontrol.c. There are comprehensive comments there for a mechanism that basically achieves the same thing. The idea is that one flag is used at all times and means it is enabled. The second flags is a one time only flag to indicate that the patching process is complete. With one flag it seems to work, but it is racy. -- E Mare, Libertas ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] memcg: remove KMEM_ACCOUNTED_ACTIVATED
Hi, Glauber Hi. In memcg_update_kmem_limit() we do the whole process of limit initialization under a mutex so the situation we need protection from in tcp_update_limit() is impossible. BTW once set, the 'activated' flag is never cleared and never checked alone, only along with the 'active' flag, that's why I doubt we need it at all. The updates are protected by a mutex, but the readers are not, and should not. So we can still be patching the readers, and the double-flag was initially used so we can make sure that both flags are only set after the static branches are in. Note that one of the flags is set inside memcg_update_cache_sizes(). After that, we call static_key_slow_inc(). At this point, someone whose code is patched in could start accounting, but it shouldn't - because not all sites are patched in. So after the update is done, we set the other flag, and now everybody will start going through. Could you do something clever with just one flag? Probably yes. But I doubt it would be that much cleaner, this is just the way that patching sites work. -- E Mare, Libertas ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH] memcg: remove KMEM_ACCOUNTED_ACTIVATED
Could you do something clever with just one flag? Probably yes. But I doubt it would be that much cleaner, this is just the way that patching sites work. Thank you for spending your time to listen to me. Don't worry! I thank you for carrying this forward. Let me try to explain what is bothering me. We have two state bits for each memcg, 'active' and 'activated'. There are two scenarios where the bits can be modified: 1) The kmem limit is set on a memcg for the first time - memcg_update_kmem_limit(). Here we call memcg_update_cache_sizes(), which sets the 'activated' bit on success, then update static branching, then set the 'active' bit. All three actions are done atomically in respect to other tasks setting the limit due to the set_limit_mutex. After both bits are set, they never get cleared for the memcg. So far so good. But again, note how you yourself describe it: the cations are done atomically *in respect to other tasks setting the limit* But there are also tasks that are running its courses naturally and just allocating memory. For those, some call sites will be on, some will be off. We need to make sure that *none* of them uses the patched site until *all* of them are patched. This has nothing to do with updates, this is all about the readers. 2) When a subgroup of a kmem-active cgroup is created - memcg_propagate_kmem(). Here we copy kmem_account_flags from the parent, then increase static branching refcounter, then call memcg_update_cache_sizes() for the new memcg, which may clear the 'activated' bit on failure. After successful execution, the state bits never get cleared for the new memcg. In scenario 2 there is no need bothering about the flags setting order, because we don't have any tasks in the cgroup yet - the tasks can be moved in only after css_online finishes when we have both of the bits set and the static branching enabled. Actually, we already do not bother about it, because we have both bits set before the cgroup is fully initialized (memcg_update_cache_sizes() is called). Yes, after the first cgroup is set none of that matters. But it is just easier and less error prone just to follow the same path every time. As I have said, if you can come up with a more clever way to deal with the problem above that doesn't involve the double flag - and you can prove it works - I am definitely fine with it. But this is subtle code, and in the past - Michal can attest this - we've changed this being sure it would work just to see it explode in our faces. So although I am willing to review every patch for correctness on that front (I never said I liked the 2-flags scheme...), unless you have a bug or real problem on it, I would advise against changing it if its just to make it more readable. But again, don't take me too seriously on this. If you and Michal think you can come up with something better, I'm all for it. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v13 00/16] kmemcg shrinkers
Please note that in contrast to previous versions this patch-set implements slab shrinking only when we hit the user memory limit so that kmem allocations will still fail if we are below the user memory limit, but close to the kmem limit. This is, because the implementation of kmem-only reclaim was rather incomplete - we had to fail GFP_NOFS allocations since everything we could reclaim was only FS data. I will try to improve this and send in a separate patch-set, but currently it is only worthwhile setting the kmem limit to be greater than the user mem limit just to enable per-memcg slab accounting and reclaim. That is unfortunate, but it makes sense as a first step. -- E Mare, Libertas ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v13 04/16] memcg: move memcg_caches_array_size() function
On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov vdavy...@parallels.com wrote: I need to move this up a bit, and I am doing in a separate patch just to reduce churn in the patch that needs it. Signed-off-by: Vladimir Davydov vdavy...@parallels.com Reviewed-by: Glauber Costa glom...@openvz.org ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v13 05/16] vmscan: move call to shrink_slab() to shrink_zones()
On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov vdavy...@parallels.com wrote: This reduces the indentation level of do_try_to_free_pages() and removes extra loop over all eligible zones counting the number of on-LRU pages. Looks correct to me. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v13 14/16] vmpressure: in-kernel notifications
On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov vdavy...@parallels.com wrote: From: Glauber Costa glom...@openvz.org During the past weeks, it became clear to us that the shrinker interface It has been more than a few weeks by now =) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v13 13/16] vmscan: take at least one pass with shrinkers
On Tue, Dec 10, 2013 at 3:50 PM, Vladimir Davydov vdavy...@parallels.com wrote: On 12/10/2013 08:18 AM, Dave Chinner wrote: On Mon, Dec 09, 2013 at 12:05:54PM +0400, Vladimir Davydov wrote: From: Glauber Costa glom...@openvz.org In very low free kernel memory situations, it may be the case that we have less objects to free than our initial batch size. If this is the case, it is better to shrink those, and open space for the new workload then to keep them and fail the new allocations. In particular, we are concerned with the direct reclaim case for memcg. Although this same technique can be applied to other situations just as well, we will start conservative and apply it for that case, which is the one that matters the most. This should be at the start of the series. Since Glauber wanted to introduce this only for memcg-reclaim first, this can't be at the start of the series, but I'll move it to go immediately after per-memcg shrinking core in the next iteration. Thanks. So, the reason for that being memcg only, is that the reclaim for small objects triggered a bunch of XFS regressions (I am sure the regressions are general, but I've tested them using ZFS). In theory they shouldn't, so we can try to make it global again, so long as it comes together with benchmarks demonstrating that it is a safe change. I am not sure the filesystem people would benefit from that directly, though. So it may not be worth the hassle... ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] Race in memcg kmem?
On Tue, Dec 10, 2013 at 5:59 PM, Vladimir Davydov vdavy...@parallels.com wrote: Hi, Looking through the per-memcg kmem_cache initialization code, I have a bad feeling that it is prone to a race. Before getting to fixing it, I'd like to ensure this race is not only in my imagination. Here it goes. We keep per-memcg kmem caches in the memcg_params field of each root cache. The memcg_params array is grown dynamically by memcg_update_cache_size(). I guess that if this function is executed concurrently with memcg_create_kmem_cache() we can get a race resulting in a memory leak. Ok, let's see. -- memcg_create_kmem_cache(memcg, cachep) -- creates a new kmem_cache corresponding to a memcg and assigns it to the root cache; called in the background - it is OK to have several such functions trying to create a cache for the same memcg concurrently, but only one of them should succeed. Yes. @cachep is the root cache @memcg is the memcg we want to create a cache for. The function: A1) assures there is no cache corresponding to the memcg (if it is we have nothing to do): idx = memcg_cache_id(memcg); if (cachep-memcg_params[idx]) goto out; A2) creates and assigns a new cache: new_cachep = kmem_cache_dup(memcg, cachep); Please note this cannot proceed in parallel with memcg_update_cache_size(), because they both take the slab mutex. // init new_cachep cachep-memcg_params-memcg_caches[idx] = new_cachep; -- memcg_update_cache_size(s, num_groups) -- grows s-memcg_params to accomodate data for num_groups memcg's @s is the root cache whose memcg_params we want to grow @num_groups is the new number of kmem-active cgroups (defines the new size of memcg_params array). The function: B1) allocates and assigns a new cache: cur_params = s-memcg_params; s-memcg_params = kzalloc(size, GFP_KERNEL); B2) copies per-memcg cache ptrs from the old memcg_params array to the new one: for (i = 0; i memcg_limited_groups_array_size; i++) { if (!cur_params-memcg_caches[i]) continue; s-memcg_params-memcg_caches[i] = cur_params-memcg_caches[i]; } B3) frees the old array: kfree(cur_params); Since these two functions do not share any mutexes, we can get the They do share a mutex, the slab mutex. following race: Assume, by the time Cpu0 gets to memcg_create_kmem_cache(), the memcg cache has already been created by another thread, so this function should do nothing. Cpu0Cpu1 B1 A1 we haven't initialized memcg_params yet so Cpu0 will proceed to A2 to alloc and assign a new cache A2 B2 Cpu1 rewrites the memcg cache ptr set by Cpu0 at A2 - a memory leak? B3 I'd like to add that even if I'm right about the race, this is rather not critical, because memcg_update_cache_sizes() is called very rarely. Every race is critical. So, I am a bit lost by your description. Get back to me if I got anything wrong, but I am think that the point that you're missing is that all heavy slab operations take the slab_mutex underneath, and that includes cache creation and update. BTW, it seems to me that the way we update memcg_params in memcg_update_cache_size() make cache_from_memcg_idx() prone to use-after-free: static inline struct kmem_cache * cache_from_memcg_idx(struct kmem_cache *s, int idx) { if (!s-memcg_params) return NULL; return s-memcg_params-memcg_caches[idx]; } This is equivalent to 1) struct memcg_cache_params *params = s-memcg_params; 2) return params-memcg_caches[idx]; If memcg_update_cache_size() is executed between steps 1 and 2 on another CPU, at step 2 we will dereference memcg_params that has already been freed. This is very unlikely, but still possible. Perhaps, we should free old memcg params only after a sync_rcu()? You seem to be right in this one. Indeed, if my mind does not betray me, That is how I freed the LRUs. (with synchronize_rcus). ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v13 11/16] mm: list_lru: add per-memcg lists
OK, as far as I can tell, this is introducing a per-node, per-memcg LRU lists. Is that correct? If so, then that is not what Glauber and I originally intended for memcg LRUs. per-node LRUs are expensive in terms of memory and cross multiplying them by the number of memcgs in a system was not a good use of memory. According to Glauber, most memcgs are small and typically confined to a single node or two by external means and therefore don't need the scalability numa aware LRUs provide. Hence the idea was that the memcg LRUs would just be a single LRU list, just like a non-numa aware list_lru instantiation. IOWs, this is the structure that we had decided on as the best compromise between memory usage, complexity and memcg awareness: Sorry for jumping late into this particular e-mail. I just wanted to point out that the reason I adopted such matrix in my design was that it actually uses less memory this way. My reasoning for this was explained in the original patch that I posted that contained that implementation. This is because whenever an object would go on a memcg list, it *would not* go on the global list. Therefore, to keep information about nodes for global reclaim, you have to put them in node-lists. memcg reclaim, however, would reclaim regardless of node information. In global reclaim, the memcg lists would be scanned obeying the node structure in the lists. Because that has a fixed cost, it ends up using less memory that having a second list pointer in the objects, which is something that scale with the number of objects. Not to mention, that cost would be incurred even with memcg not being in use, which is something that we would like to avoid. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 1/2] memcg: fix memcg_size() calculation
On Mon, Dec 16, 2013 at 8:47 PM, Michal Hocko mho...@suse.cz wrote: On Sat 14-12-13 12:15:33, Vladimir Davydov wrote: The mem_cgroup structure contains nr_node_ids pointers to mem_cgroup_per_node objects, not the objects themselves. Ouch! This is 2k per node which is wasted. What a shame I haven't noticed this back then when reviewing 45cf7ebd5a033 (memcg: reduce the size of struct memcg 244-fold) IIRC, they weren't pointers back then. I think they were embedded in the structure, and I let them embedded. My mind may be tricking me, but I think I recall that Johannes changed them to pointers in a later time. No ? In any case, this is correct. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v7 00/11] per-cgroup cpu-stat
Peter et. al, I am coming again with this series, hoping this is a better time for you all to look at it. I am *not* going as far as marking cpuacct deprecated, because I think it deserves a special discussion (even though my position in this matter is widely known), but all the infrastructure to make it happen is here. But after this, it should be a matter of setting a flag (or not). Through this patchset I am making cpu cgroup provide the same functionality of cpuacct, and now with a more clear semantics, I attempt to provide userspace with enough information to reconstruct per-container version of files like /proc/stat. In particular, we are interested in knowing the per-cgroup slices of user time, system time, wait time, number of processes, and a variety of statistics. To make sure we can count nr of switches correctly, I am ressurecting one of Peter's patches that apparently got nowhere in the past. Glauber Costa (8): don't call cpuacct_charge in stop_task.c sched: adjust exec_clock to use it as cpu usage metric cpuacct: don't actually do anything. sched: document the cpu cgroup. sched: account guest time per-cgroup as well. sched: record per-cgroup number of context switches sched: change nr_context_switches calculation. sched: introduce cgroup file stat_percpu Peter Zijlstra (1): sched: Push put_prev_task() into pick_next_task() Tejun Heo (2): cgroup: implement CFTYPE_NO_PREFIX cgroup, sched: let cpu serve the same files as cpuacct Documentation/cgroups/cpu.txt | 99 +++ include/linux/cgroup.h| 1 + kernel/cgroup.c | 16 +- kernel/sched/core.c | 385 -- kernel/sched/cputime.c| 33 +++- kernel/sched/fair.c | 39 - kernel/sched/idle_task.c | 9 +- kernel/sched/rt.c | 42 +++-- kernel/sched/sched.h | 35 +++- kernel/sched/stop_task.c | 8 +- 10 files changed, 626 insertions(+), 41 deletions(-) create mode 100644 Documentation/cgroups/cpu.txt -- 1.8.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v7 01/11] don't call cpuacct_charge in stop_task.c
Commit 8f618968 changed stop_task to do the same bookkeping as the other classes. However, the call to cpuacct_charge() doesn't affect the scheduler decisions at all, and doesn't need to be moved over. Moreover, being a kthread, the migration thread won't belong to any cgroup anyway, rendering this call quite useless. Signed-off-by: Glauber Costa glom...@openvz.org CC: Mike Galbraith mgalbra...@suse.de CC: Peter Zijlstra a.p.zijls...@chello.nl CC: Thomas Gleixner t...@linutronix.de --- kernel/sched/stop_task.c | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index e08fbee..4142d7e 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -68,7 +68,6 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev) account_group_exec_runtime(curr, delta_exec); curr-se.exec_start = rq_clock_task(rq); - cpuacct_charge(curr, delta_exec); } static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued) -- 1.8.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v7 04/11] sched: adjust exec_clock to use it as cpu usage metric
exec_clock already provides per-group cpu usage metrics, and can be reused by cpuacct in case cpu and cpuacct are comounted. However, it is only provided by tasks in fair class. Doing the same for rt is easy, and can be done in an already existing hierarchy loop. This is an improvement over the independent hierarchy walk executed by cpuacct. Signed-off-by: Glauber Costa glom...@openvz.org CC: Dave Jones da...@redhat.com CC: Ben Hutchings b...@decadent.org.uk CC: Peter Zijlstra a.p.zijls...@chello.nl CC: Paul Turner p...@google.com CC: Lennart Poettering lenn...@poettering.net CC: Kay Sievers kay.siev...@vrfy.org CC: Tejun Heo t...@kernel.org --- kernel/sched/rt.c| 1 + kernel/sched/sched.h | 3 +++ 2 files changed, 4 insertions(+) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index e9f8dcd..4a21045 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -907,6 +907,7 @@ static void update_curr_rt(struct rq *rq) for_each_sched_rt_entity(rt_se) { rt_rq = rt_rq_of_se(rt_se); + schedstat_add(rt_rq, exec_clock, delta_exec); if (sched_rt_runtime(rt_rq) != RUNTIME_INF) { raw_spin_lock(rt_rq-rt_runtime_lock); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 765c687..b05dd84 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -254,6 +254,7 @@ struct cfs_rq { unsigned int nr_running, h_nr_running; u64 exec_clock; + u64 prev_exec_clock; u64 min_vruntime; #ifndef CONFIG_64BIT u64 min_vruntime_copy; @@ -356,6 +357,8 @@ struct rt_rq { struct plist_head pushable_tasks; #endif int rt_throttled; + u64 exec_clock; + u64 prev_exec_clock; u64 rt_time; u64 rt_runtime; /* Nests inside the rq lock: */ -- 1.8.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v7 06/11] sched: document the cpu cgroup.
The CPU cgroup is so far, undocumented. Although data exists in the Documentation directory about its functioning, it is usually spread, and/or presented in the context of something else. This file consolidates all cgroup-related information about it. Signed-off-by: Glauber Costa glom...@openvz.org --- Documentation/cgroups/cpu.txt | 81 +++ 1 file changed, 81 insertions(+) create mode 100644 Documentation/cgroups/cpu.txt diff --git a/Documentation/cgroups/cpu.txt b/Documentation/cgroups/cpu.txt new file mode 100644 index 000..072fd58 --- /dev/null +++ b/Documentation/cgroups/cpu.txt @@ -0,0 +1,81 @@ +CPU Controller +-- + +The CPU controller is responsible for grouping tasks together that will be +viewed by the scheduler as a single unit. The CFS scheduler will first divide +CPU time equally between all entities in the same level, and then proceed by +doing the same in the next level. Basic use cases for that are described in the +main cgroup documentation file, cgroups.txt. + +Users of this functionality should be aware that deep hierarchies will of +course impose scheduler overhead, since the scheduler will have to take extra +steps and look up additional data structures to make its final decision. + +Through the CPU controller, the scheduler is also able to cap the CPU +utilization of a particular group. This is particularly useful in environments +in which CPU is paid for by the hour, and one values predictability over +performance. + +CPU Accounting +-- + +The CPU cgroup will also provide additional files under the prefix cpuacct. +Those files provide accounting statistics and were previously provided by the +separate cpuacct controller. Although the cpuacct controller will still be kept +around for compatibility reasons, its usage is discouraged. If both the CPU and +cpuacct controllers are present in the system, distributors are encouraged to +always mount them together. + +Files +- + +The CPU controller exposes the following files to the user: + + - cpu.shares: The weight of each group living in the same hierarchy, that + translates into the amount of CPU it is expected to get. Upon cgroup creation, + each group gets assigned a default of 1024. The percentage of CPU assigned to + the cgroup is the value of shares divided by the sum of all shares in all + cgroups in the same level. + + - cpu.cfs_period_us: The duration in microseconds of each scheduler period, for + bandwidth decisions. This defaults to 10us or 100ms. Larger periods will + improve throughput at the expense of latency, since the scheduler will be able + to sustain a cpu-bound workload for longer. The opposite of true for smaller + periods. Note that this only affects non-RT tasks that are scheduled by the + CFS scheduler. + +- cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us + in for the current group will be allowed to run. For instance, if it is set to + half of cpu_period_us, the cgroup will only be able to peak run for 50 % of + the time. One should note that this represents aggregate time over all CPUs + in the system. Therefore, in order to allow full usage of two CPUs, for + instance, one should set this value to twice the value of cfs_period_us. + +- cpu.stat: statistics about the bandwidth controls. No data will be presented + if cpu.cfs_quota_us is not set. The file presents three + numbers: + nr_periods: how many full periods have been elapsed. + nr_throttled: number of times we exausted the full allowed bandwidth + throttled_time: total time the tasks were not run due to being overquota + + - cpu.rt_runtime_us and cpu.rt_period_us: Those files are the RT-tasks + analogous to the CFS files cfs_quota_us and cfs_period_us. One important + difference, though, is that while the cfs quotas are upper bounds that + won't necessarily be met, the rt runtimes form a stricter guarantee. + Therefore, no overlap is allowed. Implications of that are that given a + hierarchy with multiple children, the sum of all rt_runtime_us may not exceed + the runtime of the parent. Also, a rt_runtime_us of 0, means that no rt tasks + can ever be run in this cgroup. For more information about rt tasks runtime + assignments, see scheduler/sched-rt-group.txt + + - cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks + in this group. + + - cpuacct.usage_percpu: The CPU time, in nanoseconds, consumed by all tasks in + this group, separated by CPU. The format is an space-separated array of time + values, one for each present CPU. + + - cpuacct.stat: aggregate user and system time consumed by tasks in this group. + The format is + user: x + system: y -- 1.8.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v7 07/11] sched: account guest time per-cgroup as well.
We already track multiple tick statistics per-cgroup, using the task_group_account_field facility. This patch accounts guest_time in that manner as well. Signed-off-by: Glauber Costa glom...@openvz.org CC: Peter Zijlstra a.p.zijls...@chello.nl CC: Paul Turner p...@google.com --- kernel/sched/cputime.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 74a44ef..e653e52 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -183,8 +183,6 @@ void account_user_time(struct task_struct *p, cputime_t cputime, static void account_guest_time(struct task_struct *p, cputime_t cputime, cputime_t cputime_scaled) { - u64 *cpustat = kcpustat_this_cpu-cpustat; - /* Add guest time to process. */ p-utime += cputime; p-utimescaled += cputime_scaled; @@ -193,11 +191,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime, /* Add guest time to cpustat. */ if (TASK_NICE(p) 0) { - cpustat[CPUTIME_NICE] += (__force u64) cputime; - cpustat[CPUTIME_GUEST_NICE] += (__force u64) cputime; + task_group_account_field(p, CPUTIME_NICE, (__force u64) cputime); + task_group_account_field(p, CPUTIME_GUEST, (__force u64) cputime); } else { - cpustat[CPUTIME_USER] += (__force u64) cputime; - cpustat[CPUTIME_GUEST] += (__force u64) cputime; + task_group_account_field(p, CPUTIME_USER, (__force u64) cputime); + task_group_account_field(p, CPUTIME_GUEST, (__force u64) cputime); } } -- 1.8.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v7 08/11] sched: Push put_prev_task() into pick_next_task()
From: Peter Zijlstra a.p.zijls...@chello.nl In order to avoid having to do put/set on a whole cgroup hierarchy when we context switch, push the put into pick_next_task() so that both operations are in the same function. Further changes then allow us to possibly optimize away redundant work. [ glom...@openvz.org: incorporated mailing list feedback ] Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl Signed-off-by: Glauber Costa glom...@openvz.org --- kernel/sched/core.c | 20 +++- kernel/sched/fair.c | 6 +- kernel/sched/idle_task.c | 6 +- kernel/sched/rt.c| 27 --- kernel/sched/sched.h | 8 +++- kernel/sched/stop_task.c | 5 - 6 files changed, 44 insertions(+), 28 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0fa0f87..3b1441f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2295,18 +2295,11 @@ static inline void schedule_debug(struct task_struct *prev) schedstat_inc(this_rq(), sched_count); } -static void put_prev_task(struct rq *rq, struct task_struct *prev) -{ - if (prev-on_rq || rq-skip_clock_update 0) - update_rq_clock(rq); - prev-sched_class-put_prev_task(rq, prev); -} - /* * Pick up the highest-prio task: */ static inline struct task_struct * -pick_next_task(struct rq *rq) +pick_next_task(struct rq *rq, struct task_struct *prev) { const struct sched_class *class; struct task_struct *p; @@ -2316,13 +2309,13 @@ pick_next_task(struct rq *rq) * the fair class we can call that function directly: */ if (likely(rq-nr_running == rq-cfs.h_nr_running)) { - p = fair_sched_class.pick_next_task(rq); + p = fair_sched_class.pick_next_task(rq, prev); if (likely(p)) return p; } for_each_class(class) { - p = class-pick_next_task(rq); + p = class-pick_next_task(rq, prev); if (p) return p; } @@ -2417,8 +2410,9 @@ need_resched: if (unlikely(!rq-nr_running)) idle_balance(cpu, rq); - put_prev_task(rq, prev); - next = pick_next_task(rq); + if (prev-on_rq || rq-skip_clock_update 0) + update_rq_clock(rq); + next = pick_next_task(rq, prev); clear_tsk_need_resched(prev); rq-skip_clock_update = 0; @@ -4395,7 +4389,7 @@ static void migrate_tasks(unsigned int dead_cpu) if (rq-nr_running == 1) break; - next = pick_next_task(rq); + next = pick_next_task(rq, NULL); BUG_ON(!next); next-sched_class-put_prev_task(rq, next); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a7f5c39..414cf1c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3628,7 +3628,8 @@ preempt: set_last_buddy(se); } -static struct task_struct *pick_next_task_fair(struct rq *rq) +static struct task_struct * +pick_next_task_fair(struct rq *rq, struct task_struct *prev) { struct task_struct *p; struct cfs_rq *cfs_rq = rq-cfs; @@ -3637,6 +3638,9 @@ static struct task_struct *pick_next_task_fair(struct rq *rq) if (!cfs_rq-nr_running) return NULL; + if (prev) + prev-sched_class-put_prev_task(rq, prev); + do { se = pick_next_entity(cfs_rq); set_next_entity(cfs_rq, se); diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index d8da010..4f0a332 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -33,8 +33,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl resched_task(rq-idle); } -static struct task_struct *pick_next_task_idle(struct rq *rq) +static struct task_struct * +pick_next_task_idle(struct rq *rq, struct task_struct *prev) { + if (prev) + prev-sched_class-put_prev_task(rq, prev); + schedstat_inc(rq, sched_goidle); #ifdef CONFIG_SMP /* Trigger the post schedule to do an idle_enter for CFS */ diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 4a21045..6c41fa1 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1330,15 +1330,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq) { struct sched_rt_entity *rt_se; struct task_struct *p; - struct rt_rq *rt_rq; - - rt_rq = rq-rt; - - if (!rt_rq-rt_nr_running) - return NULL; - - if (rt_rq_throttled(rt_rq)) - return NULL; + struct rt_rq *rt_rq = rq-rt; do { rt_se = pick_next_rt_entity(rq, rt_rq); @@ -1352,9 +1344,22 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq) return p; } -static struct task_struct *pick_next_task_rt(struct rq *rq) +static struct
[Devel] [PATCH v7 09/11] sched: record per-cgroup number of context switches
Context switches are, to this moment, a property of the runqueue. When running containers, we would like to be able to present a separate figure for each container (or cgroup, in this context). The chosen way to accomplish this is to increment a per cfs_rq or rt_rq, depending on the task, for each of the sched entities involved, up to the parent. It is trivial to note that for the parent cgroup, we always add 1 by doing this. Also, we are not introducing any hierarchy walks in here. An already existent walk is reused. There are, however, two main issues: 1. the traditional context switch code only increment nr_switches when a different task is being inserted in the rq. Eventually, albeit not likely, we will pick the same task as before. Since for cfq and rt we only now which task will be next after the walk, we need to do the walk again, decrementing 1. Since this is by far not likely, it seems a fair price to pay. 2. Those figures do not include switches from and to the idle or stop task. Those need to be recorded separately, which will happen in a follow up patch. Signed-off-by: Glauber Costa glom...@openvz.org CC: Peter Zijlstra a.p.zijls...@chello.nl CC: Paul Turner p...@google.com --- kernel/sched/fair.c | 18 ++ kernel/sched/rt.c| 15 +-- kernel/sched/sched.h | 3 +++ 3 files changed, 34 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 414cf1c..b29e5dd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3642,6 +3642,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev) prev-sched_class-put_prev_task(rq, prev); do { + if (likely(prev)) + cfs_rq-nr_switches++; se = pick_next_entity(cfs_rq); set_next_entity(cfs_rq, se); cfs_rq = group_cfs_rq(se); @@ -3651,6 +3653,22 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev) if (hrtick_enabled(rq)) hrtick_start_fair(rq, p); + /* +* This condition is extremely unlikely, and most of the time will just +* consist of this unlikely branch, which is extremely cheap. But we +* still need to have it, because when we first loop through cfs_rq's, +* we can't possibly know which task we will pick. The call to +* set_next_entity above is not meant to mess up the tree in this case, +* so this should give us the same chain, in the same order. +*/ + if (unlikely(p == prev)) { + se = p-se; + for_each_sched_entity(se) { + cfs_rq = cfs_rq_of(se); + cfs_rq-nr_switches--; + } + } + return p; } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 6c41fa1..894fc0b 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1326,13 +1326,16 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq, return next; } -static struct task_struct *_pick_next_task_rt(struct rq *rq) +static struct task_struct * +_pick_next_task_rt(struct rq *rq, struct task_struct *prev) { struct sched_rt_entity *rt_se; struct task_struct *p; struct rt_rq *rt_rq = rq-rt; do { + if (likely(prev)) + rt_rq-rt_nr_switches++; rt_se = pick_next_rt_entity(rq, rt_rq); BUG_ON(!rt_se); rt_rq = group_rt_rq(rt_se); @@ -1341,6 +1344,14 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq) p = rt_task_of(rt_se); p-se.exec_start = rq_clock_task(rq); + /* See fair.c for an explanation on this */ + if (unlikely(p == prev)) { + for_each_sched_rt_entity(rt_se) { + rt_rq = rt_rq_of_se(rt_se); + rt_rq-rt_nr_switches--; + } + } + return p; } @@ -1359,7 +1370,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev) if (prev) prev-sched_class-put_prev_task(rq, prev); - p = _pick_next_task_rt(rq); + p = _pick_next_task_rt(rq, prev); /* The running task is never eligible for pushing */ if (p) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a39eb9b..39d3284 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -273,6 +273,7 @@ struct cfs_rq { unsigned int nr_spread_over; #endif + u64 nr_switches; #ifdef CONFIG_SMP /* * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be @@ -342,6 +343,8 @@ static inline int rt_bandwidth_enabled(void) struct rt_rq { struct rt_prio_array active; unsigned int rt_nr_running; + u64 rt_nr_switches; + #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED struct { int curr; /* highest queued rt task prio */ -- 1.8.1.4
[Devel] [PATCH v7 10/11] sched: change nr_context_switches calculation.
This patch changes the calculation of nr_context_switches. The variable nr_switches is now used to account for the number of transition to the idle task, or stop task. It is removed from the schedule() path. The total calculation can be made using the fact that the transitions to fair and rt classes are recorded in the root_task_group. One can easily derive the total figure by adding those quantities together. Signed-off-by: Glauber Costa glom...@openvz.org CC: Peter Zijlstra a.p.zijls...@chello.nl CC: Paul Turner p...@google.com --- kernel/sched/core.c | 17 +++-- kernel/sched/idle_task.c | 3 +++ kernel/sched/stop_task.c | 2 ++ 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3b1441f..892c828 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2031,13 +2031,27 @@ unsigned long nr_running(void) return sum; } +#ifdef CONFIG_FAIR_GROUP_SCHED +#define cfs_nr_switches(tg, cpu) (tg)-cfs_rq[cpu]-nr_switches +#else +#define cfs_nr_switches(tg, cpu) cpu_rq(cpu)-cfs.nr_switches +#endif +#ifdef CONFIG_RT_GROUP_SCHED +#define rt_nr_switches(tg, cpu) (tg)-rt_rq[cpu]-rt_nr_switches +#else +#define rt_nr_switches(tg, cpu) cpu_rq(cpu)-rt.rt_nr_switches +#endif + unsigned long long nr_context_switches(void) { int i; unsigned long long sum = 0; - for_each_possible_cpu(i) + for_each_possible_cpu(i) { + sum += cfs_nr_switches(root_task_group, i); + sum += rt_nr_switches(root_task_group, i); sum += cpu_rq(i)-nr_switches; + } return sum; } @@ -2417,7 +2431,6 @@ need_resched: rq-skip_clock_update = 0; if (likely(prev != next)) { - rq-nr_switches++; rq-curr = next; ++*switch_count; diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index 4f0a332..923286c 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -39,6 +39,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev) if (prev) prev-sched_class-put_prev_task(rq, prev); + if (prev != rq-idle) + rq-nr_switches++; + schedstat_inc(rq, sched_goidle); #ifdef CONFIG_SMP /* Trigger the post schedule to do an idle_enter for CFS */ diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c index 6adf6a9..48c572a 100644 --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -32,6 +32,8 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev) stop-se.exec_start = rq_clock_task(rq); if (prev) prev-sched_class-put_prev_task(rq, prev); + if (prev != rq-stop) + rq-nr_switches++; return stop; } -- 1.8.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v7 11/11] sched: introduce cgroup file stat_percpu
The file cpu.stat_percpu will show various scheduler related information, that are usually available to the top level through other files. For instance, most of the meaningful data in /proc/stat is presented here. Given this file, a container can easily construct a local copy of /proc/stat for internal consumption. The data we export is comprised of: * all the tick information, previously available only through cpuacct, like user time, system time, etc. * wait time, which can be used to construct analogous information to steal time in hypervisors, * nr_switches and nr_running, which are cgroup-local versions of their global counterparts. The file format consists of a one-line header that describes the fields being listed. No guarantee is given that the fields will be kept the same between kernel releases, and readers should always check the header in order to introspect it. Each of the following lines will show the respective field value for each of the possible cpus in the system. All values are show in nanoseconds. One example output for this file is: cpu user nice system irq softirq guest guest_nice wait nr_switches nr_running cpu0 47100 0 1500 0 0 0 0 1996534 7205 1 cpu1 58800 0 1700 0 0 0 0 2848680 6510 1 cpu2 50500 0 1400 0 0 0 0 2350771 6183 1 cpu3 47200 0 1600 0 0 0 0 19766345 6277 2 Signed-off-by: Glauber Costa glom...@openvz.org CC: Peter Zijlstra a.p.zijls...@chello.nl CC: Paul Turner p...@google.com --- Documentation/cgroups/cpu.txt | 18 +++ kernel/sched/core.c | 109 ++ kernel/sched/fair.c | 14 ++ kernel/sched/sched.h | 10 +++- 4 files changed, 150 insertions(+), 1 deletion(-) diff --git a/Documentation/cgroups/cpu.txt b/Documentation/cgroups/cpu.txt index 072fd58..d40e500 100644 --- a/Documentation/cgroups/cpu.txt +++ b/Documentation/cgroups/cpu.txt @@ -68,6 +68,24 @@ The CPU controller exposes the following files to the user: can ever be run in this cgroup. For more information about rt tasks runtime assignments, see scheduler/sched-rt-group.txt + - cpu.stat_percpu: Various scheduler statistics for the current group. The + information provided in this file is akin to the one displayed in /proc/stat, + except for the fact that it is cgroup-aware. The file format consists of a + one-line header that describes the fields being listed. No guarantee is + given that the fields will be kept the same between kernel releases, and + readers should always check the header in order to introspect it. + + Each of the following lines will show the respective field value for + each of the possible cpus in the system. All values are show in + nanoseconds. One example output for this file is: + + cpu user nice system irq softirq guest guest_nice wait nr_switches nr_running + cpu0 47100 0 1500 0 0 0 0 1996534 7205 1 + cpu1 58800 0 1700 0 0 0 0 2848680 6510 1 + cpu2 50500 0 1400 0 0 0 0 2350771 6183 1 + cpu3 47200 0 1600 0 0 0 0 19766345 6277 2 + + - cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks in this group. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 892c828..e2963ec 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7239,6 +7239,7 @@ static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu) #else static inline u64 cfs_exec_clock(struct task_group *tg, int cpu) { + return 0; } static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu) @@ -7670,6 +7671,108 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft) } #endif /* CONFIG_RT_GROUP_SCHED */ +#ifdef CONFIG_SCHEDSTATS + +#ifdef CONFIG_FAIR_GROUP_SCHED +#define fair_rq(field, tg, i) (tg)-cfs_rq[i]-field +#else +#define fair_rq(field, tg, i) 0 +#endif + +#ifdef CONFIG_RT_GROUP_SCHED +#define rt_rq(field, tg, i) (tg)-rt_rq[i]-field +#else +#define rt_rq(field, tg, i) 0 +#endif + +static u64 tg_nr_switches(struct task_group *tg, int cpu) +{ + /* nr_switches, which counts idle and stop task, is added to all tgs */ + return cpu_rq(cpu)-nr_switches + + cfs_nr_switches(tg, cpu) + rt_nr_switches(tg, cpu); +} + +static u64 tg_nr_running(struct task_group *tg, int cpu) +{ + /* +* because of autogrouped groups in root_task_group, the +* following does not hold. +*/ + if (tg != root_task_group) + return rt_rq(rt_nr_running, tg, cpu) + fair_rq(nr_running, tg, cpu); + + return cpu_rq(cpu)-nr_running; +} + +static u64 tg_wait(struct task_group *tg, int cpu) +{ + u64 val; + + if (tg != root_task_group) + val = cfs_read_wait(tg, cpu); + else + /* +* There are many errors here that we are accumulating. +* However, we only provide this in the interest of having
[Devel] [PATCH v7 03/11] cgroup, sched: let cpu serve the same files as cpuacct
From: Tejun Heo t...@kernel.org cpuacct being on a separate hierarchy is one of the main cgroup related complaints from scheduler side and the consensus seems to be * Allowing cpuacct to be a separate controller was a mistake. In general multiple controllers on the same type of resource should be avoided, especially accounting-only ones. * Statistics provided by cpuacct are useful and should instead be served by cpu. This patch makes cpu maintain and serve all cpuacct.* files and make cgroup core ignore cpuacct if it's co-mounted with cpu. This is a step in deprecating cpuacct. The next patch will allow disabling or dropping cpuacct without affecting userland too much. Note that this creates some discrepancies in /proc/cgroups and /proc/PID/cgroup. The co-mounted cpuacct won't be reflected correctly there. cpuacct will eventually be removed completely probably except for the statistics filenames and I'd like to keep the amount of compatbility hackery to minimum as much as possible. The cpu statistics implementation isn't optimized in any way. It's mostly verbatim copy from cpuacct. The goal is allowing quick disabling and removal of CONFIG_CGROUP_CPUACCT and creating a base on top of which cpu can implement proper optimization. [ glommer: don't call *_charge in stop_task.c ] Signed-off-by: Tejun Heo t...@kernel.org Signed-off-by: Glauber Costa glom...@openvz.org Cc: Peter Zijlstra pet...@infradead.org Cc: Michal Hocko mho...@suse.cz Cc: Kay Sievers kay.siev...@vrfy.org Cc: Lennart Poettering mzxre...@0pointer.de Cc: Dave Jones da...@redhat.com Cc: Ben Hutchings b...@decadent.org.uk Cc: Paul Turner p...@google.com --- kernel/cgroup.c| 13 kernel/sched/core.c| 173 + kernel/sched/cputime.c | 23 +++ kernel/sched/fair.c| 1 + kernel/sched/rt.c | 1 + kernel/sched/sched.h | 7 ++ 6 files changed, 218 insertions(+) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 8bbeb4d..01c3ed0 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1256,6 +1256,19 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts) } /* +* cpuacct is deprecated and cpu will serve the same stat files. +* If co-mount with cpu is requested, ignore cpuacct. Note that +* this creates some discrepancies in /proc/cgroups and +* /proc/PID/cgroup. +* +* https://lkml.org/lkml/2012/9/13/542 +*/ +#if IS_ENABLED(CONFIG_CGROUP_SCHED) IS_ENABLED(CONFIG_CGROUP_CPUACCT) + if ((opts-subsys_mask (1 cpu_cgroup_subsys_id)) + (opts-subsys_mask (1 cpuacct_subsys_id))) + opts-subsys_mask = ~(1 cpuacct_subsys_id); +#endif + /* * Option noprefix was introduced just for backward compatibility * with the old cpuset, so we allow noprefix only if mounting just * the cpuset subsystem. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 36f85be..5ae1adf 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6327,6 +6327,7 @@ int in_sched_functions(unsigned long addr) */ struct task_group root_task_group; LIST_HEAD(task_groups); +static DEFINE_PER_CPU(u64, root_tg_cpuusage); #endif DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); @@ -6385,6 +6386,8 @@ void __init sched_init(void) #endif /* CONFIG_RT_GROUP_SCHED */ #ifdef CONFIG_CGROUP_SCHED + root_task_group.cpustat = kernel_cpustat; + root_task_group.cpuusage = root_tg_cpuusage; list_add(root_task_group.list, task_groups); INIT_LIST_HEAD(root_task_group.children); INIT_LIST_HEAD(root_task_group.siblings); @@ -6665,6 +6668,8 @@ static void free_sched_group(struct task_group *tg) free_fair_sched_group(tg); free_rt_sched_group(tg); autogroup_free(tg); + free_percpu(tg-cpuusage); + free_percpu(tg-cpustat); kfree(tg); } @@ -6677,6 +6682,11 @@ struct task_group *sched_create_group(struct task_group *parent) if (!tg) return ERR_PTR(-ENOMEM); + tg-cpuusage = alloc_percpu(u64); + tg-cpustat = alloc_percpu(struct kernel_cpustat); + if (!tg-cpuusage || !tg-cpustat) + goto err; + if (!alloc_fair_sched_group(tg, parent)) goto err; @@ -6776,6 +6786,24 @@ void sched_move_task(struct task_struct *tsk) task_rq_unlock(rq, tsk, flags); } + +void task_group_charge(struct task_struct *tsk, u64 cputime) +{ + struct task_group *tg; + int cpu = task_cpu(tsk); + + rcu_read_lock(); + + tg = container_of(task_subsys_state(tsk, cpu_cgroup_subsys_id), + struct task_group, css); + + for (; tg; tg = tg-parent) { + u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu); + *cpuusage += cputime; + } + + rcu_read_unlock(); +} #endif /* CONFIG_CGROUP_SCHED */ #if defined
[Devel] [PATCH v7 05/11] cpuacct: don't actually do anything.
All the information we have that is needed for cpuusage (and cpuusage_percpu) is present in schedstats. It is already recorded in a sane hierarchical way. If we have CONFIG_SCHEDSTATS, we don't really need to do any extra work. All former functions become empty inlines. Signed-off-by: Glauber Costa glom...@openvz.org Cc: Peter Zijlstra pet...@infradead.org Cc: Michal Hocko mho...@suse.cz Cc: Kay Sievers kay.siev...@vrfy.org Cc: Lennart Poettering mzxre...@0pointer.de Cc: Dave Jones da...@redhat.com Cc: Ben Hutchings b...@decadent.org.uk Cc: Paul Turner p...@google.com --- kernel/sched/core.c | 102 ++- kernel/sched/sched.h | 10 +++-- 2 files changed, 90 insertions(+), 22 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5ae1adf..0fa0f87 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6787,6 +6787,7 @@ void sched_move_task(struct task_struct *tsk) task_rq_unlock(rq, tsk, flags); } +#ifndef CONFIG_SCHEDSTATS void task_group_charge(struct task_struct *tsk, u64 cputime) { struct task_group *tg; @@ -6804,6 +6805,7 @@ void task_group_charge(struct task_struct *tsk, u64 cputime) rcu_read_unlock(); } +#endif #endif /* CONFIG_CGROUP_SCHED */ #if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH) @@ -7199,22 +7201,92 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup *old_cgrp, sched_move_task(task); } -static u64 task_group_cpuusage_read(struct task_group *tg, int cpu) +/* + * Take rq-lock to make 64-bit write safe on 32-bit platforms. + */ +static inline void lock_rq_dword(int cpu) { - u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu); - u64 data; - #ifndef CONFIG_64BIT - /* -* Take rq-lock to make 64-bit read safe on 32-bit platforms. -*/ raw_spin_lock_irq(cpu_rq(cpu)-lock); - data = *cpuusage; +#endif +} + +static inline void unlock_rq_dword(int cpu) +{ +#ifndef CONFIG_64BIT raw_spin_unlock_irq(cpu_rq(cpu)-lock); +#endif +} + +#ifdef CONFIG_SCHEDSTATS +#ifdef CONFIG_FAIR_GROUP_SCHED +static inline u64 cfs_exec_clock(struct task_group *tg, int cpu) +{ + return tg-cfs_rq[cpu]-exec_clock - tg-cfs_rq[cpu]-prev_exec_clock; +} + +static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu) +{ + tg-cfs_rq[cpu]-prev_exec_clock = tg-cfs_rq[cpu]-exec_clock; +} #else - data = *cpuusage; +static inline u64 cfs_exec_clock(struct task_group *tg, int cpu) +{ +} + +static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu) +{ +} +#endif +#ifdef CONFIG_RT_GROUP_SCHED +static inline u64 rt_exec_clock(struct task_group *tg, int cpu) +{ + return tg-rt_rq[cpu]-exec_clock - tg-rt_rq[cpu]-prev_exec_clock; +} + +static inline void rt_exec_clock_reset(struct task_group *tg, int cpu) +{ + tg-rt_rq[cpu]-prev_exec_clock = tg-rt_rq[cpu]-exec_clock; +} +#else +static inline u64 rt_exec_clock(struct task_group *tg, int cpu) +{ + return 0; +} + +static inline void rt_exec_clock_reset(struct task_group *tg, int cpu) +{ +} #endif +static u64 task_group_cpuusage_read(struct task_group *tg, int cpu) +{ + u64 ret = 0; + + lock_rq_dword(cpu); + ret = cfs_exec_clock(tg, cpu) + rt_exec_clock(tg, cpu); + unlock_rq_dword(cpu); + + return ret; +} + +static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val) +{ + lock_rq_dword(cpu); + cfs_exec_clock_reset(tg, cpu); + rt_exec_clock_reset(tg, cpu); + unlock_rq_dword(cpu); +} +#else +static u64 task_group_cpuusage_read(struct task_group *tg, int cpu) +{ + u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu); + u64 data; + + lock_rq_dword(cpu); + data = *cpuusage; + unlock_rq_dword(cpu); + return data; } @@ -7222,17 +7294,11 @@ static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val) { u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu); -#ifndef CONFIG_64BIT - /* -* Take rq-lock to make 64-bit write safe on 32-bit platforms. -*/ - raw_spin_lock_irq(cpu_rq(cpu)-lock); + lock_rq_dword(cpu); *cpuusage = val; - raw_spin_unlock_irq(cpu_rq(cpu)-lock); -#else - *cpuusage = val; -#endif + unlock_rq_dword(cpu); } +#endif /* return total cpu usage (in nanoseconds) of a group */ static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b05dd84..0e5e795 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -710,8 +710,6 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu) #endif } -extern void task_group_charge(struct task_struct *tsk, u64 cputime); - #else /* CONFIG_CGROUP_SCHED */ static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { } @@ -719,10 +717,14 @@ static inline struct task_group *task_group(struct
[Devel] [PATCH v7 02/11] cgroup: implement CFTYPE_NO_PREFIX
From: Tejun Heo t...@kernel.org When cgroup files are created, cgroup core automatically prepends the name of the subsystem as prefix. This patch adds CFTYPE_NO_PREFIX which disables the automatic prefix. This will be used to deprecate cpuacct which will make cpu create and serve the cpuacct files. Signed-off-by: Tejun Heo t...@kernel.org Cc: Peter Zijlstra pet...@infradead.org Cc: Glauber Costa glom...@openvz.org --- include/linux/cgroup.h | 1 + kernel/cgroup.c| 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 5047355..5a8a093 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -397,6 +397,7 @@ struct cgroup_map_cb { #define CFTYPE_ONLY_ON_ROOT(1U 0) /* only create on root cg */ #define CFTYPE_NOT_ON_ROOT (1U 1) /* don't create on root cg */ #define CFTYPE_INSANE (1U 2) /* don't create if sane_behavior */ +#define CFTYPE_NO_PREFIX (1U 3) /* skip subsys prefix */ #define MAX_CFTYPE_NAME64 diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 2a99262..8bbeb4d 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2681,7 +2681,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys, umode_t mode; char name[MAX_CGROUP_TYPE_NAMELEN + MAX_CFTYPE_NAME + 2] = { 0 }; - if (subsys !(cgrp-root-flags CGRP_ROOT_NOPREFIX)) { + if (subsys !(cft-flags CFTYPE_NO_PREFIX) + !(cgrp-root-flags CGRP_ROOT_NOPREFIX)) { strcpy(name, subsys-name); strcat(name, .); } -- 1.8.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/3] Container fixups
Kir, This is a slightly modified shot at the container fixups. First patch makes us more resilient, since by being isolated by the mount namespaces, we no longer have problems with leaking mounts. I identified one of those problems today, and patch #1 in this series fixes it. Patch #2 is the rebase of the fixups scripts patches taking this into account. With patch #3, we provide a host-side fixup (so no guest scripts) for all PAM-based distros, overriding PAM loginuid session decisions. That module is only used for the audit subsystem, which is not present in containers (and it is not clear if it will ever be) With this patches, I can successfully run vzctl enter and ssh into containers running totally unmodified kernels for: centos, ubuntu and suse. Please comment Glauber Costa (3): hooks_ct: create devices inside container allow for distro-specific fix ups at creation time. hooks_ct: trick PAM to not bail out in loginuid failures etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 ++ include/dist.h | 2 ++ include/env.h | 3 +- src/lib/dist.c | 10 +- src/lib/env.c | 10 +++--- src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 87 ++--- 8 files changed, 147 insertions(+), 11 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/3] allow for distro-specific fix ups at creation time.
From: Glauber Costa glom...@parallels.com We will need that infrastucture when running with Linux upstream, since some support is very unlikely to ever land in the Kernel. We need to do things like account for the fact that udev may kick in and destroy all the setup we have done for /dev. Since preemptively preventing those actions to happen is very hard ( as an example, centos6 init binary will issue mount syscalls itself), the most robust approach is to insert a script that will be run shortly after rc.sysinit. Although this can't guarantee that it will run *immediately* after, it should be early enough to undo every harmful operation done by /sbin/init and rc.sysinit, therefore allowing operation to continue freely. Signed-off-by: Glauber Costa glom...@parallels.com --- etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 +++ include/dist.h | 2 ++ include/env.h | 3 ++- src/lib/dist.c | 10 +- src/lib/env.c | 10 ++ src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 35 +++ 8 files changed, 99 insertions(+), 7 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf index 727461d..49226c5 100644 --- a/etc/dists/redhat.conf +++ b/etc/dists/redhat.conf @@ -25,3 +25,4 @@ SET_DNS=set_dns.sh SET_USERPASS=set_userpass.sh SET_UGID_QUOTA=set_ugid_quota.sh POST_CREATE=postcreate.sh +CT_FIXUP=fixups.sh diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh new file mode 100755 index 000..da3fd24 --- /dev/null +++ b/etc/dists/scripts/fixups.sh @@ -0,0 +1,43 @@ +#!/bin/bash +# Copyright (C) 2000-2009, Parallels, Inc. All rights reserved. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +# +# +# Run-time fixup for legacy distributions that will rely on udevd to populate +# their /dev. Those distributions will effectively mount an empty tmpfs on top +# of /dev, defeating any setup we may have done. This includes RHEL-based +# distributions up to RHEL6. We are better off not allowing that to run at all + +function fixup_udev() +{ + if [ -f /etc/fedora-release ]; then + return; + fi + + # All the first two may fail if the distribution didn't actually mount + # then. + umount /dev/pts + umount /dev/shm + # still may hold references to files open by rc.sysinit. Even if this is the + # case those will be simple things like /dev/null and should go away shortly. + # let us not fail because of that. + umount /dev -l +} + +fixup_udev + +exit 0 + diff --git a/include/dist.h b/include/dist.h index 1f6a5a6..4faa452 100644 --- a/include/dist.h +++ b/include/dist.h @@ -29,6 +29,7 @@ #defineSET_USERPASS5 #defineSET_UGID_QUOTA 6 #definePOST_CREATE 7 +#defineCT_FIXUP8 typedef struct { char *def_ostmpl; @@ -46,6 +47,7 @@ typedef struct dist_actions { char *set_userpass; /** setup user password. */ char *set_ugid_quota; /** setup 2level quota. */ char *post_create; /** sostcreate actions. */ + char *ct_fixup; /** run-time fixups. */ } dist_actions; /* Read distribution specific actions configuration file. diff --git a/include/env.h b/include/env.h index 4fef438..b10dfd9 100644 --- a/include/env.h +++ b/include/env.h @@ -104,7 +104,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param *param, int stop_mode, int vz_chroot(const char *root); int vz_env_create(vps_handler *h, envid_t veid, vps_res *res, int wait_p[2], int old_wait_p[2], int err_p[2], - env_create_FN fn, void *data); + env_create_FN fn, dist_actions *actions, void *data); struct vps_res; struct arg_start { @@ -116,6 +116,7 @@ struct arg_start { vps_handler *h; void *data; env_create_FN fn; + dist_actions *actions; int userns_p; /* while running in userns, there's extra sync needed */ }; diff --git a/src/lib/dist.c b/src/lib/dist.c index bde94ab..8047cf7 100644 --- a/src/lib/dist.c +++ b/src/lib/dist.c @@ -36,7 +36,8
[Devel] [PATCH 1/3] hooks_ct: create devices inside container
Our devices were being created from the parent of container's init, because we need to be still outside container context to do it. However, this creates quite an annoyance, because those bind mounts will show up in the host /proc/mounts. Turns out, we don't really need to do it from the root side. We can do it from the container side provided we do it before we chroot - and then the host side fs is still visible. The fact that we join a mount namespace will act to keep those mounts totally private, and exempt us from cleaning it up. Signed-off-by: Glauber Costa glom...@openvz.org --- src/lib/hooks_ct.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index daa85ed..7bc9814 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -306,6 +306,10 @@ static int _env_create(void *data) */ close(arg-userns_p); + if (arg-h-can_join_userns) { + create_devices(arg-h, arg-veid, arg-res-fs.root); + } + ret = ct_chroot(arg-res-fs.root); /* Probably means chroot failed */ if (ret) @@ -438,10 +442,6 @@ static int ct_env_create(struct arg_start *arg) } arg-userns_p = userns_p[0]; - if (arg-h-can_join_userns) { - create_devices(arg-h, arg-veid, arg-res-fs.root); - } - ret = clone(_env_create, child_stack, clone_flags, arg); close(userns_p[0]); if (ret 0) { -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/3] hooks_ct: trick PAM to not bail out in loginuid failures
From pam_loginuid manpage: The pam_loginuid module sets the loginuid process attribute for the process that was authenticated. This is necessary for applications to be correctly audited The audit subsystem is not namespace aware, and it is likely to be a long time before it becomes - if ever. There isn't still consensus in the comunity about whether or not it makes sense, and if it does, how to implement it. When authenticating, however, writes to the loginuid will fail under user namespaces, and login programs like ssh will fail to work. One solution that does not involve recompiling the kernel with audit disabled, or rebooting into a command line disabled state, is to bind mount the permit module - that basically permits anything - into the loginuid module. This is not distribution specific - aside from the right path - so we apply for all distros, before we go do the fixup scripts. Signed-off-by: Glauber Costa glom...@openvz.org --- src/lib/hooks_ct.c | 44 1 file changed, 44 insertions(+) diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index f769865..a4f9425 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -288,6 +288,50 @@ static int apply_fixups(vps_handler *h, envid_t veid, const char *root, dist_act char buf[STR_SIZE]; int runlevels[] = { 3, 5 }; /* where to apply the fixups */ unsigned int i; + char *pampaths[] = { /lib/security/, /lib64/security, /* centos, suse, debian */ +/lib/x86_64-linux-gnu/security/, /* ubuntu */ +/lib/i386-linux-gnu/security/ /* ubuntu */ + }; + + /* +* From pam_loginuid manpage: The pam_loginuid module sets the +* loginuid process attribute for the process that was authenticated. +* This is necessary for applications to be correctly audited +* +* The audit subsystem is not namespace aware, and it is likely to be a +* long time before it becomes - if ever. There isn't still consensus +* in the comunity about whether or not it makes sense, and if it does, +* how to implement it. +* +* When authenticating, however, writes to the loginuid will fail under +* user namespaces, and login programs like ssh will fail to work. One +* solution that does not involve recompiling the kernel with audit +* disabled, or rebooting into a command line disabled state, is to +* bind mount the permit module - that basically permits anything - +* into the loginuid module. +* +* This is not distribution specific - aside from the right path - so +* we apply for all distros, before we go do the fixup scripts. +*/ + for (i = 0; i ARRAY_SIZE(pampaths); i++) { + struct stat st; + char buf_orig[STR_SIZE]; + + snprintf(buf, sizeof(buf), %s%s/pam_loginuid.so, +root, pampaths[i]); + + if (stat(buf, st)) + continue; + + snprintf(buf_orig, sizeof(buf_orig), %s%s/pam_permit.so, +root, pampaths[i]); + + if ((stat(buf_orig, st)) || (mount(buf_orig, buf, , MS_BIND, 0) 0)) + logger(-1, errno, + Cannot overwrite pam_loginuid.so. Container will start, but + login-based programs like ssh may not work); + break; + } /* Distributions that don't need the fixup will can stop right here */ if (!actions || !actions-ct_fixup) -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 2/3] allow for distro-specific fix ups at creation time.
On 05/21/2013 12:32 AM, Kir Kolyshkin wrote: On 05/20/2013 05:49 AM, Glauber Costa wrote: From: Glauber Costa glom...@parallels.com We will need that infrastucture when running with Linux upstream, since some support is very unlikely to ever land in the Kernel. We need to do things like account for the fact that udev may kick in and destroy all the setup we have done for /dev. Since preemptively preventing those actions to happen is very hard ( as an example, centos6 init binary will issue mount syscalls itself), the most robust approach is to insert a script that will be run shortly after rc.sysinit. Although this can't guarantee that it will run *immediately* after, it should be early enough to undo every harmful operation done by /sbin/init and rc.sysinit, therefore allowing operation to continue freely. I understand that 1 this is something redhat/centos-specific? 2 there is no need to do that for debian/ubuntu etc? No, thi is old-udev specific. In more up to date distributions, udev is already aware that it might be being containerized and won't do all those initialization things. I haven't tested *all* distros we support, but among the most up to date in each variant, only centos6 needs it (ubuntu 12 runs fine, suse 12 runs fine) Am I right? Yes and no, see above. Signed-off-by: Glauber Costa glom...@parallels.com --- etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 +++ include/dist.h | 2 ++ include/env.h | 3 ++- src/lib/dist.c | 10 +- src/lib/env.c | 10 ++ src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 35 +++ 8 files changed, 99 insertions(+), 7 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf index 727461d..49226c5 100644 --- a/etc/dists/redhat.conf +++ b/etc/dists/redhat.conf @@ -25,3 +25,4 @@ SET_DNS=set_dns.sh SET_USERPASS=set_userpass.sh SET_UGID_QUOTA=set_ugid_quota.sh POST_CREATE=postcreate.sh +CT_FIXUP=fixups.sh diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh new file mode 100755 index 000..da3fd24 --- /dev/null +++ b/etc/dists/scripts/fixups.sh @@ -0,0 +1,43 @@ +#!/bin/bash Why bash? It is not available in some distros by default (notably Debian). So, unless we do have a reason to use bash-specific features we'd rather not to. Will change to /bin/sh +# Copyright (C) 2000-2009, Parallels, Inc. All rights reserved. :)) That was written by a past me. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +# +# +# Run-time fixup for legacy distributions that will rely on udevd to populate +# their /dev. Those distributions will effectively mount an empty tmpfs on top +# of /dev, defeating any setup we may have done. This includes RHEL-based +# distributions up to RHEL6. We are better off not allowing that to run at all + +function fixup_udev() s/function // as this is bashism ok index bde94ab..8047cf7 100644 --- a/src/lib/dist.c +++ b/src/lib/dist.c @@ -36,7 +36,8 @@ static struct distr_conf { {SET_DNS, SET_DNS}, {SET_USERPASS, SET_USERPASS}, {SET_UGID_QUOTA, SET_UGID_QUOTA}, -{POST_CREATE, POST_CREATE} +{POST_CREATE, POST_CREATE}, +{CT_FIXUP, CT_FIXUP } 1. Please add comma after } so the next author won't have to patch your line. I.e. +{CT_FIXUP, CT_FIXUP }, 2. Maybe we can give it more concrete name? FIXUP is, well, anything. This is very unfortunate. All the other distro scripts we maintain have very well defined names, associated actions and ways to run. They are described in details in etc/dists/scripts/distribution.conf-template file. In this case we have a script with a name that doesn't tell us anything, is only run for redhat* distros, only when user ns is available/enabled, and in an extremely strange way -- not by executing it in a context of CT like all other scripts do, but by putting it to /etc/rc{3,5}d/S00fixups.sh before CT start. More to say, the logic of putting it this way is built into vzctl C code. Distro scripts meant
[Devel] [PATCH v2 0/2] distro fixups
Kir, Here is a new attempt at implementing fixups scripts. They look nicer than the last version, and rely on a more generic and configurable script instead, that should make our lives a lot easier in the future. Please let me know what you think Glauber Costa (2): allow for distro-specific fix ups at creation time. prestart: fixup legacy udev effects etc/dists/debian.conf | 1 + etc/dists/redhat.conf | 1 + etc/dists/scripts/prestart.sh | 74 +++ etc/dists/suse.conf | 1 + include/dist.h| 2 ++ src/lib/dist.c| 10 +- src/lib/env.c | 15 + 7 files changed, 103 insertions(+), 1 deletion(-) create mode 100755 etc/dists/scripts/prestart.sh -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 2/2] prestart: fixup legacy udev effects
Legacy udev will do a couple of operations that will destroy all the setup we have done for /dev. This is because, unaware that it is that it is being containerized, it will mount a tmpfs on /dev, and then setup it all again. Since preemptively preventing those actions to happen is very hard ( as an example, centos6 init binary will issue mount syscalls itself), the most robust approach is to insert a script that will be run shortly after rc.sysinit. It is hard to detect if rc.sysinit was already ran, but we notice that at the end of each run (for both versions 4, 5 and 6), it will touch /.autofsck. We can check if the file was modified (non-existent - existent, or different modification time) and run our fixups after this. Signed-off-by: Glauber Costa glom...@openvz.org --- etc/dists/scripts/prestart.sh | 36 1 file changed, 36 insertions(+) diff --git a/etc/dists/scripts/prestart.sh b/etc/dists/scripts/prestart.sh index 1f28b13..52047a0 100755 --- a/etc/dists/scripts/prestart.sh +++ b/etc/dists/scripts/prestart.sh @@ -21,6 +21,41 @@ # is, so far, meaningless inside a container. This script will apply various # fixups if needed. +# Legacy udev will try to mount its own /dev in tmpfs, which will in turn +# destroy all our hand crafted setup. We need to undo it here. +fixup_udev() +{ + if [ -f /etc/fedora-release ]; then + return + fi + + if [ ! -f /etc/centos-release -a ! -f /etc/redhat-release ]; then + return + fi + + # rc.sysinit will touch this file after it finishes. + timestamp=$(stat -c %x /.autofsck 2/dev/null) + i=0 + while true; do + newstamp=$(stat -c %x /.autofsck 2/dev/null) + if [ x$newstamp = x$timestamp ]; then + sleep 0.5 + i=$((i+1)) + [ $i -gt 10 ] return + continue + fi + break + done + # All the first two may fail if the distribution didn't actually mount + # then. + umount /dev/pts + umount /dev/shm + # still may hold references to files open by rc.sysinit. Even if this is the + # case those will be simple things like /dev/null and should go away shortly. + # let us not fail because of that. + umount /dev -l +} + fixup_loginuid() { local pam_permit=security/pam_permit.so @@ -36,6 +71,7 @@ fixup_loginuid() [ x$VZ_KERNEL = xyes ] exit 0 [ x$USERNS = xno ] exit 0 +fixup_udev fixup_loginuid exit 0 -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v5 4/6] modify tar extraction to account for user namespace
On 05/19/2013 09:41 PM, Kir Kolyshkin wrote: + */ +#define VZ_DEFAULT_UID10 +#define VZ_DEFAULT_GID10 I assume these are no longer used, right? right ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH] allow for distro-specific fix ups at creation time.
From: Glauber Costa glom...@parallels.com We will need that infrastucture when running with Linux upstream, since some support is very unlikely to ever land in the Kernel. We need to do things like account for the fact that udev may kick in and destroy all the setup we have done for /dev. Since preemptively preventing those actions to happen is very hard ( as an example, centos6 init binary will issue mount syscalls itself), the most robust approach is to insert a script that will be run shortly after rc.sysinit. Although this can't guarantee that it will run *immediately* after, it should be early enough to undo every harmful operation done by /sbin/init and rc.sysinit, therefore allowing operation to continue freely. Signed-off-by: Glauber Costa glom...@parallels.com --- etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 +++ include/dist.h | 2 ++ include/env.h | 3 ++- src/lib/dist.c | 10 +- src/lib/env.c | 10 ++ src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 35 +++ 8 files changed, 99 insertions(+), 7 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf index 727461d..49226c5 100644 --- a/etc/dists/redhat.conf +++ b/etc/dists/redhat.conf @@ -25,3 +25,4 @@ SET_DNS=set_dns.sh SET_USERPASS=set_userpass.sh SET_UGID_QUOTA=set_ugid_quota.sh POST_CREATE=postcreate.sh +CT_FIXUP=fixups.sh diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh new file mode 100755 index 000..da3fd24 --- /dev/null +++ b/etc/dists/scripts/fixups.sh @@ -0,0 +1,43 @@ +#!/bin/bash +# Copyright (C) 2000-2009, Parallels, Inc. All rights reserved. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +# +# +# Run-time fixup for legacy distributions that will rely on udevd to populate +# their /dev. Those distributions will effectively mount an empty tmpfs on top +# of /dev, defeating any setup we may have done. This includes RHEL-based +# distributions up to RHEL6. We are better off not allowing that to run at all + +function fixup_udev() +{ + if [ -f /etc/fedora-release ]; then + return; + fi + + # All the first two may fail if the distribution didn't actually mount + # then. + umount /dev/pts + umount /dev/shm + # still may hold references to files open by rc.sysinit. Even if this is the + # case those will be simple things like /dev/null and should go away shortly. + # let us not fail because of that. + umount /dev -l +} + +fixup_udev + +exit 0 + diff --git a/include/dist.h b/include/dist.h index 1f6a5a6..4faa452 100644 --- a/include/dist.h +++ b/include/dist.h @@ -29,6 +29,7 @@ #defineSET_USERPASS5 #defineSET_UGID_QUOTA 6 #definePOST_CREATE 7 +#defineCT_FIXUP8 typedef struct { char *def_ostmpl; @@ -46,6 +47,7 @@ typedef struct dist_actions { char *set_userpass; /** setup user password. */ char *set_ugid_quota; /** setup 2level quota. */ char *post_create; /** sostcreate actions. */ + char *ct_fixup; /** run-time fixups. */ } dist_actions; /* Read distribution specific actions configuration file. diff --git a/include/env.h b/include/env.h index 4fef438..b10dfd9 100644 --- a/include/env.h +++ b/include/env.h @@ -104,7 +104,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param *param, int stop_mode, int vz_chroot(const char *root); int vz_env_create(vps_handler *h, envid_t veid, vps_res *res, int wait_p[2], int old_wait_p[2], int err_p[2], - env_create_FN fn, void *data); + env_create_FN fn, dist_actions *actions, void *data); struct vps_res; struct arg_start { @@ -116,6 +116,7 @@ struct arg_start { vps_handler *h; void *data; env_create_FN fn; + dist_actions *actions; int userns_p; /* while running in userns, there's extra sync needed */ }; diff --git a/src/lib/dist.c b/src/lib/dist.c index bde94ab..8047cf7 100644 --- a/src/lib/dist.c +++ b/src/lib/dist.c @@ -36,7 +36,8
Re: [Devel] [PATCH v5 6/6] allow for distro-specific fix ups at creation time.
+{ +char buf[STR_SIZE]; + +/* Distributions that don't need the fixup will can stop right here */ +if (!actions || !actions-ct_fixup) +return 0; + +if (snprintf(buf, sizeof(buf), %s/%s, root, /etc/rc3.d/S00vz-fixups.sh) 0) Again and again :( How this snprintf can return negative value? Will change. +goto out; + +if (open(buf, O_RDWR|O_CREAT, 0) 0) +goto out; + +if (mount(actions-ct_fixup, buf, , MS_BIND, 0) 0) +goto out; Please remind me why are you bind-mounting rather than just copying the file? Because I believe it to be more robust this way. The user won't be able to accidentally delete it, mess with permissions, etc. IOW, this is to to make it all temporary and avoid changing the container permanently. + +/* + * Being safe never killed anyone... + * We bind mount instead of symlinking the original rc3 so it won't appear in the original + * private mount. + */ +if (snprintf(buf, sizeof(buf), %s/%s, root, /etc/rc5.d/S00vz-fixups.sh) 0) +goto out; + +if (open(buf, O_RDWR|O_CREAT, 0) 0) +goto out; + +if (mount(actions-ct_fixup, buf, , MS_BIND, 0) 0) +goto out; This is copy-pasting. Can you at least redo this with something like char *dirs[] = {/etc/rc3.d, /etc/rc5.d}; for(i = 0; i ARRAY_SIZE(dirs); i++) { bla bla bla } Yes. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v4 2/7] add user mismatch test
On 05/17/2013 04:18 AM, Kir Kolyshkin wrote: +stat(res-fs.private, private_stat); +if ((local_uid (private_stat.st_uid != *local_uid)) || +(local_gid (private_stat.st_gid != *local_gid))) { As I just commented at the very end of a previous patch, indeed it does make sense to check for local_uid and local_gid being not-NULL, since here you dereference both without checking (relying on can_join_userns which only checks local_uid but not local_gid). So that comment was right (and taking it into account this code is correct). Look again, we check both variables here. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v4 1/7] user namespace support for upstream containers
On 05/17/2013 04:11 AM, Kir Kolyshkin wrote: +if ((arg-userns_p != -1) (read(arg-userns_p, ret, sizeof(ret)) != sizeof(ret))) { +logger(-1, errno, Cannot read from user namespace pipe); We don't close arg-userns_p in case of error here. And it seems we do not close the other end of it here. I had that concern before, you said we're closing all fds somewhere, I'd rather close it explicitly here to be on the safe side (and pass both fds to the child to be able to do so). No need. We also don't pass the other side of the pipe to the child for the other 2 pipes that we use. We don't close all fds somewhere. We close all fds in close_fds right before we call exec_container_init. As for the error here, I changed it, even though it is not strictly needed, since we always exit on errors here. +} } +close(userns_p[0]); +close(userns_p[1]); snprintf(procpath, STR_SIZE, /proc/%d/ns/net, ret); ret = symlink(procpath, ctpath); if (ret) { logger(-1, errno, Can't symlink into netns file %s, ctpath); -destroy_container(arg-veid); -return VZ_RESOURCE_ERROR; +goto err_noclose; } return 0; +err: +close(userns_p[0]); +close(userns_p[1]); +/* FIXME: remove ourselves from container first */ +err_noclose: +destroy_container(arg-veid); +return VZ_RESOURCE_ERROR; } I don't really enjoy this mess of close()'s, goto's and returns, it can be done a bit simple IMHO. But looks like there are no logic errors in here, so OK. Well, I changed this to avoid the goto's. We have more duplicate code, but less confusion. char path[STR_SIZE]; +char upath[STR_SIZE]; struct stat st; +unsigned long *local_uid = param-res.misc.local_uid; ret = container_init(); if (ret) { @@ -592,6 +813,9 @@ int ct_do_open(vps_handler *h, vps_param *param) if (snprintf(path, sizeof(path), /proc/%d/ns/pid, getpid()) 0) return VZ_RESOURCE_ERROR; +if (snprintf(upath, sizeof(upath), /proc/%d/ns/user, getpid()) 0) +return VZ_RESOURCE_ERROR; + Same, I fail to see how snprintf() can ever return negative value here. I fail as well, but in case it does, we will return an error. I don't expect this branch to be ever taken, but it is there as a safeguard. I will remove it if you want me to. I removed the other. ret = mkdir(NETNS_RUN_DIR, S_IRWXU|S_IRGRP|S_IXGRP|S_IROTH|S_IXOTH); if (ret (errno != EEXIST)) { @@ -601,6 +825,20 @@ int ct_do_open(vps_handler *h, vps_param *param) } h-can_join_pidns = !stat(path, st); +/* + * Being able to join the user namespace is a good indication that the + * user namespace is complete. For a long time, the user namespace + * existed, but were far away from being feature complete. When + * running in such a kernel, joining the user namespace will just + * cripple our container, since we won't be able to do anything. It is + * only good for people who are okay running containers as root. + * + * It is not enough, however, for user namespaces to be present in the + * kernel. The container must have been setup to use it (we need the + * mapped user to own the files, etc. So we also need to find suitable + * configuration in the config files. + */ +h-can_join_userns = !stat(upath, st) local_uid; Maybe local_gid as well? Here we are just testing if it is enable or disabled. It will be clear later on when we document it, but local_uid is the only thing we need to test. If it is disabled, we should not even look at local_gid, since we are, already, essentially root. local_uid enabled and local_gid disabled is a valid, though weird, configuration. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v4 3/7] Also pass cmd_p pointer to container open
On 05/17/2013 05:35 AM, Kir Kolyshkin wrote: This is kinda becoming over-engineered (and now I realized I should have said it earlier, when reviewing the patch that added the first param). I understand well why you need local_uid and local_gid from config and cmdline in ct_open(), but maybe this non-zero checks can be done later in the code, so instead of using just h-can_join_userns you can call a function, something like I only need cmd_p at one moment: When we create the container. I think it is better than to leave can_join_userns with the vps_p test only, then. We can test it manually during fs_create, because there is the place in which we really need it. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v5 0/6] User namespace support for upstream containers
Kir, In this patchset, I hope to be addressing all your concerns. I don't use cmd_p any longer, as you requested. I have also tried to merge most of your comments at the main userns patch. It is a bit massive code, so if is there still anything that you would me to change, don't shy away. I will go back to it ASAP. Glauber Costa (6): user namespace support for upstream containers add user mismatch test allow local uid and gid to be specified at container creation modify tar extraction to account for user namespace automatically add bridge venet0 when needed allow for distro-specific fix ups at creation time. etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 ++ include/dist.h | 2 + include/env.h | 4 +- include/res.h | 6 + include/types.h | 1 + man/vzctl.8.in | 17 +++ scripts/vps-create.in | 14 ++ scripts/vps-functions.in| 7 + src/lib/Makefile.am | 3 + src/lib/chown_preload.c | 93 + src/lib/create.c| 38 - src/lib/dist.c | 10 +- src/lib/env.c | 26 +++- src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 329 +++- src/vzctl-actions.c | 2 + src/vzctl.c | 1 + vzctl.spec | 2 +- 19 files changed, 581 insertions(+), 20 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh create mode 100644 src/lib/chown_preload.c -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v5 1/6] user namespace support for upstream containers
From: Glauber Costa glom...@parallels.com This patch allows the execution of unprivileged containers running ontop of an upstream Linux Kernel. We will run at whatever UID is found in the configuration file (so far empty, thus disabled). Signed-off-by: Glauber Costa glom...@parallels.com --- include/env.h | 1 + include/types.h| 1 + src/lib/hooks_ct.c | 254 +++-- 3 files changed, 251 insertions(+), 5 deletions(-) diff --git a/include/env.h b/include/env.h index 06729f6..4fef438 100644 --- a/include/env.h +++ b/include/env.h @@ -116,6 +116,7 @@ struct arg_start { vps_handler *h; void *data; env_create_FN fn; + int userns_p; /* while running in userns, there's extra sync needed */ }; struct env_create_param3; diff --git a/include/types.h b/include/types.h index a4bce73..fa511aa 100644 --- a/include/types.h +++ b/include/types.h @@ -95,6 +95,7 @@ typedef struct vps_handler { int vzfd; /** /dev/vzctl file descriptor. */ int stdfd; int can_join_pidns; /* can't enter otherwise */ + int can_join_userns; /* can't run non privileged otherwise */ int (*is_run)(struct vps_handler *h, envid_t veid); int (*enter)(struct vps_handler *h, envid_t veid, const char *root, int flags); int (*destroy)(struct vps_handler *h, envid_t veid); diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index e46c1df..e71f116 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -66,6 +66,8 @@ static int sys_setns(int fd, int nstype) # define MS_PRIVATE (1 18) #endif +#define UID_GID_RANGE 10 /* how many users per container */ + /* This function is there in GLIBC, but not in headers */ extern int pivot_root(const char * new_root, const char * put_old); @@ -203,16 +205,159 @@ static int ct_setlimits(vps_handler *h, envid_t veid, struct ub_struct *ub) } #undef add_value +static int write_uid_gid_mapping(vps_handler *h, unsigned long uid, unsigned long gid, pid_t pid) +{ + char buf[STR_SIZE]; + char map[STR_SIZE]; + int fd; + int len; + int ret = VZ_RESOURCE_ERROR; + + len = snprintf(map, sizeof(map), 0 %ld %d, uid, UID_GID_RANGE); + snprintf(buf, sizeof(buf), /proc/%d/uid_map, pid); + if ((fd = open(buf, O_WRONLY)) 0) + goto out; + + if ((write(fd, map, len) != len)) + goto out; + + close(fd); + + len = snprintf(map, sizeof(map), 0 %ld %d, gid, UID_GID_RANGE); + snprintf(buf, sizeof(map), /proc/%d/gid_map, pid); + if ((fd = open(buf, O_WRONLY)) 0) + goto out; + + if ((write(fd, map, len) != len)) + goto out; + ret = 0; +out: + if (fd = 0) + close(fd); + return ret; +} + +/* + * Those devices should exist in the container, and be valid device nodes with + * user access permission. But we need to be absolutely sure this is the case, + * so we will provide our own versions. That could actually happen since some + * distributions may come with emptied /dev's, waiting for udev to populate them. + * That won't happen, we do it ourselves. + */ +static void create_devices(vps_handler *h, envid_t veid, const char *root) +{ + unsigned int i; + char *devices[] = { + /dev/null, + /dev/zero, + /dev/random, + /dev/urandom, + }; + + /* +* We will tolerate errors, and keep the container running, because it is +* likely we will be able to boot it to a barely functional state. But +* be vocal about it +*/ + for (i = 0; i ARRAY_SIZE(devices); i++) { + char ct_devname[STR_SIZE]; + int ret; + + snprintf(ct_devname, sizeof(ct_devname), %s%s, root, devices[i]); + + /* +* No need to be crazy about file flags. When we bind mount, the +* source permissions will be inherited. +*/ + ret = open(ct_devname, O_RDWR|O_CREAT, 0); + if (ret 0) { + logger(-1, errno, Could not touch device %s, devices[i]); + continue; + } + close(ret); + + ret = mount(devices[i], ct_devname, , MS_BIND, 0); + if (ret 0) + logger(-1, errno, Could not bind mount device %s, devices[i]); + } +} + static int _env_create(void *data) { struct arg_start *arg = data; struct env_create_param3 create_param; int ret; - if ((ret = ct_chroot(arg-res-fs.root))) + if ((arg-userns_p != -1) (read(arg-userns_p, ret, sizeof(ret)) != sizeof(ret))) { + logger(-1, errno, Cannot read from user namespace pipe); + close(arg-userns_p); + return VZ_RESOURCE_ERROR; + } + + ret = ct_chroot(arg
[Devel] [PATCH v5 2/6] add user mismatch test
From: Glauber Costa glom...@parallels.com In theory, we won't be able to run if our private area is not owned by ourselves. We could, if it have very wide open security permissions, but we should never set up a container like that. Aside from a basic sanity check, this is intended to catch problems for the few people who may have already created containers that will be owned by root:root, and will now try to run it unprivileged. Signed-off-by: Glauber Costa glom...@parallels.com --- src/lib/env.c | 16 1 file changed, 16 insertions(+) diff --git a/src/lib/env.c b/src/lib/env.c index 898bfc8..96400c0 100644 --- a/src/lib/env.c +++ b/src/lib/env.c @@ -30,6 +30,7 @@ #include linux/reboot.h #include sys/mount.h #include sys/utsname.h +#include sys/stat.h #include vzerror.h #include res.h @@ -554,6 +555,21 @@ int vps_start_custom(vps_handler *h, envid_t veid, vps_param *param, logger(-1, 0, Container is already running); return VZ_VE_RUNNING; } + if (!is_vz_kernel(h) h-can_join_userns) { + struct stat private_stat; + unsigned long *local_uid = res-misc.local_uid; + unsigned long *local_gid = res-misc.local_gid; + + stat(res-fs.private, private_stat); + if ((local_uid (private_stat.st_uid != *local_uid)) || + (local_gid (private_stat.st_gid != *local_gid))) { + logger(-1, 0, Container private area is owned by %d:%d + , but configuration file says we should run with %lu:%lu.\n + Refusing to run., private_stat.st_uid, private_stat.st_gid, + *res-misc.local_uid, *res-misc.local_gid); + return VZ_FS_BAD_TMPL; + } + } if ((ret = check_ub(h, res-ub))) return ret; -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v5 3/6] allow local uid and gid to be specified at container creation
From: Glauber Costa glom...@parallels.com It is a valid use case to run a container with host uid and gid different than the default. In particular, already deployed versions of vzctl are expected to have this value unset, effectively meaning they are not expecting user namespaces to be present. We also deem as a valid use case to run a fully privileged container, in which case we will explicitly disable user namespaces. This patch provides and documents a way to do so. Signed-off-by: Glauber Costa glom...@parallels.com --- man/vzctl.8.in | 17 + src/lib/hooks_ct.c | 4 ++-- src/vzctl-actions.c | 2 ++ src/vzctl.c | 1 + 4 files changed, 22 insertions(+), 2 deletions(-) diff --git a/man/vzctl.8.in b/man/vzctl.8.in index 20a1856..e233696 100644 --- a/man/vzctl.8.in +++ b/man/vzctl.8.in @@ -871,6 +871,8 @@ List of available fields can be obtained using \fB-L\fR option. .OP --ipadd addr .OP --hostname name .OP --name name +.OP --local_uid uid +.OP --local_gid gid .YS .IP 4 Creates a new container area. This operation should be done once, before @@ -922,6 +924,21 @@ a container. Note that this option can be used multiple times. You can use \fB--hostname\fR \fIname\fR option to set a host name for a container. + +When running with an upstream Linux Kernel that supports user namespaces (= +3.8), the parameters \fB--local_uid\fR and \fB--local_gid\fR can be used to +select which \fIuid\fR and \fIgid\fR respectively will be used as a base user +in the host system. Note that user namespaces provide a 1:1 mapping between +container users and host users. If these options are not specified, the values +\fBLOCAL_UID\fR and \fBLOCAL_GID\fR from global configuration file +\fBvz.conf\fR(5) are used. An explicit \fB--local_uid\fR value of 0 will +disable user namespace support, and run the container as a privileged user. In +this case, \fB--local_gid\fR is ignored. + +\fBWarning:\fR use \fB--local_uid\fR and \fB--local_gid\fR with care, specially +when migrating containers. In all situations, the container's files in the +filesystem needs to be correctly owned by the host-side users. + .IP \fBdestroy\fR | \fBdelete\fR \fICTID\fR 4 Removes a container private area by deleting all files, directories and the configuration file of this container. diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index e71f116..299d43d 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -423,7 +423,7 @@ static int ct_env_create(struct arg_start *arg) clone_flags |= CLONE_NEWNET|CLONE_NEWNS; if (!arg-h-can_join_userns) { - logger(-1, 0, WARNING: Running container unprivileged. USER_NS not supported); + logger(-1, 0, WARNING: Running container unprivileged. USER_NS not supported, or runtime disabled); userns_p[0] = userns_p[1] = -1; } else { @@ -844,7 +844,7 @@ int ct_do_open(vps_handler *h, vps_param *param) * mapped user to own the files, etc. So we also need to find suitable * configuration in the config files. */ - h-can_join_userns = !stat(upath, st) local_uid; + h-can_join_userns = !stat(upath, st) local_uid (*local_uid != 0); h-is_run = ct_is_run; h-enter = ct_enter; h-destroy = ct_destroy; diff --git a/src/vzctl-actions.c b/src/vzctl-actions.c index 0dd2ae7..1ef61d1 100644 --- a/src/vzctl-actions.c +++ b/src/vzctl-actions.c @@ -391,6 +391,8 @@ static int parse_create_opt(envid_t veid, int argc, char **argv, {ve_layout, required_argument, NULL, PARAM_VE_LAYOUT}, {velayout,required_argument, NULL, PARAM_VE_LAYOUT}, {diskspace, required_argument, NULL, PARAM_DISKSPACE}, + {local_uid, required_argument, NULL, PARAM_LOCAL_UID}, + {local_gid, required_argument, NULL, PARAM_LOCAL_GID}, { NULL, 0, NULL, 0 } }; diff --git a/src/vzctl.c b/src/vzctl.c index 359bcde..54d66d1 100644 --- a/src/vzctl.c +++ b/src/vzctl.c @@ -65,6 +65,7 @@ static void usage(int rc) vzctl create ctid [--ostemplate name] [--config name]\n [--layout ploop|simfs] [--hostname name] [--name name] [--ipadd addr]\n [--diskspace kbytes] [--private path] [--root path]\n + [--local_uid UID] [--local_gid GID]\n vzctl start ctid [--force] [--wait]\n vzctl destroy | mount | umount | stop | restart | status ctid\n #ifdef HAVE_PLOOP -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v5 4/6] modify tar extraction to account for user namespace
From: Glauber Costa glom...@parallels.com If we are running upstream with user namespaces, we need to create the container filesystem not with the ownership preserved, but reflecting the mapping we need to apply. Note that according to our documentation, we should ignore this if the user explicitly requested an uid mapping of 0 (gid is ignored in this case). Our tooling doesn't allow any easy way to unpack a whol distribution with offsets mechanically applied like this. We could do the whole unpacking in a user namespace itself, but that does not come without problems on its own (for instance, we won't be able to create any device files, we have to carefully adjust permissions in the root directory, etc) To work around that, we can employ a trick to allow container creation right now, as well as to avoid compatibility problems: we will resort to LD_PRELOAD to load a schim that captures calls to the chown family of system calls and applies the offset manually. Signed-off-by: Glauber Costa glom...@parallels.com --- include/res.h | 6 scripts/vps-create.in | 14 src/lib/Makefile.am | 3 ++ src/lib/chown_preload.c | 93 + src/lib/create.c| 38 +--- vzctl.spec | 2 +- 6 files changed, 150 insertions(+), 6 deletions(-) create mode 100644 src/lib/chown_preload.c diff --git a/include/res.h b/include/res.h index 0dfacf7..047593a 100644 --- a/include/res.h +++ b/include/res.h @@ -47,6 +47,12 @@ struct env_param { }; typedef struct env_param env_param_t; +/* + * When running upstream kernels, we will need host-side UID and GID. Those + * are configurable, but if none is specified, those are used. + */ +#define VZ_DEFAULT_UID 10 +#define VZ_DEFAULT_GID 10 typedef struct { list_head_t userpw; list_head_t nameserver; diff --git a/scripts/vps-create.in b/scripts/vps-create.in index 126f048..a1b3382 100755 --- a/scripts/vps-create.in +++ b/scripts/vps-create.in @@ -22,11 +22,22 @@ # Required parameters: # VE_PRVT- path to root of CT private areas # PRIVATE_TEMPLATE - path to private template used as a source for copying +# +# Optional parameters: +# UID_OFFSET - offset to be added to all tar UIDs +# GID_OFFSET - offset to be added to all tar GIDs . @SCRIPTDIR@/vps-functions vzcheckvar VE_PRVT PRIVATE_TEMPLATE +chown_preload_if_needed() +{ + [ -z $UID_OFFSET -o -z $GID_OFFSET ] return + + export LD_PRELOAD=libvzchown.so +} + create_prvt() { local TMP AVAIL NEEDED HEADER OPT @@ -75,6 +86,9 @@ create_prvt() [ $AVAIL -ge $NEEDED ] || vzerror Insufficient disk space in $VE_PRVT; available: $AVAIL, needed: $NEEDED ${VZ_FS_NO_DISK_SPACE} CAT=cat + + chown_preload_if_needed + # Use pv to show nice progress bar if we can pv -V /dev/null 21 CAT=pv chmod 700 $VE_PRVT diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am index b009d15..34c463b 100644 --- a/src/lib/Makefile.am +++ b/src/lib/Makefile.am @@ -72,6 +72,9 @@ libvzctl_la_LIBADD = $(XML_LIBS) $(CGROUP_LIBS) $(DL_LIBS) if HAVE_CGROUP libvzctl_la_SOURCES += cgroup.c hooks_ct.c + +lib_LTLIBRARIES += libvzchown.la +libvzchown_la_SOURCES = chown_preload.c endif if HAVE_VZ_KERNEL diff --git a/src/lib/chown_preload.c b/src/lib/chown_preload.c new file mode 100644 index 000..4e2be4a --- /dev/null +++ b/src/lib/chown_preload.c @@ -0,0 +1,93 @@ +/* + * Copyright (C) 2013, Parallels, Inc. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Authors: Kir Kolyshkin and Glauber Costa + */ + +#include unistd.h +#include stdio.h +#include stdlib.h + +#include dlfcn.h + +static int (*real_chown)(const char *path, uid_t owner, gid_t group) = NULL; +static int (*real_fchown)(int fd, uid_t owner, gid_t group) = NULL; +static int (*real_lchown)(const char *path, uid_t owner, gid_t group) = NULL; +static int (*real_fchownat)(int dirfd, const char *pathname, + uid_t owner, gid_t group, int flags) = NULL; + +uid_t uid_offset = 0; +gid_t gid_offset = 0; + +static void __init(void) +{ + char *uidstr, *gidstr; + + uidstr = getenv(UID_OFFSET
[Devel] [PATCH v5 5/6] automatically add bridge venet0 when needed
From: Glauber Costa glom...@parallels.com The chosen architecture to deal with --ipadd with upstream containers is to create a veth pair and add the host side information to a bridge called venet0. This way, all the code that expects venet0 to exist can still work without modifications, (or with just a few). Our intention to do that was actually already stated in the comments, but the code was removed before merging because --ipadd would not work without full unshare support anyway. This patch implements that. Signed-off-by: Glauber Costa glom...@parallels.com --- scripts/vps-functions.in | 7 +++ src/lib/hooks_ct.c | 37 +++-- 2 files changed, 42 insertions(+), 2 deletions(-) diff --git a/scripts/vps-functions.in b/scripts/vps-functions.in index 826c0a1..37b2de5 100755 --- a/scripts/vps-functions.in +++ b/scripts/vps-functions.in @@ -170,6 +170,13 @@ vzadjustmacs() # other setups, the bridge is expected to already exist and be valid. vzconfbridge() { + if [ x$BRIDGE == xvenet0 ]; then + if [ $(brctl show venet0 2/dev/null | tail -n+2 | wc -l) == 0 ]; then + brctl addbr venet0 + ${IP_CMD} link set venet0 up + fi + fi + if [ x$BRIDGE != x ]; then brctl addif $BRIDGE $HNAME /dev/null 21 fi diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index 299d43d..f002fa0 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -17,6 +17,7 @@ #include logger.h #include script.h #include cgroup.h +#include linux/vzctl_venet.h #define NETNS_RUN_DIR /var/run/netns @@ -731,8 +732,40 @@ static int ct_netdev_ctl(vps_handler *h, envid_t veid, int op, char *name) static int ct_ip_ctl(vps_handler *h, envid_t veid, int op, const char *ipstr) { - logger(-1, 0, %s not yet supported upstream, __func__); - return 0; + int ret = -1; + char *envp[5]; + char buf[STR_SIZE]; + int i = 0; + + if (!h-can_join_pidns) { + logger(-1, 0, Cannot join pid namespace: + --ipadd is not supported in kernels without full pidns support); + return VZ_BAD_KERNEL; + } + envp[i++] = strdup(VNAME=venet0); + envp[i++] = strdup(BRIDGE=venet0); + + snprintf(buf, sizeof(buf), HNAME=venet0.%d, veid); + envp[i++] = strdup(buf); + + snprintf(buf, sizeof(buf), VEID=%d, veid); + envp[i++] = strdup(buf); + + envp[i] = NULL; + + if (op == VE_IP_ADD) { + char *argv[] = { VPS_NETNS_DEV_ADD, NULL }; + + ret = run_script(VPS_NETNS_DEV_ADD, argv, envp, 0); + } else { + char *argv[] = { VPS_NETNS_DEV_DEL, NULL }; + + ret = run_script(VPS_NETNS_DEV_DEL, argv, envp, 0); + } + free_arg(envp); + + return ret; + } /* -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v5 6/6] allow for distro-specific fix ups at creation time.
From: Glauber Costa glom...@parallels.com We will need that infrastucture when running with Linux upstream, since some support is very unlikely to ever land in the Kernel. We need to do things like account for the fact that udev may kick in and destroy all the setup we have done for /dev. Since preemptively preventing those actions to happen is very hard ( as an example, centos6 init binary will issue mount syscalls itself), the most robust approach is to insert a script that will be run shortly after rc.sysinit. Although this can't guarantee that it will run *immediately* after, it should be early enough to undo every harmful operation done by /sbin/init and rc.sysinit, therefore allowing operation to continue freely. Signed-off-by: Glauber Costa glom...@parallels.com --- etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 +++ include/dist.h | 2 ++ include/env.h | 3 ++- src/lib/dist.c | 10 +- src/lib/env.c | 10 ++ src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 38 ++ 8 files changed, 102 insertions(+), 7 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf index 727461d..49226c5 100644 --- a/etc/dists/redhat.conf +++ b/etc/dists/redhat.conf @@ -25,3 +25,4 @@ SET_DNS=set_dns.sh SET_USERPASS=set_userpass.sh SET_UGID_QUOTA=set_ugid_quota.sh POST_CREATE=postcreate.sh +CT_FIXUP=fixups.sh diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh new file mode 100755 index 000..da3fd24 --- /dev/null +++ b/etc/dists/scripts/fixups.sh @@ -0,0 +1,43 @@ +#!/bin/bash +# Copyright (C) 2000-2009, Parallels, Inc. All rights reserved. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +# +# +# Run-time fixup for legacy distributions that will rely on udevd to populate +# their /dev. Those distributions will effectively mount an empty tmpfs on top +# of /dev, defeating any setup we may have done. This includes RHEL-based +# distributions up to RHEL6. We are better off not allowing that to run at all + +function fixup_udev() +{ + if [ -f /etc/fedora-release ]; then + return; + fi + + # All the first two may fail if the distribution didn't actually mount + # then. + umount /dev/pts + umount /dev/shm + # still may hold references to files open by rc.sysinit. Even if this is the + # case those will be simple things like /dev/null and should go away shortly. + # let us not fail because of that. + umount /dev -l +} + +fixup_udev + +exit 0 + diff --git a/include/dist.h b/include/dist.h index 1f6a5a6..4faa452 100644 --- a/include/dist.h +++ b/include/dist.h @@ -29,6 +29,7 @@ #defineSET_USERPASS5 #defineSET_UGID_QUOTA 6 #definePOST_CREATE 7 +#defineCT_FIXUP8 typedef struct { char *def_ostmpl; @@ -46,6 +47,7 @@ typedef struct dist_actions { char *set_userpass; /** setup user password. */ char *set_ugid_quota; /** setup 2level quota. */ char *post_create; /** sostcreate actions. */ + char *ct_fixup; /** run-time fixups. */ } dist_actions; /* Read distribution specific actions configuration file. diff --git a/include/env.h b/include/env.h index 4fef438..b10dfd9 100644 --- a/include/env.h +++ b/include/env.h @@ -104,7 +104,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param *param, int stop_mode, int vz_chroot(const char *root); int vz_env_create(vps_handler *h, envid_t veid, vps_res *res, int wait_p[2], int old_wait_p[2], int err_p[2], - env_create_FN fn, void *data); + env_create_FN fn, dist_actions *actions, void *data); struct vps_res; struct arg_start { @@ -116,6 +116,7 @@ struct arg_start { vps_handler *h; void *data; env_create_FN fn; + dist_actions *actions; int userns_p; /* while running in userns, there's extra sync needed */ }; diff --git a/src/lib/dist.c b/src/lib/dist.c index bde94ab..8047cf7 100644 --- a/src/lib/dist.c +++ b/src/lib/dist.c @@ -36,7 +36,8
Re: [Devel] [PATCH 1/6] vzctl: split ct_env_create
On 05/16/2013 04:14 PM, Andrey Vagin wrote: + ret = ct_env_create_real(arg); + if (ret 0) return VZ_RESOURCE_ERROR; - } Isn't it better to just keep the return values intact in create_real, and then return them as is if ret != 0 ? ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 2/6] vzctl: save PID of init in a state file
On 05/16/2013 04:14 PM, Andrey Vagin wrote: CRIU requires a pid of the init. Signed-off-by: Andrey Vagin ava...@openvz.org The way you coded it, it seems to me that we will always overwrite the pid file, which is fine: this way we won't run into the usual pid file already exists kinds of problem. When we reach this point, we already tested if the container is running... My only fear is that we do that by verifying if there is any tasks already in the cgroup. So can we race against other container creation? We put ourselves in the container before we call clone, so *at this point* the vps_is_run test would fail, but by now we have already tested it a while ago... Maybe vzctl has other mechanism against such races that I am not aware of, but could you explain me why we don't have this race? It is my only concern, the rest looks fine. Having the init pid is quite handy for other situations as well, like stop --fast. So far we hack that by killing everybody until no more tasks are found in the cgroup, but what I always wanted was really a mechanism to kill init. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v3 4/9] user namespace support for upstream containers
On 05/14/2013 07:02 AM, Kir Kolyshkin wrote: Oh my, four cases of whitespace-at-eol which I had to fixed manually. Sorry about that. I think I got to used to checkpatch and the such that I forget to verify all of those manually. Wanted to apply it nevertheless and fix some things myself, but then the list grew out too big. Plus please see inline. Here we go. /* This function is there in GLIBC, but not in headers */ extern int pivot_root(const char * new_root, const char * put_old); @@ -138,10 +140,41 @@ static int _env_create(void *data) struct env_create_param3 create_param; int ret; -if ((ret = ct_chroot(arg-res-fs.root))) +if ((arg-userns_p != -1) (read(arg-userns_p, ret, sizeof(ret)) == 0)) 1 I'd suggest comparing with sizeof(ret) As you wish. 2 We can't print any errors here, can we? If we can I'd rather do it. I believe logger works in here, yes. +return -1; You're supposed to return VZ_* exit codes from here. ok. + +ret = ct_chroot(arg-res-fs.root); +close(arg-userns_p); +/* Probably means chroot failed */ +if (ret) return ret; -if ((ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 1))) +if (arg-h-can_join_userns) { +int fd; +setuid(0); +setgid(0); +/* + * We need the special flag newinstance. This is a requirement + * of the userns-aware implementation of devpts as of Linux 3.9. + * Because of that special requirement, we do it here rather than + * later. + */ +mount(devpts, /dev/pts, devpts, 0, newinstance); What if mount point does not exist, or is not a directory? I think it makes sense to 1. try creating it, ignoring EEXIST 2. check return from mount() Check return and then what? It is not clear for me if we should abort or not when things go wrong. It seems perfectly valid that the container continues. I have the same dilemma with the other devices. +/* /dev/ptmx, if it even exists, would refer to the root ptmx. + * We don't want that, we want our newly created instance to contain + * all ptys. So we bind mount the root device here + */ +fd = open(/dev/ptmx, O_RDWR|O_CREAT, 0); You do open() to create the /dev/ptmx file in case it does not exist, right? Why RDWR then? No special reason. +/* + * Those devices should exist in the container, and be valid device nodes with + * user access permission. But we need to be absolutely sure this is the case, + * so we will provide our own versions. That could actually happen since some + * distributions may come with emptied /dev's, waiting for udev to populate them. + * That won't happen, we do it ourselves. + */ +static void create_devices(vps_handler *h, envid_t veid, const char *root) +{ +unsigned int i; +char *devices[] = { +/dev/null, +/dev/zero, +/dev/random, +/dev/urandom, +}; + +/* + * We will tolerate errors, and keep the container running, because it is + * likely we will be able to boot it to a barely functional state. But + * be vocal about it + */ +for (i = 0; i ARRAY_SIZE(devices); i++) { +char ct_devname[STR_SIZE]; +int ret; + +ret = snprintf(ct_devname, sizeof(ct_devname), %s/%s, root, devices[i]); +if (ret 0) { +logger(-1, errno, Could not allocate device string\n); we don't need \n for logger() calls. Will remove, but the most important question here is the behavioral question. Is it okay to proceed ? static int ct_env_create(struct arg_start *arg) { @@ -233,6 +348,8 @@ static int ct_env_create(struct arg_start *arg) int ret; char procpath[STR_SIZE]; char ctpath[STR_SIZE]; +int userns_p[2]; +int err; stack_size = get_pagesize(); if (stack_size 0) @@ -272,17 +389,58 @@ static int ct_env_create(struct arg_start *arg) * Belong in the setup phase */ clone_flags = SIGCHLD; -/* FIXME: USERNS is still work in progress */ clone_flags |= CLONE_NEWUTS|CLONE_NEWPID|CLONE_NEWIPC; clone_flags |= CLONE_NEWNET|CLONE_NEWNS; +if (!arg-h-can_join_userns) { +logger(-1, 0, WARNING: Running container unprivileged. USER_NS not supported); + +userns_p[0] = userns_p[1] = -1; +} else { +clone_flags |= CLONE_NEWUSER; +if (pipe(userns_p) 0) { +logger(-1, errno, Can not create userns pipe); +return VZ_RESOURCE_ERROR; +} +} +arg-userns_p = userns_p[0]; + +if (arg-h-can_join_userns) { +create_devices(arg-h, arg-veid, arg-res-fs.root); +} + ret = clone(_env_create, child_stack, clone_flags, arg); if (ret 0) { logger(-1, errno, Unable to clone); /* FIXME: remove ourselves from container
Re: [Devel] [PATCH v3 0/9] Upstream Linux support for userns
On 05/14/2013 07:03 AM, Kir Kolyshkin wrote: On 04/29/2013 10:16 PM, Glauber Costa wrote: Kir, Please review the following patchset. The main difference from last version is that we support running with userns disabled even if it is present. This effectively means that containers that were already created and owned by root will keep working. It is also possible to explicitly disable it at container creation by setting local_uid to 0. There are also some bugfixes and changes according to the review you provided. Running a container works, and vzctl enter works as well. Most pressing, is the fact that although this patchset finally implements --ipadd (now all infrastructure is in place). First three patches merged, patch 4 needs more love, and the rest of the series depends on #4. Thanks, will fix. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v3 7/9] modify tar extraction to account for user namespace
On 05/14/2013 07:17 AM, Kir Kolyshkin wrote: Hmm... If I understand it correctly, in case LOCAL_UID/LOCAL_GID is not set in the global configuration file, and not supplied from command line, here you apply the default values of 1. The problem I see these values are not saved into configuration file and therefore not applied during the start. Can you please check if I'm right here? I can double check, but I remember having tested this exactly scenario: create container, (with multiple LOCAL_UID values), and then checking the result. It seems to work, but I will recheck that. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v3 6/9] allow local uid and gid to be specified at container creation
On 05/14/2013 07:09 AM, Kir Kolyshkin wrote: +When running with an upstream Linux Kernel that supports user namespaces (= +3.8), the parameters \fB--local_uid\fR and \fB--local_gid\fR can be used to +select which \fIuid\fR and \fIgid\fR respectively will be used as a base user +in the host system. Note that user namespaces provide a 1:1 mapping between +container users and host users. If these options are not specified, the value +10 is used. This probably comes from an older version of a patchset. We don't have compiled-in defaults, do we? I'd say If these options are not specified, the values \fBLOCAL_UID\fR and \fBLOCAL_GID\fR from global configuration file \fBvz.conf\fR(5) are used. Huummm, this is not my understanding. Please note that --local_uid and --local_uid are creation time switches. This means that they will only apply to newly created containers. Upon execution, these are read from the local configuration file. In older containers, this will be unset and we will run with userns disabled. In newly containers, the only way to disable it is to specify --local_uid = 0, but by default we run with 10. What is the problem with this? ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v3 7/9] modify tar extraction to account for user namespace
On 05/14/2013 07:17 AM, Kir Kolyshkin wrote: Hmm... If I understand it correctly, in case LOCAL_UID/LOCAL_GID is not set in the global configuration file, and not supplied from command line, here you apply the default values of 1. The problem I see these values are not saved into configuration file and therefore not applied during the start. Can you please check if I'm right here? I actually think it is better to remove this excerpt. I will assume that if we are running a new installation, the global configuration file will be a new one as well. So I will just default it to 0, which means userns disabled. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v4 3/7] Also pass cmd_p pointer to container open
We would like to know early on in the container lifetime (before creation) if the container should use user namespaces. At this point, we not yet have a ct configuration file, just the global one. It may very well be that the user have overriden the global configuration file information through the command line. If that is the case, this value should take precedence. To achieve this, we should also pass the cmd_p information to the open functions. Signed-off-by: Glauber Costa glom...@openvz.org --- include/env.h | 8 +--- src/lib/env.c | 7 --- src/lib/hooks_ct.c | 2 +- src/lib/hooks_vz.c | 2 +- src/vzctl-actions.c | 2 +- 5 files changed, 12 insertions(+), 9 deletions(-) diff --git a/include/env.h b/include/env.h index 4fef438..35d427f 100644 --- a/include/env.h +++ b/include/env.h @@ -49,8 +49,10 @@ enum { * * @param veid CT ID. * @return handler or NULL on error. + * @param global ct parameters. + * @cmdcommand line ct parameters. */ -vps_handler *vz_open(envid_t veid, vps_param *param); +vps_handler *vz_open(envid_t veid, vps_param *param, vps_param *cmd); /** Close CT handler. * @@ -126,6 +128,6 @@ int exec_container_init(struct arg_start *arg, struct env_create_param3 *create_param); int set_personality32(); -int vz_do_open(vps_handler *h, vps_param *param); -int ct_do_open(vps_handler *h, vps_param *param); +int vz_do_open(vps_handler *h, vps_param *param, vps_param *cmd); +int ct_do_open(vps_handler *h, vps_param *param, vps_param *cmd); #endif diff --git a/src/lib/env.c b/src/lib/env.c index 96400c0..166c83b 100644 --- a/src/lib/env.c +++ b/src/lib/env.c @@ -74,9 +74,10 @@ static int reset_std() * * @param veid CT ID. * @param paramCT parameters. + * @param cmd CT command line parameters. * @return handler or NULL on error. */ -vps_handler *vz_open(envid_t veid, vps_param *param) +vps_handler *vz_open(envid_t veid, vps_param *param, vps_param *cmd) { vps_handler *h = NULL; int ret = -1; @@ -90,7 +91,7 @@ vps_handler *vz_open(envid_t veid, vps_param *param) #ifdef VZ_KERNEL_SUPPORTED /* Find out if we are under OpenVZ or upstream kernel */ if (stat_file(/proc/vz)) - ret = vz_do_open(h, param); + ret = vz_do_open(h, param, cmd); else #endif { @@ -98,7 +99,7 @@ vps_handler *vz_open(envid_t veid, vps_param *param) non-OpenVZ kernel); h-vzfd = -1; #ifdef HAVE_UPSTREAM - ret = ct_do_open(h, param); + ret = ct_do_open(h, param, cmd); #else logger(-1, 0, No suitable kernel support compiled in); #endif diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index 8738eeb..f4d8f48 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -791,7 +791,7 @@ static int ct_setcontext(envid_t veid) return 0; } -int ct_do_open(vps_handler *h, vps_param *param) +int ct_do_open(vps_handler *h, vps_param *param, vps_param *cmd) { int ret; char path[STR_SIZE]; diff --git a/src/lib/hooks_vz.c b/src/lib/hooks_vz.c index 50770c0..c373b13 100644 --- a/src/lib/hooks_vz.c +++ b/src/lib/hooks_vz.c @@ -466,7 +466,7 @@ static int vz_veth_ctl(vps_handler *h, envid_t veid, int op, veth_dev *dev) return ret; } -int vz_do_open(vps_handler *h, vps_param *param) +int vz_do_open(vps_handler *h, vps_param *param, vps_param *cmd) { if ((h-vzfd = open(VZCTLDEV, O_RDWR)) 0) { logger(-1, errno, Unable to open %s, VZCTLDEV); diff --git a/src/vzctl-actions.c b/src/vzctl-actions.c index 0dd2ae7..7bf15c9 100644 --- a/src/vzctl-actions.c +++ b/src/vzctl-actions.c @@ -1470,7 +1470,7 @@ int run_action(envid_t veid, act_t action, vps_param *g_p, vps_param *vps_p, int ret = 0, lock_id = -1; struct sigaction act; - if ((h = vz_open(veid, g_p)) == NULL) { + if ((h = vz_open(veid, g_p, cmd_p)) == NULL) { /* Accept to run set --save --force on any kernel, * otherwise error out if initialization failed */ -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v4 4/7] allow local uid and gid to be specified at container creation
From: Glauber Costa glom...@parallels.com It is a valid use case to run a container with host uid and gid different than the default. In particular, already deployed versions of vzctl are expected to have this value unset, effectively meaning they are not expecting user namespaces to be present. We also deem as a valid use case to run a fully privileged container, in which case we will explicitly disable user namespaces. This patch provides and documents a way to do so. Signed-off-by: Glauber Costa glom...@parallels.com --- man/vzctl.8.in | 17 + src/lib/hooks_ct.c | 11 --- src/vzctl-actions.c | 2 ++ src/vzctl.c | 1 + 4 files changed, 28 insertions(+), 3 deletions(-) diff --git a/man/vzctl.8.in b/man/vzctl.8.in index 20a1856..e233696 100644 --- a/man/vzctl.8.in +++ b/man/vzctl.8.in @@ -871,6 +871,8 @@ List of available fields can be obtained using \fB-L\fR option. .OP --ipadd addr .OP --hostname name .OP --name name +.OP --local_uid uid +.OP --local_gid gid .YS .IP 4 Creates a new container area. This operation should be done once, before @@ -922,6 +924,21 @@ a container. Note that this option can be used multiple times. You can use \fB--hostname\fR \fIname\fR option to set a host name for a container. + +When running with an upstream Linux Kernel that supports user namespaces (= +3.8), the parameters \fB--local_uid\fR and \fB--local_gid\fR can be used to +select which \fIuid\fR and \fIgid\fR respectively will be used as a base user +in the host system. Note that user namespaces provide a 1:1 mapping between +container users and host users. If these options are not specified, the values +\fBLOCAL_UID\fR and \fBLOCAL_GID\fR from global configuration file +\fBvz.conf\fR(5) are used. An explicit \fB--local_uid\fR value of 0 will +disable user namespace support, and run the container as a privileged user. In +this case, \fB--local_gid\fR is ignored. + +\fBWarning:\fR use \fB--local_uid\fR and \fB--local_gid\fR with care, specially +when migrating containers. In all situations, the container's files in the +filesystem needs to be correctly owned by the host-side users. + .IP \fBdestroy\fR | \fBdelete\fR \fICTID\fR 4 Removes a container private area by deleting all files, directories and the configuration file of this container. diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index f4d8f48..9b2e929 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -418,7 +418,7 @@ static int ct_env_create(struct arg_start *arg) clone_flags |= CLONE_NEWNET|CLONE_NEWNS; if (!arg-h-can_join_userns) { - logger(-1, 0, WARNING: Running container unprivileged. USER_NS not supported); + logger(-1, 0, WARNING: Running container unprivileged. USER_NS not supported, or runtime disabled); userns_p[0] = userns_p[1] = -1; } else { @@ -797,7 +797,12 @@ int ct_do_open(vps_handler *h, vps_param *param, vps_param *cmd) char path[STR_SIZE]; char upath[STR_SIZE]; struct stat st; - unsigned long *local_uid = param-res.misc.local_uid; + unsigned long *local_uid; + + /* Command line takes precedence, but if it is unset, check global file */ + local_uid = cmd-res.misc.local_uid; + if (!local_uid) + local_uid = param-res.misc.local_uid; ret = container_init(); if (ret) { @@ -838,7 +843,7 @@ int ct_do_open(vps_handler *h, vps_param *param, vps_param *cmd) * mapped user to own the files, etc. So we also need to find suitable * configuration in the config files. */ - h-can_join_userns = !stat(upath, st) local_uid; + h-can_join_userns = !stat(upath, st) local_uid (*local_uid != 0); h-is_run = ct_is_run; h-enter = ct_enter; h-destroy = ct_destroy; diff --git a/src/vzctl-actions.c b/src/vzctl-actions.c index 7bf15c9..044a0ad 100644 --- a/src/vzctl-actions.c +++ b/src/vzctl-actions.c @@ -391,6 +391,8 @@ static int parse_create_opt(envid_t veid, int argc, char **argv, {ve_layout, required_argument, NULL, PARAM_VE_LAYOUT}, {velayout,required_argument, NULL, PARAM_VE_LAYOUT}, {diskspace, required_argument, NULL, PARAM_DISKSPACE}, + {local_uid, required_argument, NULL, PARAM_LOCAL_UID}, + {local_gid, required_argument, NULL, PARAM_LOCAL_GID}, { NULL, 0, NULL, 0 } }; diff --git a/src/vzctl.c b/src/vzctl.c index 359bcde..54d66d1 100644 --- a/src/vzctl.c +++ b/src/vzctl.c @@ -65,6 +65,7 @@ static void usage(int rc) vzctl create ctid [--ostemplate name] [--config name]\n [--layout ploop|simfs] [--hostname name] [--name name] [--ipadd addr]\n [--diskspace kbytes] [--private path] [--root path]\n + [--local_uid UID] [--local_gid GID]\n vzctl start ctid [--force] [--wait]\n vzctl destroy | mount | umount | stop | restart | status ctid\n #ifdef HAVE_PLOOP -- 1.7.11.7
[Devel] [PATCH v4 5/7] modify tar extraction to account for user namespace
From: Glauber Costa glom...@parallels.com If we are running upstream with user namespaces, we need to create the container filesystem not with the ownership preserved, but reflecting the mapping we need to apply. Note that according to our documentation, we should ignore this if the user explicitly requested an uid mapping of 0 (gid is ignored in this case). Our tooling doesn't allow any easy way to unpack a whol distribution with offsets mechanically applied like this. We could do the whole unpacking in a user namespace itself, but that does not come without problems on its own (for instance, we won't be able to create any device files, we have to carefully adjust permissions in the root directory, etc) To work around that, we can employ a trick to allow container creation right now, as well as to avoid compatibility problems: we will resort to LD_PRELOAD to load a schim that captures calls to the chown family of system calls and applies the offset manually. Signed-off-by: Glauber Costa glom...@parallels.com --- include/res.h | 6 scripts/vps-create.in | 14 src/lib/Makefile.am | 3 ++ src/lib/chown_preload.c | 93 + src/lib/create.c| 25 ++--- vzctl.spec | 2 +- 6 files changed, 137 insertions(+), 6 deletions(-) create mode 100644 src/lib/chown_preload.c diff --git a/include/res.h b/include/res.h index 0dfacf7..047593a 100644 --- a/include/res.h +++ b/include/res.h @@ -47,6 +47,12 @@ struct env_param { }; typedef struct env_param env_param_t; +/* + * When running upstream kernels, we will need host-side UID and GID. Those + * are configurable, but if none is specified, those are used. + */ +#define VZ_DEFAULT_UID 10 +#define VZ_DEFAULT_GID 10 typedef struct { list_head_t userpw; list_head_t nameserver; diff --git a/scripts/vps-create.in b/scripts/vps-create.in index 126f048..a1b3382 100755 --- a/scripts/vps-create.in +++ b/scripts/vps-create.in @@ -22,11 +22,22 @@ # Required parameters: # VE_PRVT- path to root of CT private areas # PRIVATE_TEMPLATE - path to private template used as a source for copying +# +# Optional parameters: +# UID_OFFSET - offset to be added to all tar UIDs +# GID_OFFSET - offset to be added to all tar GIDs . @SCRIPTDIR@/vps-functions vzcheckvar VE_PRVT PRIVATE_TEMPLATE +chown_preload_if_needed() +{ + [ -z $UID_OFFSET -o -z $GID_OFFSET ] return + + export LD_PRELOAD=libvzchown.so +} + create_prvt() { local TMP AVAIL NEEDED HEADER OPT @@ -75,6 +86,9 @@ create_prvt() [ $AVAIL -ge $NEEDED ] || vzerror Insufficient disk space in $VE_PRVT; available: $AVAIL, needed: $NEEDED ${VZ_FS_NO_DISK_SPACE} CAT=cat + + chown_preload_if_needed + # Use pv to show nice progress bar if we can pv -V /dev/null 21 CAT=pv chmod 700 $VE_PRVT diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am index b009d15..34c463b 100644 --- a/src/lib/Makefile.am +++ b/src/lib/Makefile.am @@ -72,6 +72,9 @@ libvzctl_la_LIBADD = $(XML_LIBS) $(CGROUP_LIBS) $(DL_LIBS) if HAVE_CGROUP libvzctl_la_SOURCES += cgroup.c hooks_ct.c + +lib_LTLIBRARIES += libvzchown.la +libvzchown_la_SOURCES = chown_preload.c endif if HAVE_VZ_KERNEL diff --git a/src/lib/chown_preload.c b/src/lib/chown_preload.c new file mode 100644 index 000..4e2be4a --- /dev/null +++ b/src/lib/chown_preload.c @@ -0,0 +1,93 @@ +/* + * Copyright (C) 2013, Parallels, Inc. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Authors: Kir Kolyshkin and Glauber Costa + */ + +#include unistd.h +#include stdio.h +#include stdlib.h + +#include dlfcn.h + +static int (*real_chown)(const char *path, uid_t owner, gid_t group) = NULL; +static int (*real_fchown)(int fd, uid_t owner, gid_t group) = NULL; +static int (*real_lchown)(const char *path, uid_t owner, gid_t group) = NULL; +static int (*real_fchownat)(int dirfd, const char *pathname, + uid_t owner, gid_t group, int flags) = NULL; + +uid_t uid_offset = 0; +gid_t gid_offset = 0; + +static void __init(void) +{ + char *uidstr, *gidstr; + + uidstr = getenv(UID_OFFSET); + gidstr
[Devel] [PATCH v4 6/7] automatically add bridge venet0 when needed
From: Glauber Costa glom...@parallels.com The chosen architecture to deal with --ipadd with upstream containers is to create a veth pair and add the host side information to a bridge called venet0. This way, all the code that expects venet0 to exist can still work without modifications, (or with just a few). Our intention to do that was actually already stated in the comments, but the code was removed before merging because --ipadd would not work without full unshare support anyway. This patch implements that. Signed-off-by: Glauber Costa glom...@parallels.com --- scripts/vps-functions.in | 7 +++ src/lib/hooks_ct.c | 37 +++-- 2 files changed, 42 insertions(+), 2 deletions(-) diff --git a/scripts/vps-functions.in b/scripts/vps-functions.in index 826c0a1..37b2de5 100755 --- a/scripts/vps-functions.in +++ b/scripts/vps-functions.in @@ -170,6 +170,13 @@ vzadjustmacs() # other setups, the bridge is expected to already exist and be valid. vzconfbridge() { + if [ x$BRIDGE == xvenet0 ]; then + if [ $(brctl show venet0 2/dev/null | tail -n+2 | wc -l) == 0 ]; then + brctl addbr venet0 + ${IP_CMD} link set venet0 up + fi + fi + if [ x$BRIDGE != x ]; then brctl addif $BRIDGE $HNAME /dev/null 21 fi diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index 9b2e929..8695985 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -17,6 +17,7 @@ #include logger.h #include script.h #include cgroup.h +#include linux/vzctl_venet.h #define NETNS_RUN_DIR /var/run/netns @@ -725,8 +726,40 @@ static int ct_netdev_ctl(vps_handler *h, envid_t veid, int op, char *name) static int ct_ip_ctl(vps_handler *h, envid_t veid, int op, const char *ipstr) { - logger(-1, 0, %s not yet supported upstream, __func__); - return 0; + int ret = -1; + char *envp[5]; + char buf[STR_SIZE]; + int i = 0; + + if (!h-can_join_pidns) { + logger(-1, 0, Cannot join pid namespace: + --ipadd is not supported in kernels without full pidns support); + return VZ_BAD_KERNEL; + } + envp[i++] = strdup(VNAME=venet0); + envp[i++] = strdup(BRIDGE=venet0); + + snprintf(buf, sizeof(buf), HNAME=venet0.%d, veid); + envp[i++] = strdup(buf); + + snprintf(buf, sizeof(buf), VEID=%d, veid); + envp[i++] = strdup(buf); + + envp[i] = NULL; + + if (op == VE_IP_ADD) { + char *argv[] = { VPS_NETNS_DEV_ADD, NULL }; + + ret = run_script(VPS_NETNS_DEV_ADD, argv, envp, 0); + } else { + char *argv[] = { VPS_NETNS_DEV_DEL, NULL }; + + ret = run_script(VPS_NETNS_DEV_DEL, argv, envp, 0); + } + free_arg(envp); + + return ret; + } /* -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v4 7/7] allow for distro-specific fix ups at creation time.
From: Glauber Costa glom...@parallels.com We will need that infrastucture when running with Linux upstream, since some support is very unlikely to ever land in the Kernel. We need to do things like account for the fact that udev may kick in and destroy all the setup we have done for /dev. Since preemptively preventing those actions to happen is very hard ( as an example, centos6 init binary will issue mount syscalls itself), the most robust approach is to insert a script that will be run shortly after rc.sysinit. Although this can't guarantee that it will run *immediately* after, it should be early enough to undo every harmful operation done by /sbin/init and rc.sysinit, therefore allowing operation to continue freely. Signed-off-by: Glauber Costa glom...@parallels.com --- etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 +++ include/dist.h | 2 ++ include/env.h | 3 ++- src/lib/dist.c | 10 +- src/lib/env.c | 10 ++ src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 38 ++ 8 files changed, 102 insertions(+), 7 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf index 727461d..49226c5 100644 --- a/etc/dists/redhat.conf +++ b/etc/dists/redhat.conf @@ -25,3 +25,4 @@ SET_DNS=set_dns.sh SET_USERPASS=set_userpass.sh SET_UGID_QUOTA=set_ugid_quota.sh POST_CREATE=postcreate.sh +CT_FIXUP=fixups.sh diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh new file mode 100755 index 000..da3fd24 --- /dev/null +++ b/etc/dists/scripts/fixups.sh @@ -0,0 +1,43 @@ +#!/bin/bash +# Copyright (C) 2000-2009, Parallels, Inc. All rights reserved. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +# +# +# Run-time fixup for legacy distributions that will rely on udevd to populate +# their /dev. Those distributions will effectively mount an empty tmpfs on top +# of /dev, defeating any setup we may have done. This includes RHEL-based +# distributions up to RHEL6. We are better off not allowing that to run at all + +function fixup_udev() +{ + if [ -f /etc/fedora-release ]; then + return; + fi + + # All the first two may fail if the distribution didn't actually mount + # then. + umount /dev/pts + umount /dev/shm + # still may hold references to files open by rc.sysinit. Even if this is the + # case those will be simple things like /dev/null and should go away shortly. + # let us not fail because of that. + umount /dev -l +} + +fixup_udev + +exit 0 + diff --git a/include/dist.h b/include/dist.h index 1f6a5a6..4faa452 100644 --- a/include/dist.h +++ b/include/dist.h @@ -29,6 +29,7 @@ #defineSET_USERPASS5 #defineSET_UGID_QUOTA 6 #definePOST_CREATE 7 +#defineCT_FIXUP8 typedef struct { char *def_ostmpl; @@ -46,6 +47,7 @@ typedef struct dist_actions { char *set_userpass; /** setup user password. */ char *set_ugid_quota; /** setup 2level quota. */ char *post_create; /** sostcreate actions. */ + char *ct_fixup; /** run-time fixups. */ } dist_actions; /* Read distribution specific actions configuration file. diff --git a/include/env.h b/include/env.h index 35d427f..223c3a3 100644 --- a/include/env.h +++ b/include/env.h @@ -106,7 +106,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param *param, int stop_mode, int vz_chroot(const char *root); int vz_env_create(vps_handler *h, envid_t veid, vps_res *res, int wait_p[2], int old_wait_p[2], int err_p[2], - env_create_FN fn, void *data); + env_create_FN fn, dist_actions *actions, void *data); struct vps_res; struct arg_start { @@ -118,6 +118,7 @@ struct arg_start { vps_handler *h; void *data; env_create_FN fn; + dist_actions *actions; int userns_p; /* while running in userns, there's extra sync needed */ }; diff --git a/src/lib/dist.c b/src/lib/dist.c index bde94ab..8047cf7 100644 --- a/src/lib/dist.c +++ b/src/lib/dist.c @@ -36,7 +36,8
[Devel] [PATCH v4 1/7] user namespace support for upstream containers
From: Glauber Costa glom...@parallels.com This patch allows the execution of unprivileged containers running ontop of an upstream Linux Kernel. We will run at whatever UID is found in the configuration file (so far empty, thus disabled). Signed-off-by: Glauber Costa glom...@parallels.com --- include/env.h | 1 + include/types.h| 1 + src/lib/hooks_ct.c | 256 +++-- 3 files changed, 249 insertions(+), 9 deletions(-) diff --git a/include/env.h b/include/env.h index 06729f6..4fef438 100644 --- a/include/env.h +++ b/include/env.h @@ -116,6 +116,7 @@ struct arg_start { vps_handler *h; void *data; env_create_FN fn; + int userns_p; /* while running in userns, there's extra sync needed */ }; struct env_create_param3; diff --git a/include/types.h b/include/types.h index a4bce73..fa511aa 100644 --- a/include/types.h +++ b/include/types.h @@ -95,6 +95,7 @@ typedef struct vps_handler { int vzfd; /** /dev/vzctl file descriptor. */ int stdfd; int can_join_pidns; /* can't enter otherwise */ + int can_join_userns; /* can't run non privileged otherwise */ int (*is_run)(struct vps_handler *h, envid_t veid); int (*enter)(struct vps_handler *h, envid_t veid, const char *root, int flags); int (*destroy)(struct vps_handler *h, envid_t veid); diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index e46c1df..8738eeb 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -66,6 +66,8 @@ static int sys_setns(int fd, int nstype) # define MS_PRIVATE (1 18) #endif +#define UID_GID_RANGE 10 /* how many users per container */ + /* This function is there in GLIBC, but not in headers */ extern int pivot_root(const char * new_root, const char * put_old); @@ -203,16 +205,154 @@ static int ct_setlimits(vps_handler *h, envid_t veid, struct ub_struct *ub) } #undef add_value +static int write_uid_gid_mapping(vps_handler *h, unsigned long uid, unsigned long gid, pid_t pid) +{ + char buf[STR_SIZE]; + char map[STR_SIZE]; + int fd; + int len; + int ret; + + len = snprintf(map, sizeof(map), 0 %ld %d, uid, UID_GID_RANGE); + snprintf(buf, sizeof(buf), /proc/%d/uid_map, pid); + if ((fd = open(buf, O_WRONLY)) 0) + goto out; + + if ((write(fd, map, len) != len)) + goto out; + + close(fd); + + len = snprintf(map, sizeof(map), 0 %ld %d, gid, UID_GID_RANGE); + snprintf(buf, sizeof(map), /proc/%d/gid_map, pid); + if ((fd = open(buf, O_WRONLY)) 0) + goto out; + + if ((write(fd, map, len) != len)) + goto out; + ret = 0; +out: + if (fd 0) + close(fd); + return ret; +} + +/* + * Those devices should exist in the container, and be valid device nodes with + * user access permission. But we need to be absolutely sure this is the case, + * so we will provide our own versions. That could actually happen since some + * distributions may come with emptied /dev's, waiting for udev to populate them. + * That won't happen, we do it ourselves. + */ +static void create_devices(vps_handler *h, envid_t veid, const char *root) +{ + unsigned int i; + char *devices[] = { + /dev/null, + /dev/zero, + /dev/random, + /dev/urandom, + }; + + /* +* We will tolerate errors, and keep the container running, because it is +* likely we will be able to boot it to a barely functional state. But +* be vocal about it +*/ + for (i = 0; i ARRAY_SIZE(devices); i++) { + char ct_devname[STR_SIZE]; + int ret; + + ret = snprintf(ct_devname, sizeof(ct_devname), %s/%s, root, devices[i]); + if (ret 0) { + logger(-1, errno, Could not allocate device string); + continue; + } + + /* +* No need to be crazy about file flags. When we bind mount, the +* source permissions will be inherited. +*/ + ret = open(ct_devname, O_RDWR|O_CREAT, 0); + if (ret 0) { + logger(-1, errno, Could not touch device %s, devices[i]); + continue; + } + close(ret); + + ret = mount(devices[i], ct_devname, , MS_BIND, 0); + if (ret 0) + logger(-1, errno, Could not bind mount device %s, devices[i]); + } +} + static int _env_create(void *data) { struct arg_start *arg = data; struct env_create_param3 create_param; int ret; - if ((ret = ct_chroot(arg-res-fs.root))) + if ((arg-userns_p != -1) (read(arg-userns_p, ret, sizeof(ret)) != sizeof(ret))) { + logger(-1, errno, Cannot read
[Devel] [PATCH v4 2/7] add user mismatch test
From: Glauber Costa glom...@parallels.com In theory, we won't be able to run if our private area is not owned by ourselves. We could, if it have very wide open security permissions, but we should never set up a container like that. Aside from a basic sanity check, this is intended to catch problems for the few people who may have already created containers that will be owned by root:root, and will now try to run it unprivileged. Signed-off-by: Glauber Costa glom...@parallels.com --- src/lib/env.c | 16 1 file changed, 16 insertions(+) diff --git a/src/lib/env.c b/src/lib/env.c index 898bfc8..96400c0 100644 --- a/src/lib/env.c +++ b/src/lib/env.c @@ -30,6 +30,7 @@ #include linux/reboot.h #include sys/mount.h #include sys/utsname.h +#include sys/stat.h #include vzerror.h #include res.h @@ -554,6 +555,21 @@ int vps_start_custom(vps_handler *h, envid_t veid, vps_param *param, logger(-1, 0, Container is already running); return VZ_VE_RUNNING; } + if (!is_vz_kernel(h) h-can_join_userns) { + struct stat private_stat; + unsigned long *local_uid = res-misc.local_uid; + unsigned long *local_gid = res-misc.local_gid; + + stat(res-fs.private, private_stat); + if ((local_uid (private_stat.st_uid != *local_uid)) || + (local_gid (private_stat.st_gid != *local_gid))) { + logger(-1, 0, Container private area is owned by %d:%d + , but configuration file says we should run with %lu:%lu.\n + Refusing to run., private_stat.st_uid, private_stat.st_gid, + *res-misc.local_uid, *res-misc.local_gid); + return VZ_FS_BAD_TMPL; + } + } if ((ret = check_ub(h, res-ub))) return ret; -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v4 0/7] User namespace support for upstream containers
Kir, In this patchset, I hope to be addressing all your concerns. It is a bit massive code, so if is there still anything that you would me to change, don't shy away. I will go back to it ASAP. That said, I am incorporating your comments on the previous PATCH 4 (now 1), and adding another one, that also passes cmd_p to open. This is being sent as an incremental patch since you said you have already merged up to PATCH 3 (I still don't see that in git). But if you haven't feel free to fold it. With this patch, we are fully able to get rid of the default value during create, and yet test for userns presence very early during creation time. I think this very much clarifies our handling of the command line option. The documentation is also changed as you requested, for consistency. Glauber Costa (7): user namespace support for upstream containers add user mismatch test Also pass cmd_p pointer to container open allow local uid and gid to be specified at container creation modify tar extraction to account for user namespace automatically add bridge venet0 when needed allow for distro-specific fix ups at creation time. etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 ++ include/dist.h | 2 + include/env.h | 12 +- include/res.h | 6 + include/types.h | 1 + man/vzctl.8.in | 17 +++ scripts/vps-create.in | 14 ++ scripts/vps-functions.in| 7 + src/lib/Makefile.am | 3 + src/lib/chown_preload.c | 93 src/lib/create.c| 25 +++- src/lib/dist.c | 10 +- src/lib/env.c | 33 - src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 338 ++-- src/lib/hooks_vz.c | 2 +- src/vzctl-actions.c | 4 +- src/vzctl.c | 1 + vzctl.spec | 2 +- 20 files changed, 583 insertions(+), 33 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh create mode 100644 src/lib/chown_preload.c -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v3 7/9] modify tar extraction to account for user namespace
On 05/11/2013 03:53 AM, Igor M Podlesny wrote: On 30 April 2013 13:16, Glauber Costa glom...@openvz.org wrote: From: Glauber Costa glom...@parallels.com To work around that, we can employ a trick to allow container creation right now, as well as to avoid compatibility problems: we will resort to LD_PRELOAD to load a schim that captures calls to the chown family of system calls and applies the offset manually. Signed-off-by: Glauber Costa glom...@parallels.com [...] +chown_preload_if_needed() +{ + [ -z $UID_OFFSET -o -z $GID_OFFSET ] return + + export LD_PRELOAD=libvzchown.so LD_PRELOAD can be stacked (say, LD_PRELOAD=lib1:lib2:lib3) and in case it was already set, it's better be kept. Indeed, it is better. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 3/3] hooks_ct: fix ub limits setting for upstream containers
On 05/11/2013 05:07 AM, Igor M Podlesny wrote: On 30 April 2013 11:46, Glauber Costa glom...@openvz.org wrote: Currently, vps_setup_res() have an explicit test for state != VPS_STARTING before applying beancounter limits. This means that we can set limits without further problems when the container is running, but will fail to do so when the container is starting. In fact, hooks_vz have an explicit call for it in its startup function. We should do the same, and call our version of setlimits ourselves when the container is coming up. Signed-off-by: Glauber Costa glom...@openvz.org --- src/lib/hooks_ct.c | 147 +++-- 1 file changed, 76 insertions(+), 71 deletions(-) diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index 63af536..3cfb6e9 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -153,6 +153,77 @@ static int _env_create(void *data) return exec_container_init(arg, create_param); } +#define add_value(val, var, mult) do { if (val) { var = *val * mult; } } while (0) Aren't macroses supposed to be UPPER CASE named? Yes, and for a reason: macros behave very differently than functions, and a convention to allow us to spot them and identify them as macros very easily is beneficial. This particular macro is defined right before the function, and undefined right after it, so it should be extremely clear to every reader that this is a macro. In this case, I don't see a reason for being so strict. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v3 4/9] user namespace support for upstream containers
On 05/11/2013 04:14 AM, Igor M Podlesny wrote: On 30 April 2013 13:16, Glauber Costa glom...@openvz.org wrote: @@ -576,7 +765,9 @@ int ct_do_open(vps_handler *h, vps_param *param) { int ret; char path[STR_SIZE]; + char upath[STR_SIZE]; struct stat st; + unsigned long *local_uid = param-res.misc.local_uid; ret = container_init(); if (ret) { @@ -592,6 +783,9 @@ int ct_do_open(vps_handler *h, vps_param *param) if (snprintf(path, sizeof(path), /proc/%d/ns/pid, getpid()) 0) return VZ_RESOURCE_ERROR; + if (snprintf(upath, sizeof(upath), /proc/%d/ns/user, getpid()) 0) + return VZ_RESOURCE_ERROR; + It might have sense to compare return code not with 0, but sizeof(upath), since ... The functions snprintf() and vsnprintf() do not write more than size bytes (including the terminating null byte ('\0')). If the output was truncated due to this limit then the return value is the number of characters (excluding the terminating null byte) which would have been written to the final string if enough space had been available. Thus, a return value of size or more means that the output was truncated. ... Doesn't matter that much, because this situation is impossible, given that getpid() will always return a bounded value and we know the rest of the string. So it either fails, or it succeeds in its fullest. But that is all the same to me. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 3/3] hooks_ct: fix ub limits setting for upstream containers
On 05/13/2013 12:11 PM, Igor M Podlesny wrote: On 13 May 2013 16:11, Glauber Costa glom...@parallels.com wrote: On 05/13/2013 12:08 PM, Igor M Podlesny wrote: On 13 May 2013 15:50, Glauber Costa glom...@parallels.com wrote: [...] Aren't macroses supposed to be UPPER CASE named? Yes, and for a reason: macros behave very differently than functions, and a convention to allow us to spot them and identify them as macros very easily is beneficial. This particular macro is defined right before the function, and undefined right after it, so it should be extremely clear to every reader that this is a macro. In this case, I don't see a reason for being so strict. As to me it's a nice opportunity to fix this along. Fix what? Fix lower_cases exception for MACRO putting CC back The way I see it, there is absolutely nothing to fix here. As I explained above, I believe coding style exists for a reason, and call for consistency to help one derive things about unknown code easier. This is a local macro, only exist within the scope of this function, it is use-once-and-forget, and won't be available anywhere else in the code. In this particular case, the only difference between the existing and proposed version is that one is easier to type, and the other is harder. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/3] Fixes for upstream containers
Kir, Please find attached simple fixes for upstream containers. The first one is a compiler warning for recently introduced code, while the last two together fixes bugs with parsing and applying beancounter configuration. Patch #2 fixes a bug only present in versions compiled with the --without-vz option, while Patch #3 fixes a bug present in all versions. Thanks Glauber Costa (3): hooks_ct: fix gcc warning ub: compile ub support for non vz kernels as well hooks_ct: fix ub limits setting for upstream containers include/ub.h| 40 -- src/lib/Makefile.am | 3 +- src/lib/hooks_ct.c | 149 +++- src/lib/ub.c| 13 +++-- 4 files changed, 87 insertions(+), 118 deletions(-) -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/3] hooks_ct: fix gcc warning
Signed-off-by: Glauber Costa glom...@openvz.org --- src/lib/hooks_ct.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index 03a18e7..63af536 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -247,7 +247,7 @@ static int ct_enter(vps_handler *h, envid_t veid, const char *root, int flags) if (dp == NULL) return VZ_RESOURCE_ERROR; - if (err = container_add_task(veid)) { + if ((err = container_add_task(veid))) { logger(-1, 0, Can't add task creator to container: %s, container_error(err)); goto out; } -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/3] ub: compile ub support for non vz kernels as well
From: Glauber Costa glom...@parallels.com Commit c9d9170b0 fixed a bug by not including the ub functions if VZ support was not present in the running kernel, and replacing them with empty stubs. This approach, however, proved to be too aggressive. We need to at least be able to read and write the ub configuration from our conf files. The solution proposed in this patch is to rework that so only the actual beancounter operations are compiled out. Signed-off-by: Glauber Costa glom...@openvz.org --- include/ub.h| 40 src/lib/Makefile.am | 3 ++- src/lib/ub.c| 13 - 3 files changed, 10 insertions(+), 46 deletions(-) diff --git a/include/ub.h b/include/ub.h index d2779f8..9ffc66b 100644 --- a/include/ub.h +++ b/include/ub.h @@ -140,7 +140,6 @@ struct ub_struct { }; typedef struct ub_struct ub_param; -#ifdef VZ_KERNEL_SUPPORTED /** Apply UBC resources. * * @param hCT handler. @@ -178,43 +177,4 @@ void add_ub_limit(struct ub_struct *ub, int res_id, unsigned long *limit); void free_ub_param(ub_param *ub); void merge_ub(ub_param *dst, ub_param *src); int is_vswap_config(const ub_param *param); -#else /* ! VZ_KERNEL_SUPPORTED */ -static inline int vps_set_ublimit(vps_handler *h, envid_t veid, ub_param *ubc) -{ - return VZ_BAD_KERNEL; -} -static inline int set_ublimit(vps_handler *h, envid_t veid, ub_param *ubc) -{ - return VZ_BAD_KERNEL; -} -static inline int check_ub(vps_handler *h, ub_param *ub) -{ - return VZ_BAD_KERNEL; -} -static inline int add_ub_param(ub_param *ub, ub_res *res) -{ - return VZ_BAD_KERNEL; -} -static inline int vps_read_ubc(envid_t veid, ub_param *ub) -{ - return VZ_BAD_KERNEL; -} -static inline int get_ub_resid(char *name) -{ - return VZ_BAD_KERNEL; -} -static inline void add_ub_limit(struct ub_struct *ub, int res_id, unsigned long *limit) -{ -} -static inline void free_ub_param(ub_param *ub) -{ -} -static inline void merge_ub(ub_param *dst, ub_param *src) -{ -} -static inline int is_vswap_config(const ub_param *param) -{ - return 0; -} -#endif /* VZ_KERNEL_SUPPORTED */ #endif diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am index 630e957..b009d15 100644 --- a/src/lib/Makefile.am +++ b/src/lib/Makefile.am @@ -60,6 +60,7 @@ libvzctl_la_SOURCES = bitmap.c \ util.c \ veth.c \ vps_configure.c \ + ub.c \ vzfeatures.c if HAVE_PLOOP @@ -74,5 +75,5 @@ libvzctl_la_SOURCES += cgroup.c hooks_ct.c endif if HAVE_VZ_KERNEL -libvzctl_la_SOURCES += ub.c hooks_vz.c +libvzctl_la_SOURCES += hooks_vz.c endif diff --git a/src/lib/ub.c b/src/lib/ub.c index f850ee3..0f9e153 100644 --- a/src/lib/ub.c +++ b/src/lib/ub.c @@ -33,11 +33,6 @@ #include vzsyscalls.h #include util.h -static inline int setublimit(uid_t uid, unsigned long resource, - const unsigned long *rlim) -{ - return syscall(__NR_setublimit, uid, resource, rlim); -} static struct ubname2id { char *name; @@ -164,6 +159,13 @@ int get_ub_resid(char *name) return -1; } +#ifdef VZ_KERNEL_SUPPORTED +static inline int setublimit(uid_t uid, unsigned long resource, + const unsigned long *rlim) +{ + return syscall(__NR_setublimit, uid, resource, rlim); +} + static const char *get_ub_name(unsigned int res_id) { int i; @@ -226,6 +228,7 @@ if (res != NULL) { \ return 0; } +#endif /** Apply UBC resources. * -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/3] hooks_ct: fix ub limits setting for upstream containers
Currently, vps_setup_res() have an explicit test for state != VPS_STARTING before applying beancounter limits. This means that we can set limits without further problems when the container is running, but will fail to do so when the container is starting. In fact, hooks_vz have an explicit call for it in its startup function. We should do the same, and call our version of setlimits ourselves when the container is coming up. Signed-off-by: Glauber Costa glom...@openvz.org --- src/lib/hooks_ct.c | 147 +++-- 1 file changed, 76 insertions(+), 71 deletions(-) diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index 63af536..3cfb6e9 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -153,6 +153,77 @@ static int _env_create(void *data) return exec_container_init(arg, create_param); } +#define add_value(val, var, mult) do { if (val) { var = *val * mult; } } while (0) + +static int ct_setlimits(vps_handler *h, envid_t veid, struct ub_struct *ub) +{ + unsigned long tcp = 0; + unsigned long kmem = 0; + unsigned long kmemall = 0; + unsigned long mem = 0; + unsigned long swap = 0; + int pagesize = sysconf(_SC_PAGESIZE); + + add_value(ub-physpages, mem, pagesize); + add_value(ub-tcpsndbuf, tcp, 1); + add_value(ub-tcprcvbuf, tcp, 1); + add_value(ub-swappages, swap, pagesize); + + /* +* OpenVZ beancounters traditionally acconted objects. Also, we could +* always get a very high granularity about which objects we are +* tracking. Our attempt in this implementation is to translate the +* historical beancounters into something that makes sense given the +* underlying Linux infrastructure, and provide something that would +* allow for more or less the kind of protection the user asked for. A +* 1:1 mapping, however, is not possible - and will never be. +* +* Upstream Linux cgroup controllers went in a very different +* direction. First, resources tend to be viewed in its entirety. We +* have entities like memory, or kernel memory, instead of a list +* of all internal structures like dentry, siginfo, etc. For network +* buffers, we can specify the total buffer memory instead of send and +* receive buffers, etc. +* +* Also, all accounting is done in pages, not in objects - which is the +* only thing that makes sense if the accounting is done in an +* aggregate manner. We don't really know the size of those +* structures, so we use an estimate to get a value in pages. This is +* not a stable API of the kernel, so it is bound to change. +* +* Here is the size in bytes of the following structs, in Linux 3.4: +* +* dcache: 248, siginfo: 128, sock: 1072, task 8128 +*/ + #define DCACHE 248 + #define SIGINFO 128 + #define SOCK 1072 + + add_value(ub-kmemsize, kmem, 1); + add_value(ub-dcachesize, kmemall, DCACHE); + add_value(ub-numtcpsock, kmemall, SOCK); + add_value(ub-numsiginfo, kmemall, SIGINFO); + add_value(ub-numothersock, kmemall, SOCK); + add_value(ub-othersockbuf, kmemall, 1); + add_value(ub-numproc, kmemall, 2 * pagesize); + add_value(ub-dgramrcvbuf, kmemall, SOCK); + + if (mem) + container_apply_config(veid, MEMORY, mem); + if (tcp) + container_apply_config(veid, TCP, tcp); + + kmem = max_ul(kmem, kmemall); + if (kmem) + container_apply_config(veid, KMEMORY, kmem); + + if (swap) + container_apply_config(veid, SWAP, swap); + + return 0; +} +#undef add_value + static int ct_env_create(struct arg_start *arg) { @@ -187,6 +258,11 @@ static int ct_env_create(struct arg_start *arg) return VZ_RESOURCE_ERROR; } + if ((ret = ct_setlimits(arg-h, arg-veid, arg-res-ub))) { + logger(-1, 0, Could not apply container limits: %s, container_error(ret)); + return VZ_RESOURCE_ERROR; + } + if ((ret = container_add_task(arg-veid))) { logger(-1, 0, Can't add task creator to container: %s, container_error(ret)); return VZ_RESOURCE_ERROR; @@ -289,77 +365,6 @@ out: return ret; } -#define add_value(val, var, mult) do { if (val) { var = *val * mult; } } while (0) - -static int ct_setlimits(vps_handler *h, envid_t veid, struct ub_struct *ub) -{ - unsigned long tcp = 0; - unsigned long kmem = 0; - unsigned long kmemall = 0; - unsigned long mem = 0; - unsigned long swap = 0; - int pagesize = sysconf(_SC_PAGESIZE); - - add_value(ub-physpages, mem, pagesize); - add_value(ub-tcpsndbuf, tcp, 1); - add_value(ub-tcprcvbuf, tcp, 1); - add_value(ub-swappages, swap, pagesize
[Devel] [PATCH] cgroups: fix set command with beancounters upstream
The kernel memory controller cannot flip states from unlimited to limited if there are already tasks in it. Therefore, we always have to run with *some* value of kmem enabled. If we don't do it, we can't start unlimited and then use the set command to set any beancounters. We write the maximum amount minus two pages, which should effectively mean accounting turned on, but unlimited Signed-off-by: Glauber Costa glom...@openvz.org --- src/lib/cgroup.c | 13 + 1 file changed, 13 insertions(+) diff --git a/src/lib/cgroup.c b/src/lib/cgroup.c index 9185d46..ae7fe5c 100644 --- a/src/lib/cgroup.c +++ b/src/lib/cgroup.c @@ -56,6 +56,19 @@ static int controller_apply_config(struct cgroup *ct, struct cgroup *parent, } else if (!strcmp(memory, name)) { if ((ret = cgroup_set_value_string(controller, memory.use_hierarchy, 1))) return ret; + /* +* The kernel memory controller cannot flip states from +* unlimited to limited if there are already tasks in it. +* Therefore, we always have to run with *some* value of kmem +* enabled. If we don't do it, we can't start unlimited and +* then use the set command to set any beancounters. We write +* The maximum amount minus two pages, which should effectively +* mean accounting turned on, but unlimited. This will fail +* if the kmem controller is not present, but that is okay. +*/ + cgroup_set_value_string(controller, memory.kmem.limit_in_bytes, + 9223372036854767712); + } else if (!strcmp(devices, name)) { if ((ret = cgroup_set_value_string(controller, devices.deny, a))) return ret; -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v3 0/9] Upstream Linux support for userns
Kir, Please review the following patchset. The main difference from last version is that we support running with userns disabled even if it is present. This effectively means that containers that were already created and owned by root will keep working. It is also possible to explicitly disable it at container creation by setting local_uid to 0. There are also some bugfixes and changes according to the review you provided. Running a container works, and vzctl enter works as well. Most pressing, is the fact that although this patchset finally implements --ipadd (now all infrastructure is in place). Glauber Costa (9): host uid and gid parameters adjust fs_create parameter pass parameters to open user namespace support for upstream containers add user mismatch test allow local uid and gid to be specified at container creation modify tar extraction to account for user namespace automatically add bridge venet0 when needed allow for distro-specific fix ups at creation time. etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 +++ etc/vz.conf.in | 4 + include/dist.h | 2 + include/env.h | 10 +- include/res.h | 8 ++ include/types.h | 1 + include/vzctl_param.h | 3 + man/vzctl.8.in | 16 +++ scripts/vps-create.in | 14 +++ scripts/vps-functions.in| 7 ++ src/lib/Makefile.am | 3 + src/lib/chown_preload.c | 93 ++ src/lib/config.c| 21 src/lib/create.c| 37 -- src/lib/dist.c | 10 +- src/lib/env.c | 33 +++-- src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 293 ++-- src/lib/hooks_vz.c | 2 +- src/vzctl-actions.c | 4 +- src/vzctl.c | 1 + vzctl.spec | 2 +- 23 files changed, 577 insertions(+), 33 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh create mode 100644 src/lib/chown_preload.c -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v3 1/9] host uid and gid parameters
From: Glauber Costa glom...@parallels.com When running with an upstream Linux kernel that supports user namespaces, we will run the container using an unprivileged user in the system. That can be any user, and it serves as base to a 1:1 mapping between users in the container and users in the host. By default, the value 10 will be used for both uid and gid. Signed-off-by: Glauber Costa glom...@parallels.com --- etc/vz.conf.in| 4 include/res.h | 2 ++ include/vzctl_param.h | 3 +++ src/lib/config.c | 21 + 4 files changed, 30 insertions(+) diff --git a/etc/vz.conf.in b/etc/vz.conf.in index 07da9c7..de240ac 100644 --- a/etc/vz.conf.in +++ b/etc/vz.conf.in @@ -39,6 +39,10 @@ DEF_OSTEMPLATE=centos-5 ## Filesystem layout for new CTs: either simfs (default) or ploop #VE_LAYOUT=ploop +# User namespace configuration +LOCAL_UID=10 +LOCAL_GID=10 + ## Load vzwdog module VZWDOG=no diff --git a/include/res.h b/include/res.h index 7060aef..0dfacf7 100644 --- a/include/res.h +++ b/include/res.h @@ -56,6 +56,8 @@ typedef struct { int onboot; unsigned long *bootorder; int wait; + unsigned long *local_uid; + unsigned long *local_gid; } misc_param; struct mod_action; diff --git a/include/vzctl_param.h b/include/vzctl_param.h index 99eb655..cb5ef70 100644 --- a/include/vzctl_param.h +++ b/include/vzctl_param.h @@ -150,5 +150,8 @@ #define PARAM_MOUNT_OPTS 396 +#define PARAM_LOCAL_UID397 +#define PARAM_LOCAL_GID398 + #define PARAM_LINE e:p:f:t:i:l:k:a:b:n:x:h #endif diff --git a/src/lib/config.c b/src/lib/config.c index 617df6a..cb0bfe0 100644 --- a/src/lib/config.c +++ b/src/lib/config.c @@ -128,6 +128,10 @@ static vps_config config[] = { {CONFIGFILE, NULL, PARAM_CONFIG}, {ORIGIN_SAMPLE,NULL,PARAM_CONFIG_SAMPLE}, {DISABLED, NULL, PARAM_DISABLED}, +#ifdef HAVE_UPSTREAM +{LOCAL_UID, NULL, PARAM_LOCAL_UID}, +{LOCAL_GID, NULL, PARAM_LOCAL_GID}, +#endif /* quota */ {DISK_QUOTA, NULL, PARAM_DISK_QUOTA}, {DISKSPACE, NULL, PARAM_DISKSPACE}, @@ -1364,6 +1368,12 @@ static int store_misc(vps_param *old_p, vps_param *vps_p, vps_config *conf, ret = conf_store_strlist(conf_h, conf-name, misc-searchdomain, 0); break; + case PARAM_LOCAL_UID: + ret = conf_store_ulong(conf_h, conf-name, misc-local_uid); + break; + case PARAM_LOCAL_GID: + ret = conf_store_ulong(conf_h, conf-name, misc-local_gid); + break; } return ret; } @@ -1989,6 +1999,13 @@ static int parse(envid_t veid, vps_param *vps_p, char *val, int id) case PARAM_IPTABLES: ret = parse_iptables(vps_p-res.env, val); break; + + case PARAM_LOCAL_UID: + conf_parse_ulong(vps_p-res.misc.local_uid, val); + break; + case PARAM_LOCAL_GID: + conf_parse_ulong(vps_p-res.misc.local_gid, val); + break; case PARAM_LOCKEDPAGES: case PARAM_PRIVVMPAGES: case PARAM_SHMPAGES: @@ -2694,6 +2711,8 @@ static void free_misc(misc_param *misc) FREE_P(misc-hostname) FREE_P(misc-description) FREE_P(misc-bootorder) + FREE_P(misc-local_uid) + FREE_P(misc-local_gid) } static void free_net(net_param *net) @@ -2862,6 +2881,8 @@ static void merge_misc(misc_param *dst, misc_param *src) MERGE_STR(hostname) MERGE_STR(description) MERGE_INT(onboot) + MERGE_P(local_uid) + MERGE_P(local_gid) MERGE_P(bootorder) MERGE_INT(wait) } -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v3 2/9] adjust fs_create parameter
From: Glauber Costa glom...@parallels.com We need to pass more information to fs_create. Instead of adding arguments, it is preferred to pass the whole vps_p structure and unfold it inside the callee. Signed-off-by: Glauber Costa glom...@parallels.com --- src/lib/create.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/src/lib/create.c b/src/lib/create.c index 9a0d43e..2fd9314 100644 --- a/src/lib/create.c +++ b/src/lib/create.c @@ -103,8 +103,7 @@ static void cleanup_destroy_ve(void *data) vps_destroy_dir(d-veid, d-private); } -static int fs_create(envid_t veid, fs_param *fs, tmpl_param *tmpl, - dq_param *dq, int layout, int ploop_mode) +static int fs_create(envid_t veid, vps_handler *h, vps_param *vps_p) { char tarball[PATH_LEN]; char tmp_dir[PATH_LEN]; @@ -117,6 +116,10 @@ static int fs_create(envid_t veid, fs_param *fs, tmpl_param *tmpl, char *dst; const char *ext[] = { , .gz, .bz2, .xz, NULL }; const char *errmsg_ext = [.gz|.bz2|.xz]; + dq_param *dq = vps_p-res.dq; + int layout = vps_p-opt.layout; + fs_param *fs = vps_p-res.fs; + tmpl_param *tmpl = vps_p-res.tmpl; int ploop = (layout == VE_LAYOUT_PLOOP); struct destroy_ve ddata; struct vzctl_cleanup_handler *ch; @@ -173,6 +176,7 @@ find: /* Create and mount ploop image */ struct vzctl_create_image_param param = {}; struct vzctl_mount_param mount_param = {}; + int ploop_mode = vps_p-opt.mode; if (ploop_mode 0) ploop_mode = PLOOP_EXPANDED_MODE; @@ -388,9 +392,7 @@ int vps_create(vps_handler *h, envid_t veid, vps_param *vps_p, vps_param *cmd_p, tmpl-ostmpl = full_ostmpl; } } - ret = fs_create(veid, fs, tmpl, vps_p-res.dq, - vps_p-opt.layout, - vps_p-opt.mode); + ret = fs_create(veid, h, vps_p); if (ret) goto err_root; } -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v3 3/9] pass parameters to open
Upstream containers running on kernels that support user namespaces would benefit from being able to have early access to the container configuration. This is because we would like user namespaces support to be either enabled or disabled, and a lot of the actions we take can be potentially affected by this configuration switch. We do that by bundling it into vz_open, where we traditionally are setting all the namespace related flags (based solely on the presence of it in the system). The open handler in hooks_vz technically does not need to be modified. But I am doing it out of consistency. Signed-off-by: Glauber Costa glom...@openvz.org --- include/env.h | 6 +++--- src/lib/env.c | 7 --- src/lib/hooks_ct.c | 2 +- src/lib/hooks_vz.c | 2 +- src/vzctl-actions.c | 2 +- 5 files changed, 10 insertions(+), 9 deletions(-) diff --git a/include/env.h b/include/env.h index 1628bbf..06729f6 100644 --- a/include/env.h +++ b/include/env.h @@ -50,7 +50,7 @@ enum { * @param veid CT ID. * @return handler or NULL on error. */ -vps_handler *vz_open(envid_t veid); +vps_handler *vz_open(envid_t veid, vps_param *param); /** Close CT handler. * @@ -125,6 +125,6 @@ int exec_container_init(struct arg_start *arg, struct env_create_param3 *create_param); int set_personality32(); -int vz_do_open(vps_handler *h); -int ct_do_open(vps_handler *h); +int vz_do_open(vps_handler *h, vps_param *param); +int ct_do_open(vps_handler *h, vps_param *param); #endif diff --git a/src/lib/env.c b/src/lib/env.c index 7e9a923..898bfc8 100644 --- a/src/lib/env.c +++ b/src/lib/env.c @@ -72,9 +72,10 @@ static int reset_std() /** Allocate and initialize CT handler. * * @param veid CT ID. + * @param paramCT parameters. * @return handler or NULL on error. */ -vps_handler *vz_open(envid_t veid) +vps_handler *vz_open(envid_t veid, vps_param *param) { vps_handler *h = NULL; int ret = -1; @@ -88,7 +89,7 @@ vps_handler *vz_open(envid_t veid) #ifdef VZ_KERNEL_SUPPORTED /* Find out if we are under OpenVZ or upstream kernel */ if (stat_file(/proc/vz)) - ret = vz_do_open(h); + ret = vz_do_open(h, param); else #endif { @@ -96,7 +97,7 @@ vps_handler *vz_open(envid_t veid) non-OpenVZ kernel); h-vzfd = -1; #ifdef HAVE_UPSTREAM - ret = ct_do_open(h); + ret = ct_do_open(h, param); #else logger(-1, 0, No suitable kernel support compiled in); #endif diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index 3cfb6e9..f24d91e 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -572,7 +572,7 @@ static int ct_setcontext(envid_t veid) return 0; } -int ct_do_open(vps_handler *h) +int ct_do_open(vps_handler *h, vps_param *param) { int ret; char path[STR_SIZE]; diff --git a/src/lib/hooks_vz.c b/src/lib/hooks_vz.c index afbcd0f..50770c0 100644 --- a/src/lib/hooks_vz.c +++ b/src/lib/hooks_vz.c @@ -466,7 +466,7 @@ static int vz_veth_ctl(vps_handler *h, envid_t veid, int op, veth_dev *dev) return ret; } -int vz_do_open(vps_handler *h) +int vz_do_open(vps_handler *h, vps_param *param) { if ((h-vzfd = open(VZCTLDEV, O_RDWR)) 0) { logger(-1, errno, Unable to open %s, VZCTLDEV); diff --git a/src/vzctl-actions.c b/src/vzctl-actions.c index e0cd261..4627043 100644 --- a/src/vzctl-actions.c +++ b/src/vzctl-actions.c @@ -1471,7 +1471,7 @@ int run_action(envid_t veid, act_t action, vps_param *g_p, vps_param *vps_p, int ret = 0, lock_id = -1; struct sigaction act; - if ((h = vz_open(veid)) == NULL) { + if ((h = vz_open(veid, g_p)) == NULL) { /* Accept to run set --save --force on any kernel, * otherwise error out if initialization failed */ -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v3 4/9] user namespace support for upstream containers
From: Glauber Costa glom...@parallels.com This patch allows the execution of unprivileged containers running ontop of an upstream Linux Kernel. We will run at whatever UID is found in the configuration file (so far empty, thus disabled). Signed-off-by: Glauber Costa glom...@parallels.com --- include/env.h | 1 + include/types.h| 1 + src/lib/hooks_ct.c | 216 - 3 files changed, 214 insertions(+), 4 deletions(-) diff --git a/include/env.h b/include/env.h index 06729f6..4fef438 100644 --- a/include/env.h +++ b/include/env.h @@ -116,6 +116,7 @@ struct arg_start { vps_handler *h; void *data; env_create_FN fn; + int userns_p; /* while running in userns, there's extra sync needed */ }; struct env_create_param3; diff --git a/include/types.h b/include/types.h index a4bce73..fa511aa 100644 --- a/include/types.h +++ b/include/types.h @@ -95,6 +95,7 @@ typedef struct vps_handler { int vzfd; /** /dev/vzctl file descriptor. */ int stdfd; int can_join_pidns; /* can't enter otherwise */ + int can_join_userns; /* can't run non privileged otherwise */ int (*is_run)(struct vps_handler *h, envid_t veid); int (*enter)(struct vps_handler *h, envid_t veid, const char *root, int flags); int (*destroy)(struct vps_handler *h, envid_t veid); diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index f24d91e..b0cb359 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -66,6 +66,8 @@ static int sys_setns(int fd, int nstype) # define MS_PRIVATE (1 18) #endif +#define UID_GID_RANGE 10 /* how many users per container */ + /* This function is there in GLIBC, but not in headers */ extern int pivot_root(const char * new_root, const char * put_old); @@ -138,10 +140,41 @@ static int _env_create(void *data) struct env_create_param3 create_param; int ret; - if ((ret = ct_chroot(arg-res-fs.root))) + if ((arg-userns_p != -1) (read(arg-userns_p, ret, sizeof(ret)) == 0)) + return -1; + + ret = ct_chroot(arg-res-fs.root); + close(arg-userns_p); + /* Probably means chroot failed */ + if (ret) return ret; - if ((ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 1))) + if (arg-h-can_join_userns) { + int fd; + setuid(0); + setgid(0); + /* +* We need the special flag newinstance. This is a requirement +* of the userns-aware implementation of devpts as of Linux 3.9. +* Because of that special requirement, we do it here rather than +* later. +*/ + mount(devpts, /dev/pts, devpts, 0, newinstance); + /* /dev/ptmx, if it even exists, would refer to the root ptmx. +* We don't want that, we want our newly created instance to contain +* all ptys. So we bind mount the root device here +*/ + fd = open(/dev/ptmx, O_RDWR|O_CREAT, 0); + close(fd); + mount(/dev/pts/ptmx, /dev/ptmx, , MS_BIND, 0); + } + + /* +* If we are using the user namespace, we will have the full capability +* set in the target namespace. So we don't need any of that. +*/ + if (!arg-h-can_join_userns + (ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 1))) return ret; fill_container_param(arg, create_param); @@ -224,6 +257,88 @@ static int ct_setlimits(vps_handler *h, envid_t veid, struct ub_struct *ub) } #undef add_value +static int write_uid_gid_mapping(vps_handler *h, unsigned long uid, unsigned long gid, pid_t pid) +{ + char buf[STR_SIZE]; + char map[STR_SIZE]; + int fd; + int len; + int ret; + + len = snprintf(map, sizeof(map), 0 %ld %d, uid, UID_GID_RANGE); + snprintf(buf, sizeof(buf), /proc/%d/uid_map, pid); + if ((fd = open(buf, O_WRONLY)) 0) + goto out; + + if ((write(fd, map, len) 0)) + goto out; + + close(fd); + + len = snprintf(map, sizeof(map), 0 %ld %d, gid, UID_GID_RANGE); + snprintf(buf, sizeof(map), /proc/%d/gid_map, pid); + if ((fd = open(buf, O_WRONLY)) 0) + goto out; + + if ((write(fd, map, len) 0)) + goto out; + ret = 0; +out: + if (fd 0) + close(fd); + return ret; +} + +/* + * Those devices should exist in the container, and be valid device nodes with + * user access permission. But we need to be absolutely sure this is the case, + * so we will provide our own versions. That could actually happen since some + * distributions may come with emptied /dev's, waiting for udev to populate them. + * That won't happen, we do it ourselves. + */ +static void create_devices
[Devel] [PATCH v3 5/9] add user mismatch test
From: Glauber Costa glom...@parallels.com In theory, we won't be able to run if our private area is not owned by ourselves. We could, if it have very wide open security permissions, but we should never set up a container like that. Aside from a basic sanity check, this is intended to catch problems for the few people who may have already created containers that will be owned by root:root, and will now try to run it unprivileged. Signed-off-by: Glauber Costa glom...@parallels.com --- src/lib/env.c | 16 1 file changed, 16 insertions(+) diff --git a/src/lib/env.c b/src/lib/env.c index 898bfc8..96400c0 100644 --- a/src/lib/env.c +++ b/src/lib/env.c @@ -30,6 +30,7 @@ #include linux/reboot.h #include sys/mount.h #include sys/utsname.h +#include sys/stat.h #include vzerror.h #include res.h @@ -554,6 +555,21 @@ int vps_start_custom(vps_handler *h, envid_t veid, vps_param *param, logger(-1, 0, Container is already running); return VZ_VE_RUNNING; } + if (!is_vz_kernel(h) h-can_join_userns) { + struct stat private_stat; + unsigned long *local_uid = res-misc.local_uid; + unsigned long *local_gid = res-misc.local_gid; + + stat(res-fs.private, private_stat); + if ((local_uid (private_stat.st_uid != *local_uid)) || + (local_gid (private_stat.st_gid != *local_gid))) { + logger(-1, 0, Container private area is owned by %d:%d + , but configuration file says we should run with %lu:%lu.\n + Refusing to run., private_stat.st_uid, private_stat.st_gid, + *res-misc.local_uid, *res-misc.local_gid); + return VZ_FS_BAD_TMPL; + } + } if ((ret = check_ub(h, res-ub))) return ret; -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v3 7/9] modify tar extraction to account for user namespace
From: Glauber Costa glom...@parallels.com If we are running upstream with user namespaces, we need to create the container filesystem not with the ownership preserved, but reflecting the mapping we need to apply. Note that according to our documentation, we should ignore this if the user explicitly requested an uid mapping of 0 (gid is ignored in this case). Our tooling doesn't allow any easy way to unpack a whol distribution with offsets mechanically applied like this. We could do the whole unpacking in a user namespace itself, but that does not come without problems on its own (for instance, we won't be able to create any device files, we have to carefully adjust permissions in the root directory, etc) To work around that, we can employ a trick to allow container creation right now, as well as to avoid compatibility problems: we will resort to LD_PRELOAD to load a schim that captures calls to the chown family of system calls and applies the offset manually. Signed-off-by: Glauber Costa glom...@parallels.com --- include/res.h | 6 scripts/vps-create.in | 14 src/lib/Makefile.am | 3 ++ src/lib/chown_preload.c | 93 + src/lib/create.c| 25 ++--- vzctl.spec | 2 +- 6 files changed, 137 insertions(+), 6 deletions(-) create mode 100644 src/lib/chown_preload.c diff --git a/include/res.h b/include/res.h index 0dfacf7..047593a 100644 --- a/include/res.h +++ b/include/res.h @@ -47,6 +47,12 @@ struct env_param { }; typedef struct env_param env_param_t; +/* + * When running upstream kernels, we will need host-side UID and GID. Those + * are configurable, but if none is specified, those are used. + */ +#define VZ_DEFAULT_UID 10 +#define VZ_DEFAULT_GID 10 typedef struct { list_head_t userpw; list_head_t nameserver; diff --git a/scripts/vps-create.in b/scripts/vps-create.in index 126f048..a1b3382 100755 --- a/scripts/vps-create.in +++ b/scripts/vps-create.in @@ -22,11 +22,22 @@ # Required parameters: # VE_PRVT- path to root of CT private areas # PRIVATE_TEMPLATE - path to private template used as a source for copying +# +# Optional parameters: +# UID_OFFSET - offset to be added to all tar UIDs +# GID_OFFSET - offset to be added to all tar GIDs . @SCRIPTDIR@/vps-functions vzcheckvar VE_PRVT PRIVATE_TEMPLATE +chown_preload_if_needed() +{ + [ -z $UID_OFFSET -o -z $GID_OFFSET ] return + + export LD_PRELOAD=libvzchown.so +} + create_prvt() { local TMP AVAIL NEEDED HEADER OPT @@ -75,6 +86,9 @@ create_prvt() [ $AVAIL -ge $NEEDED ] || vzerror Insufficient disk space in $VE_PRVT; available: $AVAIL, needed: $NEEDED ${VZ_FS_NO_DISK_SPACE} CAT=cat + + chown_preload_if_needed + # Use pv to show nice progress bar if we can pv -V /dev/null 21 CAT=pv chmod 700 $VE_PRVT diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am index b009d15..34c463b 100644 --- a/src/lib/Makefile.am +++ b/src/lib/Makefile.am @@ -72,6 +72,9 @@ libvzctl_la_LIBADD = $(XML_LIBS) $(CGROUP_LIBS) $(DL_LIBS) if HAVE_CGROUP libvzctl_la_SOURCES += cgroup.c hooks_ct.c + +lib_LTLIBRARIES += libvzchown.la +libvzchown_la_SOURCES = chown_preload.c endif if HAVE_VZ_KERNEL diff --git a/src/lib/chown_preload.c b/src/lib/chown_preload.c new file mode 100644 index 000..4e2be4a --- /dev/null +++ b/src/lib/chown_preload.c @@ -0,0 +1,93 @@ +/* + * Copyright (C) 2013, Parallels, Inc. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Authors: Kir Kolyshkin and Glauber Costa + */ + +#include unistd.h +#include stdio.h +#include stdlib.h + +#include dlfcn.h + +static int (*real_chown)(const char *path, uid_t owner, gid_t group) = NULL; +static int (*real_fchown)(int fd, uid_t owner, gid_t group) = NULL; +static int (*real_lchown)(const char *path, uid_t owner, gid_t group) = NULL; +static int (*real_fchownat)(int dirfd, const char *pathname, + uid_t owner, gid_t group, int flags) = NULL; + +uid_t uid_offset = 0; +gid_t gid_offset = 0; + +static void __init(void) +{ + char *uidstr, *gidstr; + + uidstr = getenv(UID_OFFSET); + gidstr
[Devel] [PATCH v3 8/9] automatically add bridge venet0 when needed
From: Glauber Costa glom...@parallels.com The chosen architecture to deal with --ipadd with upstream containers is to create a veth pair and add the host side information to a bridge called venet0. This way, all the code that expects venet0 to exist can still work without modifications, (or with just a few). Our intention to do that was actually already stated in the comments, but the code was removed before merging because --ipadd would not work without full unshare support anyway. This patch implements that. Signed-off-by: Glauber Costa glom...@parallels.com --- scripts/vps-functions.in | 7 +++ src/lib/hooks_ct.c | 37 +++-- 2 files changed, 42 insertions(+), 2 deletions(-) diff --git a/scripts/vps-functions.in b/scripts/vps-functions.in index 826c0a1..37b2de5 100755 --- a/scripts/vps-functions.in +++ b/scripts/vps-functions.in @@ -170,6 +170,13 @@ vzadjustmacs() # other setups, the bridge is expected to already exist and be valid. vzconfbridge() { + if [ x$BRIDGE == xvenet0 ]; then + if [ $(brctl show venet0 2/dev/null | tail -n+2 | wc -l) == 0 ]; then + brctl addbr venet0 + ${IP_CMD} link set venet0 up + fi + fi + if [ x$BRIDGE != x ]; then brctl addif $BRIDGE $HNAME /dev/null 21 fi diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index a791934..2d195a5 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -17,6 +17,7 @@ #include logger.h #include script.h #include cgroup.h +#include linux/vzctl_venet.h #define NETNS_RUN_DIR /var/run/netns @@ -695,8 +696,40 @@ static int ct_netdev_ctl(vps_handler *h, envid_t veid, int op, char *name) static int ct_ip_ctl(vps_handler *h, envid_t veid, int op, const char *ipstr) { - logger(-1, 0, %s not yet supported upstream, __func__); - return 0; + int ret = -1; + char *envp[5]; + char buf[STR_SIZE]; + int i = 0; + + if (!h-can_join_pidns) { + logger(-1, 0, Cannot join pid namespace: + --ipadd is not supported in kernels without full pidns support); + return VZ_BAD_KERNEL; + } + envp[i++] = strdup(VNAME=venet0); + envp[i++] = strdup(BRIDGE=venet0); + + snprintf(buf, sizeof(buf), HNAME=venet0.%d, veid); + envp[i++] = strdup(buf); + + snprintf(buf, sizeof(buf), VEID=%d, veid); + envp[i++] = strdup(buf); + + envp[i] = NULL; + + if (op == VE_IP_ADD) { + char *argv[] = { VPS_NETNS_DEV_ADD, NULL }; + + ret = run_script(VPS_NETNS_DEV_ADD, argv, envp, 0); + } else { + char *argv[] = { VPS_NETNS_DEV_DEL, NULL }; + + ret = run_script(VPS_NETNS_DEV_DEL, argv, envp, 0); + } + free_arg(envp); + + return ret; + } /* -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v3 9/9] allow for distro-specific fix ups at creation time.
From: Glauber Costa glom...@parallels.com We will need that infrastucture when running with Linux upstream, since some support is very unlikely to ever land in the Kernel. We need to do things like account for the fact that udev may kick in and destroy all the setup we have done for /dev. Since preemptively preventing those actions to happen is very hard ( as an example, centos6 init binary will issue mount syscalls itself), the most robust approach is to insert a script that will be run shortly after rc.sysinit. Although this can't guarantee that it will run *immediately* after, it should be early enough to undo every harmful operation done by /sbin/init and rc.sysinit, therefore allowing operation to continue freely. Signed-off-by: Glauber Costa glom...@parallels.com --- etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 +++ include/dist.h | 2 ++ include/env.h | 3 ++- src/lib/dist.c | 10 +- src/lib/env.c | 10 ++ src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 38 ++ 8 files changed, 102 insertions(+), 7 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf index 727461d..49226c5 100644 --- a/etc/dists/redhat.conf +++ b/etc/dists/redhat.conf @@ -25,3 +25,4 @@ SET_DNS=set_dns.sh SET_USERPASS=set_userpass.sh SET_UGID_QUOTA=set_ugid_quota.sh POST_CREATE=postcreate.sh +CT_FIXUP=fixups.sh diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh new file mode 100755 index 000..da3fd24 --- /dev/null +++ b/etc/dists/scripts/fixups.sh @@ -0,0 +1,43 @@ +#!/bin/bash +# Copyright (C) 2000-2009, Parallels, Inc. All rights reserved. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +# +# +# Run-time fixup for legacy distributions that will rely on udevd to populate +# their /dev. Those distributions will effectively mount an empty tmpfs on top +# of /dev, defeating any setup we may have done. This includes RHEL-based +# distributions up to RHEL6. We are better off not allowing that to run at all + +function fixup_udev() +{ + if [ -f /etc/fedora-release ]; then + return; + fi + + # All the first two may fail if the distribution didn't actually mount + # then. + umount /dev/pts + umount /dev/shm + # still may hold references to files open by rc.sysinit. Even if this is the + # case those will be simple things like /dev/null and should go away shortly. + # let us not fail because of that. + umount /dev -l +} + +fixup_udev + +exit 0 + diff --git a/include/dist.h b/include/dist.h index 1f6a5a6..4faa452 100644 --- a/include/dist.h +++ b/include/dist.h @@ -29,6 +29,7 @@ #defineSET_USERPASS5 #defineSET_UGID_QUOTA 6 #definePOST_CREATE 7 +#defineCT_FIXUP8 typedef struct { char *def_ostmpl; @@ -46,6 +47,7 @@ typedef struct dist_actions { char *set_userpass; /** setup user password. */ char *set_ugid_quota; /** setup 2level quota. */ char *post_create; /** sostcreate actions. */ + char *ct_fixup; /** run-time fixups. */ } dist_actions; /* Read distribution specific actions configuration file. diff --git a/include/env.h b/include/env.h index 4fef438..b10dfd9 100644 --- a/include/env.h +++ b/include/env.h @@ -104,7 +104,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param *param, int stop_mode, int vz_chroot(const char *root); int vz_env_create(vps_handler *h, envid_t veid, vps_res *res, int wait_p[2], int old_wait_p[2], int err_p[2], - env_create_FN fn, void *data); + env_create_FN fn, dist_actions *actions, void *data); struct vps_res; struct arg_start { @@ -116,6 +116,7 @@ struct arg_start { vps_handler *h; void *data; env_create_FN fn; + dist_actions *actions; int userns_p; /* while running in userns, there's extra sync needed */ }; diff --git a/src/lib/dist.c b/src/lib/dist.c index bde94ab..8047cf7 100644 --- a/src/lib/dist.c +++ b/src/lib/dist.c @@ -36,7 +36,8
Re: [Devel] [PATCH] cgroups: fix set command with beancounters upstream
On 04/30/2013 01:48 PM, Kir Kolyshkin wrote: On 04/29/2013 10:12 PM, Glauber Costa wrote: The kernel memory controller cannot flip states from unlimited to limited if there are already tasks in it. Therefore, we always have to run with *some* value of kmem enabled. If we don't do it, we can't start unlimited and then use the set command to set any beancounters. We write the maximum amount minus two pages, which should effectively mean accounting turned on, but unlimited Alternatively, we can require CT restart for the change to take effect. We do that for a few options, see check_set_mode(). But this is probably OK too. Being all the same to you, I prefer this way. This is not really about container running, but rather cgroups have tasks. Doing it on restart would still mandate us to always be careful about when to set up cgroup limits and always do that before we put any tasks into it. Of course we can do that, but not having that limitation makes us more robust IMHO. We also already have a lot of non-limit related stuff that we have to do during creation anyway (like setting up cpusets and memory hierarchy), so it just fits together nicely ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v2 2/8] adjust fs_create parameter
On 04/01/2013 07:44 PM, Kir Kolyshkin wrote: On 04/01/2013 08:37 AM, Glauber Costa wrote: On 04/01/2013 07:13 PM, Dmitry Guryanov wrote: On 130401 19:04:37, Glauber Costa wrote: On 04/01/2013 06:58 PM, Dmitry Guryanov wrote: On 130322 14:48:16, Glauber Costa wrote: We need to pass more information to fs_create. Instead of adding arguments, it is preferred to pass the whole vps_p structure and unfold it inside the callee. Can't apply this patch :( This patch is rather old. Isn't it already present in Kir's ? Kir's tree certainly have all the lastest upstream linux work, so it should either have it, or a modified version of it. No, it isn't. Latest commit there is two weeks ago. Could you please tell, where i can find up-to-date vzctl repository ? I am checking here, and this patch never made its way to the repository. We settled in a slightly different solution for it. Why do you need to apply it ? I guess Dmitry wants to test userns support... which is not in yet. Yes, but it has nothing to do with that patch. Dmitry, maybe you are looking at the wrong series ? ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 0/8] upstream Linux support for userns
Kir, Please review the following patchset. It implements support for userns inside containers. Running a container works, and vzctl enter works as well. There are still some caveats that I intend to tackle in the upcoming weeks. Most pressing, is the fact that although this patchset finally implements --ipadd (now all infrastructure is in place), we can ssh into containers due to issues related to the proc filesystem. Let me know if there are any issues, I'll happily fix them. Glauber Costa (8): host uid and gid parameters adjust fs_create parameter user namespace support for upstream containers modify tar extraction to account for user namespace add user mismatch test allow local uid and gid to be specified at container creation automatically add bridge venet0 when needed allow for distro-specific fix ups at creation time. etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 +++ include/dist.h | 2 + include/env.h | 4 +- include/res.h | 8 ++ include/types.h | 1 + include/vzctl_param.h | 3 + man/vzctl.8.in | 14 +++ scripts/vps-create.in | 19 scripts/vps-functions.in| 7 ++ src/lib/Makefile.am | 3 + src/lib/chown_preload.c | 93 +++ src/lib/config.c| 32 ++ src/lib/create.c| 30 +++-- src/lib/dist.c | 10 +- src/lib/env.c | 23 +++- src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 269 ++-- src/vzctl-actions.c | 2 + src/vzctl.c | 1 + vzctl.spec | 2 +- 21 files changed, 544 insertions(+), 25 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh create mode 100644 src/lib/chown_preload.c -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 1/8] host uid and gid parameters
When running with an upstream Linux kernel that supports user namespaces, we will run the container using an unprivileged user in the system. That can be any user, and it serves as base to a 1:1 mapping between users in the container and users in the host. By default, the value 10 will be used for both uid and gid. Signed-off-by: Glauber Costa glom...@parallels.com --- include/res.h | 8 include/vzctl_param.h | 3 +++ src/lib/config.c | 32 3 files changed, 43 insertions(+) diff --git a/include/res.h b/include/res.h index e07060d..47c36da 100644 --- a/include/res.h +++ b/include/res.h @@ -47,6 +47,12 @@ struct env_param { }; typedef struct env_param env_param_t; +/* + * When running upstream kernels, we will need host-side UID and GID. Those + * are configurable, but if none is specified, those are used. + */ +#define VZ_DEFAULT_UID 10 +#define VZ_DEFAULT_GID 10 typedef struct { list_head_t userpw; list_head_t nameserver; @@ -56,6 +62,8 @@ typedef struct { int onboot; unsigned long *bootorder; int wait; + unsigned long *local_uid; + unsigned long *local_gid; } misc_param; struct mod_action; diff --git a/include/vzctl_param.h b/include/vzctl_param.h index 3034a13..e583540 100644 --- a/include/vzctl_param.h +++ b/include/vzctl_param.h @@ -149,5 +149,8 @@ #define PARAM_MOUNT_OPTS 396 +#define PARAM_LOCAL_UID397 +#define PARAM_LOCAL_GID398 + #define PARAM_LINE e:p:f:t:i:l:k:a:b:n:x:h #endif diff --git a/src/lib/config.c b/src/lib/config.c index cf5f352..cc4c1cc 100644 --- a/src/lib/config.c +++ b/src/lib/config.c @@ -128,6 +128,10 @@ static vps_config config[] = { {CONFIGFILE, NULL, PARAM_CONFIG}, {ORIGIN_SAMPLE,NULL,PARAM_CONFIG_SAMPLE}, {DISABLED, NULL, PARAM_DISABLED}, +#ifdef HAVE_UPSTREAM +{LOCAL_UID, NULL, PARAM_LOCAL_UID}, +{LOCAL_GID, NULL, PARAM_LOCAL_GID}, +#endif /* quota */ {DISK_QUOTA, NULL, PARAM_DISK_QUOTA}, {DISKSPACE, NULL, PARAM_DISKSPACE}, @@ -1363,6 +1367,12 @@ static int store_misc(vps_param *old_p, vps_param *vps_p, vps_config *conf, ret = conf_store_strlist(conf_h, conf-name, misc-searchdomain, 0); break; + case PARAM_LOCAL_UID: + ret = conf_store_ulong(conf_h, conf-name, misc-local_uid); + break; + case PARAM_LOCAL_GID: + ret = conf_store_ulong(conf_h, conf-name, misc-local_gid); + break; } return ret; } @@ -1996,6 +2006,13 @@ static int parse(envid_t veid, vps_param *vps_p, char *val, int id) case PARAM_IPTABLES: ret = parse_iptables(vps_p-res.env, val); break; + + case PARAM_LOCAL_UID: + conf_parse_ulong(vps_p-res.misc.local_uid, val); + break; + case PARAM_LOCAL_GID: + conf_parse_ulong(vps_p-res.misc.local_gid, val); + break; case PARAM_LOCKEDPAGES: case PARAM_PRIVVMPAGES: case PARAM_SHMPAGES: @@ -2697,6 +2714,8 @@ static void free_misc(misc_param *misc) FREE_P(misc-hostname) FREE_P(misc-description) FREE_P(misc-bootorder) + FREE_P(misc-local_uid) + FREE_P(misc-local_gid) } static void free_net(net_param *net) @@ -2789,6 +2808,17 @@ void free_vps_param(vps_param *param) *dst-x = *src-x; \ } +#define MERGE_P_DEFAULT(x, dfl) \ + if ((src-x) != NULL) { \ + if ((dst-x) == NULL) \ + dst-x = malloc(sizeof(*(dst-x))); \ + *dst-x = *src-x; \ + } \ + if ((dst-x) == NULL) { \ + dst-x = malloc(sizeof(*(dst-x))); \ + *dst-x = (unsigned long)(dfl); \ + } + #define MERGE_P2(x)\ if ((src-x) != NULL) { \ if ((dst-x) == NULL) \ @@ -2865,6 +2895,8 @@ static void merge_misc(misc_param *dst, misc_param *src) MERGE_STR(hostname) MERGE_STR(description) MERGE_INT(onboot) + MERGE_P_DEFAULT(local_uid, VZ_DEFAULT_UID) + MERGE_P_DEFAULT(local_gid, VZ_DEFAULT_GID) MERGE_P(bootorder) MERGE_INT(wait) } -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 3/8] user namespace support for upstream containers
This patch allows the execution of unprivileged containers running ontop of an upstream Linux Kernel. We will run at whatever UID is found in the configuration file. Signed-off-by: Glauber Costa glom...@parallels.com --- include/env.h | 1 + include/types.h| 1 + src/lib/hooks_ct.c | 194 +++-- 3 files changed, 191 insertions(+), 5 deletions(-) diff --git a/include/env.h b/include/env.h index 1628bbf..d41df2e 100644 --- a/include/env.h +++ b/include/env.h @@ -116,6 +116,7 @@ struct arg_start { vps_handler *h; void *data; env_create_FN fn; + int userns_p; /* while running in userns, there's extra sync needed */ }; struct env_create_param3; diff --git a/include/types.h b/include/types.h index ceecb93..54eb1f4 100644 --- a/include/types.h +++ b/include/types.h @@ -95,6 +95,7 @@ typedef struct vps_handler { int vzfd; /** /dev/vzctl file descriptor. */ int stdfd; int can_join_pidns; /* can't enter otherwise */ + int can_join_userns; /* can't run non privileged otherwise */ int (*is_run)(struct vps_handler *h, envid_t veid); int (*enter)(struct vps_handler *h, envid_t veid, const char *root, int flags); int (*destroy)(struct vps_handler *h, envid_t veid); diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index 29d7eea..6bd27c1 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -66,6 +66,8 @@ static int sys_setns(int fd, int nstype) # define MS_PRIVATE (1 18) #endif +#define UID_GID_RANGE 10 /* how many users per container */ + /* This function is there in GLIBC, but not in headers */ extern int pivot_root(const char * new_root, const char * put_old); @@ -138,10 +140,39 @@ static int _env_create(void *data) struct env_create_param3 create_param; int ret; - if ((ret = ct_chroot(arg-res-fs.root))) + if ((arg-userns_p != -1) (read(arg-userns_p, ret, sizeof(ret)) == 0)) + return -1; + + ret = ct_chroot(arg-res-fs.root); + close(arg-userns_p); + /* Probably means chroot failed */ + if (ret) return ret; - if ((ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 1))) + if (arg-h-can_join_userns) { + setuid(0); + setgid(0); + /* +* We need the special flag newinstance. This is a requirement +* of the userns-aware implementation of devpts as of Linux 3.9. +* Because of that special requirement, we do it here rather than +* later. +*/ + mount(devpts, /dev/pts, devpts, 0, newinstance); + /* /dev/ptmx, if it even exists, would refer to the root ptmx. +* We don't want that, we want our newly created instance to contain +* all ptys. So we bind mount the root device here +*/ + open(/dev/ptmx, O_RDWR|O_CREAT, 0); + mount(/dev/pts/ptmx, /dev/ptmx, , MS_BIND, 0); + } + + /* +* If we are using the user namespace, we will have the full capability +* set in the target namespace. So we don't need any of that. +*/ + if (!arg-h-can_join_userns + (ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 1))) return ret; fill_container_param(arg, create_param); @@ -153,6 +184,79 @@ static int _env_create(void *data) return exec_container_init(arg, create_param); } +static int write_uid_gid_mapping(vps_handler *h, unsigned long uid, unsigned long gid, pid_t pid) +{ + char buf[STR_SIZE]; + char map[STR_SIZE]; + int fd; + + snprintf(map, sizeof(map), 0 %ld %d, uid, UID_GID_RANGE); + snprintf(buf, sizeof(buf), /proc/%d/uid_map, pid); + if ((fd = open(buf, O_WRONLY)) 0) + return -1; + + if ((write(fd, map, sizeof(map)) 0)) + return -1; + + snprintf(map, sizeof(map), 0 %ld %d, gid, UID_GID_RANGE); + snprintf(buf, sizeof(map), /proc/%d/gid_map, pid); + if ((fd = open(buf, O_WRONLY)) 0) + return -1; + + if ((write(fd, map, sizeof(map)) 0)) + return -1; + + return 0; +} + +/* + * Those devices should exist in the container, and be valid device nodes with + * user access permission. But we need to be absolutely sure this is the case, + * so we will provide our own versions. That could actually happen since some + * distributions may come with emptied /dev's, waiting for udev to populate them. + * That won't happen, we do it ourselves. + */ +static void create_devices(vps_handler *h, envid_t veid, const char *root) +{ + unsigned int i; + char *devices[] = { + /dev/null, + /dev/zero, + /dev/random, + /dev/urandom, + }; + + /* +* We
[Devel] [PATCH v2 4/8] modify tar extraction to account for user namespace
If we are running upstream with user namespaces, we need to create the container filesystem not with the ownership preserved, but reflecting the mapping we need to apply. The long term goal is to patch the tar utility to allow us to specify an offset that will be applied during extraction. However, even if tar is modified, the legacy base of old tar utilities is still too big. We can employ a trick to allow container creation right now, as well as to avoid compatibility problems: we will resort to LD_PRELOAD to load a schim that captures calls to the chown family of system calls and applies the offset manually. Signed-off-by: Glauber Costa glom...@parallels.com --- scripts/vps-create.in | 19 ++ src/lib/Makefile.am | 3 ++ src/lib/chown_preload.c | 93 + src/lib/create.c| 21 +++ vzctl.spec | 2 +- 5 files changed, 130 insertions(+), 8 deletions(-) create mode 100644 src/lib/chown_preload.c diff --git a/scripts/vps-create.in b/scripts/vps-create.in index 126f048..53f6643 100755 --- a/scripts/vps-create.in +++ b/scripts/vps-create.in @@ -22,11 +22,27 @@ # Required parameters: # VE_PRVT- path to root of CT private areas # PRIVATE_TEMPLATE - path to private template used as a source for copying +# +# Optional parameters: +# UID_OFFSET - offset to be added to all tar UIDs +# GID_OFFSET - offset to be added to all tar GIDs . @SCRIPTDIR@/vps-functions vzcheckvar VE_PRVT PRIVATE_TEMPLATE +chown_preload_if_needed() +{ + [ -z $UID_OFFSET -o -z $GID_OFFSET ] return + + # TODO use tar with appropriate option when it becomes available + # + # The goal is to get this merged into tar so we no longer + # need to do this preload. Whenever it happens, we will include + # a version check here as well + export LD_PRELOAD=libvzchown.so +} + create_prvt() { local TMP AVAIL NEEDED HEADER OPT @@ -75,6 +91,9 @@ create_prvt() [ $AVAIL -ge $NEEDED ] || vzerror Insufficient disk space in $VE_PRVT; available: $AVAIL, needed: $NEEDED ${VZ_FS_NO_DISK_SPACE} CAT=cat + + chown_preload_if_needed + # Use pv to show nice progress bar if we can pv -V /dev/null 21 CAT=pv chmod 700 $VE_PRVT diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am index 42a7ed6..44c8a69 100644 --- a/src/lib/Makefile.am +++ b/src/lib/Makefile.am @@ -70,6 +70,9 @@ libvzctl_la_LIBADD = $(XML_LIBS) $(CGROUP_LIBS) $(DL_LIBS) if HAVE_CGROUP libvzctl_la_SOURCES += cgroup.c hooks_ct.c + +lib_LTLIBRARIES += libvzchown.la +libvzchown_la_SOURCES = chown_preload.c endif if HAVE_VZ_KERNEL diff --git a/src/lib/chown_preload.c b/src/lib/chown_preload.c new file mode 100644 index 000..4e2be4a --- /dev/null +++ b/src/lib/chown_preload.c @@ -0,0 +1,93 @@ +/* + * Copyright (C) 2013, Parallels, Inc. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Authors: Kir Kolyshkin and Glauber Costa + */ + +#include unistd.h +#include stdio.h +#include stdlib.h + +#include dlfcn.h + +static int (*real_chown)(const char *path, uid_t owner, gid_t group) = NULL; +static int (*real_fchown)(int fd, uid_t owner, gid_t group) = NULL; +static int (*real_lchown)(const char *path, uid_t owner, gid_t group) = NULL; +static int (*real_fchownat)(int dirfd, const char *pathname, + uid_t owner, gid_t group, int flags) = NULL; + +uid_t uid_offset = 0; +gid_t gid_offset = 0; + +static void __init(void) +{ + char *uidstr, *gidstr; + + uidstr = getenv(UID_OFFSET); + gidstr = getenv(GID_OFFSET); + if (!uidstr || !gidstr) { + fprintf(stderr, Environment variables UID_OFFSET + and GID_OFFSET are required -- aborting\n); + exit(33); + } + uid_offset = strtol(uidstr, NULL, 10); + gid_offset = strtol(gidstr, NULL, 10); + + real_chown = dlsym(RTLD_NEXT, chown); + real_fchown = dlsym(RTLD_NEXT, fchown); + real_lchown = dlsym(RTLD_NEXT, lchown); + real_fchownat = dlsym(RTLD_NEXT, fchownat); + + if (!real_chown || !real_fchown || !real_lchown || !real_fchownat) { + fprintf(stderr
[Devel] [PATCH v2 5/8] add user mismatch test
In theory, we won't be able to run if our private area is not owned by ourselves. We could, if it have very wide open security permissions, but we should never set up a container like that. Aside from a basic sanity check, this is intended to catch problems for the few people who may have already created containers that will be owned by root:root, and will now try to run it unprivileged. Signed-off-by: Glauber Costa glom...@parallels.com --- src/lib/env.c | 13 + 1 file changed, 13 insertions(+) diff --git a/src/lib/env.c b/src/lib/env.c index 2da848d..ff4dad2 100644 --- a/src/lib/env.c +++ b/src/lib/env.c @@ -30,6 +30,7 @@ #include linux/reboot.h #include sys/mount.h #include sys/utsname.h +#include sys/stat.h #include vzerror.h #include res.h @@ -551,6 +552,18 @@ int vps_start_custom(vps_handler *h, envid_t veid, vps_param *param, logger(-1, 0, Container is already running); return VZ_VE_RUNNING; } + if (!is_vz_kernel(h) h-can_join_userns) { + struct stat private_stat; + stat(res-fs.private, private_stat); + if ((private_stat.st_uid != *res-misc.local_uid) || + (private_stat.st_gid != *res-misc.local_gid)) { + logger(-1, 0, Container private area is owned by %d:%d + , but configuration file says we should run with %lu:%lu.\n + Refusing to run., private_stat.st_uid, private_stat.st_gid, + *res-misc.local_uid, *res-misc.local_gid); + return VZ_FS_BAD_TMPL; + } + } if ((ret = check_ub(h, res-ub))) return ret; -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 6/8] allow local uid and gid to be specified at container creation
It is a valid use case to run a container with host uid and gid different than the default. This patch provides and documents a way to do so. Signed-off-by: Glauber Costa glom...@parallels.com --- man/vzctl.8.in | 14 ++ src/vzctl-actions.c | 2 ++ src/vzctl.c | 1 + 3 files changed, 17 insertions(+) diff --git a/man/vzctl.8.in b/man/vzctl.8.in index 5efd702..07162ea 100644 --- a/man/vzctl.8.in +++ b/man/vzctl.8.in @@ -852,6 +852,8 @@ List of available fields can be obtained using \fB-L\fR option. .OP --ipadd addr .OP --hostname name .OP --name name +.OP --local_uid uid +.OP --local_gid gid .YS .IP 4 Creates a new container area. This operation should be done once, before @@ -903,6 +905,18 @@ a container. Note that this option can be used multiple times. You can use \fB--hostname\fR \fIname\fR option to set a host name for a container. + +When running with an upstream Linux Kernel that supports user namespaces +(= 3.8), the parameters \fB--local_uid\fR and \fB--local_gid\fR can be used to select +which \fIuid\fR and \fIgid\fR respectively will be used as a base user in the +host system. Note that user namespaces provide a 1:1 mapping between container +users and host users. If these options are not specified, the value 10 is +used. + +\fBWarning:\fR use \fB--local_uid\fR and \fB--local_gid\fR with care, specially +when migrating containers. In all situations, the container's files in the +filesystem needs to be correctly owned by the host-side users. + .IP \fBdestroy\fR | \fBdelete\fR \fICTID\fR 4 Removes a container private area by deleting all files, directories and the configuration file of this container. diff --git a/src/vzctl-actions.c b/src/vzctl-actions.c index be22265..63d93aa 100644 --- a/src/vzctl-actions.c +++ b/src/vzctl-actions.c @@ -391,6 +391,8 @@ static int parse_create_opt(envid_t veid, int argc, char **argv, {ve_layout, required_argument, NULL, PARAM_VE_LAYOUT}, {velayout,required_argument, NULL, PARAM_VE_LAYOUT}, {diskspace, required_argument, NULL, PARAM_DISKSPACE}, + {local_uid, required_argument, NULL, PARAM_LOCAL_UID}, + {local_gid, required_argument, NULL, PARAM_LOCAL_GID}, { NULL, 0, NULL, 0 } }; diff --git a/src/vzctl.c b/src/vzctl.c index d9bba7d..a72ab39 100644 --- a/src/vzctl.c +++ b/src/vzctl.c @@ -65,6 +65,7 @@ static void usage(int rc) vzctl create ctid [--ostemplate name] [--config name]\n [--layout ploop|simfs] [--hostname name] [--name name] [--ipadd addr]\n [--diskspace kbytes] [--private path] [--root path]\n + [--local_uid UID] [--local_gid GID]\n vzctl start ctid [--force] [--wait]\n vzctl destroy | mount | umount | stop | restart | status ctid\n #ifdef HAVE_PLOOP -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 7/8] automatically add bridge venet0 when needed
The chosen architecture to deal with --ipadd with upstream containers is to create a veth pair and add the host side information to a bridge called venet0. This way, all the code that expects venet0 to exist can still work without modifications, (or with just a few). Our intention to do that was actually already stated in the comments, but the code was removed before merging because --ipadd would not work without full unshare support anyway. This patch implements that. Signed-off-by: Glauber Costa glom...@parallels.com --- scripts/vps-functions.in | 7 +++ src/lib/hooks_ct.c | 37 +++-- 2 files changed, 42 insertions(+), 2 deletions(-) diff --git a/scripts/vps-functions.in b/scripts/vps-functions.in index 826c0a1..ab05aa0 100755 --- a/scripts/vps-functions.in +++ b/scripts/vps-functions.in @@ -170,6 +170,13 @@ vzadjustmacs() # other setups, the bridge is expected to already exist and be valid. vzconfbridge() { + if [ x$BRIDGE == xvenet0 ]; then + if [ `brctl show venet0 2/dev/null | tail -n+2 | wc -l` == 0 ]; then + brctl addbr venet0 + ${IP_CMD} link set venet0 up + fi + fi + if [ x$BRIDGE != x ]; then brctl addif $BRIDGE $HNAME /dev/null 21 fi diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index 6bd27c1..d5b15dc 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -17,6 +17,7 @@ #include logger.h #include script.h #include cgroup.h +#include linux/vzctl_venet.h #define NETNS_RUN_DIR /var/run/netns @@ -665,8 +666,40 @@ static int ct_netdev_ctl(vps_handler *h, envid_t veid, int op, char *name) static int ct_ip_ctl(vps_handler *h, envid_t veid, int op, const char *ipstr) { - logger(-1, 0, %s not yet supported upstream, __func__); - return 0; + int ret = -1; + char *envp[5]; + char buf[STR_SIZE]; + int i = 0; + + if (!h-can_join_pidns) { + logger(-1, 0, Cannot join pid namespace: + --ipadd is not supported in kernels without full pidns support); + return VZ_BAD_KERNEL; + } + envp[i++] = strdup(VNAME=venet0); + envp[i++] = strdup(BRIDGE=venet0); + + snprintf(buf, sizeof(buf), HNAME=venet0.%d, veid); + envp[i++] = strdup(buf); + + snprintf(buf, sizeof(buf), VEID=%d, veid); + envp[i++] = strdup(buf); + + envp[i] = NULL; + + if (op == VE_IP_ADD) { + char *argv[] = { VPS_NETNS_DEV_ADD, NULL }; + + ret = run_script(VPS_NETNS_DEV_ADD, argv, envp, 0); + } else { + char *argv[] = { VPS_NETNS_DEV_DEL, NULL }; + + ret = run_script(VPS_NETNS_DEV_DEL, argv, envp, 0); + } + free_arg(envp); + + return ret; + } /* -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 8/8] allow for distro-specific fix ups at creation time.
We will need that infrastucture when running with Linux upstream, since some support is very unlikely to ever land in the Kernel. We need to do things like account for the fact that udev may kick in and destroy all the setup we have done for /dev. Since preemptively preventing those actions to happen is very hard ( as an example, centos6 init binary will issue mount syscalls itself), the most robust approach is to insert a script that will be run shortly after rc.sysinit. Although this can't guarantee that it will run *immediately* after, it should be early enough to undo every harmful operation done by /sbin/init and rc.sysinit, therefore allowing operation to continue freely. Signed-off-by: Glauber Costa glom...@parallels.com --- etc/dists/redhat.conf | 1 + etc/dists/scripts/fixups.sh | 43 +++ include/dist.h | 2 ++ include/env.h | 3 ++- src/lib/dist.c | 10 +- src/lib/env.c | 10 ++ src/lib/exec.c | 2 +- src/lib/hooks_ct.c | 38 ++ 8 files changed, 102 insertions(+), 7 deletions(-) create mode 100755 etc/dists/scripts/fixups.sh diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf index 727461d..49226c5 100644 --- a/etc/dists/redhat.conf +++ b/etc/dists/redhat.conf @@ -25,3 +25,4 @@ SET_DNS=set_dns.sh SET_USERPASS=set_userpass.sh SET_UGID_QUOTA=set_ugid_quota.sh POST_CREATE=postcreate.sh +CT_FIXUP=fixups.sh diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh new file mode 100755 index 000..da3fd24 --- /dev/null +++ b/etc/dists/scripts/fixups.sh @@ -0,0 +1,43 @@ +#!/bin/bash +# Copyright (C) 2000-2009, Parallels, Inc. All rights reserved. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +# +# +# Run-time fixup for legacy distributions that will rely on udevd to populate +# their /dev. Those distributions will effectively mount an empty tmpfs on top +# of /dev, defeating any setup we may have done. This includes RHEL-based +# distributions up to RHEL6. We are better off not allowing that to run at all + +function fixup_udev() +{ + if [ -f /etc/fedora-release ]; then + return; + fi + + # All the first two may fail if the distribution didn't actually mount + # then. + umount /dev/pts + umount /dev/shm + # still may hold references to files open by rc.sysinit. Even if this is the + # case those will be simple things like /dev/null and should go away shortly. + # let us not fail because of that. + umount /dev -l +} + +fixup_udev + +exit 0 + diff --git a/include/dist.h b/include/dist.h index 1f6a5a6..4faa452 100644 --- a/include/dist.h +++ b/include/dist.h @@ -29,6 +29,7 @@ #defineSET_USERPASS5 #defineSET_UGID_QUOTA 6 #definePOST_CREATE 7 +#defineCT_FIXUP8 typedef struct { char *def_ostmpl; @@ -46,6 +47,7 @@ typedef struct dist_actions { char *set_userpass; /** setup user password. */ char *set_ugid_quota; /** setup 2level quota. */ char *post_create; /** sostcreate actions. */ + char *ct_fixup; /** run-time fixups. */ } dist_actions; /* Read distribution specific actions configuration file. diff --git a/include/env.h b/include/env.h index d41df2e..acea123 100644 --- a/include/env.h +++ b/include/env.h @@ -104,7 +104,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param *param, int stop_mode, int vz_chroot(const char *root); int vz_env_create(vps_handler *h, envid_t veid, vps_res *res, int wait_p[2], int old_wait_p[2], int err_p[2], - env_create_FN fn, void *data); + env_create_FN fn, dist_actions *actions, void *data); struct vps_res; struct arg_start { @@ -116,6 +116,7 @@ struct arg_start { vps_handler *h; void *data; env_create_FN fn; + dist_actions *actions; int userns_p; /* while running in userns, there's extra sync needed */ }; diff --git a/src/lib/dist.c b/src/lib/dist.c index bde94ab..8047cf7 100644 --- a/src/lib/dist.c +++ b/src/lib/dist.c @@ -36,7 +36,8 @@ static struct distr_conf { {SET_DNS
[Devel] [PATCH v2 0/6] Unprivileged containers with user namespaces
Kir, Please take a look at the following patches. They implement support for unprivileged containers using user namespaces, and should work, modulo bugs. v2: * use conf_parse_ulong to simplify uid/gid parsing. We do need to provide a default value for creation, though. * allow 0 to be specified as uid/gid offset. It simplifies the code if conf_parse_ulong is used, and well, if anyone *really* wants to run privileged... We will apply the default value now only if the fields are unset. Glauber Costa (6): host uid and gid parameters adjust fs_create parameter user namespace support for upstream containers modify tar extraction to account for user namespace add user mismatch test allow local uid and gid to be specified at container creation include/res.h | 8 + include/types.h | 1 + include/vzctl_param.h | 3 ++ man/vzctl.8.in | 14 scripts/vps-create.in | 19 ++ src/lib/Makefile.am | 3 ++ src/lib/chown_preload.c | 93 + src/lib/config.c| 32 + src/lib/create.c| 30 ++-- src/lib/env.c | 29 +++ src/lib/hooks_ct.c | 93 +++-- src/vzctl-actions.c | 2 ++ src/vzctl.c | 1 + vzctl.spec | 2 +- 14 files changed, 316 insertions(+), 14 deletions(-) create mode 100644 src/lib/chown_preload.c -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 1/6] host uid and gid parameters
When running with an upstream Linux kernel that supports user namespaces, we will run the container using an unprivileged user in the system. That can be any user, and it serves as base to a 1:1 mapping between users in the container and users in the host. By default, the value 10 will be used for both uid and gid. Signed-off-by: Glauber Costa glom...@parallels.com --- include/res.h | 8 include/vzctl_param.h | 3 +++ src/lib/config.c | 32 3 files changed, 43 insertions(+) diff --git a/include/res.h b/include/res.h index e07060d..47c36da 100644 --- a/include/res.h +++ b/include/res.h @@ -47,6 +47,12 @@ struct env_param { }; typedef struct env_param env_param_t; +/* + * When running upstream kernels, we will need host-side UID and GID. Those + * are configurable, but if none is specified, those are used. + */ +#define VZ_DEFAULT_UID 10 +#define VZ_DEFAULT_GID 10 typedef struct { list_head_t userpw; list_head_t nameserver; @@ -56,6 +62,8 @@ typedef struct { int onboot; unsigned long *bootorder; int wait; + unsigned long *local_uid; + unsigned long *local_gid; } misc_param; struct mod_action; diff --git a/include/vzctl_param.h b/include/vzctl_param.h index 3034a13..e583540 100644 --- a/include/vzctl_param.h +++ b/include/vzctl_param.h @@ -149,5 +149,8 @@ #define PARAM_MOUNT_OPTS 396 +#define PARAM_LOCAL_UID397 +#define PARAM_LOCAL_GID398 + #define PARAM_LINE e:p:f:t:i:l:k:a:b:n:x:h #endif diff --git a/src/lib/config.c b/src/lib/config.c index cf5f352..cc4c1cc 100644 --- a/src/lib/config.c +++ b/src/lib/config.c @@ -128,6 +128,10 @@ static vps_config config[] = { {CONFIGFILE, NULL, PARAM_CONFIG}, {ORIGIN_SAMPLE,NULL,PARAM_CONFIG_SAMPLE}, {DISABLED, NULL, PARAM_DISABLED}, +#ifdef HAVE_UPSTREAM +{LOCAL_UID, NULL, PARAM_LOCAL_UID}, +{LOCAL_GID, NULL, PARAM_LOCAL_GID}, +#endif /* quota */ {DISK_QUOTA, NULL, PARAM_DISK_QUOTA}, {DISKSPACE, NULL, PARAM_DISKSPACE}, @@ -1363,6 +1367,12 @@ static int store_misc(vps_param *old_p, vps_param *vps_p, vps_config *conf, ret = conf_store_strlist(conf_h, conf-name, misc-searchdomain, 0); break; + case PARAM_LOCAL_UID: + ret = conf_store_ulong(conf_h, conf-name, misc-local_uid); + break; + case PARAM_LOCAL_GID: + ret = conf_store_ulong(conf_h, conf-name, misc-local_gid); + break; } return ret; } @@ -1996,6 +2006,13 @@ static int parse(envid_t veid, vps_param *vps_p, char *val, int id) case PARAM_IPTABLES: ret = parse_iptables(vps_p-res.env, val); break; + + case PARAM_LOCAL_UID: + conf_parse_ulong(vps_p-res.misc.local_uid, val); + break; + case PARAM_LOCAL_GID: + conf_parse_ulong(vps_p-res.misc.local_gid, val); + break; case PARAM_LOCKEDPAGES: case PARAM_PRIVVMPAGES: case PARAM_SHMPAGES: @@ -2697,6 +2714,8 @@ static void free_misc(misc_param *misc) FREE_P(misc-hostname) FREE_P(misc-description) FREE_P(misc-bootorder) + FREE_P(misc-local_uid) + FREE_P(misc-local_gid) } static void free_net(net_param *net) @@ -2789,6 +2808,17 @@ void free_vps_param(vps_param *param) *dst-x = *src-x; \ } +#define MERGE_P_DEFAULT(x, dfl) \ + if ((src-x) != NULL) { \ + if ((dst-x) == NULL) \ + dst-x = malloc(sizeof(*(dst-x))); \ + *dst-x = *src-x; \ + } \ + if ((dst-x) == NULL) { \ + dst-x = malloc(sizeof(*(dst-x))); \ + *dst-x = (unsigned long)(dfl); \ + } + #define MERGE_P2(x)\ if ((src-x) != NULL) { \ if ((dst-x) == NULL) \ @@ -2865,6 +2895,8 @@ static void merge_misc(misc_param *dst, misc_param *src) MERGE_STR(hostname) MERGE_STR(description) MERGE_INT(onboot) + MERGE_P_DEFAULT(local_uid, VZ_DEFAULT_UID) + MERGE_P_DEFAULT(local_gid, VZ_DEFAULT_GID) MERGE_P(bootorder) MERGE_INT(wait) } -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 2/6] adjust fs_create parameter
We need to pass more information to fs_create. Instead of adding arguments, it is preferred to pass the whole vps_p structure and unfold it inside the callee. Signed-off-by: Glauber Costa glom...@parallels.com --- src/lib/create.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/src/lib/create.c b/src/lib/create.c index 39e..0a0330f 100644 --- a/src/lib/create.c +++ b/src/lib/create.c @@ -90,8 +90,8 @@ static int download_template(char *tmpl) return run_script(VPS_DOWNLOAD, arg, env, 0); } -static int fs_create(envid_t veid, fs_param *fs, tmpl_param *tmpl, - dq_param *dq, int layout, int ploop_mode) +static int fs_create(envid_t veid, vps_handler *h, fs_param *fs, + tmpl_param *tmpl, vps_param *vps_p) { char tarball[PATH_LEN]; char tmp_dir[PATH_LEN]; @@ -104,6 +104,10 @@ static int fs_create(envid_t veid, fs_param *fs, tmpl_param *tmpl, char *dst; const char *ext[] = { , .gz, .bz2, .xz, NULL }; const char *errmsg_ext = [.gz|.bz2|.xz]; + dq_param *dq = vps_p-res.dq; + int layout = vps_p-opt.layout; + unsigned int uid_offset = vps_p-res.misc.local_uid; + unsigned int gid_offset = vps_p-res.misc.local_gid; int ploop = (layout == VE_LAYOUT_PLOOP); if (ploop (!dq-diskspace || dq-diskspace[1] = 0)) { @@ -153,6 +157,7 @@ find: /* Create and mount ploop image */ struct vzctl_create_image_param param = {}; struct vzctl_mount_param mount_param = {}; + int ploop_mode = vps_p-opt.mode; if (ploop_mode 0) ploop_mode = PLOOP_EXPANDED_MODE; @@ -360,9 +365,7 @@ int vps_create(vps_handler *h, envid_t veid, vps_param *vps_p, vps_param *cmd_p, tmpl-ostmpl = full_ostmpl; } } - if ((ret = fs_create(veid, fs, tmpl, vps_p-res.dq, - vps_p-opt.layout, - vps_p-opt.mode))) + if ((ret = fs_create(veid, h, fs, tmpl, vps_p))) goto err_root; } -- 1.7.11.7 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 3/6] user namespace support for upstream containers
This patch allows the execution of unprivileged containers running ontop of an upstream Linux Kernel. We will run at whatever UID is found in the configuration file. Signed-off-by: Glauber Costa glom...@parallels.com --- include/types.h| 1 + src/lib/env.c | 16 ++ src/lib/hooks_ct.c | 93 -- 3 files changed, 107 insertions(+), 3 deletions(-) diff --git a/include/types.h b/include/types.h index ceecb93..54eb1f4 100644 --- a/include/types.h +++ b/include/types.h @@ -95,6 +95,7 @@ typedef struct vps_handler { int vzfd; /** /dev/vzctl file descriptor. */ int stdfd; int can_join_pidns; /* can't enter otherwise */ + int can_join_userns; /* can't run non privileged otherwise */ int (*is_run)(struct vps_handler *h, envid_t veid); int (*enter)(struct vps_handler *h, envid_t veid, const char *root, int flags); int (*destroy)(struct vps_handler *h, envid_t veid); diff --git a/src/lib/env.c b/src/lib/env.c index 2da848d..75e2dee 100644 --- a/src/lib/env.c +++ b/src/lib/env.c @@ -280,6 +280,22 @@ int exec_container_init(struct arg_start *arg, if (read(arg-wait_p, ret, sizeof(ret)) == 0) return -1; + /* +* If we are running on upstream Linux Kernel, we will arrive here as +* the default unprivileged user. This needs to be setup by the +* container thread (pid 1), but only after the mapping is established. +* +* Since the mapping can only ever be established by a process that has +* CAP_SETUID in the parent namespace, this has to be done from the +* process who called clone, not by the cloned children. We need some sort +* of synchronization to make sure the mappings are already in place, so +* we do it after the read of wait_p above. +*/ + if (!is_vz_kernel(arg-h) arg-h-can_join_userns) { + setuid(0); + setgid(0); + } + if ((fd = open(/dev/null, O_RDWR)) != -1) { dup2(fd, 0); dup2(fd, 1); diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c index 29d7eea..1ed3b84 100644 --- a/src/lib/hooks_ct.c +++ b/src/lib/hooks_ct.c @@ -66,6 +66,8 @@ static int sys_setns(int fd, int nstype) # define MS_PRIVATE (1 18) #endif +#define UID_GID_RANGE 10 /* how many users per container */ + /* This function is there in GLIBC, but not in headers */ extern int pivot_root(const char * new_root, const char * put_old); @@ -141,7 +143,12 @@ static int _env_create(void *data) if ((ret = ct_chroot(arg-res-fs.root))) return ret; - if ((ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 1))) + /* +* If we are using the user namespace, we will have the full capability +* set in the target namespace. So we don't need any of that. +*/ + if (!arg-h-can_join_userns + (ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 1))) return ret; fill_container_param(arg, create_param); @@ -153,6 +160,31 @@ static int _env_create(void *data) return exec_container_init(arg, create_param); } +static int write_uid_gid_mapping(vps_handler *h, unsigned long uid, unsigned long gid, pid_t pid) +{ + char buf[STR_SIZE]; + char map[STR_SIZE]; + int fd; + + snprintf(map, sizeof(map), 0 %ld %d, uid, UID_GID_RANGE); + snprintf(buf, sizeof(buf), /proc/%d/uid_map, pid); + if ((fd = open(buf, O_WRONLY)) 0) + return -1; + + if ((write(fd, map, sizeof(map)) 0)) + return -1; + + snprintf(map, sizeof(map), 0 %ld %d, gid, UID_GID_RANGE); + snprintf(buf, sizeof(map), /proc/%d/gid_map, pid); + if ((fd = open(buf, O_WRONLY)) 0) + return -1; + + if ((write(fd, map, sizeof(map)) 0)) + return -1; + + return 0; +} + static int ct_env_create(struct arg_start *arg) { @@ -190,16 +222,30 @@ static int ct_env_create(struct arg_start *arg) * Belong in the setup phase */ clone_flags = SIGCHLD; - /* FIXME: USERNS is still work in progress */ clone_flags |= CLONE_NEWUTS|CLONE_NEWPID|CLONE_NEWIPC; clone_flags |= CLONE_NEWNET|CLONE_NEWNS; + if (arg-h-can_join_userns) + clone_flags |= CLONE_NEWUSER; + else + logger(-1, 0, WARNING: Running container unprivileged. USER_NS not supported); + ret = clone(_env_create, child_stack, clone_flags, arg); if (ret 0) { logger(-1, errno, Unable to clone); /* FIXME: remove ourselves from container first */ destroy_container(arg-veid); return VZ_RESOURCE_ERROR; + } else if (arg-h-can_join_userns) { + /* +* Now we need to write to the mapping file. It has to be us
[Devel] [PATCH v2 4/6] modify tar extraction to account for user namespace
If we are running upstream with user namespaces, we need to create the container filesystem not with the ownership preserved, but reflecting the mapping we need to apply. The long term goal is to patch the tar utility to allow us to specify an offset that will be applied during extraction. However, even if tar is modified, the legacy base of old tar utilities is still too big. We can employ a trick to allow container creation right now, as well as to avoid compatibility problems: we will resort to LD_PRELOAD to load a schim that captures calls to the chown family of system calls and applies the offset manually. Signed-off-by: Glauber Costa glom...@parallels.com --- scripts/vps-create.in | 19 ++ src/lib/Makefile.am | 3 ++ src/lib/chown_preload.c | 93 + src/lib/create.c| 21 +++ vzctl.spec | 2 +- 5 files changed, 130 insertions(+), 8 deletions(-) create mode 100644 src/lib/chown_preload.c diff --git a/scripts/vps-create.in b/scripts/vps-create.in index 126f048..53f6643 100755 --- a/scripts/vps-create.in +++ b/scripts/vps-create.in @@ -22,11 +22,27 @@ # Required parameters: # VE_PRVT- path to root of CT private areas # PRIVATE_TEMPLATE - path to private template used as a source for copying +# +# Optional parameters: +# UID_OFFSET - offset to be added to all tar UIDs +# GID_OFFSET - offset to be added to all tar GIDs . @SCRIPTDIR@/vps-functions vzcheckvar VE_PRVT PRIVATE_TEMPLATE +chown_preload_if_needed() +{ + [ -z $UID_OFFSET -o -z $GID_OFFSET ] return + + # TODO use tar with appropriate option when it becomes available + # + # The goal is to get this merged into tar so we no longer + # need to do this preload. Whenever it happens, we will include + # a version check here as well + export LD_PRELOAD=libvzchown.so +} + create_prvt() { local TMP AVAIL NEEDED HEADER OPT @@ -75,6 +91,9 @@ create_prvt() [ $AVAIL -ge $NEEDED ] || vzerror Insufficient disk space in $VE_PRVT; available: $AVAIL, needed: $NEEDED ${VZ_FS_NO_DISK_SPACE} CAT=cat + + chown_preload_if_needed + # Use pv to show nice progress bar if we can pv -V /dev/null 21 CAT=pv chmod 700 $VE_PRVT diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am index 42a7ed6..44c8a69 100644 --- a/src/lib/Makefile.am +++ b/src/lib/Makefile.am @@ -70,6 +70,9 @@ libvzctl_la_LIBADD = $(XML_LIBS) $(CGROUP_LIBS) $(DL_LIBS) if HAVE_CGROUP libvzctl_la_SOURCES += cgroup.c hooks_ct.c + +lib_LTLIBRARIES += libvzchown.la +libvzchown_la_SOURCES = chown_preload.c endif if HAVE_VZ_KERNEL diff --git a/src/lib/chown_preload.c b/src/lib/chown_preload.c new file mode 100644 index 000..4e2be4a --- /dev/null +++ b/src/lib/chown_preload.c @@ -0,0 +1,93 @@ +/* + * Copyright (C) 2013, Parallels, Inc. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + * Authors: Kir Kolyshkin and Glauber Costa + */ + +#include unistd.h +#include stdio.h +#include stdlib.h + +#include dlfcn.h + +static int (*real_chown)(const char *path, uid_t owner, gid_t group) = NULL; +static int (*real_fchown)(int fd, uid_t owner, gid_t group) = NULL; +static int (*real_lchown)(const char *path, uid_t owner, gid_t group) = NULL; +static int (*real_fchownat)(int dirfd, const char *pathname, + uid_t owner, gid_t group, int flags) = NULL; + +uid_t uid_offset = 0; +gid_t gid_offset = 0; + +static void __init(void) +{ + char *uidstr, *gidstr; + + uidstr = getenv(UID_OFFSET); + gidstr = getenv(GID_OFFSET); + if (!uidstr || !gidstr) { + fprintf(stderr, Environment variables UID_OFFSET + and GID_OFFSET are required -- aborting\n); + exit(33); + } + uid_offset = strtol(uidstr, NULL, 10); + gid_offset = strtol(gidstr, NULL, 10); + + real_chown = dlsym(RTLD_NEXT, chown); + real_fchown = dlsym(RTLD_NEXT, fchown); + real_lchown = dlsym(RTLD_NEXT, lchown); + real_fchownat = dlsym(RTLD_NEXT, fchownat); + + if (!real_chown || !real_fchown || !real_lchown || !real_fchownat) { + fprintf(stderr