Re: [Devel] [PATCH 3/8] memcg, slab: never try to merge memcg caches

2014-02-04 Thread Glauber Costa
On Tue, Feb 4, 2014 at 7:27 PM, Vladimir Davydov vdavy...@parallels.com wrote:
 On 02/04/2014 07:11 PM, Michal Hocko wrote:
 On Tue 04-02-14 18:59:23, Vladimir Davydov wrote:
 On 02/04/2014 06:52 PM, Michal Hocko wrote:
 On Sun 02-02-14 20:33:48, Vladimir Davydov wrote:
 Suppose we are creating memcg cache A that could be merged with cache B
 of the same memcg. Since any memcg cache has the same parameters as its
 parent cache, parent caches PA and PB of memcg caches A and B must be
 mergeable too. That means PA was merged with PB on creation or vice
 versa, i.e. PA = PB. From that it follows that A = B, and we couldn't
 even try to create cache B, because it already exists - a contradiction.
 I cannot tell I understand the above but I am totally not sure about the
 statement bellow.

 So let's remove unused code responsible for merging memcg caches.
 How come the code was unused? find_mergeable called cache_match_memcg...
 Oh, sorry for misleading comment. I mean the code handling merging of
 per-memcg caches is useless, AFAIU: if we find an alias for a per-memcg
 cache on kmem_cache_create_memcg(), the parent of the found alias must
 be the same as the parent_cache passed to kmem_cache_create_memcg(), but
 if it were so, we would never proceed to the memcg cache creation,
 because the cache we want to create already exists.
 I am still not sure I understand this correctly. So the outcome of this
 patch is that compatible caches of different memcgs can be merged
 together? Sorry if this is a stupid question but I am not that familiar
 with this area much I am just seeing that cache_match_memcg goes away
 and my understanding of the function is that it should prevent from
 different memcg's caches merging.

 Let me try to explain how I understand it.

 What is cache merging/aliasing? When we create a cache
 (kmem_cache_create()), we first try to find a compatible cache that
 already exists and can handle requests from the new cache. If it is, we
 do not create any new caches, instead we simply increment the old cache
 refcount and return it.

 What about memcgs? Currently, it operates in the same way, i.e. on memcg
 cache creation we also try to find a compatible cache of the same memcg
 first. But if there were such a cache, they parents would have been
 merged (i.e. it would be the same cache). That means we would not even
 get to this memcg cache creation, because it already exists. That's why
 the code handling memcg caches merging seems pointless to me.


IIRC, this may not always hold. Some of the properties are configurable via
sysfs, and it might be that you haven't merged two parent caches because they
properties differ, but would be fine merging the child caches.

If all properties we check are compile-time parameters, then it should be okay.


-- 
E Mare, Libertas
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 1/8] memcg: export kmemcg cache id via cgroup fs

2014-02-03 Thread Glauber Costa
On Mon, Feb 3, 2014 at 10:57 AM, Vladimir Davydov
vdavy...@parallels.com wrote:
 On 02/03/2014 10:21 AM, David Rientjes wrote:
 On Sun, 2 Feb 2014, Vladimir Davydov wrote:

 Per-memcg kmem caches are named as follows:

   global-cache-name(cgroup-kmem-id:cgroup-name)

 where cgroup-kmem-id is the unique id of the memcg the cache belongs
 to, cgroup-name is the relative name of the memcg on the cgroup fs.
 Cache names are exposed to userspace for debugging purposes (e.g. via
 sysfs in case of slub or via dmesg).

 Using relative names makes it impossible in general (in case the cgroup
 hierarchy is not flat) to find out which memcg a particular cache
 belongs to, because cgroup-kmem-id is not known to the user. Since
 using absolute cgroup names would be an overkill, let's fix this by
 exporting the id of kmem-active memcg via cgroup fs file
 memory.kmem.id.

 Hmm, I'm not sure exporting additional information is the best way to do
 it only for this purpose.  I do understand the problem in naming
 collisions if the hierarchy isn't flat and we typically work around that
 by ensuring child memcgs still have a unique memcg.  This isn't only a
 problem in slab cache naming, me also avoid printing the entire absolute
 names for things like the oom killer.

 AFAIU, cgroup identifiers dumped on oom (cgroup paths, currently) and
 memcg slab cache names serve for different purposes. The point is oom is
 a perfectly normal situation for the kernel, and info dumped to dmesg is
 for admin to find out the cause of the problem (a greedy user or
 cgroup). On the other hand, slab cache names are dumped to dmesg only on
 extraordinary situations - like bugs in slab implementation, or double
 free, or detected memory leaks - where we usually do not need the name
 of the memcg that triggered the problem, because the bug is likely to be
 in the kernel subsys using the cache. Plus, the names are exported to
 sysfs in case of slub, again for debugging purposes, AFAIK. So IMO the
 use cases for oom vs slab names are completely different - information
 vs debugging - and I want to export kmem.id only for the ability of
 debugging kmemcg and slab subsystems.


Then maybe it is better to wrap it into some kind of CONFIG_DEBUG wrap.
We already have other files like that.

 So it would be nice to have
 consensus on how people are supposed to identify memcgs with a hierarchy:
 either by exporting information like the id like you do here (but leave
 the oom killer still problematic) or by insisting people name their memcgs
 with unique names if they care to differentiate them.

 Anyway, I agree with you that this needs a consensus, because this is a
 functional change.

 Thanks.



-- 
E Mare, Libertas
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v14 16/18] vmpressure: in-kernel notifications

2013-12-20 Thread Glauber Costa
 I have the exact problem described above for a project I'm working on
 and this solution seems to solve it well.

 However, I had a few issues while trying to use this interface. I'll
 comment on them below, but please take this more as advice seeking
 than patch review.

 This patch extends that to also support in-kernel users. Events that
 should be generated for in-kernel consumption will be marked as such,
 and for those, we will call a registered function instead of triggering
 an eventfd notification.

 Please note that due to my lack of understanding of each shrinker user,
 I will stay away from converting the actual users, you are all welcome
 to do so.

 Signed-off-by: Glauber Costa glom...@openvz.org
 Signed-off-by: Vladimir Davydov vdavy...@parallels.com
 Acked-by: Anton Vorontsov an...@enomsg.org
 Acked-by: Pekka Enberg penb...@kernel.org
 Reviewed-by: Greg Thelen gthe...@google.com
 Cc: Dave Chinner dchin...@redhat.com
 Cc: John Stultz john.stu...@linaro.org
 Cc: Andrew Morton a...@linux-foundation.org
 Cc: Joonsoo Kim iamjoonsoo@lge.com
 Cc: Michal Hocko mho...@suse.cz
 Cc: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 Cc: Johannes Weiner han...@cmpxchg.org
 ---
  include/linux/vmpressure.h |5 +
  mm/vmpressure.c|   53 
 +---
  2 files changed, 55 insertions(+), 3 deletions(-)

 diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
 index 3f3788d..9102e53 100644
 --- a/include/linux/vmpressure.h
 +++ b/include/linux/vmpressure.h
 @@ -19,6 +19,9 @@ struct vmpressure {
   /* Have to grab the lock on events traversal or modifications. */
   struct mutex events_lock;

 + /* False if only kernel users want to be notified, true otherwise. */
 + bool notify_userspace;
 +
   struct work_struct work;
  };

 @@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct 
 cgroup_subsys_state *css,
struct cftype *cft,
struct eventfd_ctx *eventfd,
const char *args);
 +extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
 + void (*fn)(void));
  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
   struct cftype *cft,
   struct eventfd_ctx *eventfd);
 diff --git a/mm/vmpressure.c b/mm/vmpressure.c
 index e0f6283..730e7c1 100644
 --- a/mm/vmpressure.c
 +++ b/mm/vmpressure.c
 @@ -130,8 +130,12 @@ static enum vmpressure_levels 
 vmpressure_calc_level(unsigned long scanned,
  }

  struct vmpressure_event {
 - struct eventfd_ctx *efd;
 + union {
 + struct eventfd_ctx *efd;
 + void (*fn)(void);

 How does the callback access its private data?

 + };
   enum vmpressure_levels level;
 + bool kernel_event;
   struct list_head node;
  };

 @@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
   mutex_lock(vmpr-events_lock);

   list_for_each_entry(ev, vmpr-events, node) {
 - if (level = ev-level) {
 + if (ev-kernel_event) {
 + ev-fn();

 I think it would be interesting to pass 'level' to the callback (I'll
 probably use it myself), but we could wait for a in-tree user before
 adding it.

 + } else if (vmpr-notify_userspace  level = ev-level) {
   eventfd_signal(ev-efd, 1);
   signalled = true;
   }
   }

 + vmpr-notify_userspace = false;
   mutex_unlock(vmpr-events_lock);

   return signalled;
 @@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
* we account it too.
*/
   if (!(gfp  (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
 - return;
 + goto schedule;

   /*
* If we got here with no pages scanned, then that is an indicator
 @@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
   vmpr-scanned += scanned;
   vmpr-reclaimed += reclaimed;
   scanned = vmpr-scanned;
 + /*
 +  * If we didn't reach this point, only kernel events will be triggered.
 +  * It is the job of the worker thread to clean this up once the
 +  * notifications are all delivered.
 +  */
 + vmpr-notify_userspace = true;
   spin_unlock(vmpr-sr_lock);

 +schedule:
   if (scanned  vmpressure_win)
   return;
   schedule_work(vmpr-work);
 @@ -324,6 +338,39 @@ int vmpressure_register_event(struct 
 cgroup_subsys_state *css,
  }

  /**
 + * vmpressure_register_kernel_event() - Register kernel-side notification
 + * @css: css that is interested in vmpressure notifications
 + * @fn:  function to be called when pressure happens
 + *
 + * This function register in-kernel users interested in receiving 
 notifications

Re: [Devel] [PATCH v14 16/18] vmpressure: in-kernel notifications

2013-12-20 Thread Glauber Costa
One correction:

  int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
 - void (*fn)(void))
 +void (*fn)(void *data, int level), void 
 *data)
  {
 -   struct vmpressure *vmpr = css_to_vmpressure(css);
 +   struct vmpressure *vmpr;
 struct vmpressure_event *ev;

 +   vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
 +

This looks like it could be improved. Better not to have that memcg
specific thing
here.

Other than that it makes sense.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v14 16/18] vmpressure: in-kernel notifications

2013-12-20 Thread Glauber Costa
On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino lcapitul...@redhat.com wrote:
 On Fri, 20 Dec 2013 10:03:32 -0500
 Luiz Capitulino lcapitul...@redhat.com wrote:

  The answer for all of your questions above can be summarized by noting
  that for the lack of other users (at the time), this patch does the bare 
  minimum
  for memcg needs. I agree, for instance, that it would be good to pass the 
  level
  but since memcg won't do anything with thta, I didn't pass it.
 
  That should be extended if you need to.

 That works for me. That is, including this minimal version first and
 extending it when we get in-tree users.

 Btw, there's something I was thinking just right now. If/when we
 convert shrink functions to use this API, they will come to depend
 on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.

 Is this acceptable (this is an honest question)? Because today, they
 do work when CONFIG_MEMCG=n. Should those shrink functions use the
 shrinker API as a fallback?

If you have a non-memcg user, that should obviously be available for
CONFIG_MEMCG=n
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v14 16/18] vmpressure: in-kernel notifications

2013-12-20 Thread Glauber Costa
On Fri, Dec 20, 2013 at 8:53 PM, Luiz Capitulino lcapitul...@redhat.com wrote:
 On Fri, 20 Dec 2013 20:46:05 +0400
 Glauber Costa glom...@gmail.com wrote:

 On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino lcapitul...@redhat.com 
 wrote:
  On Fri, 20 Dec 2013 10:03:32 -0500
  Luiz Capitulino lcapitul...@redhat.com wrote:
 
   The answer for all of your questions above can be summarized by noting
   that for the lack of other users (at the time), this patch does the 
   bare minimum
   for memcg needs. I agree, for instance, that it would be good to pass 
   the level
   but since memcg won't do anything with thta, I didn't pass it.
  
   That should be extended if you need to.
 
  That works for me. That is, including this minimal version first and
  extending it when we get in-tree users.
 
  Btw, there's something I was thinking just right now. If/when we
  convert shrink functions to use this API, they will come to depend
  on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.
 
  Is this acceptable (this is an honest question)? Because today, they
  do work when CONFIG_MEMCG=n. Should those shrink functions use the
  shrinker API as a fallback?

 If you have a non-memcg user, that should obviously be available for
 CONFIG_MEMCG=n

 OK, which means we'll have to change it, right? Because, if I'm not
 missing something, today vmpressure does depend on CONFIG_MEMCG=y.

You mean the main vmpressure mechanism?
Sorry, this was out of my mental cachelines. Yes, vmpressure depends
on MEMCG, because
the pressure interface is memcg-specific (global == root memcg)

You might want to change that so you can reuse the mechanism and let
only the user interface
depend on memcg.


-- 
E Mare, Libertas
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 4/6] memcg, slab: check and init memcg_cahes under slab_mutex

2013-12-19 Thread Glauber Costa
On Thu, Dec 19, 2013 at 11:07 AM, Vladimir Davydov
vdavy...@parallels.com wrote:
 On 12/18/2013 09:41 PM, Michal Hocko wrote:
 On Wed 18-12-13 17:16:55, Vladimir Davydov wrote:
 The memcg_params::memcg_caches array can be updated concurrently from
 memcg_update_cache_size() and memcg_create_kmem_cache(). Although both
 of these functions take the slab_mutex during their operation, the
 latter checks if memcg's cache has already been allocated w/o taking the
 mutex. This can result in a race as described below.

 Asume two threads schedule kmem_cache creation works for the same
 kmem_cache of the same memcg from __memcg_kmem_get_cache(). One of the
 works successfully creates it. Another work should fail then, but if it
 interleaves with memcg_update_cache_size() as follows, it does not:
 I am not sure I understand the race. memcg_update_cache_size is called
 when we start accounting a new memcg or a child is created and it
 inherits accounting from the parent. memcg_create_kmem_cache is called
 when a new cache is first allocated from, right?

 memcg_update_cache_size() is called when kmem accounting is activated
 for a memcg, no matter how.

 memcg_create_kmem_cache() is scheduled from __memcg_kmem_get_cache().
 It's OK to have a bunch of such methods trying to create the same memcg
 cache concurrently, but only one of them should succeed.

 Why cannot we simply take slab_mutex inside memcg_create_kmem_cache?
 it is running from the workqueue context so it should clash with other
 locks.

 Hmm, Glauber's code never takes the slab_mutex inside memcontrol.c. I
 have always been wondering why, because it could simplify flow paths
 significantly (e.g. update_cache_sizes() - update_all_caches() -
 update_cache_size() - from memcontrol.c to slab_common.c and back again
 just to take the mutex).


Because that is a layering violation and exposes implementation
details of the slab to
the outside world. I agree this would make things a lot simpler, but
please check with Christoph
if this is acceptable before going forward.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] memcg: remove KMEM_ACCOUNTED_ACTIVATED

2013-12-17 Thread Glauber Costa
On Mon, Dec 2, 2013 at 10:15 PM, Michal Hocko mho...@suse.cz wrote:
 [CCing Glauber - please do so in other posts for kmem related changes]

 On Mon 02-12-13 17:08:13, Vladimir Davydov wrote:
 The KMEM_ACCOUNTED_ACTIVATED was introduced by commit a8964b9b (memcg:
 use static branches when code not in use) in order to guarantee that
 static_key_slow_inc(memcg_kmem_enabled_key) will be called only once
 for each memory cgroup when its kmem limit is set. The point is that at
 that time the memcg_update_kmem_limit() function's workflow looked like
 this:

   bool must_inc_static_branch = false;

   cgroup_lock();
   mutex_lock(set_limit_mutex);
   if (!memcg-kmem_account_flags  val != RESOURCE_MAX) {
   /* The kmem limit is set for the first time */
   ret = res_counter_set_limit(memcg-kmem, val);

   memcg_kmem_set_activated(memcg);
   must_inc_static_branch = true;
   } else
   ret = res_counter_set_limit(memcg-kmem, val);
   mutex_unlock(set_limit_mutex);
   cgroup_unlock();

   if (must_inc_static_branch) {
   /* We can't do this under cgroup_lock */
   static_key_slow_inc(memcg_kmem_enabled_key);
   memcg_kmem_set_active(memcg);
   }

 Today, we don't use cgroup_lock in memcg_update_kmem_limit(), and
 static_key_slow_inc() is called under the set_limit_mutex, but the
 leftover from the above-mentioned commit is still here. Let's remove it.

 OK, so I have looked there again and 692e89abd154b (memcg: increment
 static branch right after limit set) which went in after cgroup_mutex
 has been removed. It came along with the following comment.
 /*
  * setting the active bit after the inc will guarantee no one
  * starts accounting before all call sites are patched
  */

 This suggests that the flag is needed after all because we have
 to be sure that _all_ the places have to be patched. AFAIU
 memcg_kmem_newpage_charge might see the static key already patched so
 it would do a charge but memcg_kmem_commit_charge would still see it
 unpatched and so the charge won't be committed.

 Or am I missing something?

You are correct. This flag is there due to the way we are using static branches.
The patching of one call site is atomic, but the patching of all of
them are not.
Therefore we need to use a two-flag scheme to guarantee that in the first time
we turn the static branches on, there will be a clear point after
which we're going
to start accounting.





-- 
E Mare, Libertas
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] memcg: remove KMEM_ACCOUNTED_ACTIVATED

2013-12-17 Thread Glauber Costa
On Mon, Dec 2, 2013 at 10:51 PM, Michal Hocko mho...@suse.cz wrote:
 On Mon 02-12-13 22:26:48, Glauber Costa wrote:
 On Mon, Dec 2, 2013 at 10:15 PM, Michal Hocko mho...@suse.cz wrote:
  [CCing Glauber - please do so in other posts for kmem related changes]
 
  On Mon 02-12-13 17:08:13, Vladimir Davydov wrote:
  The KMEM_ACCOUNTED_ACTIVATED was introduced by commit a8964b9b (memcg:
  use static branches when code not in use) in order to guarantee that
  static_key_slow_inc(memcg_kmem_enabled_key) will be called only once
  for each memory cgroup when its kmem limit is set. The point is that at
  that time the memcg_update_kmem_limit() function's workflow looked like
  this:
 
bool must_inc_static_branch = false;
 
cgroup_lock();
mutex_lock(set_limit_mutex);
if (!memcg-kmem_account_flags  val != RESOURCE_MAX) {
/* The kmem limit is set for the first time */
ret = res_counter_set_limit(memcg-kmem, val);
 
memcg_kmem_set_activated(memcg);
must_inc_static_branch = true;
} else
ret = res_counter_set_limit(memcg-kmem, val);
mutex_unlock(set_limit_mutex);
cgroup_unlock();
 
if (must_inc_static_branch) {
/* We can't do this under cgroup_lock */
static_key_slow_inc(memcg_kmem_enabled_key);
memcg_kmem_set_active(memcg);
}
 
  Today, we don't use cgroup_lock in memcg_update_kmem_limit(), and
  static_key_slow_inc() is called under the set_limit_mutex, but the
  leftover from the above-mentioned commit is still here. Let's remove it.
 
  OK, so I have looked there again and 692e89abd154b (memcg: increment
  static branch right after limit set) which went in after cgroup_mutex
  has been removed. It came along with the following comment.
  /*
   * setting the active bit after the inc will guarantee no 
  one
   * starts accounting before all call sites are patched
   */
 
  This suggests that the flag is needed after all because we have
  to be sure that _all_ the places have to be patched. AFAIU
  memcg_kmem_newpage_charge might see the static key already patched so
  it would do a charge but memcg_kmem_commit_charge would still see it
  unpatched and so the charge won't be committed.
 
  Or am I missing something?

 You are correct. This flag is there due to the way we are using static 
 branches.
 The patching of one call site is atomic, but the patching of all of
 them are not.
 Therefore we need to use a two-flag scheme to guarantee that in the first 
 time
 we turn the static branches on, there will be a clear point after
 which we're going
 to start accounting.

 So http://lkml.org/lkml/2013/11/27/314 is correct then, right?

It looks correct.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] memcg: remove KMEM_ACCOUNTED_ACTIVATED

2013-12-17 Thread Glauber Costa
On Mon, Dec 2, 2013 at 11:21 PM, Vladimir Davydov
vdavy...@parallels.com wrote:
 On 12/02/2013 10:26 PM, Glauber Costa wrote:

 On Mon, Dec 2, 2013 at 10:15 PM, Michal Hocko mho...@suse.cz wrote:

 [CCing Glauber - please do so in other posts for kmem related changes]

 On Mon 02-12-13 17:08:13, Vladimir Davydov wrote:

 The KMEM_ACCOUNTED_ACTIVATED was introduced by commit a8964b9b (memcg:
 use static branches when code not in use) in order to guarantee that
 static_key_slow_inc(memcg_kmem_enabled_key) will be called only once
 for each memory cgroup when its kmem limit is set. The point is that at
 that time the memcg_update_kmem_limit() function's workflow looked like
 this:

bool must_inc_static_branch = false;

cgroup_lock();
mutex_lock(set_limit_mutex);
if (!memcg-kmem_account_flags  val != RESOURCE_MAX) {
/* The kmem limit is set for the first time */
ret = res_counter_set_limit(memcg-kmem, val);

memcg_kmem_set_activated(memcg);
must_inc_static_branch = true;
} else
ret = res_counter_set_limit(memcg-kmem, val);
mutex_unlock(set_limit_mutex);
cgroup_unlock();

if (must_inc_static_branch) {
/* We can't do this under cgroup_lock */
static_key_slow_inc(memcg_kmem_enabled_key);
memcg_kmem_set_active(memcg);
}

 Today, we don't use cgroup_lock in memcg_update_kmem_limit(), and
 static_key_slow_inc() is called under the set_limit_mutex, but the
 leftover from the above-mentioned commit is still here. Let's remove it.

 OK, so I have looked there again and 692e89abd154b (memcg: increment
 static branch right after limit set) which went in after cgroup_mutex
 has been removed. It came along with the following comment.
  /*
   * setting the active bit after the inc will guarantee
 no one
   * starts accounting before all call sites are patched
   */

 This suggests that the flag is needed after all because we have
 to be sure that _all_ the places have to be patched. AFAIU
 memcg_kmem_newpage_charge might see the static key already patched so
 it would do a charge but memcg_kmem_commit_charge would still see it
 unpatched and so the charge won't be committed.

 Or am I missing something?

 You are correct. This flag is there due to the way we are using static
 branches.
 The patching of one call site is atomic, but the patching of all of
 them are not.
 Therefore we need to use a two-flag scheme to guarantee that in the first
 time
 we turn the static branches on, there will be a clear point after
 which we're going
 to start accounting.


 Hi, Glauber

 Sorry, but I don't understand why we need two flags. Isn't checking the flag
 set after all call sites have been patched (I mean KMEM_ACCOUNTED_ACTIVE)
 not enough?

Take a look at net/ipv4/tcp_memcontrol.c. There are comprehensive comments there
for a mechanism that basically achieves the same thing. The idea is
that one flag is used
at all times and means it is enabled. The second flags is a one time
only flag to indicate
that the patching process is complete. With one flag it seems to work,
but it is racy.

-- 
E Mare, Libertas
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] memcg: remove KMEM_ACCOUNTED_ACTIVATED

2013-12-17 Thread Glauber Costa
 Hi, Glauber

Hi.


 In memcg_update_kmem_limit() we do the whole process of limit
 initialization under a mutex so the situation we need protection from in
 tcp_update_limit() is impossible. BTW once set, the 'activated' flag is
 never cleared and never checked alone, only along with the 'active'
 flag, that's why I doubt we need it at all.

The updates are protected by a mutex, but the readers are not, and should not.
So we can still be patching the readers, and the double-flag was
initially used so
we can make sure that both flags are only set after the static branches are in.

Note that one of the flags is set inside memcg_update_cache_sizes(). After that,
we call static_key_slow_inc(). At this point, someone whose code is
patched in could
start accounting, but it shouldn't - because not all sites are patched in.

So after the update is done, we set the other flag, and now everybody
will start going
through.

Could you do something clever with just one flag? Probably yes. But I
doubt it would
be that much cleaner, this is just the way that patching sites work.

-- 
E Mare, Libertas
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] memcg: remove KMEM_ACCOUNTED_ACTIVATED

2013-12-17 Thread Glauber Costa

 Could you do something clever with just one flag? Probably yes. But I
 doubt it would
 be that much cleaner, this is just the way that patching sites work.

 Thank you for spending your time to listen to me.

Don't worry! I thank you for carrying this forward.

 Let me try to explain what is bothering me.

 We have two state bits for each memcg, 'active' and 'activated'. There
 are two scenarios where the bits can be modified:

 1) The kmem limit is set on a memcg for the first time -
 memcg_update_kmem_limit(). Here we call memcg_update_cache_sizes(),
 which sets the 'activated' bit on success, then update static branching,
 then set the 'active' bit. All three actions are done atomically in
 respect to other tasks setting the limit due to the set_limit_mutex.
 After both bits are set, they never get cleared for the memcg.

So far so good. But again, note how you yourself describe it:
the cations are done atomically  *in respect to other tasks setting the limit*

But there are also tasks that are running its courses naturally and
just allocating
memory. For those, some call sites will be on, some will be off. We need to make
sure that *none* of them uses the patched site until *all* of them are patched.
This has nothing to do with updates, this is all about the readers.

 2) When a subgroup of a kmem-active cgroup is created -
 memcg_propagate_kmem(). Here we copy kmem_account_flags from the parent,
 then increase static branching refcounter, then call
 memcg_update_cache_sizes() for the new memcg, which may clear the
 'activated' bit on failure. After successful execution, the state bits
 never get cleared for the new memcg.

 In scenario 2 there is no need bothering about the flags setting order,
 because we don't have any tasks in the cgroup yet - the tasks can be
 moved in only after css_online finishes when we have both of the bits
 set and the static branching enabled. Actually, we already do not bother
 about it, because we have both bits set before the cgroup is fully
 initialized (memcg_update_cache_sizes() is called).

Yes, after the first cgroup is set none of that matters. But it is just easier
and less error prone just to follow the same path every time. As I have said,
if you can come up with a more clever way to deal with the problem above
that doesn't involve the double flag - and you can prove it works - I
am definitely
fine with it. But this is subtle code, and in the past - Michal can
attest this - we've
changed this being sure it would work just to see it explode in our faces.

So although I am willing to review every patch for correctness on that
front (I never
said I liked the 2-flags scheme...), unless you have a bug or real
problem on it,
I would advise against changing it if its just to make it more readable.

But again, don't take me too seriously on this. If you and Michal think you can
come up with something better, I'm all for it.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v13 00/16] kmemcg shrinkers

2013-12-17 Thread Glauber Costa
 Please note that in contrast to previous versions this patch-set implements
 slab shrinking only when we hit the user memory limit so that kmem allocations
 will still fail if we are below the user memory limit, but close to the kmem
 limit. This is, because the implementation of kmem-only reclaim was rather
 incomplete - we had to fail GFP_NOFS allocations since everything we could
 reclaim was only FS data. I will try to improve this and send in a separate
 patch-set, but currently it is only worthwhile setting the kmem limit to be
 greater than the user mem limit just to enable per-memcg slab accounting and
 reclaim.

That is unfortunate, but it makes sense as a first step.



-- 
E Mare, Libertas
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v13 04/16] memcg: move memcg_caches_array_size() function

2013-12-17 Thread Glauber Costa
On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov
vdavy...@parallels.com wrote:
 I need to move this up a bit, and I am doing in a separate patch just to
 reduce churn in the patch that needs it.

 Signed-off-by: Vladimir Davydov vdavy...@parallels.com
Reviewed-by: Glauber Costa glom...@openvz.org
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v13 05/16] vmscan: move call to shrink_slab() to shrink_zones()

2013-12-17 Thread Glauber Costa
On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov
vdavy...@parallels.com wrote:
 This reduces the indentation level of do_try_to_free_pages() and removes
 extra loop over all eligible zones counting the number of on-LRU pages.


Looks correct to me.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v13 14/16] vmpressure: in-kernel notifications

2013-12-17 Thread Glauber Costa
On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov
vdavy...@parallels.com wrote:
 From: Glauber Costa glom...@openvz.org

 During the past weeks, it became clear to us that the shrinker interface

It has been more than a few weeks by now =)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v13 13/16] vmscan: take at least one pass with shrinkers

2013-12-17 Thread Glauber Costa
On Tue, Dec 10, 2013 at 3:50 PM, Vladimir Davydov
vdavy...@parallels.com wrote:
 On 12/10/2013 08:18 AM, Dave Chinner wrote:
 On Mon, Dec 09, 2013 at 12:05:54PM +0400, Vladimir Davydov wrote:
 From: Glauber Costa glom...@openvz.org

 In very low free kernel memory situations, it may be the case that we
 have less objects to free than our initial batch size. If this is the
 case, it is better to shrink those, and open space for the new workload
 then to keep them and fail the new allocations.

 In particular, we are concerned with the direct reclaim case for memcg.
 Although this same technique can be applied to other situations just as
 well, we will start conservative and apply it for that case, which is
 the one that matters the most.
 This should be at the start of the series.

 Since Glauber wanted to introduce this only for memcg-reclaim first,
 this can't be at the start of the series, but I'll move it to go
 immediately after per-memcg shrinking core in the next iteration.

 Thanks.

So, the reason for that being memcg only, is that the reclaim for
small objects triggered
a bunch of XFS regressions (I am sure the regressions are general, but
I've tested them using
ZFS).

In theory they shouldn't, so we can try to make it global again, so
long as it comes together
with benchmarks demonstrating that it is a safe change.

I am not sure the filesystem people would benefit from that directly,
though. So it may not be worth the hassle...
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] Race in memcg kmem?

2013-12-17 Thread Glauber Costa
On Tue, Dec 10, 2013 at 5:59 PM, Vladimir Davydov
vdavy...@parallels.com wrote:
 Hi,

 Looking through the per-memcg kmem_cache initialization code, I have a
 bad feeling that it is prone to a race. Before getting to fixing it, I'd
 like to ensure this race is not only in my imagination. Here it goes.

 We keep per-memcg kmem caches in the memcg_params field of each root
 cache. The memcg_params array is grown dynamically by
 memcg_update_cache_size(). I guess that if this function is executed
 concurrently with memcg_create_kmem_cache() we can get a race resulting
 in a memory leak.

Ok, let's see.

 -- memcg_create_kmem_cache(memcg, cachep) --
 creates a new kmem_cache corresponding to a memcg and assigns it to the
 root cache; called in the background - it is OK to have several such
 functions trying to create a cache for the same memcg concurrently, but
 only one of them should succeed.

Yes.

 @cachep is the root cache
 @memcg is the memcg we want to create a cache for.

 The function:

 A1) assures there is no cache corresponding to the memcg (if it is we
 have nothing to do):
 idx = memcg_cache_id(memcg);
 if (cachep-memcg_params[idx])
 goto out;

 A2) creates and assigns a new cache:
 new_cachep = kmem_cache_dup(memcg, cachep);
Please note this cannot proceed in parallel with memcg_update_cache_size(),
because they both take the slab mutex.

 // init new_cachep
 cachep-memcg_params-memcg_caches[idx] = new_cachep;


 -- memcg_update_cache_size(s, num_groups) --
 grows s-memcg_params to accomodate data for num_groups memcg's
 @s is the root cache whose memcg_params we want to grow
 @num_groups is the new number of kmem-active cgroups (defines the new
 size of memcg_params array).

 The function:

 B1) allocates and assigns a new cache:
 cur_params = s-memcg_params;
 s-memcg_params = kzalloc(size, GFP_KERNEL);

 B2) copies per-memcg cache ptrs from the old memcg_params array to the
 new one:
 for (i = 0; i  memcg_limited_groups_array_size; i++) {
 if (!cur_params-memcg_caches[i])
 continue;
 s-memcg_params-memcg_caches[i] =
 cur_params-memcg_caches[i];
 }

 B3) frees the old array:
 kfree(cur_params);


 Since these two functions do not share any mutexes, we can get the

They do share a mutex, the slab mutex.

 following race:

 Assume, by the time Cpu0 gets to memcg_create_kmem_cache(), the memcg
 cache has already been created by another thread, so this function
 should do nothing.

 Cpu0Cpu1
 
 B1
 A1  we haven't initialized memcg_params yet so Cpu0 will
 proceed to A2 to alloc and assign a new cache
 A2
 B2  Cpu1 rewrites the memcg cache ptr set by Cpu0 at A2
 - a memory leak?
 B3

 I'd like to add that even if I'm right about the race, this is rather
 not critical, because memcg_update_cache_sizes() is called very rarely.

Every race is critical.

So, I am a bit lost by your description. Get back to me if I got anything wrong,
but I am think that the point that you're missing is that all heavy
slab operations
take the slab_mutex underneath, and that includes cache creation and update.



 BTW, it seems to me that the way we update memcg_params in
 memcg_update_cache_size() make cache_from_memcg_idx() prone to
 use-after-free:

 static inline struct kmem_cache *
 cache_from_memcg_idx(struct kmem_cache *s, int idx)
 {
 if (!s-memcg_params)
 return NULL;
 return s-memcg_params-memcg_caches[idx];
 }

 This is equivalent to

 1) struct memcg_cache_params *params = s-memcg_params;
 2) return params-memcg_caches[idx];

 If memcg_update_cache_size() is executed between steps 1 and 2 on
 another CPU, at step 2 we will dereference memcg_params that has already
 been freed. This is very unlikely, but still possible. Perhaps, we
 should free old memcg params only after a sync_rcu()?


You seem to be right in this one. Indeed, if my mind does not betray
me, That is how I freed
the LRUs. (with synchronize_rcus).
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v13 11/16] mm: list_lru: add per-memcg lists

2013-12-17 Thread Glauber Costa
 OK, as far as I can tell, this is introducing a per-node, per-memcg
 LRU lists. Is that correct?

 If so, then that is not what Glauber and I originally intended for
 memcg LRUs. per-node LRUs are expensive in terms of memory and cross
 multiplying them by the number of memcgs in a system was not a good
 use of memory.

 According to Glauber, most memcgs are small and typically confined
 to a single node or two by external means and therefore don't need the
 scalability numa aware LRUs provide. Hence the idea was that the
 memcg LRUs would just be a single LRU list, just like a non-numa
 aware list_lru instantiation. IOWs, this is the structure that we
 had decided on as the best compromise between memory usage,
 complexity and memcg awareness:

Sorry for jumping late into this particular e-mail.

I just wanted to point out that the reason I adopted such matrix in my
design was that
it actually uses less memory this way. My reasoning for this was
explained in the original
patch that I posted that contained that implementation.

This is because whenever an object would go on a memcg list, it *would
not* go on
the global list. Therefore, to keep information about nodes for global
reclaim, you
have to put them in node-lists.

memcg reclaim, however, would reclaim regardless of node information.

In global reclaim, the memcg lists would be scanned obeying the node structure
in the lists.

Because that has a fixed cost, it ends up using less memory that having a second
list pointer in the objects, which is something that scale with the
number of objects.
Not to mention, that cost would be incurred even with memcg not being in use,
which is something that we would like to avoid.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 1/2] memcg: fix memcg_size() calculation

2013-12-17 Thread Glauber Costa
On Mon, Dec 16, 2013 at 8:47 PM, Michal Hocko mho...@suse.cz wrote:
 On Sat 14-12-13 12:15:33, Vladimir Davydov wrote:
 The mem_cgroup structure contains nr_node_ids pointers to
 mem_cgroup_per_node objects, not the objects themselves.

 Ouch! This is 2k per node which is wasted. What a shame I haven't
 noticed this back then when reviewing 45cf7ebd5a033 (memcg: reduce the
 size of struct memcg 244-fold)

IIRC, they weren't pointers back then. I think they were embedded in
the structure, and I let
them embedded.
My mind may be tricking me, but I think I recall that Johannes changed
them to pointers
in a later time. No ?

In any case, this is correct.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v7 00/11] per-cgroup cpu-stat

2013-05-29 Thread Glauber Costa
Peter et. al,

I am coming again with this series, hoping this is a better time for you all to
look at it.

I am *not* going as far as marking cpuacct deprecated, because I think it
deserves a special discussion (even though my position in this matter is widely
known), but all the infrastructure to make it happen is here. But after this,
it should be a matter of setting a flag (or not).

Through this patchset I am making cpu cgroup provide the same functionality of
cpuacct, and now with a more clear semantics, I attempt to provide userspace
with enough information to reconstruct per-container version of files like
/proc/stat. In particular, we are interested in knowing the per-cgroup slices
of user time, system time, wait time, number of processes, and a variety of
statistics.

To make sure we can count nr of switches correctly, I am ressurecting one of
Peter's patches that apparently got nowhere in the past.

Glauber Costa (8):
  don't call cpuacct_charge in stop_task.c
  sched: adjust exec_clock to use it as cpu usage metric
  cpuacct: don't actually do anything.
  sched: document the cpu cgroup.
  sched: account guest time per-cgroup as well.
  sched: record per-cgroup number of context switches
  sched: change nr_context_switches calculation.
  sched: introduce cgroup file stat_percpu

Peter Zijlstra (1):
  sched: Push put_prev_task() into pick_next_task()

Tejun Heo (2):
  cgroup: implement CFTYPE_NO_PREFIX
  cgroup, sched: let cpu serve the same files as cpuacct

 Documentation/cgroups/cpu.txt |  99 +++
 include/linux/cgroup.h|   1 +
 kernel/cgroup.c   |  16 +-
 kernel/sched/core.c   | 385 --
 kernel/sched/cputime.c|  33 +++-
 kernel/sched/fair.c   |  39 -
 kernel/sched/idle_task.c  |   9 +-
 kernel/sched/rt.c |  42 +++--
 kernel/sched/sched.h  |  35 +++-
 kernel/sched/stop_task.c  |   8 +-
 10 files changed, 626 insertions(+), 41 deletions(-)
 create mode 100644 Documentation/cgroups/cpu.txt

-- 
1.8.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v7 01/11] don't call cpuacct_charge in stop_task.c

2013-05-29 Thread Glauber Costa
Commit 8f618968 changed stop_task to do the same bookkeping as the
other classes. However, the call to cpuacct_charge() doesn't affect
the scheduler decisions at all, and doesn't need to be moved over.

Moreover, being a kthread, the migration thread won't belong to any
cgroup anyway, rendering this call quite useless.

Signed-off-by: Glauber Costa glom...@openvz.org
CC: Mike Galbraith mgalbra...@suse.de
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Thomas Gleixner t...@linutronix.de
---
 kernel/sched/stop_task.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index e08fbee..4142d7e 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -68,7 +68,6 @@ static void put_prev_task_stop(struct rq *rq, struct 
task_struct *prev)
account_group_exec_runtime(curr, delta_exec);
 
curr-se.exec_start = rq_clock_task(rq);
-   cpuacct_charge(curr, delta_exec);
 }
 
 static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
-- 
1.8.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v7 04/11] sched: adjust exec_clock to use it as cpu usage metric

2013-05-29 Thread Glauber Costa
exec_clock already provides per-group cpu usage metrics, and can be
reused by cpuacct in case cpu and cpuacct are comounted.

However, it is only provided by tasks in fair class. Doing the same for
rt is easy, and can be done in an already existing hierarchy loop. This
is an improvement over the independent hierarchy walk executed by
cpuacct.

Signed-off-by: Glauber Costa glom...@openvz.org
CC: Dave Jones da...@redhat.com
CC: Ben Hutchings b...@decadent.org.uk
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Paul Turner p...@google.com
CC: Lennart Poettering lenn...@poettering.net
CC: Kay Sievers kay.siev...@vrfy.org
CC: Tejun Heo t...@kernel.org
---
 kernel/sched/rt.c| 1 +
 kernel/sched/sched.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e9f8dcd..4a21045 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -907,6 +907,7 @@ static void update_curr_rt(struct rq *rq)
 
for_each_sched_rt_entity(rt_se) {
rt_rq = rt_rq_of_se(rt_se);
+   schedstat_add(rt_rq, exec_clock, delta_exec);
 
if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
raw_spin_lock(rt_rq-rt_runtime_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 765c687..b05dd84 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -254,6 +254,7 @@ struct cfs_rq {
unsigned int nr_running, h_nr_running;
 
u64 exec_clock;
+   u64 prev_exec_clock;
u64 min_vruntime;
 #ifndef CONFIG_64BIT
u64 min_vruntime_copy;
@@ -356,6 +357,8 @@ struct rt_rq {
struct plist_head pushable_tasks;
 #endif
int rt_throttled;
+   u64 exec_clock;
+   u64 prev_exec_clock;
u64 rt_time;
u64 rt_runtime;
/* Nests inside the rq lock: */
-- 
1.8.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v7 06/11] sched: document the cpu cgroup.

2013-05-29 Thread Glauber Costa
The CPU cgroup is so far, undocumented. Although data exists in the
Documentation directory about its functioning, it is usually spread,
and/or presented in the context of something else. This file
consolidates all cgroup-related information about it.

Signed-off-by: Glauber Costa glom...@openvz.org
---
 Documentation/cgroups/cpu.txt | 81 +++
 1 file changed, 81 insertions(+)
 create mode 100644 Documentation/cgroups/cpu.txt

diff --git a/Documentation/cgroups/cpu.txt b/Documentation/cgroups/cpu.txt
new file mode 100644
index 000..072fd58
--- /dev/null
+++ b/Documentation/cgroups/cpu.txt
@@ -0,0 +1,81 @@
+CPU Controller
+--
+
+The CPU controller is responsible for grouping tasks together that will be
+viewed by the scheduler as a single unit. The CFS scheduler will first divide
+CPU time equally between all entities in the same level, and then proceed by
+doing the same in the next level. Basic use cases for that are described in the
+main cgroup documentation file, cgroups.txt.
+
+Users of this functionality should be aware that deep hierarchies will of
+course impose scheduler overhead, since the scheduler will have to take extra
+steps and look up additional data structures to make its final decision.
+
+Through the CPU controller, the scheduler is also able to cap the CPU
+utilization of a particular group. This is particularly useful in environments
+in which CPU is paid for by the hour, and one values predictability over
+performance.
+
+CPU Accounting
+--
+
+The CPU cgroup will also provide additional files under the prefix cpuacct.
+Those files provide accounting statistics and were previously provided by the
+separate cpuacct controller. Although the cpuacct controller will still be kept
+around for compatibility reasons, its usage is discouraged. If both the CPU and
+cpuacct controllers are present in the system, distributors are encouraged to
+always mount them together.
+
+Files
+-
+
+The CPU controller exposes the following files to the user:
+
+ - cpu.shares: The weight of each group living in the same hierarchy, that
+ translates into the amount of CPU it is expected to get. Upon cgroup creation,
+ each group gets assigned a default of 1024. The percentage of CPU assigned to
+ the cgroup is the value of shares divided by the sum of all shares in all
+ cgroups in the same level.
+
+ - cpu.cfs_period_us: The duration in microseconds of each scheduler period, 
for
+ bandwidth decisions. This defaults to 10us or 100ms. Larger periods will
+ improve throughput at the expense of latency, since the scheduler will be able
+ to sustain a cpu-bound workload for longer. The opposite of true for smaller
+ periods. Note that this only affects non-RT tasks that are scheduled by the
+ CFS scheduler.
+
+- cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us
+  in for the current group will be allowed to run. For instance, if it is set 
to
+  half of cpu_period_us, the cgroup will only be able to peak run for 50 % of
+  the time. One should note that this represents aggregate time over all CPUs
+  in the system. Therefore, in order to allow full usage of two CPUs, for
+  instance, one should set this value to twice the value of cfs_period_us.
+
+- cpu.stat: statistics about the bandwidth controls. No data will be presented
+  if cpu.cfs_quota_us is not set. The file presents three
+  numbers:
+   nr_periods: how many full periods have been elapsed.
+   nr_throttled: number of times we exausted the full allowed bandwidth
+   throttled_time: total time the tasks were not run due to being overquota
+
+ - cpu.rt_runtime_us and cpu.rt_period_us: Those files are the RT-tasks
+   analogous to the CFS files cfs_quota_us and cfs_period_us. One important
+   difference, though, is that while the cfs quotas are upper bounds that
+   won't necessarily be met, the rt runtimes form a stricter guarantee.
+   Therefore, no overlap is allowed. Implications of that are that given a
+   hierarchy with multiple children, the sum of all rt_runtime_us may not 
exceed
+   the runtime of the parent. Also, a rt_runtime_us of 0, means that no rt 
tasks
+   can ever be run in this cgroup. For more information about rt tasks runtime
+   assignments, see scheduler/sched-rt-group.txt
+
+ - cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks
+   in this group.
+
+ - cpuacct.usage_percpu: The CPU time, in nanoseconds, consumed by all tasks in
+   this group, separated by CPU. The format is an space-separated array of time
+   values, one for each present CPU.
+
+ - cpuacct.stat: aggregate user and system time consumed by tasks in this 
group.
+   The format is
+   user: x
+   system: y
-- 
1.8.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v7 07/11] sched: account guest time per-cgroup as well.

2013-05-29 Thread Glauber Costa
We already track multiple tick statistics per-cgroup, using
the task_group_account_field facility. This patch accounts
guest_time in that manner as well.

Signed-off-by: Glauber Costa glom...@openvz.org
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Paul Turner p...@google.com
---
 kernel/sched/cputime.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 74a44ef..e653e52 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -183,8 +183,6 @@ void account_user_time(struct task_struct *p, cputime_t 
cputime,
 static void account_guest_time(struct task_struct *p, cputime_t cputime,
   cputime_t cputime_scaled)
 {
-   u64 *cpustat = kcpustat_this_cpu-cpustat;
-
/* Add guest time to process. */
p-utime += cputime;
p-utimescaled += cputime_scaled;
@@ -193,11 +191,11 @@ static void account_guest_time(struct task_struct *p, 
cputime_t cputime,
 
/* Add guest time to cpustat. */
if (TASK_NICE(p)  0) {
-   cpustat[CPUTIME_NICE] += (__force u64) cputime;
-   cpustat[CPUTIME_GUEST_NICE] += (__force u64) cputime;
+   task_group_account_field(p, CPUTIME_NICE, (__force u64) 
cputime);
+   task_group_account_field(p, CPUTIME_GUEST, (__force u64) 
cputime);
} else {
-   cpustat[CPUTIME_USER] += (__force u64) cputime;
-   cpustat[CPUTIME_GUEST] += (__force u64) cputime;
+   task_group_account_field(p, CPUTIME_USER, (__force u64) 
cputime);
+   task_group_account_field(p, CPUTIME_GUEST, (__force u64) 
cputime);
}
 }
 
-- 
1.8.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v7 08/11] sched: Push put_prev_task() into pick_next_task()

2013-05-29 Thread Glauber Costa
From: Peter Zijlstra a.p.zijls...@chello.nl

In order to avoid having to do put/set on a whole cgroup hierarchy
when we context switch, push the put into pick_next_task() so that
both operations are in the same function. Further changes then allow
us to possibly optimize away redundant work.

[ glom...@openvz.org: incorporated mailing list feedback ]

Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
Signed-off-by: Glauber Costa glom...@openvz.org
---
 kernel/sched/core.c  | 20 +++-
 kernel/sched/fair.c  |  6 +-
 kernel/sched/idle_task.c |  6 +-
 kernel/sched/rt.c| 27 ---
 kernel/sched/sched.h |  8 +++-
 kernel/sched/stop_task.c |  5 -
 6 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0fa0f87..3b1441f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2295,18 +2295,11 @@ static inline void schedule_debug(struct task_struct 
*prev)
schedstat_inc(this_rq(), sched_count);
 }
 
-static void put_prev_task(struct rq *rq, struct task_struct *prev)
-{
-   if (prev-on_rq || rq-skip_clock_update  0)
-   update_rq_clock(rq);
-   prev-sched_class-put_prev_task(rq, prev);
-}
-
 /*
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq)
+pick_next_task(struct rq *rq, struct task_struct *prev)
 {
const struct sched_class *class;
struct task_struct *p;
@@ -2316,13 +2309,13 @@ pick_next_task(struct rq *rq)
 * the fair class we can call that function directly:
 */
if (likely(rq-nr_running == rq-cfs.h_nr_running)) {
-   p = fair_sched_class.pick_next_task(rq);
+   p = fair_sched_class.pick_next_task(rq, prev);
if (likely(p))
return p;
}
 
for_each_class(class) {
-   p = class-pick_next_task(rq);
+   p = class-pick_next_task(rq, prev);
if (p)
return p;
}
@@ -2417,8 +2410,9 @@ need_resched:
if (unlikely(!rq-nr_running))
idle_balance(cpu, rq);
 
-   put_prev_task(rq, prev);
-   next = pick_next_task(rq);
+   if (prev-on_rq || rq-skip_clock_update  0)
+   update_rq_clock(rq);
+   next = pick_next_task(rq, prev);
clear_tsk_need_resched(prev);
rq-skip_clock_update = 0;
 
@@ -4395,7 +4389,7 @@ static void migrate_tasks(unsigned int dead_cpu)
if (rq-nr_running == 1)
break;
 
-   next = pick_next_task(rq);
+   next = pick_next_task(rq, NULL);
BUG_ON(!next);
next-sched_class-put_prev_task(rq, next);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a7f5c39..414cf1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3628,7 +3628,8 @@ preempt:
set_last_buddy(se);
 }
 
-static struct task_struct *pick_next_task_fair(struct rq *rq)
+static struct task_struct *
+pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 {
struct task_struct *p;
struct cfs_rq *cfs_rq = rq-cfs;
@@ -3637,6 +3638,9 @@ static struct task_struct *pick_next_task_fair(struct rq 
*rq)
if (!cfs_rq-nr_running)
return NULL;
 
+   if (prev)
+   prev-sched_class-put_prev_task(rq, prev);
+
do {
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d8da010..4f0a332 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -33,8 +33,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct 
task_struct *p, int fl
resched_task(rq-idle);
 }
 
-static struct task_struct *pick_next_task_idle(struct rq *rq)
+static struct task_struct *
+pick_next_task_idle(struct rq *rq, struct task_struct *prev)
 {
+   if (prev)
+   prev-sched_class-put_prev_task(rq, prev);
+
schedstat_inc(rq, sched_goidle);
 #ifdef CONFIG_SMP
/* Trigger the post schedule to do an idle_enter for CFS */
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4a21045..6c41fa1 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1330,15 +1330,7 @@ static struct task_struct *_pick_next_task_rt(struct rq 
*rq)
 {
struct sched_rt_entity *rt_se;
struct task_struct *p;
-   struct rt_rq *rt_rq;
-
-   rt_rq = rq-rt;
-
-   if (!rt_rq-rt_nr_running)
-   return NULL;
-
-   if (rt_rq_throttled(rt_rq))
-   return NULL;
+   struct rt_rq *rt_rq  = rq-rt;
 
do {
rt_se = pick_next_rt_entity(rq, rt_rq);
@@ -1352,9 +1344,22 @@ static struct task_struct *_pick_next_task_rt(struct rq 
*rq)
return p;
 }
 
-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct

[Devel] [PATCH v7 09/11] sched: record per-cgroup number of context switches

2013-05-29 Thread Glauber Costa
Context switches are, to this moment, a property of the runqueue. When
running containers, we would like to be able to present a separate
figure for each container (or cgroup, in this context).

The chosen way to accomplish this is to increment a per cfs_rq or
rt_rq, depending on the task, for each of the sched entities involved,
up to the parent. It is trivial to note that for the parent cgroup, we
always add 1 by doing this. Also, we are not introducing any hierarchy
walks in here. An already existent walk is reused.
There are, however, two main issues:

 1. the traditional context switch code only increment nr_switches when
 a different task is being inserted in the rq. Eventually, albeit not
 likely, we will pick the same task as before. Since for cfq and rt we
 only now which task will be next after the walk, we need to do the walk
 again, decrementing 1. Since this is by far not likely, it seems a fair
 price to pay.

 2. Those figures do not include switches from and to the idle or stop
 task. Those need to be recorded separately, which will happen in a
 follow up patch.

Signed-off-by: Glauber Costa glom...@openvz.org
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Paul Turner p...@google.com
---
 kernel/sched/fair.c  | 18 ++
 kernel/sched/rt.c| 15 +--
 kernel/sched/sched.h |  3 +++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 414cf1c..b29e5dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3642,6 +3642,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev)
prev-sched_class-put_prev_task(rq, prev);
 
do {
+   if (likely(prev))
+   cfs_rq-nr_switches++;
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
@@ -3651,6 +3653,22 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev)
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
 
+   /*
+* This condition is extremely unlikely, and most of the time will just
+* consist of this unlikely branch, which is extremely cheap. But we
+* still need to have it, because when we first loop through cfs_rq's,
+* we can't possibly know which task we will pick. The call to
+* set_next_entity above is not meant to mess up the tree in this case,
+* so this should give us the same chain, in the same order.
+*/
+   if (unlikely(p == prev)) {
+   se = p-se;
+   for_each_sched_entity(se) {
+   cfs_rq = cfs_rq_of(se);
+   cfs_rq-nr_switches--;
+   }
+   }
+
return p;
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6c41fa1..894fc0b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1326,13 +1326,16 @@ static struct sched_rt_entity 
*pick_next_rt_entity(struct rq *rq,
return next;
 }
 
-static struct task_struct *_pick_next_task_rt(struct rq *rq)
+static struct task_struct *
+_pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 {
struct sched_rt_entity *rt_se;
struct task_struct *p;
struct rt_rq *rt_rq  = rq-rt;
 
do {
+   if (likely(prev))
+   rt_rq-rt_nr_switches++;
rt_se = pick_next_rt_entity(rq, rt_rq);
BUG_ON(!rt_se);
rt_rq = group_rt_rq(rt_se);
@@ -1341,6 +1344,14 @@ static struct task_struct *_pick_next_task_rt(struct rq 
*rq)
p = rt_task_of(rt_se);
p-se.exec_start = rq_clock_task(rq);
 
+   /* See fair.c for an explanation on this */
+   if (unlikely(p == prev)) {
+   for_each_sched_rt_entity(rt_se) {
+   rt_rq = rt_rq_of_se(rt_se);
+   rt_rq-rt_nr_switches--;
+   }
+   }
+
return p;
 }
 
@@ -1359,7 +1370,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
if (prev)
prev-sched_class-put_prev_task(rq, prev);
 
-   p = _pick_next_task_rt(rq);
+   p = _pick_next_task_rt(rq, prev);
 
/* The running task is never eligible for pushing */
if (p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a39eb9b..39d3284 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -273,6 +273,7 @@ struct cfs_rq {
unsigned int nr_spread_over;
 #endif
 
+   u64 nr_switches;
 #ifdef CONFIG_SMP
 /*
  * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
@@ -342,6 +343,8 @@ static inline int rt_bandwidth_enabled(void)
 struct rt_rq {
struct rt_prio_array active;
unsigned int rt_nr_running;
+   u64 rt_nr_switches;
+
 #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
struct {
int curr; /* highest queued rt task prio */
-- 
1.8.1.4

[Devel] [PATCH v7 10/11] sched: change nr_context_switches calculation.

2013-05-29 Thread Glauber Costa
This patch changes the calculation of nr_context_switches. The variable
nr_switches is now used to account for the number of transition to the
idle task, or stop task. It is removed from the schedule() path.

The total calculation can be made using the fact that the transitions to
fair and rt classes are recorded in the root_task_group. One can easily
derive the total figure by adding those quantities together.

Signed-off-by: Glauber Costa glom...@openvz.org
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Paul Turner p...@google.com
---
 kernel/sched/core.c  | 17 +++--
 kernel/sched/idle_task.c |  3 +++
 kernel/sched/stop_task.c |  2 ++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b1441f..892c828 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2031,13 +2031,27 @@ unsigned long nr_running(void)
return sum;
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define cfs_nr_switches(tg, cpu) (tg)-cfs_rq[cpu]-nr_switches
+#else
+#define cfs_nr_switches(tg, cpu) cpu_rq(cpu)-cfs.nr_switches
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_nr_switches(tg, cpu) (tg)-rt_rq[cpu]-rt_nr_switches
+#else
+#define rt_nr_switches(tg, cpu) cpu_rq(cpu)-rt.rt_nr_switches
+#endif
+
 unsigned long long nr_context_switches(void)
 {
int i;
unsigned long long sum = 0;
 
-   for_each_possible_cpu(i)
+   for_each_possible_cpu(i) {
+   sum += cfs_nr_switches(root_task_group, i);
+   sum += rt_nr_switches(root_task_group, i);
sum += cpu_rq(i)-nr_switches;
+   }
 
return sum;
 }
@@ -2417,7 +2431,6 @@ need_resched:
rq-skip_clock_update = 0;
 
if (likely(prev != next)) {
-   rq-nr_switches++;
rq-curr = next;
++*switch_count;
 
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 4f0a332..923286c 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -39,6 +39,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev)
if (prev)
prev-sched_class-put_prev_task(rq, prev);
 
+   if (prev != rq-idle)
+   rq-nr_switches++;
+
schedstat_inc(rq, sched_goidle);
 #ifdef CONFIG_SMP
/* Trigger the post schedule to do an idle_enter for CFS */
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 6adf6a9..48c572a 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -32,6 +32,8 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev)
stop-se.exec_start = rq_clock_task(rq);
if (prev)
prev-sched_class-put_prev_task(rq, prev);
+   if (prev != rq-stop)
+   rq-nr_switches++;
return stop;
}
 
-- 
1.8.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v7 11/11] sched: introduce cgroup file stat_percpu

2013-05-29 Thread Glauber Costa
The file cpu.stat_percpu will show various scheduler related
information, that are usually available to the top level through other
files.

For instance, most of the meaningful data in /proc/stat is presented
here. Given this file, a container can easily construct a local copy of
/proc/stat for internal consumption.

The data we export is comprised of:
* all the tick information, previously available only through cpuacct,
  like user time, system time, etc.

* wait time, which can be used to construct analogous information to
  steal time in hypervisors,

* nr_switches and nr_running, which are cgroup-local versions of
  their global counterparts.

The file format consists of a one-line header that describes the fields
being listed.  No guarantee is given that the fields will be kept the
same between kernel releases, and readers should always check the header
in order to introspect it.

Each of the following lines will show the respective field value for
each of the possible cpus in the system. All values are show in
nanoseconds.

One example output for this file is:

cpu user nice system irq softirq guest guest_nice wait nr_switches nr_running
cpu0 47100 0 1500 0 0 0 0 1996534 7205 1
cpu1 58800 0 1700 0 0 0 0 2848680 6510 1
cpu2 50500 0 1400 0 0 0 0 2350771 6183 1
cpu3 47200 0 1600 0 0 0 0 19766345 6277 2

Signed-off-by: Glauber Costa glom...@openvz.org
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Paul Turner p...@google.com
---
 Documentation/cgroups/cpu.txt |  18 +++
 kernel/sched/core.c   | 109 ++
 kernel/sched/fair.c   |  14 ++
 kernel/sched/sched.h  |  10 +++-
 4 files changed, 150 insertions(+), 1 deletion(-)

diff --git a/Documentation/cgroups/cpu.txt b/Documentation/cgroups/cpu.txt
index 072fd58..d40e500 100644
--- a/Documentation/cgroups/cpu.txt
+++ b/Documentation/cgroups/cpu.txt
@@ -68,6 +68,24 @@ The CPU controller exposes the following files to the user:
can ever be run in this cgroup. For more information about rt tasks runtime
assignments, see scheduler/sched-rt-group.txt
 
+ - cpu.stat_percpu: Various scheduler statistics for the current group. The
+   information provided in this file is akin to the one displayed in 
/proc/stat,
+   except for the fact that it is cgroup-aware. The file format consists of a
+   one-line header that describes the fields being listed.  No guarantee is
+   given that the fields will be kept the same between kernel releases, and
+   readers should always check the header in order to introspect it.
+
+   Each of the following lines will show the respective field value for
+   each of the possible cpus in the system. All values are show in
+   nanoseconds. One example output for this file is:
+
+   cpu user nice system irq softirq guest guest_nice wait nr_switches 
nr_running
+   cpu0 47100 0 1500 0 0 0 0 1996534 7205 1
+   cpu1 58800 0 1700 0 0 0 0 2848680 6510 1
+   cpu2 50500 0 1400 0 0 0 0 2350771 6183 1
+   cpu3 47200 0 1600 0 0 0 0 19766345 6277 2
+
+
  - cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks
in this group.
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 892c828..e2963ec 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7239,6 +7239,7 @@ static inline void cfs_exec_clock_reset(struct task_group 
*tg, int cpu)
 #else
 static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
 {
+   return 0;
 }
 
 static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
@@ -7670,6 +7671,108 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, 
struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHEDSTATS
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define fair_rq(field, tg, i)  (tg)-cfs_rq[i]-field
+#else
+#define fair_rq(field, tg, i)  0
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_rq(field, tg, i)  (tg)-rt_rq[i]-field
+#else
+#define rt_rq(field, tg, i)  0
+#endif
+
+static u64 tg_nr_switches(struct task_group *tg, int cpu)
+{
+   /* nr_switches, which counts idle and stop task, is added to all tgs */
+   return cpu_rq(cpu)-nr_switches +
+   cfs_nr_switches(tg, cpu) + rt_nr_switches(tg, cpu);
+}
+
+static u64 tg_nr_running(struct task_group *tg, int cpu)
+{
+   /*
+* because of autogrouped groups in root_task_group, the
+* following does not hold.
+*/
+   if (tg != root_task_group)
+   return rt_rq(rt_nr_running, tg, cpu) + fair_rq(nr_running, tg, 
cpu);
+
+   return cpu_rq(cpu)-nr_running;
+}
+
+static u64 tg_wait(struct task_group *tg, int cpu)
+{
+   u64 val;
+
+   if (tg != root_task_group)
+   val = cfs_read_wait(tg, cpu);
+   else
+   /*
+* There are many errors here that we are accumulating.
+* However, we only provide this in the interest of having

[Devel] [PATCH v7 03/11] cgroup, sched: let cpu serve the same files as cpuacct

2013-05-29 Thread Glauber Costa
From: Tejun Heo t...@kernel.org

cpuacct being on a separate hierarchy is one of the main cgroup
related complaints from scheduler side and the consensus seems to be

* Allowing cpuacct to be a separate controller was a mistake.  In
  general multiple controllers on the same type of resource should be
  avoided, especially accounting-only ones.

* Statistics provided by cpuacct are useful and should instead be
  served by cpu.

This patch makes cpu maintain and serve all cpuacct.* files and make
cgroup core ignore cpuacct if it's co-mounted with cpu.  This is a
step in deprecating cpuacct.  The next patch will allow disabling or
dropping cpuacct without affecting userland too much.

Note that this creates some discrepancies in /proc/cgroups and
/proc/PID/cgroup.  The co-mounted cpuacct won't be reflected correctly
there.  cpuacct will eventually be removed completely probably except
for the statistics filenames and I'd like to keep the amount of
compatbility hackery to minimum as much as possible.

The cpu statistics implementation isn't optimized in any way.  It's
mostly verbatim copy from cpuacct.  The goal is allowing quick
disabling and removal of CONFIG_CGROUP_CPUACCT and creating a base on
top of which cpu can implement proper optimization.

[ glommer: don't call *_charge in stop_task.c ]

Signed-off-by: Tejun Heo t...@kernel.org
Signed-off-by: Glauber Costa glom...@openvz.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Michal Hocko mho...@suse.cz
Cc: Kay Sievers kay.siev...@vrfy.org
Cc: Lennart Poettering mzxre...@0pointer.de
Cc: Dave Jones da...@redhat.com
Cc: Ben Hutchings b...@decadent.org.uk
Cc: Paul Turner p...@google.com
---
 kernel/cgroup.c|  13 
 kernel/sched/core.c| 173 +
 kernel/sched/cputime.c |  23 +++
 kernel/sched/fair.c|   1 +
 kernel/sched/rt.c  |   1 +
 kernel/sched/sched.h   |   7 ++
 6 files changed, 218 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 8bbeb4d..01c3ed0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1256,6 +1256,19 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
}
 
/*
+* cpuacct is deprecated and cpu will serve the same stat files.
+* If co-mount with cpu is requested, ignore cpuacct.  Note that
+* this creates some discrepancies in /proc/cgroups and
+* /proc/PID/cgroup.
+*
+* https://lkml.org/lkml/2012/9/13/542
+*/
+#if IS_ENABLED(CONFIG_CGROUP_SCHED)  IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+   if ((opts-subsys_mask  (1  cpu_cgroup_subsys_id)) 
+   (opts-subsys_mask  (1  cpuacct_subsys_id)))
+   opts-subsys_mask = ~(1  cpuacct_subsys_id);
+#endif
+   /*
 * Option noprefix was introduced just for backward compatibility
 * with the old cpuset, so we allow noprefix only if mounting just
 * the cpuset subsystem.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 36f85be..5ae1adf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6327,6 +6327,7 @@ int in_sched_functions(unsigned long addr)
  */
 struct task_group root_task_group;
 LIST_HEAD(task_groups);
+static DEFINE_PER_CPU(u64, root_tg_cpuusage);
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -6385,6 +6386,8 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
+   root_task_group.cpustat = kernel_cpustat;
+   root_task_group.cpuusage = root_tg_cpuusage;
list_add(root_task_group.list, task_groups);
INIT_LIST_HEAD(root_task_group.children);
INIT_LIST_HEAD(root_task_group.siblings);
@@ -6665,6 +6668,8 @@ static void free_sched_group(struct task_group *tg)
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
+   free_percpu(tg-cpuusage);
+   free_percpu(tg-cpustat);
kfree(tg);
 }
 
@@ -6677,6 +6682,11 @@ struct task_group *sched_create_group(struct task_group 
*parent)
if (!tg)
return ERR_PTR(-ENOMEM);
 
+   tg-cpuusage = alloc_percpu(u64);
+   tg-cpustat = alloc_percpu(struct kernel_cpustat);
+   if (!tg-cpuusage || !tg-cpustat)
+   goto err;
+
if (!alloc_fair_sched_group(tg, parent))
goto err;
 
@@ -6776,6 +6786,24 @@ void sched_move_task(struct task_struct *tsk)
 
task_rq_unlock(rq, tsk, flags);
 }
+
+void task_group_charge(struct task_struct *tsk, u64 cputime)
+{
+   struct task_group *tg;
+   int cpu = task_cpu(tsk);
+
+   rcu_read_lock();
+
+   tg = container_of(task_subsys_state(tsk, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+   for (; tg; tg = tg-parent) {
+   u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu);
+   *cpuusage += cputime;
+   }
+
+   rcu_read_unlock();
+}
 #endif /* CONFIG_CGROUP_SCHED */
 
 #if defined

[Devel] [PATCH v7 05/11] cpuacct: don't actually do anything.

2013-05-29 Thread Glauber Costa
All the information we have that is needed for cpuusage (and
cpuusage_percpu) is present in schedstats. It is already recorded
in a sane hierarchical way.

If we have CONFIG_SCHEDSTATS, we don't really need to do any extra
work. All former functions become empty inlines.

Signed-off-by: Glauber Costa glom...@openvz.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Michal Hocko mho...@suse.cz
Cc: Kay Sievers kay.siev...@vrfy.org
Cc: Lennart Poettering mzxre...@0pointer.de
Cc: Dave Jones da...@redhat.com
Cc: Ben Hutchings b...@decadent.org.uk
Cc: Paul Turner p...@google.com
---
 kernel/sched/core.c  | 102 ++-
 kernel/sched/sched.h |  10 +++--
 2 files changed, 90 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ae1adf..0fa0f87 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6787,6 +6787,7 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, flags);
 }
 
+#ifndef CONFIG_SCHEDSTATS
 void task_group_charge(struct task_struct *tsk, u64 cputime)
 {
struct task_group *tg;
@@ -6804,6 +6805,7 @@ void task_group_charge(struct task_struct *tsk, u64 
cputime)
 
rcu_read_unlock();
 }
+#endif
 #endif /* CONFIG_CGROUP_SCHED */
 
 #if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -7199,22 +7201,92 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup 
*old_cgrp,
sched_move_task(task);
 }
 
-static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+/*
+ * Take rq-lock to make 64-bit write safe on 32-bit platforms.
+ */
+static inline void lock_rq_dword(int cpu)
 {
-   u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu);
-   u64 data;
-
 #ifndef CONFIG_64BIT
-   /*
-* Take rq-lock to make 64-bit read safe on 32-bit platforms.
-*/
raw_spin_lock_irq(cpu_rq(cpu)-lock);
-   data = *cpuusage;
+#endif
+}
+
+static inline void unlock_rq_dword(int cpu)
+{
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(cpu_rq(cpu)-lock);
+#endif
+}
+
+#ifdef CONFIG_SCHEDSTATS
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+   return tg-cfs_rq[cpu]-exec_clock - tg-cfs_rq[cpu]-prev_exec_clock;
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+   tg-cfs_rq[cpu]-prev_exec_clock = tg-cfs_rq[cpu]-exec_clock;
+}
 #else
-   data = *cpuusage;
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+   return tg-rt_rq[cpu]-exec_clock - tg-rt_rq[cpu]-prev_exec_clock;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+   tg-rt_rq[cpu]-prev_exec_clock = tg-rt_rq[cpu]-exec_clock;
+}
+#else
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+   return 0;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
 #endif
 
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+   u64 ret = 0;
+
+   lock_rq_dword(cpu);
+   ret = cfs_exec_clock(tg, cpu) + rt_exec_clock(tg, cpu);
+   unlock_rq_dword(cpu);
+
+   return ret;
+}
+
+static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
+{
+   lock_rq_dword(cpu);
+   cfs_exec_clock_reset(tg, cpu);
+   rt_exec_clock_reset(tg, cpu);
+   unlock_rq_dword(cpu);
+}
+#else
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+   u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu);
+   u64 data;
+
+   lock_rq_dword(cpu);
+   data = *cpuusage;
+   unlock_rq_dword(cpu);
+
return data;
 }
 
@@ -7222,17 +7294,11 @@ static void task_group_cpuusage_write(struct task_group 
*tg, int cpu, u64 val)
 {
u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu);
 
-#ifndef CONFIG_64BIT
-   /*
-* Take rq-lock to make 64-bit write safe on 32-bit platforms.
-*/
-   raw_spin_lock_irq(cpu_rq(cpu)-lock);
+   lock_rq_dword(cpu);
*cpuusage = val;
-   raw_spin_unlock_irq(cpu_rq(cpu)-lock);
-#else
-   *cpuusage = val;
-#endif
+   unlock_rq_dword(cpu);
 }
+#endif
 
 /* return total cpu usage (in nanoseconds) of a group */
 static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b05dd84..0e5e795 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -710,8 +710,6 @@ static inline void set_task_rq(struct task_struct *p, 
unsigned int cpu)
 #endif
 }
 
-extern void task_group_charge(struct task_struct *tsk, u64 cputime);
-
 #else /* CONFIG_CGROUP_SCHED */
 
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -719,10 +717,14 @@ static inline struct task_group *task_group(struct

[Devel] [PATCH v7 02/11] cgroup: implement CFTYPE_NO_PREFIX

2013-05-29 Thread Glauber Costa
From: Tejun Heo t...@kernel.org

When cgroup files are created, cgroup core automatically prepends the
name of the subsystem as prefix.  This patch adds CFTYPE_NO_PREFIX
which disables the automatic prefix.

This will be used to deprecate cpuacct which will make cpu create and
serve the cpuacct files.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Glauber Costa glom...@openvz.org
---
 include/linux/cgroup.h | 1 +
 kernel/cgroup.c| 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5047355..5a8a093 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -397,6 +397,7 @@ struct cgroup_map_cb {
 #define CFTYPE_ONLY_ON_ROOT(1U  0)   /* only create on root cg */
 #define CFTYPE_NOT_ON_ROOT (1U  1)   /* don't create on root cg */
 #define CFTYPE_INSANE  (1U  2)   /* don't create if 
sane_behavior */
+#define CFTYPE_NO_PREFIX   (1U  3)   /* skip subsys prefix */
 
 #define MAX_CFTYPE_NAME64
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2a99262..8bbeb4d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2681,7 +2681,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct 
cgroup_subsys *subsys,
umode_t mode;
char name[MAX_CGROUP_TYPE_NAMELEN + MAX_CFTYPE_NAME + 2] = { 0 };
 
-   if (subsys  !(cgrp-root-flags  CGRP_ROOT_NOPREFIX)) {
+   if (subsys  !(cft-flags  CFTYPE_NO_PREFIX) 
+   !(cgrp-root-flags  CGRP_ROOT_NOPREFIX)) {
strcpy(name, subsys-name);
strcat(name, .);
}
-- 
1.8.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/3] Container fixups

2013-05-20 Thread Glauber Costa
Kir,

This is a slightly modified shot at the container fixups.  First patch makes us
more resilient, since by being isolated by the mount namespaces, we no longer
have problems with leaking mounts. I identified one of those problems today,
and patch #1 in this series fixes it.

Patch #2 is the rebase of the fixups scripts patches taking this into account.

With patch #3, we provide a host-side fixup (so no guest scripts) for all
PAM-based distros, overriding PAM loginuid session decisions.  That module is
only used for the audit subsystem, which is not present in containers (and it
is not clear if it will ever be)

With this patches, I can successfully run vzctl enter and ssh into containers
running totally unmodified kernels for: centos, ubuntu and suse.

Please comment

Glauber Costa (3):
  hooks_ct: create devices inside container
  allow for distro-specific fix ups at creation time.
  hooks_ct: trick PAM to not bail out in loginuid failures

 etc/dists/redhat.conf   |  1 +
 etc/dists/scripts/fixups.sh | 43 ++
 include/dist.h  |  2 ++
 include/env.h   |  3 +-
 src/lib/dist.c  | 10 +-
 src/lib/env.c   | 10 +++---
 src/lib/exec.c  |  2 +-
 src/lib/hooks_ct.c  | 87 ++---
 8 files changed, 147 insertions(+), 11 deletions(-)
 create mode 100755 etc/dists/scripts/fixups.sh

-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/3] allow for distro-specific fix ups at creation time.

2013-05-20 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

We will need that infrastucture when running with Linux upstream, since some
support is very unlikely to ever land in the Kernel.  We need to do things like
account for the fact that udev may kick in and destroy all the setup we have
done for /dev.  Since preemptively preventing those actions to happen is very
hard ( as an example, centos6 init binary will issue mount syscalls itself),
the most robust approach is to insert a script that will be run shortly after
rc.sysinit. Although this can't guarantee that it will run *immediately* after,
it should be early enough to undo every harmful operation done by /sbin/init
and rc.sysinit, therefore allowing operation to continue freely.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 etc/dists/redhat.conf   |  1 +
 etc/dists/scripts/fixups.sh | 43 +++
 include/dist.h  |  2 ++
 include/env.h   |  3 ++-
 src/lib/dist.c  | 10 +-
 src/lib/env.c   | 10 ++
 src/lib/exec.c  |  2 +-
 src/lib/hooks_ct.c  | 35 +++
 8 files changed, 99 insertions(+), 7 deletions(-)
 create mode 100755 etc/dists/scripts/fixups.sh

diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf
index 727461d..49226c5 100644
--- a/etc/dists/redhat.conf
+++ b/etc/dists/redhat.conf
@@ -25,3 +25,4 @@ SET_DNS=set_dns.sh
 SET_USERPASS=set_userpass.sh
 SET_UGID_QUOTA=set_ugid_quota.sh
 POST_CREATE=postcreate.sh
+CT_FIXUP=fixups.sh
diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh
new file mode 100755
index 000..da3fd24
--- /dev/null
+++ b/etc/dists/scripts/fixups.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+#  Copyright (C) 2000-2009, Parallels, Inc. All rights reserved.
+#
+#  This program is free software; you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation; either version 2 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program; if not, write to the Free Software
+#  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+#
+#
+#  Run-time fixup for legacy distributions that will rely on udevd to populate
+#  their /dev. Those distributions will effectively mount an empty tmpfs on top
+#  of /dev, defeating any setup we may have done. This includes RHEL-based
+#  distributions up to RHEL6. We are better off not allowing that to run at all
+
+function fixup_udev()
+{
+   if [ -f /etc/fedora-release ]; then
+   return;
+   fi
+
+   # All the first two may fail if the distribution didn't actually mount
+   # then.
+   umount /dev/pts
+   umount /dev/shm
+   # still may hold references to files open by rc.sysinit. Even if this 
is the
+   # case those will be simple things like /dev/null and should go away 
shortly.
+   # let us not fail because of that.
+   umount /dev -l
+}
+
+fixup_udev
+
+exit 0
+
diff --git a/include/dist.h b/include/dist.h
index 1f6a5a6..4faa452 100644
--- a/include/dist.h
+++ b/include/dist.h
@@ -29,6 +29,7 @@
 #defineSET_USERPASS5
 #defineSET_UGID_QUOTA  6
 #definePOST_CREATE 7
+#defineCT_FIXUP8
 
 typedef struct {
char *def_ostmpl;
@@ -46,6 +47,7 @@ typedef struct dist_actions {
char *set_userpass; /** setup user password. */
char *set_ugid_quota;   /** setup 2level quota. */
char *post_create;  /** sostcreate actions. */
+   char *ct_fixup; /** run-time fixups. */
 } dist_actions;
 
 /* Read distribution specific actions configuration file.
diff --git a/include/env.h b/include/env.h
index 4fef438..b10dfd9 100644
--- a/include/env.h
+++ b/include/env.h
@@ -104,7 +104,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param 
*param, int stop_mode,
 int vz_chroot(const char *root);
 int vz_env_create(vps_handler *h, envid_t veid, vps_res *res,
int wait_p[2], int old_wait_p[2], int err_p[2],
-   env_create_FN fn, void *data);
+   env_create_FN fn, dist_actions *actions, void *data);
 
 struct vps_res;
 struct arg_start {
@@ -116,6 +116,7 @@ struct arg_start {
vps_handler *h;
void *data;
env_create_FN fn;
+   dist_actions *actions;
int userns_p; /* while running in userns, there's extra sync needed */
 };
 
diff --git a/src/lib/dist.c b/src/lib/dist.c
index bde94ab..8047cf7 100644
--- a/src/lib/dist.c
+++ b/src/lib/dist.c
@@ -36,7 +36,8

[Devel] [PATCH 1/3] hooks_ct: create devices inside container

2013-05-20 Thread Glauber Costa
Our devices were being created from the parent of container's init, because we
need to be still outside container context to do it. However, this creates
quite an annoyance, because those bind mounts will show up in the host
/proc/mounts.

Turns out, we don't really need to do it from the root side. We can do it from
the container side provided we do it before we chroot - and then the host side
fs is still visible.

The fact that we join a mount namespace will act to keep those mounts totally
private, and exempt us from cleaning it up.

Signed-off-by: Glauber Costa glom...@openvz.org
---
 src/lib/hooks_ct.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index daa85ed..7bc9814 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -306,6 +306,10 @@ static int _env_create(void *data)
 */
close(arg-userns_p);
 
+   if (arg-h-can_join_userns) {
+   create_devices(arg-h, arg-veid, arg-res-fs.root);
+   }
+
ret = ct_chroot(arg-res-fs.root);
/* Probably means chroot failed */
if (ret)
@@ -438,10 +442,6 @@ static int ct_env_create(struct arg_start *arg)
}
arg-userns_p = userns_p[0];
 
-   if (arg-h-can_join_userns) {
-   create_devices(arg-h, arg-veid, arg-res-fs.root);
-   }
-
ret = clone(_env_create, child_stack, clone_flags, arg);
close(userns_p[0]);
if (ret  0) {
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/3] hooks_ct: trick PAM to not bail out in loginuid failures

2013-05-20 Thread Glauber Costa
From pam_loginuid manpage: The pam_loginuid module sets the loginuid process
attribute for the process that was authenticated.  This is necessary for
applications to be correctly audited

The audit subsystem is not namespace aware, and it is likely to be a long time
before it becomes - if ever. There isn't still consensus in the comunity about
whether or not it makes sense, and if it does, how to implement it.

When authenticating, however, writes to the loginuid will fail under user
namespaces, and login programs like ssh will fail to work. One solution that
does not involve recompiling the kernel with audit disabled, or rebooting into
a command line disabled state, is to bind mount the permit module - that
basically permits anything - into the loginuid module.

This is not distribution specific - aside from the right path - so we apply for
all distros, before we go do the fixup scripts.

Signed-off-by: Glauber Costa glom...@openvz.org
---
 src/lib/hooks_ct.c | 44 
 1 file changed, 44 insertions(+)

diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index f769865..a4f9425 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -288,6 +288,50 @@ static int apply_fixups(vps_handler *h, envid_t veid, 
const char *root, dist_act
char buf[STR_SIZE];
int runlevels[] = { 3, 5 }; /* where to apply the fixups */
unsigned int i;
+   char *pampaths[] = { /lib/security/, /lib64/security,  /* centos, 
suse, debian */
+/lib/x86_64-linux-gnu/security/, /* ubuntu */
+/lib/i386-linux-gnu/security/ /* ubuntu */
+  };
+
+   /*
+* From pam_loginuid manpage: The pam_loginuid module sets the
+* loginuid process attribute for the process that was authenticated.
+* This is necessary for applications to be correctly audited
+*
+* The audit subsystem is not namespace aware, and it is likely to be a
+* long time before it becomes - if ever. There isn't still consensus
+* in the comunity about whether or not it makes sense, and if it does,
+* how to implement it.
+*
+* When authenticating, however, writes to the loginuid will fail under
+* user namespaces, and login programs like ssh will fail to work. One
+* solution that does not involve recompiling the kernel with audit
+* disabled, or rebooting into a command line disabled state, is to
+* bind mount the permit module - that basically permits anything -
+* into the loginuid module.
+*
+* This is not distribution specific - aside from the right path - so
+* we apply for all distros, before we go do the fixup scripts.
+*/
+   for (i = 0; i  ARRAY_SIZE(pampaths); i++) {
+   struct stat st;
+   char buf_orig[STR_SIZE];
+
+   snprintf(buf, sizeof(buf), %s%s/pam_loginuid.so,
+root, pampaths[i]);
+
+   if (stat(buf, st))
+   continue;
+
+   snprintf(buf_orig, sizeof(buf_orig), %s%s/pam_permit.so,
+root, pampaths[i]);
+
+   if ((stat(buf_orig, st)) || (mount(buf_orig, buf, , MS_BIND, 
0)  0))
+   logger(-1, errno,
+   Cannot overwrite pam_loginuid.so. Container will 
start, but
+   login-based programs like ssh may not work);
+   break;
+   }
 
/* Distributions that don't need the fixup will can stop right here */
if (!actions || !actions-ct_fixup)
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 2/3] allow for distro-specific fix ups at creation time.

2013-05-20 Thread Glauber Costa
On 05/21/2013 12:32 AM, Kir Kolyshkin wrote:
 On 05/20/2013 05:49 AM, Glauber Costa wrote:
 From: Glauber Costa glom...@parallels.com

 We will need that infrastucture when running with Linux upstream,
 since some
 support is very unlikely to ever land in the Kernel.  We need to do
 things like
 account for the fact that udev may kick in and destroy all the setup
 we have
 done for /dev.  Since preemptively preventing those actions to happen
 is very
 hard ( as an example, centos6 init binary will issue mount syscalls
 itself),
 the most robust approach is to insert a script that will be run
 shortly after
 rc.sysinit. Although this can't guarantee that it will run
 *immediately* after,
 it should be early enough to undo every harmful operation done by
 /sbin/init
 and rc.sysinit, therefore allowing operation to continue freely.
 
 I understand that
 
 1 this is something redhat/centos-specific?
 2 there is no need to do that for debian/ubuntu etc?
 
No, thi is old-udev specific.
In more up to date distributions, udev is already aware that it might be
being containerized and won't do all those initialization things.

I haven't tested *all* distros we support, but among the most up to date
in each variant, only centos6 needs it (ubuntu 12 runs fine, suse 12
runs fine)

 Am I right?
Yes and no, see above.

 

 Signed-off-by: Glauber Costa glom...@parallels.com
 ---
   etc/dists/redhat.conf   |  1 +
   etc/dists/scripts/fixups.sh | 43
 +++
   include/dist.h  |  2 ++
   include/env.h   |  3 ++-
   src/lib/dist.c  | 10 +-
   src/lib/env.c   | 10 ++
   src/lib/exec.c  |  2 +-
   src/lib/hooks_ct.c  | 35 +++
   8 files changed, 99 insertions(+), 7 deletions(-)
   create mode 100755 etc/dists/scripts/fixups.sh

 diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf
 index 727461d..49226c5 100644
 --- a/etc/dists/redhat.conf
 +++ b/etc/dists/redhat.conf
 @@ -25,3 +25,4 @@ SET_DNS=set_dns.sh
   SET_USERPASS=set_userpass.sh
   SET_UGID_QUOTA=set_ugid_quota.sh
   POST_CREATE=postcreate.sh
 +CT_FIXUP=fixups.sh
 diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh
 new file mode 100755
 index 000..da3fd24
 --- /dev/null
 +++ b/etc/dists/scripts/fixups.sh
 @@ -0,0 +1,43 @@
 +#!/bin/bash
 
 Why bash? It is not available in some distros by default (notably Debian).
 So, unless we do have a reason to use bash-specific features we'd rather
 not to.
 
Will change to /bin/sh

 +#  Copyright (C) 2000-2009, Parallels, Inc. All rights reserved.
 
 :))

That was written by a past me.

 
 +#
 +#  This program is free software; you can redistribute it and/or modify
 +#  it under the terms of the GNU General Public License as published by
 +#  the Free Software Foundation; either version 2 of the License, or
 +#  (at your option) any later version.
 +#
 +#  This program is distributed in the hope that it will be useful,
 +#  but WITHOUT ANY WARRANTY; without even the implied warranty of
 +#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +#  GNU General Public License for more details.
 +#
 +#  You should have received a copy of the GNU General Public License
 +#  along with this program; if not, write to the Free Software
 +#  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 
 02111-1307  USA
 +#
 +#
 +#  Run-time fixup for legacy distributions that will rely on udevd to
 populate
 +#  their /dev. Those distributions will effectively mount an empty
 tmpfs on top
 +#  of /dev, defeating any setup we may have done. This includes
 RHEL-based
 +#  distributions up to RHEL6. We are better off not allowing that to
 run at all
 +
 +function fixup_udev()
 
 s/function //
 
 as this is bashism
 
ok

 index bde94ab..8047cf7 100644
 --- a/src/lib/dist.c
 +++ b/src/lib/dist.c
 @@ -36,7 +36,8 @@ static struct distr_conf {
   {SET_DNS, SET_DNS},
   {SET_USERPASS, SET_USERPASS},
   {SET_UGID_QUOTA, SET_UGID_QUOTA},
 -{POST_CREATE, POST_CREATE}
 +{POST_CREATE, POST_CREATE},
 +{CT_FIXUP, CT_FIXUP }
 
 1. Please add comma after } so the next author won't have to patch your
 line.
 I.e.
 
 +{CT_FIXUP, CT_FIXUP },
 
 
 2. Maybe we can give it more concrete name? FIXUP is, well, anything.
 
 This is very unfortunate. All the other distro scripts we maintain have
 very
 well defined names, associated actions and ways to run. They are described
 in details in etc/dists/scripts/distribution.conf-template file.
 
 In this case we have a script with a name that doesn't tell us anything,
 is only run for redhat* distros, only when user ns is available/enabled,
 and in an extremely strange way -- not by executing it in a context of CT
 like all other scripts do, but by putting it to /etc/rc{3,5}d/S00fixups.sh
 before CT start. More to say, the logic of putting it this way is built
 into
 vzctl C code.
 
 Distro scripts meant

[Devel] [PATCH v2 0/2] distro fixups

2013-05-20 Thread Glauber Costa
Kir,

Here is a new attempt at implementing fixups scripts.
They look nicer than the last version, and rely on a more
generic and configurable script instead, that should make
our lives a lot easier in the future.

Please let me know what you think

Glauber Costa (2):
  allow for distro-specific fix ups at creation time.
  prestart: fixup legacy udev effects

 etc/dists/debian.conf |  1 +
 etc/dists/redhat.conf |  1 +
 etc/dists/scripts/prestart.sh | 74 +++
 etc/dists/suse.conf   |  1 +
 include/dist.h|  2 ++
 src/lib/dist.c| 10 +-
 src/lib/env.c | 15 +
 7 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100755 etc/dists/scripts/prestart.sh

-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 2/2] prestart: fixup legacy udev effects

2013-05-20 Thread Glauber Costa
Legacy udev will do a couple of operations that will destroy all the setup we
have done for /dev. This is because, unaware that it is that it is being
containerized, it will mount a tmpfs on /dev, and then setup it all again.
Since preemptively preventing those actions to happen is very hard ( as an
example, centos6 init binary will issue mount syscalls itself), the most robust
approach is to insert a script that will be run shortly after rc.sysinit.

It is hard to detect if rc.sysinit was already ran, but we notice that at the
end of each run (for both versions 4, 5 and 6), it will touch /.autofsck. We
can check if the file was modified (non-existent - existent, or different
modification time) and run our fixups after this.

Signed-off-by: Glauber Costa glom...@openvz.org
---
 etc/dists/scripts/prestart.sh | 36 
 1 file changed, 36 insertions(+)

diff --git a/etc/dists/scripts/prestart.sh b/etc/dists/scripts/prestart.sh
index 1f28b13..52047a0 100755
--- a/etc/dists/scripts/prestart.sh
+++ b/etc/dists/scripts/prestart.sh
@@ -21,6 +21,41 @@
 # is, so far, meaningless inside a container. This script will apply various
 # fixups if needed.
 
+# Legacy udev will try to mount its own /dev in tmpfs, which will in turn
+# destroy all our hand crafted setup. We need to undo it here.
+fixup_udev()
+{
+   if [ -f /etc/fedora-release ]; then
+   return
+   fi
+
+   if [ ! -f /etc/centos-release -a ! -f /etc/redhat-release ]; then
+   return
+   fi
+
+   # rc.sysinit will touch this file after it finishes.
+   timestamp=$(stat -c %x /.autofsck 2/dev/null)
+   i=0
+   while true; do
+   newstamp=$(stat -c %x /.autofsck 2/dev/null)
+   if [ x$newstamp = x$timestamp ]; then
+   sleep 0.5
+   i=$((i+1))
+   [ $i -gt 10 ]  return
+   continue
+   fi
+   break
+   done
+   # All the first two may fail if the distribution didn't actually mount
+   # then.
+   umount /dev/pts
+   umount /dev/shm
+   # still may hold references to files open by rc.sysinit. Even if this 
is the
+   # case those will be simple things like /dev/null and should go away 
shortly.
+   # let us not fail because of that.
+   umount /dev -l
+}
+
 fixup_loginuid()
 {
local pam_permit=security/pam_permit.so
@@ -36,6 +71,7 @@ fixup_loginuid()
 [ x$VZ_KERNEL = xyes ]  exit 0
 [ x$USERNS = xno ]  exit 0
 
+fixup_udev 
 fixup_loginuid
 
 exit 0
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v5 4/6] modify tar extraction to account for user namespace

2013-05-19 Thread Glauber Costa
On 05/19/2013 09:41 PM, Kir Kolyshkin wrote:
 + */
 +#define VZ_DEFAULT_UID10
 +#define VZ_DEFAULT_GID10
 
 I assume these are no longer used, right?
right

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] allow for distro-specific fix ups at creation time.

2013-05-19 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

We will need that infrastucture when running with Linux upstream, since some
support is very unlikely to ever land in the Kernel.  We need to do things like
account for the fact that udev may kick in and destroy all the setup we have
done for /dev.  Since preemptively preventing those actions to happen is very
hard ( as an example, centos6 init binary will issue mount syscalls itself),
the most robust approach is to insert a script that will be run shortly after
rc.sysinit. Although this can't guarantee that it will run *immediately* after,
it should be early enough to undo every harmful operation done by /sbin/init
and rc.sysinit, therefore allowing operation to continue freely.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 etc/dists/redhat.conf   |  1 +
 etc/dists/scripts/fixups.sh | 43 +++
 include/dist.h  |  2 ++
 include/env.h   |  3 ++-
 src/lib/dist.c  | 10 +-
 src/lib/env.c   | 10 ++
 src/lib/exec.c  |  2 +-
 src/lib/hooks_ct.c  | 35 +++
 8 files changed, 99 insertions(+), 7 deletions(-)
 create mode 100755 etc/dists/scripts/fixups.sh

diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf
index 727461d..49226c5 100644
--- a/etc/dists/redhat.conf
+++ b/etc/dists/redhat.conf
@@ -25,3 +25,4 @@ SET_DNS=set_dns.sh
 SET_USERPASS=set_userpass.sh
 SET_UGID_QUOTA=set_ugid_quota.sh
 POST_CREATE=postcreate.sh
+CT_FIXUP=fixups.sh
diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh
new file mode 100755
index 000..da3fd24
--- /dev/null
+++ b/etc/dists/scripts/fixups.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+#  Copyright (C) 2000-2009, Parallels, Inc. All rights reserved.
+#
+#  This program is free software; you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation; either version 2 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program; if not, write to the Free Software
+#  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+#
+#
+#  Run-time fixup for legacy distributions that will rely on udevd to populate
+#  their /dev. Those distributions will effectively mount an empty tmpfs on top
+#  of /dev, defeating any setup we may have done. This includes RHEL-based
+#  distributions up to RHEL6. We are better off not allowing that to run at all
+
+function fixup_udev()
+{
+   if [ -f /etc/fedora-release ]; then
+   return;
+   fi
+
+   # All the first two may fail if the distribution didn't actually mount
+   # then.
+   umount /dev/pts
+   umount /dev/shm
+   # still may hold references to files open by rc.sysinit. Even if this 
is the
+   # case those will be simple things like /dev/null and should go away 
shortly.
+   # let us not fail because of that.
+   umount /dev -l
+}
+
+fixup_udev
+
+exit 0
+
diff --git a/include/dist.h b/include/dist.h
index 1f6a5a6..4faa452 100644
--- a/include/dist.h
+++ b/include/dist.h
@@ -29,6 +29,7 @@
 #defineSET_USERPASS5
 #defineSET_UGID_QUOTA  6
 #definePOST_CREATE 7
+#defineCT_FIXUP8
 
 typedef struct {
char *def_ostmpl;
@@ -46,6 +47,7 @@ typedef struct dist_actions {
char *set_userpass; /** setup user password. */
char *set_ugid_quota;   /** setup 2level quota. */
char *post_create;  /** sostcreate actions. */
+   char *ct_fixup; /** run-time fixups. */
 } dist_actions;
 
 /* Read distribution specific actions configuration file.
diff --git a/include/env.h b/include/env.h
index 4fef438..b10dfd9 100644
--- a/include/env.h
+++ b/include/env.h
@@ -104,7 +104,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param 
*param, int stop_mode,
 int vz_chroot(const char *root);
 int vz_env_create(vps_handler *h, envid_t veid, vps_res *res,
int wait_p[2], int old_wait_p[2], int err_p[2],
-   env_create_FN fn, void *data);
+   env_create_FN fn, dist_actions *actions, void *data);
 
 struct vps_res;
 struct arg_start {
@@ -116,6 +116,7 @@ struct arg_start {
vps_handler *h;
void *data;
env_create_FN fn;
+   dist_actions *actions;
int userns_p; /* while running in userns, there's extra sync needed */
 };
 
diff --git a/src/lib/dist.c b/src/lib/dist.c
index bde94ab..8047cf7 100644
--- a/src/lib/dist.c
+++ b/src/lib/dist.c
@@ -36,7 +36,8

Re: [Devel] [PATCH v5 6/6] allow for distro-specific fix ups at creation time.

2013-05-18 Thread Glauber Costa
 +{
 +char buf[STR_SIZE];
 +
 +/* Distributions that don't need the fixup will can stop right
 here */
 +if (!actions || !actions-ct_fixup)
 +return 0;
 +
 +if (snprintf(buf, sizeof(buf), %s/%s, root,
 /etc/rc3.d/S00vz-fixups.sh)  0)
 
 Again and again :(
 How this snprintf can return negative value?
 

Will change.

 +goto out;
 +
 +if (open(buf, O_RDWR|O_CREAT, 0)  0)
 +goto out;
 +
 +if (mount(actions-ct_fixup, buf, , MS_BIND, 0)  0)
 +goto out;
 
 Please remind me why are you bind-mounting rather than just copying the
 file?

Because I believe it to be more robust this way. The user won't be able
to accidentally delete it, mess with permissions, etc. IOW, this is to
to make it all temporary and avoid changing the container permanently.

 +
 +/*
 + * Being safe never killed anyone...
 + * We bind mount instead of symlinking the original rc3 so it
 won't appear in the original
 + * private mount.
 + */
 +if (snprintf(buf, sizeof(buf), %s/%s, root,
 /etc/rc5.d/S00vz-fixups.sh)  0)
 +goto out;
 +
 +if (open(buf, O_RDWR|O_CREAT, 0)  0)
 +goto out;
 +
 +if (mount(actions-ct_fixup, buf, , MS_BIND, 0)  0)
 +goto out;
 
 This is copy-pasting. Can you at least redo this with something like
 
 char *dirs[] = {/etc/rc3.d, /etc/rc5.d};
 
 for(i = 0; i  ARRAY_SIZE(dirs); i++) {
 bla bla bla
 }
 
Yes.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v4 2/7] add user mismatch test

2013-05-17 Thread Glauber Costa
On 05/17/2013 04:18 AM, Kir Kolyshkin wrote:
 +stat(res-fs.private, private_stat);
 +if ((local_uid  (private_stat.st_uid != *local_uid)) ||
 +(local_gid  (private_stat.st_gid != *local_gid))) {
 
 As I just commented at the very end of a previous patch,
 indeed it does make sense to check for local_uid and local_gid being
 not-NULL,
 since here you dereference both without checking (relying on
 can_join_userns
 which only checks local_uid but not local_gid).
 
 So that comment was right (and taking it into account this code is
 correct).
Look again, we check both variables here.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v4 1/7] user namespace support for upstream containers

2013-05-17 Thread Glauber Costa
On 05/17/2013 04:11 AM, Kir Kolyshkin wrote:
 +if ((arg-userns_p != -1)  (read(arg-userns_p, ret,
 sizeof(ret)) != sizeof(ret))) {
 +logger(-1, errno, Cannot read from user namespace pipe);
 
 We don't close arg-userns_p in case of error here.
 And it seems we do not close the other end of it here.
 I had that concern before, you said we're closing all fds somewhere,
 I'd rather close it explicitly here to be on the safe side
 (and pass both fds to the child to be able to do so).
 

No need. We also don't pass the other side of the pipe to the child for
the other 2 pipes that we use. We don't close all fds somewhere.

We close all fds in close_fds right before we call exec_container_init.

As for the error here, I changed it, even though it is not strictly
needed, since we always exit on errors here.

 +}
   }
 +close(userns_p[0]);
 +close(userns_p[1]);
 snprintf(procpath, STR_SIZE, /proc/%d/ns/net, ret);
   ret = symlink(procpath, ctpath);
   if (ret) {
   logger(-1, errno, Can't symlink into netns file %s, ctpath);
 -destroy_container(arg-veid);
 -return VZ_RESOURCE_ERROR;
 +goto err_noclose;
   }
 return 0;
 +err:
 +close(userns_p[0]);
 +close(userns_p[1]);
 +/* FIXME: remove ourselves from container first */
 +err_noclose:
 +destroy_container(arg-veid);
 +return VZ_RESOURCE_ERROR;
   }
 
 I don't really enjoy this mess of close()'s, goto's and returns,
 it can be done a bit simple IMHO.
 
 But looks like there are no logic errors in here, so OK.

Well, I changed this to avoid the goto's. We have more duplicate code,
but less confusion.


   char path[STR_SIZE];
 +char upath[STR_SIZE];
   struct stat st;
 +unsigned long *local_uid = param-res.misc.local_uid;
 ret = container_init();
   if (ret) {
 @@ -592,6 +813,9 @@ int ct_do_open(vps_handler *h, vps_param *param)
   if (snprintf(path, sizeof(path), /proc/%d/ns/pid, getpid())  0)
   return VZ_RESOURCE_ERROR;
   +if (snprintf(upath, sizeof(upath), /proc/%d/ns/user,
 getpid())  0)
 +return VZ_RESOURCE_ERROR;
 +
 
 Same, I fail to see how snprintf() can ever return negative value here.
 
I fail as well, but in case it does, we will return an error. I don't
expect this branch to be ever taken, but it is there as a safeguard. I
will remove it if you want me to. I removed the other.

   ret = mkdir(NETNS_RUN_DIR,
 S_IRWXU|S_IRGRP|S_IXGRP|S_IROTH|S_IXOTH);
 if (ret  (errno != EEXIST)) {
 @@ -601,6 +825,20 @@ int ct_do_open(vps_handler *h, vps_param *param)
   }
 h-can_join_pidns = !stat(path, st);
 +/*
 + * Being able to join the user namespace is a good indication
 that the
 + * user namespace is complete. For a long time, the user namespace
 + * existed, but were far away from being feature complete.  When
 + * running in such a kernel, joining the user namespace will just
 + * cripple our container, since we won't be able to do anything.
 It is
 + * only good for people who are okay running containers as root.
 + *
 + * It is not enough, however, for user namespaces to be present
 in the
 + * kernel. The container must have been setup to use it (we need the
 + * mapped user to own the files, etc. So we also need to find
 suitable
 + * configuration in the config files.
 + */
 +h-can_join_userns = !stat(upath, st)  local_uid;
 
 Maybe  local_gid as well?
 
Here we are just testing if it is enable or disabled. It will be clear
later on when we document it, but local_uid is the only thing we need to
test. If it is disabled, we should not even look at local_gid, since we
are, already, essentially root.

local_uid enabled and local_gid disabled is a valid, though weird,
configuration.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v4 3/7] Also pass cmd_p pointer to container open

2013-05-17 Thread Glauber Costa
On 05/17/2013 05:35 AM, Kir Kolyshkin wrote:
 
 This is kinda becoming over-engineered (and now I realized I should have
 said it earlier,
 when reviewing the patch that added the first param).
 
 I understand well why you need local_uid and local_gid from config and
 cmdline
 in ct_open(), but maybe this non-zero checks can be done later in the
 code, so
 instead of using just h-can_join_userns you can call a function,
 something like

I only need cmd_p at one moment: When we create the container. I think
it is better than to leave can_join_userns with the vps_p test only,
then. We can test it manually during fs_create, because there is the
place in which we really need it.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v5 0/6] User namespace support for upstream containers

2013-05-17 Thread Glauber Costa
Kir,

In this patchset, I hope to be addressing all your concerns.  I don't use cmd_p
any longer, as you requested. I have also tried to merge most of your comments
at the main userns patch.  It is a bit massive code, so if is there still
anything that you would me to change, don't shy away. I will go back to it
ASAP.


Glauber Costa (6):
  user namespace support for upstream containers
  add user mismatch test
  allow local uid and gid to be specified at container creation
  modify tar extraction to account for user namespace
  automatically add bridge venet0 when needed
  allow for distro-specific fix ups at creation time.

 etc/dists/redhat.conf   |   1 +
 etc/dists/scripts/fixups.sh |  43 ++
 include/dist.h  |   2 +
 include/env.h   |   4 +-
 include/res.h   |   6 +
 include/types.h |   1 +
 man/vzctl.8.in  |  17 +++
 scripts/vps-create.in   |  14 ++
 scripts/vps-functions.in|   7 +
 src/lib/Makefile.am |   3 +
 src/lib/chown_preload.c |  93 +
 src/lib/create.c|  38 -
 src/lib/dist.c  |  10 +-
 src/lib/env.c   |  26 +++-
 src/lib/exec.c  |   2 +-
 src/lib/hooks_ct.c  | 329 +++-
 src/vzctl-actions.c |   2 +
 src/vzctl.c |   1 +
 vzctl.spec  |   2 +-
 19 files changed, 581 insertions(+), 20 deletions(-)
 create mode 100755 etc/dists/scripts/fixups.sh
 create mode 100644 src/lib/chown_preload.c

-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v5 1/6] user namespace support for upstream containers

2013-05-17 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

This patch allows the execution of unprivileged containers running ontop
of an upstream Linux Kernel. We will run at whatever UID is found in the
configuration file (so far empty, thus disabled).

Signed-off-by: Glauber Costa glom...@parallels.com
---
 include/env.h  |   1 +
 include/types.h|   1 +
 src/lib/hooks_ct.c | 254 +++--
 3 files changed, 251 insertions(+), 5 deletions(-)

diff --git a/include/env.h b/include/env.h
index 06729f6..4fef438 100644
--- a/include/env.h
+++ b/include/env.h
@@ -116,6 +116,7 @@ struct arg_start {
vps_handler *h;
void *data;
env_create_FN fn;
+   int userns_p; /* while running in userns, there's extra sync needed */
 };
 
 struct env_create_param3;
diff --git a/include/types.h b/include/types.h
index a4bce73..fa511aa 100644
--- a/include/types.h
+++ b/include/types.h
@@ -95,6 +95,7 @@ typedef struct vps_handler {
int vzfd;   /** /dev/vzctl file descriptor. */
int stdfd;
int can_join_pidns; /* can't enter otherwise */
+   int can_join_userns; /* can't run non privileged otherwise */
int (*is_run)(struct vps_handler *h, envid_t veid);
int (*enter)(struct vps_handler *h, envid_t veid, const char *root, int 
flags);
int (*destroy)(struct vps_handler *h, envid_t veid);
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index e46c1df..e71f116 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -66,6 +66,8 @@ static int sys_setns(int fd, int nstype)
 # define MS_PRIVATE (1  18)
 #endif
 
+#define UID_GID_RANGE 10 /* how many users per container */
+
 /* This function is there in GLIBC, but not in headers */
 extern int pivot_root(const char * new_root, const char * put_old);
 
@@ -203,16 +205,159 @@ static int ct_setlimits(vps_handler *h, envid_t veid, 
struct ub_struct *ub)
 }
 #undef add_value
 
+static int write_uid_gid_mapping(vps_handler *h, unsigned long uid, unsigned 
long gid, pid_t pid)
+{
+   char buf[STR_SIZE];
+   char map[STR_SIZE];
+   int fd;
+   int len;
+   int ret = VZ_RESOURCE_ERROR;
+
+   len = snprintf(map, sizeof(map), 0 %ld %d, uid, UID_GID_RANGE);
+   snprintf(buf, sizeof(buf), /proc/%d/uid_map, pid);
+   if ((fd = open(buf, O_WRONLY))  0)
+   goto out;
+
+   if ((write(fd, map, len) != len))
+   goto out;
+
+   close(fd);
+
+   len = snprintf(map, sizeof(map), 0 %ld %d, gid, UID_GID_RANGE);
+   snprintf(buf, sizeof(map), /proc/%d/gid_map, pid);
+   if ((fd = open(buf, O_WRONLY))  0)
+   goto out;
+
+   if ((write(fd, map, len) != len))
+   goto out;
+   ret = 0;
+out:
+   if (fd = 0)
+   close(fd);
+   return ret;
+}
+
+/*
+ * Those devices should exist in the container, and be valid device nodes with
+ * user access permission. But we need to be absolutely sure this is the case,
+ * so we will provide our own versions. That could actually happen since some
+ * distributions may come with emptied /dev's, waiting for udev to populate 
them.
+ * That won't happen, we do it ourselves.
+ */
+static void create_devices(vps_handler *h, envid_t veid, const char *root)
+{
+   unsigned int i;
+   char *devices[] = {
+   /dev/null,
+   /dev/zero,
+   /dev/random,
+   /dev/urandom,
+   };
+
+   /*
+* We will tolerate errors, and keep the container running, because it 
is
+* likely we will be able to boot it to a barely functional state. But
+* be vocal about it
+*/
+   for (i = 0; i  ARRAY_SIZE(devices); i++) {
+   char ct_devname[STR_SIZE];
+   int ret;
+
+   snprintf(ct_devname, sizeof(ct_devname), %s%s, root, 
devices[i]);
+
+   /*
+* No need to be crazy about file flags. When we bind mount, the
+* source permissions will be inherited.
+*/
+   ret = open(ct_devname, O_RDWR|O_CREAT, 0);
+   if (ret  0) {
+   logger(-1, errno, Could not touch device %s, 
devices[i]);
+   continue;
+   }
+   close(ret);
+
+   ret = mount(devices[i], ct_devname, , MS_BIND, 0);
+   if (ret  0)
+   logger(-1, errno, Could not bind mount device %s, 
devices[i]);
+   }
+}
+
 static int _env_create(void *data)
 {
struct arg_start *arg = data;
struct env_create_param3 create_param;
int ret;
 
-   if ((ret = ct_chroot(arg-res-fs.root)))
+   if ((arg-userns_p != -1)  (read(arg-userns_p, ret, sizeof(ret)) != 
sizeof(ret))) {
+   logger(-1, errno, Cannot read from user namespace pipe);
+   close(arg-userns_p);
+   return VZ_RESOURCE_ERROR;
+   }
+
+   ret = ct_chroot(arg

[Devel] [PATCH v5 2/6] add user mismatch test

2013-05-17 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

In theory, we won't be able to run if our private area is not owned by
ourselves.  We could, if it have very wide open security permissions, but we
should never set up a container like that.

Aside from a basic sanity check, this is intended to catch problems for the few
people who may have already created containers that will be owned by root:root,
and will now try to run it unprivileged.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 src/lib/env.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/lib/env.c b/src/lib/env.c
index 898bfc8..96400c0 100644
--- a/src/lib/env.c
+++ b/src/lib/env.c
@@ -30,6 +30,7 @@
 #include linux/reboot.h
 #include sys/mount.h
 #include sys/utsname.h
+#include sys/stat.h
 
 #include vzerror.h
 #include res.h
@@ -554,6 +555,21 @@ int vps_start_custom(vps_handler *h, envid_t veid, 
vps_param *param,
logger(-1, 0, Container is already running);
return VZ_VE_RUNNING;
}
+   if (!is_vz_kernel(h)  h-can_join_userns) {
+   struct stat private_stat;
+   unsigned long *local_uid = res-misc.local_uid;
+   unsigned long *local_gid = res-misc.local_gid;
+
+   stat(res-fs.private, private_stat);
+   if ((local_uid  (private_stat.st_uid != *local_uid)) ||
+   (local_gid  (private_stat.st_gid != *local_gid))) {
+   logger(-1, 0, Container private area is owned by %d:%d
+   , but configuration file says we should run with 
%lu:%lu.\n
+   Refusing to run., private_stat.st_uid, 
private_stat.st_gid,
+   *res-misc.local_uid, *res-misc.local_gid);
+   return VZ_FS_BAD_TMPL;
+   }
+   }
if ((ret = check_ub(h, res-ub)))
return ret;
 
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v5 3/6] allow local uid and gid to be specified at container creation

2013-05-17 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

It is a valid use case to run a container with host uid and gid different than
the default. In particular, already deployed versions of vzctl are expected to
have this value unset, effectively meaning they are not expecting user
namespaces to be present. We also deem as a valid use case to run a fully
privileged container, in which case we will explicitly disable user namespaces.

This patch provides and documents a way to do so.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 man/vzctl.8.in  | 17 +
 src/lib/hooks_ct.c  |  4 ++--
 src/vzctl-actions.c |  2 ++
 src/vzctl.c |  1 +
 4 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/man/vzctl.8.in b/man/vzctl.8.in
index 20a1856..e233696 100644
--- a/man/vzctl.8.in
+++ b/man/vzctl.8.in
@@ -871,6 +871,8 @@ List of available fields can be obtained using \fB-L\fR 
option.
 .OP --ipadd addr
 .OP --hostname name
 .OP --name name
+.OP --local_uid uid
+.OP --local_gid gid
 .YS
 .IP  4
 Creates a new container area. This operation should be done once, before
@@ -922,6 +924,21 @@ a container. Note that this option can be used multiple 
times.
 
 You can use \fB--hostname\fR \fIname\fR option to set a host name for
 a container.
+
+When running with an upstream Linux Kernel that supports user namespaces (=
+3.8), the parameters \fB--local_uid\fR and \fB--local_gid\fR can be used to
+select which \fIuid\fR and \fIgid\fR respectively will be used as a base user
+in the host system. Note that user namespaces provide a 1:1 mapping between
+container users and host users. If these options are not specified, the values
+\fBLOCAL_UID\fR and \fBLOCAL_GID\fR from global configuration file
+\fBvz.conf\fR(5) are used. An explicit \fB--local_uid\fR value of 0 will
+disable user namespace support, and run the container as a privileged user. In
+this case, \fB--local_gid\fR is ignored.
+
+\fBWarning:\fR use \fB--local_uid\fR and \fB--local_gid\fR with care, specially
+when migrating containers. In all situations, the container's files in the
+filesystem needs to be correctly owned by the host-side users.
+
 .IP \fBdestroy\fR | \fBdelete\fR \fICTID\fR 4
 Removes a container private area by deleting all files, directories and
 the configuration file of this container.
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index e71f116..299d43d 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -423,7 +423,7 @@ static int ct_env_create(struct arg_start *arg)
clone_flags |= CLONE_NEWNET|CLONE_NEWNS;
 
if (!arg-h-can_join_userns) {
-   logger(-1, 0, WARNING: Running container unprivileged. USER_NS 
not supported);
+   logger(-1, 0, WARNING: Running container unprivileged. USER_NS 
not supported, or runtime disabled);
 
userns_p[0] = userns_p[1] = -1;
} else {
@@ -844,7 +844,7 @@ int ct_do_open(vps_handler *h, vps_param *param)
 * mapped user to own the files, etc. So we also need to find suitable
 * configuration in the config files.
 */
-   h-can_join_userns = !stat(upath, st)  local_uid;
+   h-can_join_userns = !stat(upath, st)  local_uid  (*local_uid != 
0);
h-is_run = ct_is_run;
h-enter = ct_enter;
h-destroy = ct_destroy;
diff --git a/src/vzctl-actions.c b/src/vzctl-actions.c
index 0dd2ae7..1ef61d1 100644
--- a/src/vzctl-actions.c
+++ b/src/vzctl-actions.c
@@ -391,6 +391,8 @@ static int parse_create_opt(envid_t veid, int argc, char 
**argv,
{ve_layout,   required_argument, NULL, PARAM_VE_LAYOUT},
{velayout,required_argument, NULL, PARAM_VE_LAYOUT},
{diskspace,   required_argument, NULL, PARAM_DISKSPACE},
+   {local_uid,   required_argument, NULL, PARAM_LOCAL_UID},
+   {local_gid,   required_argument, NULL, PARAM_LOCAL_GID},
{ NULL, 0, NULL, 0 }
 };
 
diff --git a/src/vzctl.c b/src/vzctl.c
index 359bcde..54d66d1 100644
--- a/src/vzctl.c
+++ b/src/vzctl.c
@@ -65,6 +65,7 @@ static void usage(int rc)
 vzctl create ctid [--ostemplate name] [--config name]\n
[--layout ploop|simfs] [--hostname name] [--name name] [--ipadd 
addr]\n
[--diskspace kbytes] [--private path] [--root path]\n
+   [--local_uid UID] [--local_gid GID]\n
 vzctl start ctid [--force] [--wait]\n
 vzctl destroy | mount | umount | stop | restart | status ctid\n
 #ifdef HAVE_PLOOP
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v5 4/6] modify tar extraction to account for user namespace

2013-05-17 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

If we are running upstream with user namespaces, we need to create the
container filesystem not with the ownership preserved, but reflecting the
mapping we need to apply. Note that according to our documentation, we should
ignore this if the user explicitly requested an uid mapping of 0 (gid is
ignored in this case).

Our tooling doesn't allow any easy way to unpack a whol distribution with
offsets mechanically applied like this. We could do the whole unpacking in a
user namespace itself, but that does not come without problems on its own (for
instance, we won't be able to create any device files, we have to carefully
adjust permissions in the root directory, etc)

To work around that, we can employ a trick to allow container creation right
now, as well as to avoid compatibility problems: we will resort to LD_PRELOAD
to load a schim that captures calls to the chown family of system calls and
applies the offset manually.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 include/res.h   |  6 
 scripts/vps-create.in   | 14 
 src/lib/Makefile.am |  3 ++
 src/lib/chown_preload.c | 93 +
 src/lib/create.c| 38 +---
 vzctl.spec  |  2 +-
 6 files changed, 150 insertions(+), 6 deletions(-)
 create mode 100644 src/lib/chown_preload.c

diff --git a/include/res.h b/include/res.h
index 0dfacf7..047593a 100644
--- a/include/res.h
+++ b/include/res.h
@@ -47,6 +47,12 @@ struct env_param {
 };
 typedef struct env_param env_param_t;
 
+/*
+ * When running upstream kernels, we will need host-side UID and GID. Those
+ * are configurable, but if none is specified, those are used.
+ */
+#define VZ_DEFAULT_UID 10
+#define VZ_DEFAULT_GID 10
 typedef struct {
list_head_t userpw;
list_head_t nameserver;
diff --git a/scripts/vps-create.in b/scripts/vps-create.in
index 126f048..a1b3382 100755
--- a/scripts/vps-create.in
+++ b/scripts/vps-create.in
@@ -22,11 +22,22 @@
 # Required parameters:
 #   VE_PRVT- path to root of CT private areas
 #   PRIVATE_TEMPLATE   - path to private template used as a source for copying
+#
+# Optional parameters:
+#   UID_OFFSET - offset to be added to all tar UIDs
+#   GID_OFFSET - offset to be added to all tar GIDs
 
 . @SCRIPTDIR@/vps-functions
 
 vzcheckvar VE_PRVT PRIVATE_TEMPLATE
 
+chown_preload_if_needed()
+{
+   [ -z $UID_OFFSET -o -z $GID_OFFSET ]  return 
+
+   export LD_PRELOAD=libvzchown.so
+}
+
 create_prvt()
 {
local TMP AVAIL NEEDED HEADER OPT
@@ -75,6 +86,9 @@ create_prvt()
[ $AVAIL -ge $NEEDED ] ||
vzerror Insufficient disk space in $VE_PRVT; available: 
$AVAIL, needed: $NEEDED ${VZ_FS_NO_DISK_SPACE}
CAT=cat
+
+   chown_preload_if_needed
+
# Use pv to show nice progress bar if we can
pv -V /dev/null 21  CAT=pv
chmod 700 $VE_PRVT
diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am
index b009d15..34c463b 100644
--- a/src/lib/Makefile.am
+++ b/src/lib/Makefile.am
@@ -72,6 +72,9 @@ libvzctl_la_LIBADD = $(XML_LIBS) $(CGROUP_LIBS) $(DL_LIBS)
 
 if HAVE_CGROUP
 libvzctl_la_SOURCES += cgroup.c hooks_ct.c
+
+lib_LTLIBRARIES += libvzchown.la
+libvzchown_la_SOURCES = chown_preload.c
 endif
 
 if HAVE_VZ_KERNEL
diff --git a/src/lib/chown_preload.c b/src/lib/chown_preload.c
new file mode 100644
index 000..4e2be4a
--- /dev/null
+++ b/src/lib/chown_preload.c
@@ -0,0 +1,93 @@
+/*
+ *  Copyright (C) 2013, Parallels, Inc. All rights reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *  Authors: Kir Kolyshkin and Glauber Costa
+ */
+
+#include unistd.h
+#include stdio.h
+#include stdlib.h
+
+#include dlfcn.h
+
+static int (*real_chown)(const char *path, uid_t owner, gid_t group) = NULL;
+static int (*real_fchown)(int fd, uid_t owner, gid_t group) = NULL;
+static int (*real_lchown)(const char *path, uid_t owner, gid_t group) = NULL;
+static int (*real_fchownat)(int dirfd, const char *pathname,
+   uid_t owner, gid_t group, int flags) = NULL;
+
+uid_t uid_offset = 0;
+gid_t gid_offset = 0;
+
+static void __init(void)
+{
+   char *uidstr, *gidstr;
+
+   uidstr = getenv(UID_OFFSET

[Devel] [PATCH v5 5/6] automatically add bridge venet0 when needed

2013-05-17 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

The chosen architecture to deal with --ipadd with upstream containers is to
create a veth pair and add the host side information to a bridge called venet0.
This way, all the code that expects venet0 to exist can still work without
modifications, (or with just a few).

Our intention to do that was actually already stated in the comments, but the
code was removed before merging because --ipadd would not work without full
unshare support anyway.

This patch implements that.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 scripts/vps-functions.in |  7 +++
 src/lib/hooks_ct.c   | 37 +++--
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/scripts/vps-functions.in b/scripts/vps-functions.in
index 826c0a1..37b2de5 100755
--- a/scripts/vps-functions.in
+++ b/scripts/vps-functions.in
@@ -170,6 +170,13 @@ vzadjustmacs()
 # other setups, the bridge is expected to already exist and be valid.
 vzconfbridge()
 {
+   if [ x$BRIDGE == xvenet0 ]; then
+   if [ $(brctl show venet0 2/dev/null | tail -n+2 | wc -l) == 0 
]; then
+   brctl addbr venet0
+   ${IP_CMD} link set venet0 up
+   fi
+   fi
+   
if [ x$BRIDGE != x ]; then
brctl addif $BRIDGE $HNAME /dev/null 21
fi
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index 299d43d..f002fa0 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -17,6 +17,7 @@
 #include logger.h
 #include script.h
 #include cgroup.h
+#include linux/vzctl_venet.h
 
 #define NETNS_RUN_DIR /var/run/netns
 
@@ -731,8 +732,40 @@ static int ct_netdev_ctl(vps_handler *h, envid_t veid, int 
op, char *name)
 
 static int ct_ip_ctl(vps_handler *h, envid_t veid, int op, const char *ipstr)
 {
-   logger(-1, 0, %s not yet supported upstream, __func__);
-   return 0;
+   int ret = -1;
+   char *envp[5];
+   char buf[STR_SIZE];
+   int i = 0;
+
+   if (!h-can_join_pidns) {
+   logger(-1, 0, Cannot join pid namespace: 
+   --ipadd is not supported in kernels without full pidns 
support);
+   return VZ_BAD_KERNEL;
+   }
+   envp[i++] = strdup(VNAME=venet0);
+   envp[i++] = strdup(BRIDGE=venet0);
+
+   snprintf(buf, sizeof(buf), HNAME=venet0.%d, veid);
+   envp[i++] = strdup(buf);
+
+   snprintf(buf, sizeof(buf), VEID=%d, veid);
+   envp[i++] = strdup(buf);
+
+   envp[i] = NULL;
+
+   if (op == VE_IP_ADD) {
+   char *argv[] = { VPS_NETNS_DEV_ADD, NULL };
+
+   ret = run_script(VPS_NETNS_DEV_ADD, argv, envp, 0);
+   } else  {
+   char *argv[] = { VPS_NETNS_DEV_DEL, NULL };
+
+   ret = run_script(VPS_NETNS_DEV_DEL, argv, envp, 0);
+   }
+   free_arg(envp);
+
+   return ret;
+
 }
 
 /*
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v5 6/6] allow for distro-specific fix ups at creation time.

2013-05-17 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

We will need that infrastucture when running with Linux upstream, since some
support is very unlikely to ever land in the Kernel.  We need to do things like
account for the fact that udev may kick in and destroy all the setup we have
done for /dev.  Since preemptively preventing those actions to happen is very
hard ( as an example, centos6 init binary will issue mount syscalls itself),
the most robust approach is to insert a script that will be run shortly after
rc.sysinit. Although this can't guarantee that it will run *immediately* after,
it should be early enough to undo every harmful operation done by /sbin/init
and rc.sysinit, therefore allowing operation to continue freely.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 etc/dists/redhat.conf   |  1 +
 etc/dists/scripts/fixups.sh | 43 +++
 include/dist.h  |  2 ++
 include/env.h   |  3 ++-
 src/lib/dist.c  | 10 +-
 src/lib/env.c   | 10 ++
 src/lib/exec.c  |  2 +-
 src/lib/hooks_ct.c  | 38 ++
 8 files changed, 102 insertions(+), 7 deletions(-)
 create mode 100755 etc/dists/scripts/fixups.sh

diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf
index 727461d..49226c5 100644
--- a/etc/dists/redhat.conf
+++ b/etc/dists/redhat.conf
@@ -25,3 +25,4 @@ SET_DNS=set_dns.sh
 SET_USERPASS=set_userpass.sh
 SET_UGID_QUOTA=set_ugid_quota.sh
 POST_CREATE=postcreate.sh
+CT_FIXUP=fixups.sh
diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh
new file mode 100755
index 000..da3fd24
--- /dev/null
+++ b/etc/dists/scripts/fixups.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+#  Copyright (C) 2000-2009, Parallels, Inc. All rights reserved.
+#
+#  This program is free software; you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation; either version 2 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program; if not, write to the Free Software
+#  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+#
+#
+#  Run-time fixup for legacy distributions that will rely on udevd to populate
+#  their /dev. Those distributions will effectively mount an empty tmpfs on top
+#  of /dev, defeating any setup we may have done. This includes RHEL-based
+#  distributions up to RHEL6. We are better off not allowing that to run at all
+
+function fixup_udev()
+{
+   if [ -f /etc/fedora-release ]; then
+   return;
+   fi
+
+   # All the first two may fail if the distribution didn't actually mount
+   # then.
+   umount /dev/pts
+   umount /dev/shm
+   # still may hold references to files open by rc.sysinit. Even if this 
is the
+   # case those will be simple things like /dev/null and should go away 
shortly.
+   # let us not fail because of that.
+   umount /dev -l
+}
+
+fixup_udev
+
+exit 0
+
diff --git a/include/dist.h b/include/dist.h
index 1f6a5a6..4faa452 100644
--- a/include/dist.h
+++ b/include/dist.h
@@ -29,6 +29,7 @@
 #defineSET_USERPASS5
 #defineSET_UGID_QUOTA  6
 #definePOST_CREATE 7
+#defineCT_FIXUP8
 
 typedef struct {
char *def_ostmpl;
@@ -46,6 +47,7 @@ typedef struct dist_actions {
char *set_userpass; /** setup user password. */
char *set_ugid_quota;   /** setup 2level quota. */
char *post_create;  /** sostcreate actions. */
+   char *ct_fixup; /** run-time fixups. */
 } dist_actions;
 
 /* Read distribution specific actions configuration file.
diff --git a/include/env.h b/include/env.h
index 4fef438..b10dfd9 100644
--- a/include/env.h
+++ b/include/env.h
@@ -104,7 +104,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param 
*param, int stop_mode,
 int vz_chroot(const char *root);
 int vz_env_create(vps_handler *h, envid_t veid, vps_res *res,
int wait_p[2], int old_wait_p[2], int err_p[2],
-   env_create_FN fn, void *data);
+   env_create_FN fn, dist_actions *actions, void *data);
 
 struct vps_res;
 struct arg_start {
@@ -116,6 +116,7 @@ struct arg_start {
vps_handler *h;
void *data;
env_create_FN fn;
+   dist_actions *actions;
int userns_p; /* while running in userns, there's extra sync needed */
 };
 
diff --git a/src/lib/dist.c b/src/lib/dist.c
index bde94ab..8047cf7 100644
--- a/src/lib/dist.c
+++ b/src/lib/dist.c
@@ -36,7 +36,8

Re: [Devel] [PATCH 1/6] vzctl: split ct_env_create

2013-05-16 Thread Glauber Costa
On 05/16/2013 04:14 PM, Andrey Vagin wrote:
 + ret = ct_env_create_real(arg);
 + if (ret  0)
   return VZ_RESOURCE_ERROR;
 - }

Isn't it better to just keep the return values intact in create_real,
and then return them as is if ret != 0 ?

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 2/6] vzctl: save PID of init in a state file

2013-05-16 Thread Glauber Costa
On 05/16/2013 04:14 PM, Andrey Vagin wrote:
 CRIU requires a pid of the init.
 
 Signed-off-by: Andrey Vagin ava...@openvz.org

The way you coded it, it seems to me that we will always overwrite the
pid file, which is fine: this way we won't run into the usual pid file
already exists kinds of problem. When we reach this point, we already
tested if the container is running... My only fear is that we do that by
verifying if there is any tasks already in the cgroup. So can we race
against other container creation?

We put ourselves in the container before we call clone, so *at this
point* the vps_is_run test would fail, but by now we have already tested
it a while ago... Maybe vzctl has other mechanism against such races
that I am not aware of, but could you explain me why we don't have this
race? It is my only concern, the rest looks fine. Having the init pid is
quite handy for other situations as well, like stop --fast.
So far we hack that by killing everybody until no more tasks are found
in the cgroup, but what I always wanted was really a mechanism to kill init.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v3 4/9] user namespace support for upstream containers

2013-05-14 Thread Glauber Costa
On 05/14/2013 07:02 AM, Kir Kolyshkin wrote:
 Oh my, four cases of whitespace-at-eol which I had to fixed manually.
 
Sorry about that. I think I got to used to checkpatch and the such that
I forget to verify all of those manually.

 Wanted to apply it nevertheless and fix some things myself, but then the
 list grew out too big.
 
 Plus please see inline.
 
Here we go.

   /* This function is there in GLIBC, but not in headers */
   extern int pivot_root(const char * new_root, const char * put_old);
   @@ -138,10 +140,41 @@ static int _env_create(void *data)
   struct env_create_param3 create_param;
   int ret;
   -if ((ret = ct_chroot(arg-res-fs.root)))
 +if ((arg-userns_p != -1)  (read(arg-userns_p, ret,
 sizeof(ret)) == 0))
 
 1 I'd suggest comparing with sizeof(ret)
As you wish.

 2 We can't print any errors here, can we? If we can I'd rather do it.
 
I believe logger works in here, yes.
 +return -1;
 
 You're supposed to return VZ_* exit codes from here.
 

ok.

 +
 +ret = ct_chroot(arg-res-fs.root);
 +close(arg-userns_p);
 +/* Probably means chroot failed */
 +if (ret)
   return ret;
   -if ((ret = vps_set_cap(arg-veid, arg-res-env,
 arg-res-cap, 1)))
 +if (arg-h-can_join_userns) {
 +int fd;
 +setuid(0);
 +setgid(0);
 +/*
 + * We need the special flag newinstance. This is a requirement
 + * of the userns-aware implementation of devpts as of Linux 3.9.
 + * Because of that special requirement, we do it here rather
 than
 + * later.
 + */
 +mount(devpts, /dev/pts, devpts, 0, newinstance);
 
 What if mount point does not exist, or is not a directory?
 
 I think it makes sense to
 1. try creating it, ignoring EEXIST
 2. check return from mount()
 
Check return and then what? It is not clear for me if we should abort or
not when things go wrong. It seems perfectly valid that the container
continues. I have the same dilemma with the other devices.


 +/* /dev/ptmx, if it even exists, would refer to the root ptmx.
 + * We don't want that, we want our newly created instance to
 contain
 + * all ptys. So we bind mount the root device here
 + */
 +fd = open(/dev/ptmx, O_RDWR|O_CREAT, 0);
 
 You do open() to create the /dev/ptmx file in case it does not exist,
 right?
 Why RDWR then?
 
No special reason.


 +/*
 + * Those devices should exist in the container, and be valid device
 nodes with
 + * user access permission. But we need to be absolutely sure this is
 the case,
 + * so we will provide our own versions. That could actually happen
 since some
 + * distributions may come with emptied /dev's, waiting for udev to
 populate them.
 + * That won't happen, we do it ourselves.
 + */
 +static void create_devices(vps_handler *h, envid_t veid, const char
 *root)
 +{
 +unsigned int i;
 +char *devices[] = {
 +/dev/null,
 +/dev/zero,
 +/dev/random,
 +/dev/urandom,
 +};
 +
 +/*
 + * We will tolerate errors, and keep the container running,
 because it is
 + * likely we will be able to boot it to a barely functional
 state. But
 + * be vocal about it
 + */
 +for (i = 0; i  ARRAY_SIZE(devices); i++) {
 +char ct_devname[STR_SIZE];
 +int ret;
 +
 +ret = snprintf(ct_devname, sizeof(ct_devname), %s/%s, root,
 devices[i]);
 +if (ret  0) {
 +logger(-1, errno, Could not allocate device string\n);
 
 we don't need \n for logger() calls.
 
Will remove, but the most important question here is the behavioral
question. Is it okay to proceed ?

   static int ct_env_create(struct arg_start *arg)
   {
   @@ -233,6 +348,8 @@ static int ct_env_create(struct arg_start *arg)
   int ret;
   char procpath[STR_SIZE];
   char ctpath[STR_SIZE];
 +int userns_p[2];
 +int err;
 stack_size = get_pagesize();
   if (stack_size  0)
 @@ -272,17 +389,58 @@ static int ct_env_create(struct arg_start *arg)
* Belong in the setup phase
*/
   clone_flags = SIGCHLD;
 -/* FIXME: USERNS is still work in progress */
   clone_flags |= CLONE_NEWUTS|CLONE_NEWPID|CLONE_NEWIPC;
   clone_flags |= CLONE_NEWNET|CLONE_NEWNS;
   +if (!arg-h-can_join_userns) {
 +logger(-1, 0, WARNING: Running container unprivileged.
 USER_NS not supported);
 +
 +userns_p[0] = userns_p[1] = -1;
 +} else {
 +clone_flags |= CLONE_NEWUSER;
 +if (pipe(userns_p)  0) {
 +logger(-1, errno, Can not create userns pipe);
 +return VZ_RESOURCE_ERROR;
 +}
 +}
 +arg-userns_p = userns_p[0];
 +
 +if (arg-h-can_join_userns) {
 +create_devices(arg-h, arg-veid, arg-res-fs.root);
 +}
 +
   ret = clone(_env_create, child_stack, clone_flags, arg);
   if (ret   0) {
   logger(-1, errno, Unable to clone);
   /* FIXME: remove ourselves from container 

Re: [Devel] [PATCH v3 0/9] Upstream Linux support for userns

2013-05-14 Thread Glauber Costa
On 05/14/2013 07:03 AM, Kir Kolyshkin wrote:
 On 04/29/2013 10:16 PM, Glauber Costa wrote:
 Kir,

 Please review the following patchset. The main difference from last
 version is that
 we support running with userns disabled even if it is present. This
 effectively means
 that containers that were already created and owned by root will keep
 working. It is
 also possible to explicitly disable it at container creation by
 setting local_uid to 0.

 There are also some bugfixes and changes according to the review you
 provided.

 Running a container works, and vzctl enter works as well. Most
 pressing, is
 the fact that although this patchset finally implements --ipadd (now all
 infrastructure is in place).


 
 First three patches merged, patch 4 needs more love, and the rest of the
 series
 depends on #4.
Thanks, will fix.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v3 7/9] modify tar extraction to account for user namespace

2013-05-14 Thread Glauber Costa
On 05/14/2013 07:17 AM, Kir Kolyshkin wrote:
 
 Hmm...
 
 If I understand it correctly, in case LOCAL_UID/LOCAL_GID is not set in
 the global configuration file,
 and not supplied from command line, here you apply the default values of
 1.
 
 The problem I see these values are not saved into configuration file and
 therefore not applied
 during the start.
 
 Can you please check if I'm right here?
I can double check, but I remember having tested this exactly scenario:
create container, (with multiple LOCAL_UID values), and then checking
the result. It seems to work, but I will recheck that.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v3 6/9] allow local uid and gid to be specified at container creation

2013-05-14 Thread Glauber Costa
On 05/14/2013 07:09 AM, Kir Kolyshkin wrote:
 +When running with an upstream Linux Kernel that supports user
 namespaces (=
 +3.8), the parameters \fB--local_uid\fR and \fB--local_gid\fR can be
 used to
 +select which \fIuid\fR and \fIgid\fR respectively will be used as a
 base user
 +in the host system. Note that user namespaces provide a 1:1 mapping
 between
 +container users and host users. If these options are not specified,
 the value
 +10 is used.
 
 This probably comes from an older version of a patchset. We don't have
 compiled-in defaults, do we?
 
 I'd say
 
 If these options are not specified, the values \fBLOCAL_UID\fR and
 \fBLOCAL_GID\fR
 from global configuration file \fBvz.conf\fR(5) are used.

Huummm, this is not my understanding. Please note that --local_uid and
--local_uid are creation time switches. This means that they will only
apply to newly created containers.

Upon execution, these are read from the local configuration file. In
older containers, this will be unset and we will run with userns
disabled. In newly containers, the only way to disable it is to specify
--local_uid = 0, but by default we run with 10. What is the problem
with this?


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v3 7/9] modify tar extraction to account for user namespace

2013-05-14 Thread Glauber Costa
On 05/14/2013 07:17 AM, Kir Kolyshkin wrote:
 
 Hmm...
 
 If I understand it correctly, in case LOCAL_UID/LOCAL_GID is not set in
 the global configuration file,
 and not supplied from command line, here you apply the default values of
 1.
 
 The problem I see these values are not saved into configuration file and
 therefore not applied
 during the start.
 
 Can you please check if I'm right here?

I actually think it is better to remove this excerpt. I will assume that
if we are running a new installation, the global configuration file will
be a new one as well.

So I will just default it to 0, which means userns disabled.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v4 3/7] Also pass cmd_p pointer to container open

2013-05-14 Thread Glauber Costa
We would like to know early on in the container lifetime (before creation) if
the container should use user namespaces. At this point, we not yet have a ct
configuration file, just the global one. It may very well be that the user have
overriden the global configuration file information through the command line.
If that is the case, this value should take precedence. To achieve this, we
should also pass the cmd_p information to the open functions.

Signed-off-by: Glauber Costa glom...@openvz.org
---
 include/env.h   | 8 +---
 src/lib/env.c   | 7 ---
 src/lib/hooks_ct.c  | 2 +-
 src/lib/hooks_vz.c  | 2 +-
 src/vzctl-actions.c | 2 +-
 5 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/env.h b/include/env.h
index 4fef438..35d427f 100644
--- a/include/env.h
+++ b/include/env.h
@@ -49,8 +49,10 @@ enum {
  *
  * @param veid CT ID.
  * @return handler or NULL on error.
+ * @param  global ct parameters.
+ * @cmdcommand line ct parameters.
  */
-vps_handler *vz_open(envid_t veid, vps_param *param);
+vps_handler *vz_open(envid_t veid, vps_param *param, vps_param *cmd);
 
 /** Close CT handler.
  *
@@ -126,6 +128,6 @@ int exec_container_init(struct arg_start *arg,
struct env_create_param3 *create_param);
 
 int set_personality32();
-int vz_do_open(vps_handler *h, vps_param *param);
-int ct_do_open(vps_handler *h, vps_param *param);
+int vz_do_open(vps_handler *h, vps_param *param, vps_param *cmd);
+int ct_do_open(vps_handler *h, vps_param *param, vps_param *cmd);
 #endif
diff --git a/src/lib/env.c b/src/lib/env.c
index 96400c0..166c83b 100644
--- a/src/lib/env.c
+++ b/src/lib/env.c
@@ -74,9 +74,10 @@ static int reset_std()
  *
  * @param veid CT ID.
  * @param paramCT parameters.
+ * @param cmd  CT command line parameters.
  * @return handler or NULL on error.
  */
-vps_handler *vz_open(envid_t veid, vps_param *param)
+vps_handler *vz_open(envid_t veid, vps_param *param, vps_param *cmd)
 {
vps_handler *h = NULL;
int ret = -1;
@@ -90,7 +91,7 @@ vps_handler *vz_open(envid_t veid, vps_param *param)
 #ifdef VZ_KERNEL_SUPPORTED
/* Find out if we are under OpenVZ or upstream kernel */
if (stat_file(/proc/vz))
-   ret = vz_do_open(h, param);
+   ret = vz_do_open(h, param, cmd);
else
 #endif
{
@@ -98,7 +99,7 @@ vps_handler *vz_open(envid_t veid, vps_param *param)
non-OpenVZ kernel);
h-vzfd = -1;
 #ifdef HAVE_UPSTREAM
-   ret = ct_do_open(h, param);
+   ret = ct_do_open(h, param, cmd);
 #else
logger(-1, 0, No suitable kernel support compiled in);
 #endif
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index 8738eeb..f4d8f48 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -791,7 +791,7 @@ static int ct_setcontext(envid_t veid)
return 0;
 }
 
-int ct_do_open(vps_handler *h, vps_param *param)
+int ct_do_open(vps_handler *h, vps_param *param, vps_param *cmd)
 {
int ret;
char path[STR_SIZE];
diff --git a/src/lib/hooks_vz.c b/src/lib/hooks_vz.c
index 50770c0..c373b13 100644
--- a/src/lib/hooks_vz.c
+++ b/src/lib/hooks_vz.c
@@ -466,7 +466,7 @@ static int vz_veth_ctl(vps_handler *h, envid_t veid, int 
op, veth_dev *dev)
return ret;
 }
 
-int vz_do_open(vps_handler *h, vps_param *param)
+int vz_do_open(vps_handler *h, vps_param *param, vps_param *cmd)
 {
if ((h-vzfd = open(VZCTLDEV, O_RDWR))  0) {
logger(-1, errno, Unable to open %s, VZCTLDEV);
diff --git a/src/vzctl-actions.c b/src/vzctl-actions.c
index 0dd2ae7..7bf15c9 100644
--- a/src/vzctl-actions.c
+++ b/src/vzctl-actions.c
@@ -1470,7 +1470,7 @@ int run_action(envid_t veid, act_t action, vps_param 
*g_p, vps_param *vps_p,
int ret = 0, lock_id = -1;
struct sigaction act;
 
-   if ((h = vz_open(veid, g_p)) == NULL) {
+   if ((h = vz_open(veid, g_p, cmd_p)) == NULL) {
/* Accept to run set --save --force on any kernel,
 * otherwise error out if initialization failed
 */
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v4 4/7] allow local uid and gid to be specified at container creation

2013-05-14 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

It is a valid use case to run a container with host uid and gid different than
the default. In particular, already deployed versions of vzctl are expected to
have this value unset, effectively meaning they are not expecting user
namespaces to be present. We also deem as a valid use case to run a fully
privileged container, in which case we will explicitly disable user namespaces.

This patch provides and documents a way to do so.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 man/vzctl.8.in  | 17 +
 src/lib/hooks_ct.c  | 11 ---
 src/vzctl-actions.c |  2 ++
 src/vzctl.c |  1 +
 4 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/man/vzctl.8.in b/man/vzctl.8.in
index 20a1856..e233696 100644
--- a/man/vzctl.8.in
+++ b/man/vzctl.8.in
@@ -871,6 +871,8 @@ List of available fields can be obtained using \fB-L\fR 
option.
 .OP --ipadd addr
 .OP --hostname name
 .OP --name name
+.OP --local_uid uid
+.OP --local_gid gid
 .YS
 .IP  4
 Creates a new container area. This operation should be done once, before
@@ -922,6 +924,21 @@ a container. Note that this option can be used multiple 
times.
 
 You can use \fB--hostname\fR \fIname\fR option to set a host name for
 a container.
+
+When running with an upstream Linux Kernel that supports user namespaces (=
+3.8), the parameters \fB--local_uid\fR and \fB--local_gid\fR can be used to
+select which \fIuid\fR and \fIgid\fR respectively will be used as a base user
+in the host system. Note that user namespaces provide a 1:1 mapping between
+container users and host users. If these options are not specified, the values
+\fBLOCAL_UID\fR and \fBLOCAL_GID\fR from global configuration file
+\fBvz.conf\fR(5) are used. An explicit \fB--local_uid\fR value of 0 will
+disable user namespace support, and run the container as a privileged user. In
+this case, \fB--local_gid\fR is ignored.
+
+\fBWarning:\fR use \fB--local_uid\fR and \fB--local_gid\fR with care, specially
+when migrating containers. In all situations, the container's files in the
+filesystem needs to be correctly owned by the host-side users.
+
 .IP \fBdestroy\fR | \fBdelete\fR \fICTID\fR 4
 Removes a container private area by deleting all files, directories and
 the configuration file of this container.
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index f4d8f48..9b2e929 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -418,7 +418,7 @@ static int ct_env_create(struct arg_start *arg)
clone_flags |= CLONE_NEWNET|CLONE_NEWNS;
 
if (!arg-h-can_join_userns) {
-   logger(-1, 0, WARNING: Running container unprivileged. USER_NS 
not supported);
+   logger(-1, 0, WARNING: Running container unprivileged. USER_NS 
not supported, or runtime disabled);
 
userns_p[0] = userns_p[1] = -1;
} else {
@@ -797,7 +797,12 @@ int ct_do_open(vps_handler *h, vps_param *param, vps_param 
*cmd)
char path[STR_SIZE];
char upath[STR_SIZE];
struct stat st;
-   unsigned long *local_uid = param-res.misc.local_uid;
+   unsigned long *local_uid; 
+ 
+   /* Command line takes precedence, but if it is unset, check global file 
*/ 
+   local_uid = cmd-res.misc.local_uid; 
+   if (!local_uid) 
+   local_uid = param-res.misc.local_uid; 
 
ret = container_init();
if (ret) {
@@ -838,7 +843,7 @@ int ct_do_open(vps_handler *h, vps_param *param, vps_param 
*cmd)
 * mapped user to own the files, etc. So we also need to find suitable
 * configuration in the config files.
 */
-   h-can_join_userns = !stat(upath, st)  local_uid;
+   h-can_join_userns = !stat(upath, st)  local_uid  (*local_uid != 
0);
h-is_run = ct_is_run;
h-enter = ct_enter;
h-destroy = ct_destroy;
diff --git a/src/vzctl-actions.c b/src/vzctl-actions.c
index 7bf15c9..044a0ad 100644
--- a/src/vzctl-actions.c
+++ b/src/vzctl-actions.c
@@ -391,6 +391,8 @@ static int parse_create_opt(envid_t veid, int argc, char 
**argv,
{ve_layout,   required_argument, NULL, PARAM_VE_LAYOUT},
{velayout,required_argument, NULL, PARAM_VE_LAYOUT},
{diskspace,   required_argument, NULL, PARAM_DISKSPACE},
+   {local_uid,   required_argument, NULL, PARAM_LOCAL_UID},
+   {local_gid,   required_argument, NULL, PARAM_LOCAL_GID},
{ NULL, 0, NULL, 0 }
 };
 
diff --git a/src/vzctl.c b/src/vzctl.c
index 359bcde..54d66d1 100644
--- a/src/vzctl.c
+++ b/src/vzctl.c
@@ -65,6 +65,7 @@ static void usage(int rc)
 vzctl create ctid [--ostemplate name] [--config name]\n
[--layout ploop|simfs] [--hostname name] [--name name] [--ipadd 
addr]\n
[--diskspace kbytes] [--private path] [--root path]\n
+   [--local_uid UID] [--local_gid GID]\n
 vzctl start ctid [--force] [--wait]\n
 vzctl destroy | mount | umount | stop | restart | status ctid\n
 #ifdef HAVE_PLOOP
-- 
1.7.11.7

[Devel] [PATCH v4 5/7] modify tar extraction to account for user namespace

2013-05-14 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

If we are running upstream with user namespaces, we need to create the
container filesystem not with the ownership preserved, but reflecting the
mapping we need to apply. Note that according to our documentation, we should
ignore this if the user explicitly requested an uid mapping of 0 (gid is
ignored in this case).

Our tooling doesn't allow any easy way to unpack a whol distribution with
offsets mechanically applied like this. We could do the whole unpacking in a
user namespace itself, but that does not come without problems on its own (for
instance, we won't be able to create any device files, we have to carefully
adjust permissions in the root directory, etc)

To work around that, we can employ a trick to allow container creation right
now, as well as to avoid compatibility problems: we will resort to LD_PRELOAD
to load a schim that captures calls to the chown family of system calls and
applies the offset manually.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 include/res.h   |  6 
 scripts/vps-create.in   | 14 
 src/lib/Makefile.am |  3 ++
 src/lib/chown_preload.c | 93 +
 src/lib/create.c| 25 ++---
 vzctl.spec  |  2 +-
 6 files changed, 137 insertions(+), 6 deletions(-)
 create mode 100644 src/lib/chown_preload.c

diff --git a/include/res.h b/include/res.h
index 0dfacf7..047593a 100644
--- a/include/res.h
+++ b/include/res.h
@@ -47,6 +47,12 @@ struct env_param {
 };
 typedef struct env_param env_param_t;
 
+/*
+ * When running upstream kernels, we will need host-side UID and GID. Those
+ * are configurable, but if none is specified, those are used.
+ */
+#define VZ_DEFAULT_UID 10
+#define VZ_DEFAULT_GID 10
 typedef struct {
list_head_t userpw;
list_head_t nameserver;
diff --git a/scripts/vps-create.in b/scripts/vps-create.in
index 126f048..a1b3382 100755
--- a/scripts/vps-create.in
+++ b/scripts/vps-create.in
@@ -22,11 +22,22 @@
 # Required parameters:
 #   VE_PRVT- path to root of CT private areas
 #   PRIVATE_TEMPLATE   - path to private template used as a source for copying
+#
+# Optional parameters:
+#   UID_OFFSET - offset to be added to all tar UIDs
+#   GID_OFFSET - offset to be added to all tar GIDs
 
 . @SCRIPTDIR@/vps-functions
 
 vzcheckvar VE_PRVT PRIVATE_TEMPLATE
 
+chown_preload_if_needed()
+{
+   [ -z $UID_OFFSET -o -z $GID_OFFSET ]  return 
+
+   export LD_PRELOAD=libvzchown.so
+}
+
 create_prvt()
 {
local TMP AVAIL NEEDED HEADER OPT
@@ -75,6 +86,9 @@ create_prvt()
[ $AVAIL -ge $NEEDED ] ||
vzerror Insufficient disk space in $VE_PRVT; available: 
$AVAIL, needed: $NEEDED ${VZ_FS_NO_DISK_SPACE}
CAT=cat
+
+   chown_preload_if_needed
+
# Use pv to show nice progress bar if we can
pv -V /dev/null 21  CAT=pv
chmod 700 $VE_PRVT
diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am
index b009d15..34c463b 100644
--- a/src/lib/Makefile.am
+++ b/src/lib/Makefile.am
@@ -72,6 +72,9 @@ libvzctl_la_LIBADD = $(XML_LIBS) $(CGROUP_LIBS) $(DL_LIBS)
 
 if HAVE_CGROUP
 libvzctl_la_SOURCES += cgroup.c hooks_ct.c
+
+lib_LTLIBRARIES += libvzchown.la
+libvzchown_la_SOURCES = chown_preload.c
 endif
 
 if HAVE_VZ_KERNEL
diff --git a/src/lib/chown_preload.c b/src/lib/chown_preload.c
new file mode 100644
index 000..4e2be4a
--- /dev/null
+++ b/src/lib/chown_preload.c
@@ -0,0 +1,93 @@
+/*
+ *  Copyright (C) 2013, Parallels, Inc. All rights reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *  Authors: Kir Kolyshkin and Glauber Costa
+ */
+
+#include unistd.h
+#include stdio.h
+#include stdlib.h
+
+#include dlfcn.h
+
+static int (*real_chown)(const char *path, uid_t owner, gid_t group) = NULL;
+static int (*real_fchown)(int fd, uid_t owner, gid_t group) = NULL;
+static int (*real_lchown)(const char *path, uid_t owner, gid_t group) = NULL;
+static int (*real_fchownat)(int dirfd, const char *pathname,
+   uid_t owner, gid_t group, int flags) = NULL;
+
+uid_t uid_offset = 0;
+gid_t gid_offset = 0;
+
+static void __init(void)
+{
+   char *uidstr, *gidstr;
+
+   uidstr = getenv(UID_OFFSET);
+   gidstr

[Devel] [PATCH v4 6/7] automatically add bridge venet0 when needed

2013-05-14 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

The chosen architecture to deal with --ipadd with upstream containers is to
create a veth pair and add the host side information to a bridge called venet0.
This way, all the code that expects venet0 to exist can still work without
modifications, (or with just a few).

Our intention to do that was actually already stated in the comments, but the
code was removed before merging because --ipadd would not work without full
unshare support anyway.

This patch implements that.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 scripts/vps-functions.in |  7 +++
 src/lib/hooks_ct.c   | 37 +++--
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/scripts/vps-functions.in b/scripts/vps-functions.in
index 826c0a1..37b2de5 100755
--- a/scripts/vps-functions.in
+++ b/scripts/vps-functions.in
@@ -170,6 +170,13 @@ vzadjustmacs()
 # other setups, the bridge is expected to already exist and be valid.
 vzconfbridge()
 {
+   if [ x$BRIDGE == xvenet0 ]; then
+   if [ $(brctl show venet0 2/dev/null | tail -n+2 | wc -l) == 0 
]; then
+   brctl addbr venet0
+   ${IP_CMD} link set venet0 up
+   fi
+   fi
+   
if [ x$BRIDGE != x ]; then
brctl addif $BRIDGE $HNAME /dev/null 21
fi
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index 9b2e929..8695985 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -17,6 +17,7 @@
 #include logger.h
 #include script.h
 #include cgroup.h
+#include linux/vzctl_venet.h
 
 #define NETNS_RUN_DIR /var/run/netns
 
@@ -725,8 +726,40 @@ static int ct_netdev_ctl(vps_handler *h, envid_t veid, int 
op, char *name)
 
 static int ct_ip_ctl(vps_handler *h, envid_t veid, int op, const char *ipstr)
 {
-   logger(-1, 0, %s not yet supported upstream, __func__);
-   return 0;
+   int ret = -1;
+   char *envp[5];
+   char buf[STR_SIZE];
+   int i = 0;
+
+   if (!h-can_join_pidns) {
+   logger(-1, 0, Cannot join pid namespace: 
+   --ipadd is not supported in kernels without full pidns 
support);
+   return VZ_BAD_KERNEL;
+   }
+   envp[i++] = strdup(VNAME=venet0);
+   envp[i++] = strdup(BRIDGE=venet0);
+
+   snprintf(buf, sizeof(buf), HNAME=venet0.%d, veid);
+   envp[i++] = strdup(buf);
+
+   snprintf(buf, sizeof(buf), VEID=%d, veid);
+   envp[i++] = strdup(buf);
+
+   envp[i] = NULL;
+
+   if (op == VE_IP_ADD) {
+   char *argv[] = { VPS_NETNS_DEV_ADD, NULL };
+
+   ret = run_script(VPS_NETNS_DEV_ADD, argv, envp, 0);
+   } else  {
+   char *argv[] = { VPS_NETNS_DEV_DEL, NULL };
+
+   ret = run_script(VPS_NETNS_DEV_DEL, argv, envp, 0);
+   }
+   free_arg(envp);
+
+   return ret;
+
 }
 
 /*
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v4 7/7] allow for distro-specific fix ups at creation time.

2013-05-14 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

We will need that infrastucture when running with Linux upstream, since some
support is very unlikely to ever land in the Kernel.  We need to do things like
account for the fact that udev may kick in and destroy all the setup we have
done for /dev.  Since preemptively preventing those actions to happen is very
hard ( as an example, centos6 init binary will issue mount syscalls itself),
the most robust approach is to insert a script that will be run shortly after
rc.sysinit. Although this can't guarantee that it will run *immediately* after,
it should be early enough to undo every harmful operation done by /sbin/init
and rc.sysinit, therefore allowing operation to continue freely.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 etc/dists/redhat.conf   |  1 +
 etc/dists/scripts/fixups.sh | 43 +++
 include/dist.h  |  2 ++
 include/env.h   |  3 ++-
 src/lib/dist.c  | 10 +-
 src/lib/env.c   | 10 ++
 src/lib/exec.c  |  2 +-
 src/lib/hooks_ct.c  | 38 ++
 8 files changed, 102 insertions(+), 7 deletions(-)
 create mode 100755 etc/dists/scripts/fixups.sh

diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf
index 727461d..49226c5 100644
--- a/etc/dists/redhat.conf
+++ b/etc/dists/redhat.conf
@@ -25,3 +25,4 @@ SET_DNS=set_dns.sh
 SET_USERPASS=set_userpass.sh
 SET_UGID_QUOTA=set_ugid_quota.sh
 POST_CREATE=postcreate.sh
+CT_FIXUP=fixups.sh
diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh
new file mode 100755
index 000..da3fd24
--- /dev/null
+++ b/etc/dists/scripts/fixups.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+#  Copyright (C) 2000-2009, Parallels, Inc. All rights reserved.
+#
+#  This program is free software; you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation; either version 2 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program; if not, write to the Free Software
+#  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+#
+#
+#  Run-time fixup for legacy distributions that will rely on udevd to populate
+#  their /dev. Those distributions will effectively mount an empty tmpfs on top
+#  of /dev, defeating any setup we may have done. This includes RHEL-based
+#  distributions up to RHEL6. We are better off not allowing that to run at all
+
+function fixup_udev()
+{
+   if [ -f /etc/fedora-release ]; then
+   return;
+   fi
+
+   # All the first two may fail if the distribution didn't actually mount
+   # then.
+   umount /dev/pts
+   umount /dev/shm
+   # still may hold references to files open by rc.sysinit. Even if this 
is the
+   # case those will be simple things like /dev/null and should go away 
shortly.
+   # let us not fail because of that.
+   umount /dev -l
+}
+
+fixup_udev
+
+exit 0
+
diff --git a/include/dist.h b/include/dist.h
index 1f6a5a6..4faa452 100644
--- a/include/dist.h
+++ b/include/dist.h
@@ -29,6 +29,7 @@
 #defineSET_USERPASS5
 #defineSET_UGID_QUOTA  6
 #definePOST_CREATE 7
+#defineCT_FIXUP8
 
 typedef struct {
char *def_ostmpl;
@@ -46,6 +47,7 @@ typedef struct dist_actions {
char *set_userpass; /** setup user password. */
char *set_ugid_quota;   /** setup 2level quota. */
char *post_create;  /** sostcreate actions. */
+   char *ct_fixup; /** run-time fixups. */
 } dist_actions;
 
 /* Read distribution specific actions configuration file.
diff --git a/include/env.h b/include/env.h
index 35d427f..223c3a3 100644
--- a/include/env.h
+++ b/include/env.h
@@ -106,7 +106,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param 
*param, int stop_mode,
 int vz_chroot(const char *root);
 int vz_env_create(vps_handler *h, envid_t veid, vps_res *res,
int wait_p[2], int old_wait_p[2], int err_p[2],
-   env_create_FN fn, void *data);
+   env_create_FN fn, dist_actions *actions, void *data);
 
 struct vps_res;
 struct arg_start {
@@ -118,6 +118,7 @@ struct arg_start {
vps_handler *h;
void *data;
env_create_FN fn;
+   dist_actions *actions;
int userns_p; /* while running in userns, there's extra sync needed */
 };
 
diff --git a/src/lib/dist.c b/src/lib/dist.c
index bde94ab..8047cf7 100644
--- a/src/lib/dist.c
+++ b/src/lib/dist.c
@@ -36,7 +36,8

[Devel] [PATCH v4 1/7] user namespace support for upstream containers

2013-05-14 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

This patch allows the execution of unprivileged containers running ontop
of an upstream Linux Kernel. We will run at whatever UID is found in the
configuration file (so far empty, thus disabled).

Signed-off-by: Glauber Costa glom...@parallels.com
---
 include/env.h  |   1 +
 include/types.h|   1 +
 src/lib/hooks_ct.c | 256 +++--
 3 files changed, 249 insertions(+), 9 deletions(-)

diff --git a/include/env.h b/include/env.h
index 06729f6..4fef438 100644
--- a/include/env.h
+++ b/include/env.h
@@ -116,6 +116,7 @@ struct arg_start {
vps_handler *h;
void *data;
env_create_FN fn;
+   int userns_p; /* while running in userns, there's extra sync needed */
 };
 
 struct env_create_param3;
diff --git a/include/types.h b/include/types.h
index a4bce73..fa511aa 100644
--- a/include/types.h
+++ b/include/types.h
@@ -95,6 +95,7 @@ typedef struct vps_handler {
int vzfd;   /** /dev/vzctl file descriptor. */
int stdfd;
int can_join_pidns; /* can't enter otherwise */
+   int can_join_userns; /* can't run non privileged otherwise */
int (*is_run)(struct vps_handler *h, envid_t veid);
int (*enter)(struct vps_handler *h, envid_t veid, const char *root, int 
flags);
int (*destroy)(struct vps_handler *h, envid_t veid);
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index e46c1df..8738eeb 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -66,6 +66,8 @@ static int sys_setns(int fd, int nstype)
 # define MS_PRIVATE (1  18)
 #endif
 
+#define UID_GID_RANGE 10 /* how many users per container */
+
 /* This function is there in GLIBC, but not in headers */
 extern int pivot_root(const char * new_root, const char * put_old);
 
@@ -203,16 +205,154 @@ static int ct_setlimits(vps_handler *h, envid_t veid, 
struct ub_struct *ub)
 }
 #undef add_value
 
+static int write_uid_gid_mapping(vps_handler *h, unsigned long uid, unsigned 
long gid, pid_t pid)
+{
+   char buf[STR_SIZE];
+   char map[STR_SIZE];
+   int fd;
+   int len;
+   int ret;
+
+   len = snprintf(map, sizeof(map), 0 %ld %d, uid, UID_GID_RANGE);
+   snprintf(buf, sizeof(buf), /proc/%d/uid_map, pid);
+   if ((fd = open(buf, O_WRONLY))  0)
+   goto out;
+
+   if ((write(fd, map, len) != len))
+   goto out;
+
+   close(fd);
+
+   len = snprintf(map, sizeof(map), 0 %ld %d, gid, UID_GID_RANGE);
+   snprintf(buf, sizeof(map), /proc/%d/gid_map, pid);
+   if ((fd = open(buf, O_WRONLY))  0)
+   goto out;
+
+   if ((write(fd, map, len) != len))
+   goto out;
+   ret = 0;
+out:
+   if (fd  0)
+   close(fd);
+   return ret;
+}
+
+/*
+ * Those devices should exist in the container, and be valid device nodes with
+ * user access permission. But we need to be absolutely sure this is the case,
+ * so we will provide our own versions. That could actually happen since some
+ * distributions may come with emptied /dev's, waiting for udev to populate 
them.
+ * That won't happen, we do it ourselves.
+ */
+static void create_devices(vps_handler *h, envid_t veid, const char *root)
+{
+   unsigned int i;
+   char *devices[] = {
+   /dev/null,
+   /dev/zero,
+   /dev/random,
+   /dev/urandom,
+   };
+
+   /*
+* We will tolerate errors, and keep the container running, because it 
is
+* likely we will be able to boot it to a barely functional state. But
+* be vocal about it
+*/
+   for (i = 0; i  ARRAY_SIZE(devices); i++) {
+   char ct_devname[STR_SIZE];
+   int ret;
+
+   ret = snprintf(ct_devname, sizeof(ct_devname), %s/%s, root, 
devices[i]);
+   if (ret  0) {
+   logger(-1, errno, Could not allocate device string);
+   continue;
+   }
+
+   /*
+* No need to be crazy about file flags. When we bind mount, the
+* source permissions will be inherited.
+*/
+   ret = open(ct_devname, O_RDWR|O_CREAT, 0);
+   if (ret  0) {
+   logger(-1, errno, Could not touch device %s, 
devices[i]);
+   continue;
+   }
+   close(ret);
+
+   ret = mount(devices[i], ct_devname, , MS_BIND, 0);
+   if (ret  0)
+   logger(-1, errno, Could not bind mount device %s, 
devices[i]);
+   }
+}
+
 static int _env_create(void *data)
 {
struct arg_start *arg = data;
struct env_create_param3 create_param;
int ret;
 
-   if ((ret = ct_chroot(arg-res-fs.root)))
+   if ((arg-userns_p != -1)  (read(arg-userns_p, ret, sizeof(ret)) != 
sizeof(ret))) {
+   logger(-1, errno, Cannot read

[Devel] [PATCH v4 2/7] add user mismatch test

2013-05-14 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

In theory, we won't be able to run if our private area is not owned by
ourselves.  We could, if it have very wide open security permissions, but we
should never set up a container like that.

Aside from a basic sanity check, this is intended to catch problems for the few
people who may have already created containers that will be owned by root:root,
and will now try to run it unprivileged.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 src/lib/env.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/lib/env.c b/src/lib/env.c
index 898bfc8..96400c0 100644
--- a/src/lib/env.c
+++ b/src/lib/env.c
@@ -30,6 +30,7 @@
 #include linux/reboot.h
 #include sys/mount.h
 #include sys/utsname.h
+#include sys/stat.h
 
 #include vzerror.h
 #include res.h
@@ -554,6 +555,21 @@ int vps_start_custom(vps_handler *h, envid_t veid, 
vps_param *param,
logger(-1, 0, Container is already running);
return VZ_VE_RUNNING;
}
+   if (!is_vz_kernel(h)  h-can_join_userns) {
+   struct stat private_stat;
+   unsigned long *local_uid = res-misc.local_uid;
+   unsigned long *local_gid = res-misc.local_gid;
+
+   stat(res-fs.private, private_stat);
+   if ((local_uid  (private_stat.st_uid != *local_uid)) ||
+   (local_gid  (private_stat.st_gid != *local_gid))) {
+   logger(-1, 0, Container private area is owned by %d:%d
+   , but configuration file says we should run with 
%lu:%lu.\n
+   Refusing to run., private_stat.st_uid, 
private_stat.st_gid,
+   *res-misc.local_uid, *res-misc.local_gid);
+   return VZ_FS_BAD_TMPL;
+   }
+   }
if ((ret = check_ub(h, res-ub)))
return ret;
 
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v4 0/7] User namespace support for upstream containers

2013-05-14 Thread Glauber Costa
Kir,

In this patchset, I hope to be addressing all your concerns. It is a bit
massive code, so if is there still anything that you would me to change,
don't shy away. I will go back to it ASAP.

That said, I am incorporating your comments on the previous PATCH 4 (now 1),
and adding another one, that also passes cmd_p to open. This is being sent as
an incremental patch since you said you have already merged up to PATCH 3 (I
still don't see that in git). But if you haven't feel free to fold it.

With this patch, we are fully able to get rid of the default value during
create, and yet test for userns presence very early during creation time. I
think this very much clarifies our handling of the command line option. The
documentation is also changed as you requested, for consistency.

Glauber Costa (7):
  user namespace support for upstream containers
  add user mismatch test
  Also pass cmd_p pointer to container open
  allow local uid and gid to be specified at container creation
  modify tar extraction to account for user namespace
  automatically add bridge venet0 when needed
  allow for distro-specific fix ups at creation time.

 etc/dists/redhat.conf   |   1 +
 etc/dists/scripts/fixups.sh |  43 ++
 include/dist.h  |   2 +
 include/env.h   |  12 +-
 include/res.h   |   6 +
 include/types.h |   1 +
 man/vzctl.8.in  |  17 +++
 scripts/vps-create.in   |  14 ++
 scripts/vps-functions.in|   7 +
 src/lib/Makefile.am |   3 +
 src/lib/chown_preload.c |  93 
 src/lib/create.c|  25 +++-
 src/lib/dist.c  |  10 +-
 src/lib/env.c   |  33 -
 src/lib/exec.c  |   2 +-
 src/lib/hooks_ct.c  | 338 ++--
 src/lib/hooks_vz.c  |   2 +-
 src/vzctl-actions.c |   4 +-
 src/vzctl.c |   1 +
 vzctl.spec  |   2 +-
 20 files changed, 583 insertions(+), 33 deletions(-)
 create mode 100755 etc/dists/scripts/fixups.sh
 create mode 100644 src/lib/chown_preload.c

-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v3 7/9] modify tar extraction to account for user namespace

2013-05-13 Thread Glauber Costa
On 05/11/2013 03:53 AM, Igor M Podlesny wrote:
 On 30 April 2013 13:16, Glauber Costa glom...@openvz.org wrote:
 From: Glauber Costa glom...@parallels.com
 To work around that, we can employ a trick to allow container creation right
 now, as well as to avoid compatibility problems: we will resort to LD_PRELOAD
 to load a schim that captures calls to the chown family of system calls and
 applies the offset manually.

 Signed-off-by: Glauber Costa glom...@parallels.com
 [...]
 +chown_preload_if_needed()
 +{
 +   [ -z $UID_OFFSET -o -z $GID_OFFSET ]  return
 +
 +   export LD_PRELOAD=libvzchown.so
 
LD_PRELOAD can be stacked (say, LD_PRELOAD=lib1:lib2:lib3) and in
 case it was already set, it's better be kept.
 
Indeed, it is better.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 3/3] hooks_ct: fix ub limits setting for upstream containers

2013-05-13 Thread Glauber Costa
On 05/11/2013 05:07 AM, Igor M Podlesny wrote:
 On 30 April 2013 11:46, Glauber Costa glom...@openvz.org wrote:
 Currently, vps_setup_res() have an explicit test for state != VPS_STARTING
 before applying beancounter limits. This means that we can set limits without
 further problems when the container is running, but will fail to do so when 
 the
 container is starting. In fact, hooks_vz have an explicit call for it in its
 startup function.

 We should do the same, and call our version of setlimits ourselves when the
 container is coming up.

 Signed-off-by: Glauber Costa glom...@openvz.org
 ---
  src/lib/hooks_ct.c | 147 
 +++--
  1 file changed, 76 insertions(+), 71 deletions(-)

 diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
 index 63af536..3cfb6e9 100644
 --- a/src/lib/hooks_ct.c
 +++ b/src/lib/hooks_ct.c
 @@ -153,6 +153,77 @@ static int _env_create(void *data)
 return exec_container_init(arg, create_param);
  }

 +#define add_value(val, var, mult) do { if (val) { var = *val * mult; } } 
 while (0)
 
Aren't macroses supposed to be UPPER CASE named?
 
Yes, and for a reason: macros behave very differently than functions,
and a convention to allow us to spot them and identify them as macros
very easily is beneficial.

This particular macro is defined right before the function, and
undefined right after it, so it should be extremely clear to every
reader that this is a macro. In this case, I don't see a reason for
being so strict.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v3 4/9] user namespace support for upstream containers

2013-05-13 Thread Glauber Costa
On 05/11/2013 04:14 AM, Igor M Podlesny wrote:
 On 30 April 2013 13:16, Glauber Costa glom...@openvz.org wrote:
 @@ -576,7 +765,9 @@ int ct_do_open(vps_handler *h, vps_param *param)
  {
 int ret;
 char path[STR_SIZE];
 +   char upath[STR_SIZE];
 struct stat st;
 +   unsigned long *local_uid = param-res.misc.local_uid;

 ret = container_init();
 if (ret) {
 @@ -592,6 +783,9 @@ int ct_do_open(vps_handler *h, vps_param *param)
 if (snprintf(path, sizeof(path), /proc/%d/ns/pid, getpid())  0)
 return VZ_RESOURCE_ERROR;

 +   if (snprintf(upath, sizeof(upath), /proc/%d/ns/user, getpid())  0)
 +   return VZ_RESOURCE_ERROR;
 +
 
It might have sense to compare return code not with 0, but
 sizeof(upath), since ... The functions snprintf() and vsnprintf() do
 not write more than size bytes (including the terminating null byte
 ('\0')). If the output was truncated due to this limit then the return
 value is the number of characters (excluding the terminating null
 byte) which would have been written to the final string if enough
 space had been available. Thus, a return value of size or more means
 that the output was truncated. ...
 

Doesn't matter that much, because this situation is impossible, given
that getpid() will always return a bounded value and we know the rest of
the string. So it either fails, or it succeeds in its fullest.

But that is all the same to me.



___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 3/3] hooks_ct: fix ub limits setting for upstream containers

2013-05-13 Thread Glauber Costa
On 05/13/2013 12:11 PM, Igor M Podlesny wrote:
 On 13 May 2013 16:11, Glauber Costa glom...@parallels.com wrote:
 On 05/13/2013 12:08 PM, Igor M Podlesny wrote:
 On 13 May 2013 15:50, Glauber Costa glom...@parallels.com wrote:
 [...]
Aren't macroses supposed to be UPPER CASE named?

 Yes, and for a reason: macros behave very differently than functions,
 and a convention to allow us to spot them and identify them as macros
 very easily is beneficial.

 This particular macro is defined right before the function, and
 undefined right after it, so it should be extremely clear to every
 reader that this is a macro. In this case, I don't see a reason for
 being so strict.

   As to me it's a nice opportunity to fix this along.

 Fix what?
 
Fix lower_cases exception for MACRO

putting CC back

The way I see it, there is absolutely nothing to fix here.
As I explained above, I believe coding style exists for a reason, and
call for consistency to help one derive things about unknown code easier.

This is a local macro, only exist within the scope of this function, it
is use-once-and-forget, and won't be available anywhere else in the code.

In this particular case, the only difference between the existing and
proposed version is that one is easier to type, and the other is harder.


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/3] Fixes for upstream containers

2013-05-10 Thread Glauber Costa
Kir,

Please find attached simple fixes for upstream containers. The first one is a
compiler warning for recently introduced code, while the last two together
fixes bugs with parsing and applying beancounter configuration. Patch #2 fixes
a bug only present in versions compiled with the --without-vz option, while
Patch #3 fixes a bug present in all versions.

Thanks

Glauber Costa (3):
  hooks_ct: fix gcc warning
  ub: compile ub support for non vz kernels as well
  hooks_ct: fix ub limits setting for upstream containers

 include/ub.h|  40 --
 src/lib/Makefile.am |   3 +-
 src/lib/hooks_ct.c  | 149 +++-
 src/lib/ub.c|  13 +++--
 4 files changed, 87 insertions(+), 118 deletions(-)

-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/3] hooks_ct: fix gcc warning

2013-05-10 Thread Glauber Costa
Signed-off-by: Glauber Costa glom...@openvz.org
---
 src/lib/hooks_ct.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index 03a18e7..63af536 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -247,7 +247,7 @@ static int ct_enter(vps_handler *h, envid_t veid, const 
char *root, int flags)
if (dp == NULL)
return VZ_RESOURCE_ERROR;
 
-   if (err = container_add_task(veid)) {
+   if ((err = container_add_task(veid))) {
logger(-1, 0, Can't add task creator to container: %s, 
container_error(err));
goto out;
}
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/3] ub: compile ub support for non vz kernels as well

2013-05-10 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

Commit c9d9170b0 fixed a bug by not including the ub functions if VZ support
was not present in the running kernel, and replacing them with empty stubs.

This approach, however, proved to be too aggressive. We need to at least be
able to read and write the ub configuration from our conf files. The solution
proposed in this patch is to rework that so only the actual beancounter
operations are compiled out.

Signed-off-by: Glauber Costa glom...@openvz.org
---
 include/ub.h| 40 
 src/lib/Makefile.am |  3 ++-
 src/lib/ub.c| 13 -
 3 files changed, 10 insertions(+), 46 deletions(-)

diff --git a/include/ub.h b/include/ub.h
index d2779f8..9ffc66b 100644
--- a/include/ub.h
+++ b/include/ub.h
@@ -140,7 +140,6 @@ struct ub_struct {
 };
 typedef struct ub_struct ub_param;
 
-#ifdef VZ_KERNEL_SUPPORTED
 /** Apply UBC resources.
  *
  * @param hCT handler.
@@ -178,43 +177,4 @@ void add_ub_limit(struct ub_struct *ub, int res_id, 
unsigned long *limit);
 void free_ub_param(ub_param *ub);
 void merge_ub(ub_param *dst, ub_param *src);
 int is_vswap_config(const ub_param *param);
-#else /* ! VZ_KERNEL_SUPPORTED */
-static inline int vps_set_ublimit(vps_handler *h, envid_t veid, ub_param *ubc)
-{
-   return VZ_BAD_KERNEL;
-}
-static inline int set_ublimit(vps_handler *h, envid_t veid, ub_param *ubc)
-{
-   return VZ_BAD_KERNEL;
-}
-static inline int check_ub(vps_handler *h, ub_param *ub)
-{
-   return VZ_BAD_KERNEL;
-}
-static inline int add_ub_param(ub_param *ub, ub_res *res)
-{
-   return VZ_BAD_KERNEL;
-}
-static inline int vps_read_ubc(envid_t veid, ub_param *ub)
-{
-   return VZ_BAD_KERNEL;
-}
-static inline int get_ub_resid(char *name)
-{
-   return VZ_BAD_KERNEL;
-}
-static inline void add_ub_limit(struct ub_struct *ub, int res_id, unsigned 
long *limit)
-{
-}
-static inline void free_ub_param(ub_param *ub)
-{
-}
-static inline void merge_ub(ub_param *dst, ub_param *src)
-{
-}
-static inline int is_vswap_config(const ub_param *param)
-{
-   return 0;
-}
-#endif /* VZ_KERNEL_SUPPORTED */
 #endif
diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am
index 630e957..b009d15 100644
--- a/src/lib/Makefile.am
+++ b/src/lib/Makefile.am
@@ -60,6 +60,7 @@ libvzctl_la_SOURCES = bitmap.c \
   util.c \
   veth.c \
   vps_configure.c \
+  ub.c \
   vzfeatures.c
 
 if HAVE_PLOOP
@@ -74,5 +75,5 @@ libvzctl_la_SOURCES += cgroup.c hooks_ct.c
 endif
 
 if HAVE_VZ_KERNEL
-libvzctl_la_SOURCES += ub.c hooks_vz.c
+libvzctl_la_SOURCES += hooks_vz.c
 endif
diff --git a/src/lib/ub.c b/src/lib/ub.c
index f850ee3..0f9e153 100644
--- a/src/lib/ub.c
+++ b/src/lib/ub.c
@@ -33,11 +33,6 @@
 #include vzsyscalls.h
 #include util.h
 
-static inline int setublimit(uid_t uid, unsigned long resource,
-   const unsigned long *rlim)
-{
-   return syscall(__NR_setublimit, uid, resource, rlim);
-}
 
 static struct ubname2id {
char *name;
@@ -164,6 +159,13 @@ int get_ub_resid(char *name)
return -1;
 }
 
+#ifdef VZ_KERNEL_SUPPORTED
+static inline int setublimit(uid_t uid, unsigned long resource,
+   const unsigned long *rlim)
+{
+   return syscall(__NR_setublimit, uid, resource, rlim);
+}
+
 static const char *get_ub_name(unsigned int res_id)
 {
int i;
@@ -226,6 +228,7 @@ if (res != NULL) {  
\
 
return 0;
 }
+#endif
 
 /** Apply UBC resources.
  *
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/3] hooks_ct: fix ub limits setting for upstream containers

2013-05-10 Thread Glauber Costa
Currently, vps_setup_res() have an explicit test for state != VPS_STARTING
before applying beancounter limits. This means that we can set limits without
further problems when the container is running, but will fail to do so when the
container is starting. In fact, hooks_vz have an explicit call for it in its
startup function.

We should do the same, and call our version of setlimits ourselves when the
container is coming up.

Signed-off-by: Glauber Costa glom...@openvz.org
---
 src/lib/hooks_ct.c | 147 +++--
 1 file changed, 76 insertions(+), 71 deletions(-)

diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index 63af536..3cfb6e9 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -153,6 +153,77 @@ static int _env_create(void *data)
return exec_container_init(arg, create_param);
 }
 
+#define add_value(val, var, mult) do { if (val) { var = *val * mult; } } while 
(0)
+
+static int ct_setlimits(vps_handler *h, envid_t veid, struct ub_struct *ub)
+{
+   unsigned long tcp = 0;
+   unsigned long kmem = 0;
+   unsigned long kmemall = 0;
+   unsigned long mem = 0;
+   unsigned long swap = 0;
+   int pagesize = sysconf(_SC_PAGESIZE);
+
+   add_value(ub-physpages, mem, pagesize);
+   add_value(ub-tcpsndbuf, tcp, 1);
+   add_value(ub-tcprcvbuf, tcp, 1);
+   add_value(ub-swappages, swap, pagesize);
+
+   /*
+* OpenVZ beancounters traditionally acconted objects. Also, we could
+* always get a very high granularity about which objects we are
+* tracking. Our attempt in this implementation is to translate the
+* historical beancounters into something that makes sense given the
+* underlying Linux infrastructure, and provide something that would
+* allow for more or less the kind of protection the user asked for.  A
+* 1:1 mapping, however, is not possible - and will never be.
+*
+* Upstream Linux cgroup controllers went in a very different
+* direction. First, resources tend to be viewed in its entirety. We
+* have entities like memory, or kernel memory, instead of a list
+* of all internal structures like dentry, siginfo, etc. For network
+* buffers, we can specify the total buffer memory instead of send and
+* receive buffers, etc.
+*
+* Also, all accounting is done in pages, not in objects - which is the
+* only thing that makes sense if the accounting is done in an
+* aggregate manner.  We don't really know the size of those
+* structures, so we use an estimate to get a value in pages. This is
+* not a stable API of the kernel, so it is bound to change.
+*
+* Here is the size in bytes of the following structs, in Linux 3.4:
+*
+* dcache: 248, siginfo: 128, sock: 1072, task 8128
+*/
+   #define DCACHE 248
+   #define SIGINFO 128
+   #define SOCK 1072
+
+   add_value(ub-kmemsize, kmem, 1);
+   add_value(ub-dcachesize, kmemall, DCACHE);
+   add_value(ub-numtcpsock, kmemall, SOCK);
+   add_value(ub-numsiginfo, kmemall, SIGINFO);
+   add_value(ub-numothersock, kmemall, SOCK);
+   add_value(ub-othersockbuf, kmemall, 1);
+   add_value(ub-numproc, kmemall, 2 * pagesize);
+   add_value(ub-dgramrcvbuf, kmemall, SOCK);
+
+   if (mem)
+   container_apply_config(veid, MEMORY, mem);
+   if (tcp)
+   container_apply_config(veid, TCP, tcp);
+
+   kmem = max_ul(kmem, kmemall);
+   if (kmem)
+   container_apply_config(veid, KMEMORY, kmem);
+
+   if (swap)
+   container_apply_config(veid, SWAP, swap);
+
+   return 0;
+}
+#undef add_value
+
 static int ct_env_create(struct arg_start *arg)
 {
 
@@ -187,6 +258,11 @@ static int ct_env_create(struct arg_start *arg)
return VZ_RESOURCE_ERROR;
}
 
+   if ((ret = ct_setlimits(arg-h, arg-veid, arg-res-ub))) {
+   logger(-1, 0, Could not apply container limits: %s, 
container_error(ret));
+   return VZ_RESOURCE_ERROR;
+   }
+
if ((ret = container_add_task(arg-veid))) {
logger(-1, 0, Can't add task creator to container: %s, 
container_error(ret));
return VZ_RESOURCE_ERROR;
@@ -289,77 +365,6 @@ out:
return ret;
 }
 
-#define add_value(val, var, mult) do { if (val) { var = *val * mult; } } while 
(0)
-
-static int ct_setlimits(vps_handler *h, envid_t veid, struct ub_struct *ub)
-{
-   unsigned long tcp = 0;
-   unsigned long kmem = 0;
-   unsigned long kmemall = 0;
-   unsigned long mem = 0;
-   unsigned long swap = 0;
-   int pagesize = sysconf(_SC_PAGESIZE);
-
-   add_value(ub-physpages, mem, pagesize);
-   add_value(ub-tcpsndbuf, tcp, 1);
-   add_value(ub-tcprcvbuf, tcp, 1);
-   add_value(ub-swappages, swap, pagesize

[Devel] [PATCH] cgroups: fix set command with beancounters upstream

2013-05-10 Thread Glauber Costa
The kernel memory controller cannot flip states from unlimited to limited if
there are already tasks in it.  Therefore, we always have to run with *some*
value of kmem enabled. If we don't do it, we can't start unlimited and then use
the set command to set any beancounters. We write the maximum amount minus two
pages, which should effectively mean accounting turned on, but unlimited

Signed-off-by: Glauber Costa glom...@openvz.org
---
 src/lib/cgroup.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/src/lib/cgroup.c b/src/lib/cgroup.c
index 9185d46..ae7fe5c 100644
--- a/src/lib/cgroup.c
+++ b/src/lib/cgroup.c
@@ -56,6 +56,19 @@ static int controller_apply_config(struct cgroup *ct, struct 
cgroup *parent,
} else if (!strcmp(memory, name)) {
if ((ret = cgroup_set_value_string(controller, 
memory.use_hierarchy, 1)))
return ret;
+   /*
+* The kernel memory controller cannot flip states from
+* unlimited to limited if there are already tasks in it.
+* Therefore, we always have to run with *some* value of kmem
+* enabled. If we don't do it, we can't start unlimited and
+* then use the set command to set any beancounters. We write
+* The maximum amount minus two pages, which should effectively
+* mean accounting turned on, but unlimited. This will fail
+* if the kmem controller is not present, but that is okay.
+*/
+   cgroup_set_value_string(controller, 
memory.kmem.limit_in_bytes,
+   9223372036854767712);
+
} else if (!strcmp(devices, name)) {
if ((ret = cgroup_set_value_string(controller, devices.deny, 
a)))
return ret;
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v3 0/9] Upstream Linux support for userns

2013-05-10 Thread Glauber Costa
Kir,

Please review the following patchset. The main difference from last version is 
that
we support running with userns disabled even if it is present. This effectively 
means
that containers that were already created and owned by root will keep working. 
It is
also possible to explicitly disable it at container creation by setting 
local_uid to 0.

There are also some bugfixes and changes according to the review you provided.

Running a container works, and vzctl enter works as well. Most pressing, is
the fact that although this patchset finally implements --ipadd (now all
infrastructure is in place).

Glauber Costa (9):
  host uid and gid parameters
  adjust fs_create parameter
  pass parameters to open
  user namespace support for upstream containers
  add user mismatch test
  allow local uid and gid to be specified at container creation
  modify tar extraction to account for user namespace
  automatically add bridge venet0 when needed
  allow for distro-specific fix ups at creation time.

 etc/dists/redhat.conf   |   1 +
 etc/dists/scripts/fixups.sh |  43 +++
 etc/vz.conf.in  |   4 +
 include/dist.h  |   2 +
 include/env.h   |  10 +-
 include/res.h   |   8 ++
 include/types.h |   1 +
 include/vzctl_param.h   |   3 +
 man/vzctl.8.in  |  16 +++
 scripts/vps-create.in   |  14 +++
 scripts/vps-functions.in|   7 ++
 src/lib/Makefile.am |   3 +
 src/lib/chown_preload.c |  93 ++
 src/lib/config.c|  21 
 src/lib/create.c|  37 --
 src/lib/dist.c  |  10 +-
 src/lib/env.c   |  33 +++--
 src/lib/exec.c  |   2 +-
 src/lib/hooks_ct.c  | 293 ++--
 src/lib/hooks_vz.c  |   2 +-
 src/vzctl-actions.c |   4 +-
 src/vzctl.c |   1 +
 vzctl.spec  |   2 +-
 23 files changed, 577 insertions(+), 33 deletions(-)
 create mode 100755 etc/dists/scripts/fixups.sh
 create mode 100644 src/lib/chown_preload.c

-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v3 1/9] host uid and gid parameters

2013-05-10 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

When running with an upstream Linux kernel that supports user namespaces,
we will run the container using an unprivileged user in the system. That
can be any user, and it serves as base to a 1:1 mapping between users in
the container and users in the host.

By default, the value 10 will be used for both uid and gid.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 etc/vz.conf.in|  4 
 include/res.h |  2 ++
 include/vzctl_param.h |  3 +++
 src/lib/config.c  | 21 +
 4 files changed, 30 insertions(+)

diff --git a/etc/vz.conf.in b/etc/vz.conf.in
index 07da9c7..de240ac 100644
--- a/etc/vz.conf.in
+++ b/etc/vz.conf.in
@@ -39,6 +39,10 @@ DEF_OSTEMPLATE=centos-5
 ## Filesystem layout for new CTs: either simfs (default) or ploop
 #VE_LAYOUT=ploop
 
+# User namespace configuration
+LOCAL_UID=10
+LOCAL_GID=10
+
 ## Load vzwdog module
 VZWDOG=no
 
diff --git a/include/res.h b/include/res.h
index 7060aef..0dfacf7 100644
--- a/include/res.h
+++ b/include/res.h
@@ -56,6 +56,8 @@ typedef struct {
int onboot;
unsigned long *bootorder;
int wait;
+   unsigned long *local_uid;
+   unsigned long *local_gid;
 } misc_param;
 
 struct mod_action;
diff --git a/include/vzctl_param.h b/include/vzctl_param.h
index 99eb655..cb5ef70 100644
--- a/include/vzctl_param.h
+++ b/include/vzctl_param.h
@@ -150,5 +150,8 @@
 #define PARAM_MOUNT_OPTS   396
 
 
+#define PARAM_LOCAL_UID397
+#define PARAM_LOCAL_GID398
+
 #define PARAM_LINE e:p:f:t:i:l:k:a:b:n:x:h
 #endif
diff --git a/src/lib/config.c b/src/lib/config.c
index 617df6a..cb0bfe0 100644
--- a/src/lib/config.c
+++ b/src/lib/config.c
@@ -128,6 +128,10 @@ static vps_config config[] = {
 {CONFIGFILE, NULL, PARAM_CONFIG},
 {ORIGIN_SAMPLE,NULL,PARAM_CONFIG_SAMPLE},
 {DISABLED,   NULL, PARAM_DISABLED},
+#ifdef HAVE_UPSTREAM
+{LOCAL_UID,  NULL, PARAM_LOCAL_UID},
+{LOCAL_GID,  NULL, PARAM_LOCAL_GID},
+#endif
 /* quota */
 {DISK_QUOTA, NULL, PARAM_DISK_QUOTA},
 {DISKSPACE,  NULL, PARAM_DISKSPACE},
@@ -1364,6 +1368,12 @@ static int store_misc(vps_param *old_p, vps_param 
*vps_p, vps_config *conf,
ret = conf_store_strlist(conf_h, conf-name,
misc-searchdomain, 0);
break;
+   case PARAM_LOCAL_UID:
+   ret = conf_store_ulong(conf_h, conf-name, misc-local_uid);
+   break;
+   case PARAM_LOCAL_GID:
+   ret = conf_store_ulong(conf_h, conf-name, misc-local_gid);
+   break;
}
return ret;
 }
@@ -1989,6 +1999,13 @@ static int parse(envid_t veid, vps_param *vps_p, char 
*val, int id)
case PARAM_IPTABLES:
ret = parse_iptables(vps_p-res.env, val);
break;
+
+   case PARAM_LOCAL_UID:
+   conf_parse_ulong(vps_p-res.misc.local_uid, val);
+   break;
+   case PARAM_LOCAL_GID:
+   conf_parse_ulong(vps_p-res.misc.local_gid, val);
+   break;
case PARAM_LOCKEDPAGES:
case PARAM_PRIVVMPAGES:
case PARAM_SHMPAGES:
@@ -2694,6 +2711,8 @@ static void free_misc(misc_param *misc)
FREE_P(misc-hostname)
FREE_P(misc-description)
FREE_P(misc-bootorder)
+   FREE_P(misc-local_uid)
+   FREE_P(misc-local_gid)
 }
 
 static void free_net(net_param *net)
@@ -2862,6 +2881,8 @@ static void merge_misc(misc_param *dst, misc_param *src)
MERGE_STR(hostname)
MERGE_STR(description)
MERGE_INT(onboot)
+   MERGE_P(local_uid)
+   MERGE_P(local_gid)
MERGE_P(bootorder)
MERGE_INT(wait)
 }
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v3 2/9] adjust fs_create parameter

2013-05-10 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

We need to pass more information to fs_create. Instead of adding arguments, it
is preferred to pass the whole vps_p structure and unfold it inside the callee.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 src/lib/create.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/lib/create.c b/src/lib/create.c
index 9a0d43e..2fd9314 100644
--- a/src/lib/create.c
+++ b/src/lib/create.c
@@ -103,8 +103,7 @@ static void cleanup_destroy_ve(void *data)
vps_destroy_dir(d-veid, d-private);
 }
 
-static int fs_create(envid_t veid, fs_param *fs, tmpl_param *tmpl,
-   dq_param *dq, int layout, int ploop_mode)
+static int fs_create(envid_t veid, vps_handler *h, vps_param *vps_p)
 {
char tarball[PATH_LEN];
char tmp_dir[PATH_LEN];
@@ -117,6 +116,10 @@ static int fs_create(envid_t veid, fs_param *fs, 
tmpl_param *tmpl,
char *dst;
const char *ext[] = { , .gz, .bz2, .xz, NULL };
const char *errmsg_ext = [.gz|.bz2|.xz];
+   dq_param *dq = vps_p-res.dq;
+   int layout = vps_p-opt.layout;
+   fs_param *fs = vps_p-res.fs;
+   tmpl_param *tmpl = vps_p-res.tmpl;
int ploop = (layout == VE_LAYOUT_PLOOP);
struct destroy_ve ddata;
struct vzctl_cleanup_handler *ch;
@@ -173,6 +176,7 @@ find:
/* Create and mount ploop image */
struct vzctl_create_image_param param = {};
struct vzctl_mount_param mount_param = {};
+   int ploop_mode = vps_p-opt.mode;
 
if (ploop_mode  0)
ploop_mode = PLOOP_EXPANDED_MODE;
@@ -388,9 +392,7 @@ int vps_create(vps_handler *h, envid_t veid, vps_param 
*vps_p, vps_param *cmd_p,
tmpl-ostmpl = full_ostmpl;
}
}
-   ret = fs_create(veid, fs, tmpl, vps_p-res.dq,
-   vps_p-opt.layout,
-   vps_p-opt.mode);
+   ret = fs_create(veid, h, vps_p);
if (ret)
goto err_root;
}
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v3 3/9] pass parameters to open

2013-05-10 Thread Glauber Costa
Upstream containers running on kernels that support user namespaces would
benefit from being able to have early access to the container configuration.
This is because we would like user namespaces support to be either enabled or
disabled, and a lot of the actions we take can be potentially affected by this
configuration switch.

We do that by bundling it into vz_open, where we traditionally are setting all
the namespace related flags (based solely on the presence of it in the system).

The open handler in hooks_vz technically does not need to be modified. But I am
doing it out of consistency.

Signed-off-by: Glauber Costa glom...@openvz.org
---
 include/env.h   | 6 +++---
 src/lib/env.c   | 7 ---
 src/lib/hooks_ct.c  | 2 +-
 src/lib/hooks_vz.c  | 2 +-
 src/vzctl-actions.c | 2 +-
 5 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/env.h b/include/env.h
index 1628bbf..06729f6 100644
--- a/include/env.h
+++ b/include/env.h
@@ -50,7 +50,7 @@ enum {
  * @param veid CT ID.
  * @return handler or NULL on error.
  */
-vps_handler *vz_open(envid_t veid);
+vps_handler *vz_open(envid_t veid, vps_param *param);
 
 /** Close CT handler.
  *
@@ -125,6 +125,6 @@ int exec_container_init(struct arg_start *arg,
struct env_create_param3 *create_param);
 
 int set_personality32();
-int vz_do_open(vps_handler *h);
-int ct_do_open(vps_handler *h);
+int vz_do_open(vps_handler *h, vps_param *param);
+int ct_do_open(vps_handler *h, vps_param *param);
 #endif
diff --git a/src/lib/env.c b/src/lib/env.c
index 7e9a923..898bfc8 100644
--- a/src/lib/env.c
+++ b/src/lib/env.c
@@ -72,9 +72,10 @@ static int reset_std()
 /** Allocate and initialize CT handler.
  *
  * @param veid CT ID.
+ * @param paramCT parameters.
  * @return handler or NULL on error.
  */
-vps_handler *vz_open(envid_t veid)
+vps_handler *vz_open(envid_t veid, vps_param *param)
 {
vps_handler *h = NULL;
int ret = -1;
@@ -88,7 +89,7 @@ vps_handler *vz_open(envid_t veid)
 #ifdef VZ_KERNEL_SUPPORTED
/* Find out if we are under OpenVZ or upstream kernel */
if (stat_file(/proc/vz))
-   ret = vz_do_open(h);
+   ret = vz_do_open(h, param);
else
 #endif
{
@@ -96,7 +97,7 @@ vps_handler *vz_open(envid_t veid)
non-OpenVZ kernel);
h-vzfd = -1;
 #ifdef HAVE_UPSTREAM
-   ret = ct_do_open(h);
+   ret = ct_do_open(h, param);
 #else
logger(-1, 0, No suitable kernel support compiled in);
 #endif
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index 3cfb6e9..f24d91e 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -572,7 +572,7 @@ static int ct_setcontext(envid_t veid)
return 0;
 }
 
-int ct_do_open(vps_handler *h)
+int ct_do_open(vps_handler *h, vps_param *param)
 {
int ret;
char path[STR_SIZE];
diff --git a/src/lib/hooks_vz.c b/src/lib/hooks_vz.c
index afbcd0f..50770c0 100644
--- a/src/lib/hooks_vz.c
+++ b/src/lib/hooks_vz.c
@@ -466,7 +466,7 @@ static int vz_veth_ctl(vps_handler *h, envid_t veid, int 
op, veth_dev *dev)
return ret;
 }
 
-int vz_do_open(vps_handler *h)
+int vz_do_open(vps_handler *h, vps_param *param)
 {
if ((h-vzfd = open(VZCTLDEV, O_RDWR))  0) {
logger(-1, errno, Unable to open %s, VZCTLDEV);
diff --git a/src/vzctl-actions.c b/src/vzctl-actions.c
index e0cd261..4627043 100644
--- a/src/vzctl-actions.c
+++ b/src/vzctl-actions.c
@@ -1471,7 +1471,7 @@ int run_action(envid_t veid, act_t action, vps_param 
*g_p, vps_param *vps_p,
int ret = 0, lock_id = -1;
struct sigaction act;
 
-   if ((h = vz_open(veid)) == NULL) {
+   if ((h = vz_open(veid, g_p)) == NULL) {
/* Accept to run set --save --force on any kernel,
 * otherwise error out if initialization failed
 */
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v3 4/9] user namespace support for upstream containers

2013-05-10 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

This patch allows the execution of unprivileged containers running ontop
of an upstream Linux Kernel. We will run at whatever UID is found in the
configuration file (so far empty, thus disabled).

Signed-off-by: Glauber Costa glom...@parallels.com
---
 include/env.h  |   1 +
 include/types.h|   1 +
 src/lib/hooks_ct.c | 216 -
 3 files changed, 214 insertions(+), 4 deletions(-)

diff --git a/include/env.h b/include/env.h
index 06729f6..4fef438 100644
--- a/include/env.h
+++ b/include/env.h
@@ -116,6 +116,7 @@ struct arg_start {
vps_handler *h;
void *data;
env_create_FN fn;
+   int userns_p; /* while running in userns, there's extra sync needed */
 };
 
 struct env_create_param3;
diff --git a/include/types.h b/include/types.h
index a4bce73..fa511aa 100644
--- a/include/types.h
+++ b/include/types.h
@@ -95,6 +95,7 @@ typedef struct vps_handler {
int vzfd;   /** /dev/vzctl file descriptor. */
int stdfd;
int can_join_pidns; /* can't enter otherwise */
+   int can_join_userns; /* can't run non privileged otherwise */
int (*is_run)(struct vps_handler *h, envid_t veid);
int (*enter)(struct vps_handler *h, envid_t veid, const char *root, int 
flags);
int (*destroy)(struct vps_handler *h, envid_t veid);
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index f24d91e..b0cb359 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -66,6 +66,8 @@ static int sys_setns(int fd, int nstype)
 # define MS_PRIVATE (1  18)
 #endif
 
+#define UID_GID_RANGE 10 /* how many users per container */
+
 /* This function is there in GLIBC, but not in headers */
 extern int pivot_root(const char * new_root, const char * put_old);
 
@@ -138,10 +140,41 @@ static int _env_create(void *data)
struct env_create_param3 create_param;
int ret;
 
-   if ((ret = ct_chroot(arg-res-fs.root)))
+   if ((arg-userns_p != -1)  (read(arg-userns_p, ret, sizeof(ret)) == 
0))
+   return -1;
+
+   ret = ct_chroot(arg-res-fs.root);
+   close(arg-userns_p);
+   /* Probably means chroot failed */
+   if (ret)
return ret;
 
-   if ((ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 1)))
+   if (arg-h-can_join_userns) {
+   int fd;
+   setuid(0);
+   setgid(0);
+   /* 
+* We need the special flag newinstance. This is a requirement
+* of the userns-aware implementation of devpts as of Linux 3.9.
+* Because of that special requirement, we do it here rather 
than
+* later.
+*/
+   mount(devpts, /dev/pts, devpts, 0, newinstance);
+   /* /dev/ptmx, if it even exists, would refer to the root ptmx.
+* We don't want that, we want our newly created instance to 
contain
+* all ptys. So we bind mount the root device here
+*/
+   fd = open(/dev/ptmx, O_RDWR|O_CREAT, 0);
+   close(fd);
+   mount(/dev/pts/ptmx, /dev/ptmx, , MS_BIND, 0);
+   }
+
+   /*
+* If we are using the user namespace, we will have the full capability
+* set in the target namespace. So we don't need any of that.
+*/
+   if (!arg-h-can_join_userns 
+   (ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 
1)))
return ret;
 
fill_container_param(arg, create_param);
@@ -224,6 +257,88 @@ static int ct_setlimits(vps_handler *h, envid_t veid, 
struct ub_struct *ub)
 }
 #undef add_value
 
+static int write_uid_gid_mapping(vps_handler *h, unsigned long uid, unsigned 
long gid, pid_t pid)
+{
+   char buf[STR_SIZE];
+   char map[STR_SIZE];
+   int fd;
+   int len;
+   int ret;
+
+   len = snprintf(map, sizeof(map), 0 %ld %d, uid, UID_GID_RANGE);
+   snprintf(buf, sizeof(buf), /proc/%d/uid_map, pid);
+   if ((fd = open(buf, O_WRONLY))  0)
+   goto out;
+
+   if ((write(fd, map, len)  0))
+   goto out;
+
+   close(fd);
+
+   len = snprintf(map, sizeof(map), 0 %ld %d, gid, UID_GID_RANGE);
+   snprintf(buf, sizeof(map), /proc/%d/gid_map, pid);
+   if ((fd = open(buf, O_WRONLY))  0)
+   goto out;
+
+   if ((write(fd, map, len)  0))
+   goto out;
+   ret = 0;
+out:
+   if (fd  0)
+   close(fd);
+   return ret;
+}
+
+/*
+ * Those devices should exist in the container, and be valid device nodes with
+ * user access permission. But we need to be absolutely sure this is the case,
+ * so we will provide our own versions. That could actually happen since some
+ * distributions may come with emptied /dev's, waiting for udev to populate 
them.
+ * That won't happen, we do it ourselves.
+ */
+static void create_devices

[Devel] [PATCH v3 5/9] add user mismatch test

2013-05-10 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

In theory, we won't be able to run if our private area is not owned by
ourselves.  We could, if it have very wide open security permissions, but we
should never set up a container like that.

Aside from a basic sanity check, this is intended to catch problems for the few
people who may have already created containers that will be owned by root:root,
and will now try to run it unprivileged.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 src/lib/env.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/lib/env.c b/src/lib/env.c
index 898bfc8..96400c0 100644
--- a/src/lib/env.c
+++ b/src/lib/env.c
@@ -30,6 +30,7 @@
 #include linux/reboot.h
 #include sys/mount.h
 #include sys/utsname.h
+#include sys/stat.h
 
 #include vzerror.h
 #include res.h
@@ -554,6 +555,21 @@ int vps_start_custom(vps_handler *h, envid_t veid, 
vps_param *param,
logger(-1, 0, Container is already running);
return VZ_VE_RUNNING;
}
+   if (!is_vz_kernel(h)  h-can_join_userns) {
+   struct stat private_stat;
+   unsigned long *local_uid = res-misc.local_uid;
+   unsigned long *local_gid = res-misc.local_gid;
+
+   stat(res-fs.private, private_stat);
+   if ((local_uid  (private_stat.st_uid != *local_uid)) ||
+   (local_gid  (private_stat.st_gid != *local_gid))) {
+   logger(-1, 0, Container private area is owned by %d:%d
+   , but configuration file says we should run with 
%lu:%lu.\n
+   Refusing to run., private_stat.st_uid, 
private_stat.st_gid,
+   *res-misc.local_uid, *res-misc.local_gid);
+   return VZ_FS_BAD_TMPL;
+   }
+   }
if ((ret = check_ub(h, res-ub)))
return ret;
 
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v3 7/9] modify tar extraction to account for user namespace

2013-05-10 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

If we are running upstream with user namespaces, we need to create the
container filesystem not with the ownership preserved, but reflecting the
mapping we need to apply. Note that according to our documentation, we should
ignore this if the user explicitly requested an uid mapping of 0 (gid is
ignored in this case).

Our tooling doesn't allow any easy way to unpack a whol distribution with
offsets mechanically applied like this. We could do the whole unpacking in a
user namespace itself, but that does not come without problems on its own (for
instance, we won't be able to create any device files, we have to carefully
adjust permissions in the root directory, etc)

To work around that, we can employ a trick to allow container creation right
now, as well as to avoid compatibility problems: we will resort to LD_PRELOAD
to load a schim that captures calls to the chown family of system calls and
applies the offset manually.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 include/res.h   |  6 
 scripts/vps-create.in   | 14 
 src/lib/Makefile.am |  3 ++
 src/lib/chown_preload.c | 93 +
 src/lib/create.c| 25 ++---
 vzctl.spec  |  2 +-
 6 files changed, 137 insertions(+), 6 deletions(-)
 create mode 100644 src/lib/chown_preload.c

diff --git a/include/res.h b/include/res.h
index 0dfacf7..047593a 100644
--- a/include/res.h
+++ b/include/res.h
@@ -47,6 +47,12 @@ struct env_param {
 };
 typedef struct env_param env_param_t;
 
+/*
+ * When running upstream kernels, we will need host-side UID and GID. Those
+ * are configurable, but if none is specified, those are used.
+ */
+#define VZ_DEFAULT_UID 10
+#define VZ_DEFAULT_GID 10
 typedef struct {
list_head_t userpw;
list_head_t nameserver;
diff --git a/scripts/vps-create.in b/scripts/vps-create.in
index 126f048..a1b3382 100755
--- a/scripts/vps-create.in
+++ b/scripts/vps-create.in
@@ -22,11 +22,22 @@
 # Required parameters:
 #   VE_PRVT- path to root of CT private areas
 #   PRIVATE_TEMPLATE   - path to private template used as a source for copying
+#
+# Optional parameters:
+#   UID_OFFSET - offset to be added to all tar UIDs
+#   GID_OFFSET - offset to be added to all tar GIDs
 
 . @SCRIPTDIR@/vps-functions
 
 vzcheckvar VE_PRVT PRIVATE_TEMPLATE
 
+chown_preload_if_needed()
+{
+   [ -z $UID_OFFSET -o -z $GID_OFFSET ]  return 
+
+   export LD_PRELOAD=libvzchown.so
+}
+
 create_prvt()
 {
local TMP AVAIL NEEDED HEADER OPT
@@ -75,6 +86,9 @@ create_prvt()
[ $AVAIL -ge $NEEDED ] ||
vzerror Insufficient disk space in $VE_PRVT; available: 
$AVAIL, needed: $NEEDED ${VZ_FS_NO_DISK_SPACE}
CAT=cat
+
+   chown_preload_if_needed
+
# Use pv to show nice progress bar if we can
pv -V /dev/null 21  CAT=pv
chmod 700 $VE_PRVT
diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am
index b009d15..34c463b 100644
--- a/src/lib/Makefile.am
+++ b/src/lib/Makefile.am
@@ -72,6 +72,9 @@ libvzctl_la_LIBADD = $(XML_LIBS) $(CGROUP_LIBS) $(DL_LIBS)
 
 if HAVE_CGROUP
 libvzctl_la_SOURCES += cgroup.c hooks_ct.c
+
+lib_LTLIBRARIES += libvzchown.la
+libvzchown_la_SOURCES = chown_preload.c
 endif
 
 if HAVE_VZ_KERNEL
diff --git a/src/lib/chown_preload.c b/src/lib/chown_preload.c
new file mode 100644
index 000..4e2be4a
--- /dev/null
+++ b/src/lib/chown_preload.c
@@ -0,0 +1,93 @@
+/*
+ *  Copyright (C) 2013, Parallels, Inc. All rights reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *  Authors: Kir Kolyshkin and Glauber Costa
+ */
+
+#include unistd.h
+#include stdio.h
+#include stdlib.h
+
+#include dlfcn.h
+
+static int (*real_chown)(const char *path, uid_t owner, gid_t group) = NULL;
+static int (*real_fchown)(int fd, uid_t owner, gid_t group) = NULL;
+static int (*real_lchown)(const char *path, uid_t owner, gid_t group) = NULL;
+static int (*real_fchownat)(int dirfd, const char *pathname,
+   uid_t owner, gid_t group, int flags) = NULL;
+
+uid_t uid_offset = 0;
+gid_t gid_offset = 0;
+
+static void __init(void)
+{
+   char *uidstr, *gidstr;
+
+   uidstr = getenv(UID_OFFSET);
+   gidstr

[Devel] [PATCH v3 8/9] automatically add bridge venet0 when needed

2013-05-10 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

The chosen architecture to deal with --ipadd with upstream containers is to
create a veth pair and add the host side information to a bridge called venet0.
This way, all the code that expects venet0 to exist can still work without
modifications, (or with just a few).

Our intention to do that was actually already stated in the comments, but the
code was removed before merging because --ipadd would not work without full
unshare support anyway.

This patch implements that.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 scripts/vps-functions.in |  7 +++
 src/lib/hooks_ct.c   | 37 +++--
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/scripts/vps-functions.in b/scripts/vps-functions.in
index 826c0a1..37b2de5 100755
--- a/scripts/vps-functions.in
+++ b/scripts/vps-functions.in
@@ -170,6 +170,13 @@ vzadjustmacs()
 # other setups, the bridge is expected to already exist and be valid.
 vzconfbridge()
 {
+   if [ x$BRIDGE == xvenet0 ]; then
+   if [ $(brctl show venet0 2/dev/null | tail -n+2 | wc -l) == 0 
]; then
+   brctl addbr venet0
+   ${IP_CMD} link set venet0 up
+   fi
+   fi
+   
if [ x$BRIDGE != x ]; then
brctl addif $BRIDGE $HNAME /dev/null 21
fi
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index a791934..2d195a5 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -17,6 +17,7 @@
 #include logger.h
 #include script.h
 #include cgroup.h
+#include linux/vzctl_venet.h
 
 #define NETNS_RUN_DIR /var/run/netns
 
@@ -695,8 +696,40 @@ static int ct_netdev_ctl(vps_handler *h, envid_t veid, int 
op, char *name)
 
 static int ct_ip_ctl(vps_handler *h, envid_t veid, int op, const char *ipstr)
 {
-   logger(-1, 0, %s not yet supported upstream, __func__);
-   return 0;
+   int ret = -1;
+   char *envp[5];
+   char buf[STR_SIZE];
+   int i = 0;
+
+   if (!h-can_join_pidns) {
+   logger(-1, 0, Cannot join pid namespace: 
+   --ipadd is not supported in kernels without full pidns 
support);
+   return VZ_BAD_KERNEL;
+   }
+   envp[i++] = strdup(VNAME=venet0);
+   envp[i++] = strdup(BRIDGE=venet0);
+
+   snprintf(buf, sizeof(buf), HNAME=venet0.%d, veid);
+   envp[i++] = strdup(buf);
+
+   snprintf(buf, sizeof(buf), VEID=%d, veid);
+   envp[i++] = strdup(buf);
+
+   envp[i] = NULL;
+
+   if (op == VE_IP_ADD) {
+   char *argv[] = { VPS_NETNS_DEV_ADD, NULL };
+
+   ret = run_script(VPS_NETNS_DEV_ADD, argv, envp, 0);
+   } else  {
+   char *argv[] = { VPS_NETNS_DEV_DEL, NULL };
+
+   ret = run_script(VPS_NETNS_DEV_DEL, argv, envp, 0);
+   }
+   free_arg(envp);
+
+   return ret;
+
 }
 
 /*
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v3 9/9] allow for distro-specific fix ups at creation time.

2013-05-10 Thread Glauber Costa
From: Glauber Costa glom...@parallels.com

We will need that infrastucture when running with Linux upstream, since some
support is very unlikely to ever land in the Kernel.  We need to do things like
account for the fact that udev may kick in and destroy all the setup we have
done for /dev.  Since preemptively preventing those actions to happen is very
hard ( as an example, centos6 init binary will issue mount syscalls itself),
the most robust approach is to insert a script that will be run shortly after
rc.sysinit. Although this can't guarantee that it will run *immediately* after,
it should be early enough to undo every harmful operation done by /sbin/init
and rc.sysinit, therefore allowing operation to continue freely.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 etc/dists/redhat.conf   |  1 +
 etc/dists/scripts/fixups.sh | 43 +++
 include/dist.h  |  2 ++
 include/env.h   |  3 ++-
 src/lib/dist.c  | 10 +-
 src/lib/env.c   | 10 ++
 src/lib/exec.c  |  2 +-
 src/lib/hooks_ct.c  | 38 ++
 8 files changed, 102 insertions(+), 7 deletions(-)
 create mode 100755 etc/dists/scripts/fixups.sh

diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf
index 727461d..49226c5 100644
--- a/etc/dists/redhat.conf
+++ b/etc/dists/redhat.conf
@@ -25,3 +25,4 @@ SET_DNS=set_dns.sh
 SET_USERPASS=set_userpass.sh
 SET_UGID_QUOTA=set_ugid_quota.sh
 POST_CREATE=postcreate.sh
+CT_FIXUP=fixups.sh
diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh
new file mode 100755
index 000..da3fd24
--- /dev/null
+++ b/etc/dists/scripts/fixups.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+#  Copyright (C) 2000-2009, Parallels, Inc. All rights reserved.
+#
+#  This program is free software; you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation; either version 2 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program; if not, write to the Free Software
+#  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+#
+#
+#  Run-time fixup for legacy distributions that will rely on udevd to populate
+#  their /dev. Those distributions will effectively mount an empty tmpfs on top
+#  of /dev, defeating any setup we may have done. This includes RHEL-based
+#  distributions up to RHEL6. We are better off not allowing that to run at all
+
+function fixup_udev()
+{
+   if [ -f /etc/fedora-release ]; then
+   return;
+   fi
+
+   # All the first two may fail if the distribution didn't actually mount
+   # then.
+   umount /dev/pts
+   umount /dev/shm
+   # still may hold references to files open by rc.sysinit. Even if this 
is the
+   # case those will be simple things like /dev/null and should go away 
shortly.
+   # let us not fail because of that.
+   umount /dev -l
+}
+
+fixup_udev
+
+exit 0
+
diff --git a/include/dist.h b/include/dist.h
index 1f6a5a6..4faa452 100644
--- a/include/dist.h
+++ b/include/dist.h
@@ -29,6 +29,7 @@
 #defineSET_USERPASS5
 #defineSET_UGID_QUOTA  6
 #definePOST_CREATE 7
+#defineCT_FIXUP8
 
 typedef struct {
char *def_ostmpl;
@@ -46,6 +47,7 @@ typedef struct dist_actions {
char *set_userpass; /** setup user password. */
char *set_ugid_quota;   /** setup 2level quota. */
char *post_create;  /** sostcreate actions. */
+   char *ct_fixup; /** run-time fixups. */
 } dist_actions;
 
 /* Read distribution specific actions configuration file.
diff --git a/include/env.h b/include/env.h
index 4fef438..b10dfd9 100644
--- a/include/env.h
+++ b/include/env.h
@@ -104,7 +104,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param 
*param, int stop_mode,
 int vz_chroot(const char *root);
 int vz_env_create(vps_handler *h, envid_t veid, vps_res *res,
int wait_p[2], int old_wait_p[2], int err_p[2],
-   env_create_FN fn, void *data);
+   env_create_FN fn, dist_actions *actions, void *data);
 
 struct vps_res;
 struct arg_start {
@@ -116,6 +116,7 @@ struct arg_start {
vps_handler *h;
void *data;
env_create_FN fn;
+   dist_actions *actions;
int userns_p; /* while running in userns, there's extra sync needed */
 };
 
diff --git a/src/lib/dist.c b/src/lib/dist.c
index bde94ab..8047cf7 100644
--- a/src/lib/dist.c
+++ b/src/lib/dist.c
@@ -36,7 +36,8

Re: [Devel] [PATCH] cgroups: fix set command with beancounters upstream

2013-04-30 Thread Glauber Costa
On 04/30/2013 01:48 PM, Kir Kolyshkin wrote:
 On 04/29/2013 10:12 PM, Glauber Costa wrote:
 The kernel memory controller cannot flip states from unlimited to
 limited if
 there are already tasks in it.  Therefore, we always have to run with
 *some*
 value of kmem enabled. If we don't do it, we can't start unlimited and
 then use
 the set command to set any beancounters. We write the maximum amount
 minus two
 pages, which should effectively mean accounting turned on, but
 unlimited
 
 Alternatively, we can require CT restart for the change to take effect.
 We do that for a few options, see check_set_mode().
 
 But this is probably OK too.

Being all the same to you, I prefer this way. This is not really about
container running, but rather cgroups have tasks. Doing it on
restart would still mandate us to always be careful about when to set up
cgroup limits and always do that before we put any tasks into it. Of
course we can do that, but not having that limitation makes us more
robust IMHO. We also already have a lot of non-limit related stuff that
we have to do during creation anyway (like setting up cpusets and memory
hierarchy), so it just fits together nicely

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v2 2/8] adjust fs_create parameter

2013-04-01 Thread Glauber Costa
On 04/01/2013 07:44 PM, Kir Kolyshkin wrote:
 On 04/01/2013 08:37 AM, Glauber Costa wrote:
 On 04/01/2013 07:13 PM, Dmitry Guryanov wrote:
 On 130401 19:04:37, Glauber Costa wrote:
 On 04/01/2013 06:58 PM, Dmitry Guryanov wrote:
 On 130322 14:48:16, Glauber Costa wrote:
 We need to pass more information to fs_create. Instead of adding
 arguments, it
 is preferred to pass the whole vps_p structure and unfold it
 inside the callee.

 Can't apply this patch :(

 This patch is rather old. Isn't it already present in Kir's ?
 Kir's tree certainly have all the lastest upstream linux work, so it
 should either have it, or a modified version of it.

 No, it isn't. Latest commit there is two weeks ago. Could you please
 tell,
 where i can find up-to-date vzctl repository ?

 I am checking here, and this patch never made its way to the repository.
 We settled in a slightly different solution for it. Why do you need to
 apply it ?
 
 I guess Dmitry wants to test userns support... which is not in yet.
Yes, but it has nothing to do with that patch. Dmitry, maybe you are
looking at the wrong series ?
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 0/8] upstream Linux support for userns

2013-03-22 Thread Glauber Costa
Kir,

Please review the following patchset. It implements support for userns inside
containers. Running a container works, and vzctl enter works as well. There
are still some caveats that I intend to tackle in the upcoming weeks. Most
pressing, is the fact that although this patchset finally implements --ipadd
(now all infrastructure is in place), we can ssh into containers due to issues
related to the proc filesystem.


Let me know if there are any issues, I'll happily fix them.

Glauber Costa (8):
  host uid and gid parameters
  adjust fs_create parameter
  user namespace support for upstream containers
  modify tar extraction to account for user namespace
  add user mismatch test
  allow local uid and gid to be specified at container creation
  automatically add bridge venet0 when needed
  allow for distro-specific fix ups at creation time.

 etc/dists/redhat.conf   |   1 +
 etc/dists/scripts/fixups.sh |  43 +++
 include/dist.h  |   2 +
 include/env.h   |   4 +-
 include/res.h   |   8 ++
 include/types.h |   1 +
 include/vzctl_param.h   |   3 +
 man/vzctl.8.in  |  14 +++
 scripts/vps-create.in   |  19 
 scripts/vps-functions.in|   7 ++
 src/lib/Makefile.am |   3 +
 src/lib/chown_preload.c |  93 +++
 src/lib/config.c|  32 ++
 src/lib/create.c|  30 +++--
 src/lib/dist.c  |  10 +-
 src/lib/env.c   |  23 +++-
 src/lib/exec.c  |   2 +-
 src/lib/hooks_ct.c  | 269 ++--
 src/vzctl-actions.c |   2 +
 src/vzctl.c |   1 +
 vzctl.spec  |   2 +-
 21 files changed, 544 insertions(+), 25 deletions(-)
 create mode 100755 etc/dists/scripts/fixups.sh
 create mode 100644 src/lib/chown_preload.c

-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 1/8] host uid and gid parameters

2013-03-22 Thread Glauber Costa
When running with an upstream Linux kernel that supports user namespaces,
we will run the container using an unprivileged user in the system. That
can be any user, and it serves as base to a 1:1 mapping between users in
the container and users in the host.

By default, the value 10 will be used for both uid and gid.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 include/res.h |  8 
 include/vzctl_param.h |  3 +++
 src/lib/config.c  | 32 
 3 files changed, 43 insertions(+)

diff --git a/include/res.h b/include/res.h
index e07060d..47c36da 100644
--- a/include/res.h
+++ b/include/res.h
@@ -47,6 +47,12 @@ struct env_param {
 };
 typedef struct env_param env_param_t;
 
+/*
+ * When running upstream kernels, we will need host-side UID and GID. Those
+ * are configurable, but if none is specified, those are used.
+ */
+#define VZ_DEFAULT_UID 10
+#define VZ_DEFAULT_GID 10
 typedef struct {
list_head_t userpw;
list_head_t nameserver;
@@ -56,6 +62,8 @@ typedef struct {
int onboot;
unsigned long *bootorder;
int wait;
+   unsigned long *local_uid;
+   unsigned long *local_gid;
 } misc_param;
 
 struct mod_action;
diff --git a/include/vzctl_param.h b/include/vzctl_param.h
index 3034a13..e583540 100644
--- a/include/vzctl_param.h
+++ b/include/vzctl_param.h
@@ -149,5 +149,8 @@
 #define PARAM_MOUNT_OPTS   396
 
 
+#define PARAM_LOCAL_UID397
+#define PARAM_LOCAL_GID398
+
 #define PARAM_LINE e:p:f:t:i:l:k:a:b:n:x:h
 #endif
diff --git a/src/lib/config.c b/src/lib/config.c
index cf5f352..cc4c1cc 100644
--- a/src/lib/config.c
+++ b/src/lib/config.c
@@ -128,6 +128,10 @@ static vps_config config[] = {
 {CONFIGFILE, NULL, PARAM_CONFIG},
 {ORIGIN_SAMPLE,NULL,PARAM_CONFIG_SAMPLE},
 {DISABLED,   NULL, PARAM_DISABLED},
+#ifdef HAVE_UPSTREAM
+{LOCAL_UID,  NULL, PARAM_LOCAL_UID},
+{LOCAL_GID,  NULL, PARAM_LOCAL_GID},
+#endif
 /* quota */
 {DISK_QUOTA, NULL, PARAM_DISK_QUOTA},
 {DISKSPACE,  NULL, PARAM_DISKSPACE},
@@ -1363,6 +1367,12 @@ static int store_misc(vps_param *old_p, vps_param 
*vps_p, vps_config *conf,
ret = conf_store_strlist(conf_h, conf-name,
misc-searchdomain, 0);
break;
+   case PARAM_LOCAL_UID:
+   ret = conf_store_ulong(conf_h, conf-name, misc-local_uid);
+   break;
+   case PARAM_LOCAL_GID:
+   ret = conf_store_ulong(conf_h, conf-name, misc-local_gid);
+   break;
}
return ret;
 }
@@ -1996,6 +2006,13 @@ static int parse(envid_t veid, vps_param *vps_p, char 
*val, int id)
case PARAM_IPTABLES:
ret = parse_iptables(vps_p-res.env, val);
break;
+
+   case PARAM_LOCAL_UID:
+   conf_parse_ulong(vps_p-res.misc.local_uid, val);
+   break;
+   case PARAM_LOCAL_GID:
+   conf_parse_ulong(vps_p-res.misc.local_gid, val);
+   break;
case PARAM_LOCKEDPAGES:
case PARAM_PRIVVMPAGES:
case PARAM_SHMPAGES:
@@ -2697,6 +2714,8 @@ static void free_misc(misc_param *misc)
FREE_P(misc-hostname)
FREE_P(misc-description)
FREE_P(misc-bootorder)
+   FREE_P(misc-local_uid)
+   FREE_P(misc-local_gid)
 }
 
 static void free_net(net_param *net)
@@ -2789,6 +2808,17 @@ void free_vps_param(vps_param *param)
*dst-x = *src-x;  \
}
 
+#define MERGE_P_DEFAULT(x, dfl)
\
+   if ((src-x) != NULL) { \
+   if ((dst-x) == NULL)   \
+   dst-x = malloc(sizeof(*(dst-x))); \
+   *dst-x = *src-x;  \
+   }   \
+   if ((dst-x) == NULL) { \
+   dst-x = malloc(sizeof(*(dst-x))); \
+   *dst-x = (unsigned long)(dfl); \
+   }
+
 #define MERGE_P2(x)\
if ((src-x) != NULL) { \
if ((dst-x) == NULL)   \
@@ -2865,6 +2895,8 @@ static void merge_misc(misc_param *dst, misc_param *src)
MERGE_STR(hostname)
MERGE_STR(description)
MERGE_INT(onboot)
+   MERGE_P_DEFAULT(local_uid, VZ_DEFAULT_UID)
+   MERGE_P_DEFAULT(local_gid, VZ_DEFAULT_GID)
MERGE_P(bootorder)
MERGE_INT(wait)
 }
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 3/8] user namespace support for upstream containers

2013-03-22 Thread Glauber Costa
This patch allows the execution of unprivileged containers running ontop
of an upstream Linux Kernel. We will run at whatever UID is found in the
configuration file.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 include/env.h  |   1 +
 include/types.h|   1 +
 src/lib/hooks_ct.c | 194 +++--
 3 files changed, 191 insertions(+), 5 deletions(-)

diff --git a/include/env.h b/include/env.h
index 1628bbf..d41df2e 100644
--- a/include/env.h
+++ b/include/env.h
@@ -116,6 +116,7 @@ struct arg_start {
vps_handler *h;
void *data;
env_create_FN fn;
+   int userns_p; /* while running in userns, there's extra sync needed */
 };
 
 struct env_create_param3;
diff --git a/include/types.h b/include/types.h
index ceecb93..54eb1f4 100644
--- a/include/types.h
+++ b/include/types.h
@@ -95,6 +95,7 @@ typedef struct vps_handler {
int vzfd;   /** /dev/vzctl file descriptor. */
int stdfd;
int can_join_pidns; /* can't enter otherwise */
+   int can_join_userns; /* can't run non privileged otherwise */
int (*is_run)(struct vps_handler *h, envid_t veid);
int (*enter)(struct vps_handler *h, envid_t veid, const char *root, int 
flags);
int (*destroy)(struct vps_handler *h, envid_t veid);
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index 29d7eea..6bd27c1 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -66,6 +66,8 @@ static int sys_setns(int fd, int nstype)
 # define MS_PRIVATE (1  18)
 #endif
 
+#define UID_GID_RANGE 10 /* how many users per container */
+
 /* This function is there in GLIBC, but not in headers */
 extern int pivot_root(const char * new_root, const char * put_old);
 
@@ -138,10 +140,39 @@ static int _env_create(void *data)
struct env_create_param3 create_param;
int ret;
 
-   if ((ret = ct_chroot(arg-res-fs.root)))
+   if ((arg-userns_p != -1)  (read(arg-userns_p, ret, sizeof(ret)) == 
0))
+   return -1;
+
+   ret = ct_chroot(arg-res-fs.root);
+   close(arg-userns_p);
+   /* Probably means chroot failed */
+   if (ret)
return ret;
 
-   if ((ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 1)))
+   if (arg-h-can_join_userns) {
+   setuid(0);
+   setgid(0);
+   /* 
+* We need the special flag newinstance. This is a requirement
+* of the userns-aware implementation of devpts as of Linux 3.9.
+* Because of that special requirement, we do it here rather 
than
+* later.
+*/
+   mount(devpts, /dev/pts, devpts, 0, newinstance);
+   /* /dev/ptmx, if it even exists, would refer to the root ptmx.
+* We don't want that, we want our newly created instance to 
contain
+* all ptys. So we bind mount the root device here
+*/
+   open(/dev/ptmx, O_RDWR|O_CREAT, 0);
+   mount(/dev/pts/ptmx, /dev/ptmx, , MS_BIND, 0);
+   }
+
+   /*
+* If we are using the user namespace, we will have the full capability
+* set in the target namespace. So we don't need any of that.
+*/
+   if (!arg-h-can_join_userns 
+   (ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 
1)))
return ret;
 
fill_container_param(arg, create_param);
@@ -153,6 +184,79 @@ static int _env_create(void *data)
return exec_container_init(arg, create_param);
 }
 
+static int write_uid_gid_mapping(vps_handler *h, unsigned long uid, unsigned 
long gid, pid_t pid)
+{
+   char buf[STR_SIZE];
+   char map[STR_SIZE];
+   int fd;
+
+   snprintf(map, sizeof(map), 0 %ld %d, uid, UID_GID_RANGE);
+   snprintf(buf, sizeof(buf), /proc/%d/uid_map, pid);
+   if ((fd = open(buf, O_WRONLY))  0)
+   return -1;
+
+   if ((write(fd, map, sizeof(map))  0))
+   return -1;
+
+   snprintf(map, sizeof(map), 0 %ld %d, gid, UID_GID_RANGE);
+   snprintf(buf, sizeof(map), /proc/%d/gid_map, pid);
+   if ((fd = open(buf, O_WRONLY))  0)
+   return -1;
+
+   if ((write(fd, map, sizeof(map))  0))
+   return -1;
+
+   return 0;
+}
+
+/*
+ * Those devices should exist in the container, and be valid device nodes with
+ * user access permission. But we need to be absolutely sure this is the case,
+ * so we will provide our own versions. That could actually happen since some
+ * distributions may come with emptied /dev's, waiting for udev to populate 
them.
+ * That won't happen, we do it ourselves.
+ */
+static void create_devices(vps_handler *h, envid_t veid, const char *root)
+{
+   unsigned int i;
+   char *devices[] = {
+   /dev/null,
+   /dev/zero,
+   /dev/random,
+   /dev/urandom,
+   };
+
+   /*
+* We

[Devel] [PATCH v2 4/8] modify tar extraction to account for user namespace

2013-03-22 Thread Glauber Costa
If we are running upstream with user namespaces, we need to create the
container filesystem not with the ownership preserved, but reflecting the
mapping we need to apply.

The long term goal is to patch the tar utility to allow us to specify an offset
that will be applied during extraction. However, even if tar is modified, the
legacy base of old tar utilities is still too big. We can employ a trick to
allow container creation right now, as well as to avoid compatibility problems:
we will resort to LD_PRELOAD to load a schim that captures calls to the chown
family of system calls and applies the offset manually.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 scripts/vps-create.in   | 19 ++
 src/lib/Makefile.am |  3 ++
 src/lib/chown_preload.c | 93 +
 src/lib/create.c| 21 +++
 vzctl.spec  |  2 +-
 5 files changed, 130 insertions(+), 8 deletions(-)
 create mode 100644 src/lib/chown_preload.c

diff --git a/scripts/vps-create.in b/scripts/vps-create.in
index 126f048..53f6643 100755
--- a/scripts/vps-create.in
+++ b/scripts/vps-create.in
@@ -22,11 +22,27 @@
 # Required parameters:
 #   VE_PRVT- path to root of CT private areas
 #   PRIVATE_TEMPLATE   - path to private template used as a source for copying
+#
+# Optional parameters:
+#   UID_OFFSET - offset to be added to all tar UIDs
+#   GID_OFFSET - offset to be added to all tar GIDs
 
 . @SCRIPTDIR@/vps-functions
 
 vzcheckvar VE_PRVT PRIVATE_TEMPLATE
 
+chown_preload_if_needed()
+{
+   [ -z $UID_OFFSET -o -z $GID_OFFSET ]  return 
+
+   # TODO use tar with appropriate option when it becomes available 
+   #
+   # The goal is to get this merged into tar so we no longer
+   # need to do this preload. Whenever it happens, we will include
+   # a version check here as well
+   export LD_PRELOAD=libvzchown.so
+}
+
 create_prvt()
 {
local TMP AVAIL NEEDED HEADER OPT
@@ -75,6 +91,9 @@ create_prvt()
[ $AVAIL -ge $NEEDED ] ||
vzerror Insufficient disk space in $VE_PRVT; available: 
$AVAIL, needed: $NEEDED ${VZ_FS_NO_DISK_SPACE}
CAT=cat
+
+   chown_preload_if_needed
+
# Use pv to show nice progress bar if we can
pv -V /dev/null 21  CAT=pv
chmod 700 $VE_PRVT
diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am
index 42a7ed6..44c8a69 100644
--- a/src/lib/Makefile.am
+++ b/src/lib/Makefile.am
@@ -70,6 +70,9 @@ libvzctl_la_LIBADD = $(XML_LIBS) $(CGROUP_LIBS) $(DL_LIBS)
 
 if HAVE_CGROUP
 libvzctl_la_SOURCES += cgroup.c hooks_ct.c
+
+lib_LTLIBRARIES += libvzchown.la
+libvzchown_la_SOURCES = chown_preload.c
 endif
 
 if HAVE_VZ_KERNEL
diff --git a/src/lib/chown_preload.c b/src/lib/chown_preload.c
new file mode 100644
index 000..4e2be4a
--- /dev/null
+++ b/src/lib/chown_preload.c
@@ -0,0 +1,93 @@
+/*
+ *  Copyright (C) 2013, Parallels, Inc. All rights reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *  Authors: Kir Kolyshkin and Glauber Costa
+ */
+
+#include unistd.h
+#include stdio.h
+#include stdlib.h
+
+#include dlfcn.h
+
+static int (*real_chown)(const char *path, uid_t owner, gid_t group) = NULL;
+static int (*real_fchown)(int fd, uid_t owner, gid_t group) = NULL;
+static int (*real_lchown)(const char *path, uid_t owner, gid_t group) = NULL;
+static int (*real_fchownat)(int dirfd, const char *pathname,
+   uid_t owner, gid_t group, int flags) = NULL;
+
+uid_t uid_offset = 0;
+gid_t gid_offset = 0;
+
+static void __init(void)
+{
+   char *uidstr, *gidstr;
+
+   uidstr = getenv(UID_OFFSET);
+   gidstr = getenv(GID_OFFSET);
+   if (!uidstr || !gidstr) {
+   fprintf(stderr, Environment variables UID_OFFSET 
+   and GID_OFFSET are required -- aborting\n);
+   exit(33);
+   }
+   uid_offset = strtol(uidstr, NULL, 10);
+   gid_offset = strtol(gidstr, NULL, 10);
+
+   real_chown = dlsym(RTLD_NEXT, chown);
+   real_fchown = dlsym(RTLD_NEXT, fchown);
+   real_lchown = dlsym(RTLD_NEXT, lchown);
+   real_fchownat = dlsym(RTLD_NEXT, fchownat);
+
+   if (!real_chown || !real_fchown || !real_lchown || !real_fchownat) {
+   fprintf(stderr

[Devel] [PATCH v2 5/8] add user mismatch test

2013-03-22 Thread Glauber Costa
In theory, we won't be able to run if our private area is not owned by
ourselves.  We could, if it have very wide open security permissions, but we
should never set up a container like that.

Aside from a basic sanity check, this is intended to catch problems for the few
people who may have already created containers that will be owned by root:root,
and will now try to run it unprivileged.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 src/lib/env.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/src/lib/env.c b/src/lib/env.c
index 2da848d..ff4dad2 100644
--- a/src/lib/env.c
+++ b/src/lib/env.c
@@ -30,6 +30,7 @@
 #include linux/reboot.h
 #include sys/mount.h
 #include sys/utsname.h
+#include sys/stat.h
 
 #include vzerror.h
 #include res.h
@@ -551,6 +552,18 @@ int vps_start_custom(vps_handler *h, envid_t veid, 
vps_param *param,
logger(-1, 0, Container is already running);
return VZ_VE_RUNNING;
}
+   if (!is_vz_kernel(h)  h-can_join_userns) {
+   struct stat private_stat;
+   stat(res-fs.private, private_stat);
+   if ((private_stat.st_uid != *res-misc.local_uid) ||
+   (private_stat.st_gid != *res-misc.local_gid)) {
+   logger(-1, 0, Container private area is owned by %d:%d
+   , but configuration file says we should run with 
%lu:%lu.\n
+   Refusing to run., private_stat.st_uid, 
private_stat.st_gid,
+   *res-misc.local_uid, *res-misc.local_gid);
+   return VZ_FS_BAD_TMPL;
+   }
+   }
if ((ret = check_ub(h, res-ub)))
return ret;
 
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 6/8] allow local uid and gid to be specified at container creation

2013-03-22 Thread Glauber Costa
It is a valid use case to run a container with host uid and gid different
than the default. This patch provides and documents a way to do so.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 man/vzctl.8.in  | 14 ++
 src/vzctl-actions.c |  2 ++
 src/vzctl.c |  1 +
 3 files changed, 17 insertions(+)

diff --git a/man/vzctl.8.in b/man/vzctl.8.in
index 5efd702..07162ea 100644
--- a/man/vzctl.8.in
+++ b/man/vzctl.8.in
@@ -852,6 +852,8 @@ List of available fields can be obtained using \fB-L\fR 
option.
 .OP --ipadd addr
 .OP --hostname name
 .OP --name name
+.OP --local_uid uid
+.OP --local_gid gid
 .YS
 .IP  4
 Creates a new container area. This operation should be done once, before
@@ -903,6 +905,18 @@ a container. Note that this option can be used multiple 
times.
 
 You can use \fB--hostname\fR \fIname\fR option to set a host name for
 a container.
+
+When running with an upstream Linux Kernel that supports user namespaces
+(= 3.8), the parameters \fB--local_uid\fR and \fB--local_gid\fR can be used 
to select
+which \fIuid\fR and \fIgid\fR respectively will be used as a base user in the
+host system. Note that user namespaces provide a 1:1 mapping between container
+users and host users. If these options are not specified, the value 10 is
+used.
+
+\fBWarning:\fR use \fB--local_uid\fR and \fB--local_gid\fR with care, specially
+when migrating containers. In all situations, the container's files in the
+filesystem needs to be correctly owned by the host-side users.
+
 .IP \fBdestroy\fR | \fBdelete\fR \fICTID\fR 4
 Removes a container private area by deleting all files, directories and
 the configuration file of this container.
diff --git a/src/vzctl-actions.c b/src/vzctl-actions.c
index be22265..63d93aa 100644
--- a/src/vzctl-actions.c
+++ b/src/vzctl-actions.c
@@ -391,6 +391,8 @@ static int parse_create_opt(envid_t veid, int argc, char 
**argv,
{ve_layout,   required_argument, NULL, PARAM_VE_LAYOUT},
{velayout,required_argument, NULL, PARAM_VE_LAYOUT},
{diskspace,   required_argument, NULL, PARAM_DISKSPACE},
+   {local_uid,   required_argument, NULL, PARAM_LOCAL_UID},
+   {local_gid,   required_argument, NULL, PARAM_LOCAL_GID},
{ NULL, 0, NULL, 0 }
 };
 
diff --git a/src/vzctl.c b/src/vzctl.c
index d9bba7d..a72ab39 100644
--- a/src/vzctl.c
+++ b/src/vzctl.c
@@ -65,6 +65,7 @@ static void usage(int rc)
 vzctl create ctid [--ostemplate name] [--config name]\n
[--layout ploop|simfs] [--hostname name] [--name name] [--ipadd 
addr]\n
[--diskspace kbytes] [--private path] [--root path]\n
+   [--local_uid UID] [--local_gid GID]\n
 vzctl start ctid [--force] [--wait]\n
 vzctl destroy | mount | umount | stop | restart | status ctid\n
 #ifdef HAVE_PLOOP
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 7/8] automatically add bridge venet0 when needed

2013-03-22 Thread Glauber Costa
The chosen architecture to deal with --ipadd with upstream containers is to
create a veth pair and add the host side information to a bridge called venet0.
This way, all the code that expects venet0 to exist can still work without
modifications, (or with just a few).

Our intention to do that was actually already stated in the comments, but the
code was removed before merging because --ipadd would not work without full
unshare support anyway.

This patch implements that.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 scripts/vps-functions.in |  7 +++
 src/lib/hooks_ct.c   | 37 +++--
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/scripts/vps-functions.in b/scripts/vps-functions.in
index 826c0a1..ab05aa0 100755
--- a/scripts/vps-functions.in
+++ b/scripts/vps-functions.in
@@ -170,6 +170,13 @@ vzadjustmacs()
 # other setups, the bridge is expected to already exist and be valid.
 vzconfbridge()
 {
+   if [ x$BRIDGE == xvenet0 ]; then
+   if [ `brctl show venet0 2/dev/null | tail -n+2 | wc -l` == 0 
]; then
+   brctl addbr venet0
+   ${IP_CMD} link set venet0 up
+   fi
+   fi
+   
if [ x$BRIDGE != x ]; then
brctl addif $BRIDGE $HNAME /dev/null 21
fi
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index 6bd27c1..d5b15dc 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -17,6 +17,7 @@
 #include logger.h
 #include script.h
 #include cgroup.h
+#include linux/vzctl_venet.h
 
 #define NETNS_RUN_DIR /var/run/netns
 
@@ -665,8 +666,40 @@ static int ct_netdev_ctl(vps_handler *h, envid_t veid, int 
op, char *name)
 
 static int ct_ip_ctl(vps_handler *h, envid_t veid, int op, const char *ipstr)
 {
-   logger(-1, 0, %s not yet supported upstream, __func__);
-   return 0;
+   int ret = -1;
+   char *envp[5];
+   char buf[STR_SIZE];
+   int i = 0;
+
+   if (!h-can_join_pidns) {
+   logger(-1, 0, Cannot join pid namespace: 
+   --ipadd is not supported in kernels without full pidns 
support);
+   return VZ_BAD_KERNEL;
+   }
+   envp[i++] = strdup(VNAME=venet0);
+   envp[i++] = strdup(BRIDGE=venet0);
+
+   snprintf(buf, sizeof(buf), HNAME=venet0.%d, veid);
+   envp[i++] = strdup(buf);
+
+   snprintf(buf, sizeof(buf), VEID=%d, veid);
+   envp[i++] = strdup(buf);
+
+   envp[i] = NULL;
+
+   if (op == VE_IP_ADD) {
+   char *argv[] = { VPS_NETNS_DEV_ADD, NULL };
+
+   ret = run_script(VPS_NETNS_DEV_ADD, argv, envp, 0);
+   } else  {
+   char *argv[] = { VPS_NETNS_DEV_DEL, NULL };
+
+   ret = run_script(VPS_NETNS_DEV_DEL, argv, envp, 0);
+   }
+   free_arg(envp);
+
+   return ret;
+
 }
 
 /*
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 8/8] allow for distro-specific fix ups at creation time.

2013-03-22 Thread Glauber Costa
We will need that infrastucture when running with Linux upstream, since some
support is very unlikely to ever land in the Kernel.  We need to do things like
account for the fact that udev may kick in and destroy all the setup we have
done for /dev.  Since preemptively preventing those actions to happen is very
hard ( as an example, centos6 init binary will issue mount syscalls itself),
the most robust approach is to insert a script that will be run shortly after
rc.sysinit. Although this can't guarantee that it will run *immediately* after,
it should be early enough to undo every harmful operation done by /sbin/init
and rc.sysinit, therefore allowing operation to continue freely.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 etc/dists/redhat.conf   |  1 +
 etc/dists/scripts/fixups.sh | 43 +++
 include/dist.h  |  2 ++
 include/env.h   |  3 ++-
 src/lib/dist.c  | 10 +-
 src/lib/env.c   | 10 ++
 src/lib/exec.c  |  2 +-
 src/lib/hooks_ct.c  | 38 ++
 8 files changed, 102 insertions(+), 7 deletions(-)
 create mode 100755 etc/dists/scripts/fixups.sh

diff --git a/etc/dists/redhat.conf b/etc/dists/redhat.conf
index 727461d..49226c5 100644
--- a/etc/dists/redhat.conf
+++ b/etc/dists/redhat.conf
@@ -25,3 +25,4 @@ SET_DNS=set_dns.sh
 SET_USERPASS=set_userpass.sh
 SET_UGID_QUOTA=set_ugid_quota.sh
 POST_CREATE=postcreate.sh
+CT_FIXUP=fixups.sh
diff --git a/etc/dists/scripts/fixups.sh b/etc/dists/scripts/fixups.sh
new file mode 100755
index 000..da3fd24
--- /dev/null
+++ b/etc/dists/scripts/fixups.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+#  Copyright (C) 2000-2009, Parallels, Inc. All rights reserved.
+#
+#  This program is free software; you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation; either version 2 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program; if not, write to the Free Software
+#  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+#
+#
+#  Run-time fixup for legacy distributions that will rely on udevd to populate
+#  their /dev. Those distributions will effectively mount an empty tmpfs on top
+#  of /dev, defeating any setup we may have done. This includes RHEL-based
+#  distributions up to RHEL6. We are better off not allowing that to run at all
+
+function fixup_udev()
+{
+   if [ -f /etc/fedora-release ]; then
+   return;
+   fi
+
+   # All the first two may fail if the distribution didn't actually mount
+   # then.
+   umount /dev/pts
+   umount /dev/shm
+   # still may hold references to files open by rc.sysinit. Even if this 
is the
+   # case those will be simple things like /dev/null and should go away 
shortly.
+   # let us not fail because of that.
+   umount /dev -l
+}
+
+fixup_udev
+
+exit 0
+
diff --git a/include/dist.h b/include/dist.h
index 1f6a5a6..4faa452 100644
--- a/include/dist.h
+++ b/include/dist.h
@@ -29,6 +29,7 @@
 #defineSET_USERPASS5
 #defineSET_UGID_QUOTA  6
 #definePOST_CREATE 7
+#defineCT_FIXUP8
 
 typedef struct {
char *def_ostmpl;
@@ -46,6 +47,7 @@ typedef struct dist_actions {
char *set_userpass; /** setup user password. */
char *set_ugid_quota;   /** setup 2level quota. */
char *post_create;  /** sostcreate actions. */
+   char *ct_fixup; /** run-time fixups. */
 } dist_actions;
 
 /* Read distribution specific actions configuration file.
diff --git a/include/env.h b/include/env.h
index d41df2e..acea123 100644
--- a/include/env.h
+++ b/include/env.h
@@ -104,7 +104,7 @@ int vps_stop(vps_handler *h, envid_t veid, vps_param 
*param, int stop_mode,
 int vz_chroot(const char *root);
 int vz_env_create(vps_handler *h, envid_t veid, vps_res *res,
int wait_p[2], int old_wait_p[2], int err_p[2],
-   env_create_FN fn, void *data);
+   env_create_FN fn, dist_actions *actions, void *data);
 
 struct vps_res;
 struct arg_start {
@@ -116,6 +116,7 @@ struct arg_start {
vps_handler *h;
void *data;
env_create_FN fn;
+   dist_actions *actions;
int userns_p; /* while running in userns, there's extra sync needed */
 };
 
diff --git a/src/lib/dist.c b/src/lib/dist.c
index bde94ab..8047cf7 100644
--- a/src/lib/dist.c
+++ b/src/lib/dist.c
@@ -36,7 +36,8 @@ static struct distr_conf {
{SET_DNS

[Devel] [PATCH v2 0/6] Unprivileged containers with user namespaces

2013-03-12 Thread Glauber Costa
Kir,

Please take a look at the following patches. They implement support for
unprivileged containers using user namespaces, and should work, modulo bugs.

v2:

* use conf_parse_ulong to simplify uid/gid parsing. We do need to provide a
  default value for creation, though.

* allow 0 to be specified as uid/gid offset. It simplifies the code if
  conf_parse_ulong is used, and well, if anyone *really* wants to run
  privileged... We will apply the default value now only if the fields are
  unset.

Glauber Costa (6):
  host uid and gid parameters
  adjust fs_create parameter
  user namespace support for upstream containers
  modify tar extraction to account for user namespace
  add user mismatch test
  allow local uid and gid to be specified at container creation

 include/res.h   |  8 +
 include/types.h |  1 +
 include/vzctl_param.h   |  3 ++
 man/vzctl.8.in  | 14 
 scripts/vps-create.in   | 19 ++
 src/lib/Makefile.am |  3 ++
 src/lib/chown_preload.c | 93 +
 src/lib/config.c| 32 +
 src/lib/create.c| 30 ++--
 src/lib/env.c   | 29 +++
 src/lib/hooks_ct.c  | 93 +++--
 src/vzctl-actions.c |  2 ++
 src/vzctl.c |  1 +
 vzctl.spec  |  2 +-
 14 files changed, 316 insertions(+), 14 deletions(-)
 create mode 100644 src/lib/chown_preload.c

-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 1/6] host uid and gid parameters

2013-03-12 Thread Glauber Costa
When running with an upstream Linux kernel that supports user namespaces,
we will run the container using an unprivileged user in the system. That
can be any user, and it serves as base to a 1:1 mapping between users in
the container and users in the host.

By default, the value 10 will be used for both uid and gid.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 include/res.h |  8 
 include/vzctl_param.h |  3 +++
 src/lib/config.c  | 32 
 3 files changed, 43 insertions(+)

diff --git a/include/res.h b/include/res.h
index e07060d..47c36da 100644
--- a/include/res.h
+++ b/include/res.h
@@ -47,6 +47,12 @@ struct env_param {
 };
 typedef struct env_param env_param_t;
 
+/*
+ * When running upstream kernels, we will need host-side UID and GID. Those
+ * are configurable, but if none is specified, those are used.
+ */
+#define VZ_DEFAULT_UID 10
+#define VZ_DEFAULT_GID 10
 typedef struct {
list_head_t userpw;
list_head_t nameserver;
@@ -56,6 +62,8 @@ typedef struct {
int onboot;
unsigned long *bootorder;
int wait;
+   unsigned long *local_uid;
+   unsigned long *local_gid;
 } misc_param;
 
 struct mod_action;
diff --git a/include/vzctl_param.h b/include/vzctl_param.h
index 3034a13..e583540 100644
--- a/include/vzctl_param.h
+++ b/include/vzctl_param.h
@@ -149,5 +149,8 @@
 #define PARAM_MOUNT_OPTS   396
 
 
+#define PARAM_LOCAL_UID397
+#define PARAM_LOCAL_GID398
+
 #define PARAM_LINE e:p:f:t:i:l:k:a:b:n:x:h
 #endif
diff --git a/src/lib/config.c b/src/lib/config.c
index cf5f352..cc4c1cc 100644
--- a/src/lib/config.c
+++ b/src/lib/config.c
@@ -128,6 +128,10 @@ static vps_config config[] = {
 {CONFIGFILE, NULL, PARAM_CONFIG},
 {ORIGIN_SAMPLE,NULL,PARAM_CONFIG_SAMPLE},
 {DISABLED,   NULL, PARAM_DISABLED},
+#ifdef HAVE_UPSTREAM
+{LOCAL_UID,  NULL, PARAM_LOCAL_UID},
+{LOCAL_GID,  NULL, PARAM_LOCAL_GID},
+#endif
 /* quota */
 {DISK_QUOTA, NULL, PARAM_DISK_QUOTA},
 {DISKSPACE,  NULL, PARAM_DISKSPACE},
@@ -1363,6 +1367,12 @@ static int store_misc(vps_param *old_p, vps_param 
*vps_p, vps_config *conf,
ret = conf_store_strlist(conf_h, conf-name,
misc-searchdomain, 0);
break;
+   case PARAM_LOCAL_UID:
+   ret = conf_store_ulong(conf_h, conf-name, misc-local_uid);
+   break;
+   case PARAM_LOCAL_GID:
+   ret = conf_store_ulong(conf_h, conf-name, misc-local_gid);
+   break;
}
return ret;
 }
@@ -1996,6 +2006,13 @@ static int parse(envid_t veid, vps_param *vps_p, char 
*val, int id)
case PARAM_IPTABLES:
ret = parse_iptables(vps_p-res.env, val);
break;
+
+   case PARAM_LOCAL_UID:
+   conf_parse_ulong(vps_p-res.misc.local_uid, val);
+   break;
+   case PARAM_LOCAL_GID:
+   conf_parse_ulong(vps_p-res.misc.local_gid, val);
+   break;
case PARAM_LOCKEDPAGES:
case PARAM_PRIVVMPAGES:
case PARAM_SHMPAGES:
@@ -2697,6 +2714,8 @@ static void free_misc(misc_param *misc)
FREE_P(misc-hostname)
FREE_P(misc-description)
FREE_P(misc-bootorder)
+   FREE_P(misc-local_uid)
+   FREE_P(misc-local_gid)
 }
 
 static void free_net(net_param *net)
@@ -2789,6 +2808,17 @@ void free_vps_param(vps_param *param)
*dst-x = *src-x;  \
}
 
+#define MERGE_P_DEFAULT(x, dfl)
\
+   if ((src-x) != NULL) { \
+   if ((dst-x) == NULL)   \
+   dst-x = malloc(sizeof(*(dst-x))); \
+   *dst-x = *src-x;  \
+   }   \
+   if ((dst-x) == NULL) { \
+   dst-x = malloc(sizeof(*(dst-x))); \
+   *dst-x = (unsigned long)(dfl); \
+   }
+
 #define MERGE_P2(x)\
if ((src-x) != NULL) { \
if ((dst-x) == NULL)   \
@@ -2865,6 +2895,8 @@ static void merge_misc(misc_param *dst, misc_param *src)
MERGE_STR(hostname)
MERGE_STR(description)
MERGE_INT(onboot)
+   MERGE_P_DEFAULT(local_uid, VZ_DEFAULT_UID)
+   MERGE_P_DEFAULT(local_gid, VZ_DEFAULT_GID)
MERGE_P(bootorder)
MERGE_INT(wait)
 }
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 2/6] adjust fs_create parameter

2013-03-12 Thread Glauber Costa
We need to pass more information to fs_create. Instead of adding arguments, it
is preferred to pass the whole vps_p structure and unfold it inside the callee.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 src/lib/create.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/src/lib/create.c b/src/lib/create.c
index 39e..0a0330f 100644
--- a/src/lib/create.c
+++ b/src/lib/create.c
@@ -90,8 +90,8 @@ static int download_template(char *tmpl)
return run_script(VPS_DOWNLOAD, arg, env, 0);
 }
 
-static int fs_create(envid_t veid, fs_param *fs, tmpl_param *tmpl,
-   dq_param *dq, int layout, int ploop_mode)
+static int fs_create(envid_t veid, vps_handler *h, fs_param *fs,
+   tmpl_param *tmpl, vps_param *vps_p)
 {
char tarball[PATH_LEN];
char tmp_dir[PATH_LEN];
@@ -104,6 +104,10 @@ static int fs_create(envid_t veid, fs_param *fs, 
tmpl_param *tmpl,
char *dst;
const char *ext[] = { , .gz, .bz2, .xz, NULL };
const char *errmsg_ext = [.gz|.bz2|.xz];
+   dq_param *dq = vps_p-res.dq;
+   int layout = vps_p-opt.layout;
+   unsigned int uid_offset = vps_p-res.misc.local_uid;
+   unsigned int gid_offset = vps_p-res.misc.local_gid;
int ploop = (layout == VE_LAYOUT_PLOOP);
 
if (ploop  (!dq-diskspace || dq-diskspace[1] = 0)) {
@@ -153,6 +157,7 @@ find:
/* Create and mount ploop image */
struct vzctl_create_image_param param = {};
struct vzctl_mount_param mount_param = {};
+   int ploop_mode = vps_p-opt.mode;
 
if (ploop_mode  0)
ploop_mode = PLOOP_EXPANDED_MODE;
@@ -360,9 +365,7 @@ int vps_create(vps_handler *h, envid_t veid, vps_param 
*vps_p, vps_param *cmd_p,
tmpl-ostmpl = full_ostmpl;
}
}
-   if ((ret = fs_create(veid, fs, tmpl, vps_p-res.dq,
-   vps_p-opt.layout,
-   vps_p-opt.mode)))
+   if ((ret = fs_create(veid, h, fs, tmpl, vps_p)))
goto err_root;
}
 
-- 
1.7.11.7

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 3/6] user namespace support for upstream containers

2013-03-12 Thread Glauber Costa
This patch allows the execution of unprivileged containers running ontop
of an upstream Linux Kernel. We will run at whatever UID is found in the
configuration file.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 include/types.h|  1 +
 src/lib/env.c  | 16 ++
 src/lib/hooks_ct.c | 93 --
 3 files changed, 107 insertions(+), 3 deletions(-)

diff --git a/include/types.h b/include/types.h
index ceecb93..54eb1f4 100644
--- a/include/types.h
+++ b/include/types.h
@@ -95,6 +95,7 @@ typedef struct vps_handler {
int vzfd;   /** /dev/vzctl file descriptor. */
int stdfd;
int can_join_pidns; /* can't enter otherwise */
+   int can_join_userns; /* can't run non privileged otherwise */
int (*is_run)(struct vps_handler *h, envid_t veid);
int (*enter)(struct vps_handler *h, envid_t veid, const char *root, int 
flags);
int (*destroy)(struct vps_handler *h, envid_t veid);
diff --git a/src/lib/env.c b/src/lib/env.c
index 2da848d..75e2dee 100644
--- a/src/lib/env.c
+++ b/src/lib/env.c
@@ -280,6 +280,22 @@ int exec_container_init(struct arg_start *arg,
if (read(arg-wait_p, ret, sizeof(ret)) == 0)
return -1;
 
+   /*
+* If we are running on upstream Linux Kernel, we will arrive here as
+* the default unprivileged user. This needs to be setup by the
+* container thread (pid 1), but only after the mapping is established.
+*
+* Since the mapping can only ever be established by a process that has
+* CAP_SETUID in the parent namespace, this has to be done from the
+* process who called clone, not by the cloned children. We need some 
sort
+* of synchronization to make sure the mappings are already in place, so
+* we do it after the read of wait_p above.
+*/
+   if (!is_vz_kernel(arg-h)  arg-h-can_join_userns) {
+   setuid(0);
+   setgid(0);
+   }
+
if ((fd = open(/dev/null, O_RDWR)) != -1) {
dup2(fd, 0);
dup2(fd, 1);
diff --git a/src/lib/hooks_ct.c b/src/lib/hooks_ct.c
index 29d7eea..1ed3b84 100644
--- a/src/lib/hooks_ct.c
+++ b/src/lib/hooks_ct.c
@@ -66,6 +66,8 @@ static int sys_setns(int fd, int nstype)
 # define MS_PRIVATE (1  18)
 #endif
 
+#define UID_GID_RANGE 10 /* how many users per container */
+
 /* This function is there in GLIBC, but not in headers */
 extern int pivot_root(const char * new_root, const char * put_old);
 
@@ -141,7 +143,12 @@ static int _env_create(void *data)
if ((ret = ct_chroot(arg-res-fs.root)))
return ret;
 
-   if ((ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 1)))
+   /*
+* If we are using the user namespace, we will have the full capability
+* set in the target namespace. So we don't need any of that.
+*/
+   if (!arg-h-can_join_userns 
+   (ret = vps_set_cap(arg-veid, arg-res-env, arg-res-cap, 
1)))
return ret;
 
fill_container_param(arg, create_param);
@@ -153,6 +160,31 @@ static int _env_create(void *data)
return exec_container_init(arg, create_param);
 }
 
+static int write_uid_gid_mapping(vps_handler *h, unsigned long uid, unsigned 
long gid, pid_t pid)
+{
+   char buf[STR_SIZE];
+   char map[STR_SIZE];
+   int fd;
+
+   snprintf(map, sizeof(map), 0 %ld %d, uid, UID_GID_RANGE);
+   snprintf(buf, sizeof(buf), /proc/%d/uid_map, pid);
+   if ((fd = open(buf, O_WRONLY))  0)
+   return -1;
+
+   if ((write(fd, map, sizeof(map))  0))
+   return -1;
+
+   snprintf(map, sizeof(map), 0 %ld %d, gid, UID_GID_RANGE);
+   snprintf(buf, sizeof(map), /proc/%d/gid_map, pid);
+   if ((fd = open(buf, O_WRONLY))  0)
+   return -1;
+
+   if ((write(fd, map, sizeof(map))  0))
+   return -1;
+
+   return 0;
+}
+
 static int ct_env_create(struct arg_start *arg)
 {
 
@@ -190,16 +222,30 @@ static int ct_env_create(struct arg_start *arg)
 * Belong in the setup phase
 */
clone_flags = SIGCHLD;
-   /* FIXME: USERNS is still work in progress */
clone_flags |= CLONE_NEWUTS|CLONE_NEWPID|CLONE_NEWIPC;
clone_flags |= CLONE_NEWNET|CLONE_NEWNS;
 
+   if (arg-h-can_join_userns)
+   clone_flags |= CLONE_NEWUSER;
+   else
+   logger(-1, 0, WARNING: Running container unprivileged. USER_NS 
not supported);
+
ret = clone(_env_create, child_stack, clone_flags, arg);
if (ret   0) {
logger(-1, errno, Unable to clone);
/* FIXME: remove ourselves from container first */
destroy_container(arg-veid);
return VZ_RESOURCE_ERROR;
+   } else if (arg-h-can_join_userns) {
+   /*
+* Now we need to write to the mapping file. It has to be us

[Devel] [PATCH v2 4/6] modify tar extraction to account for user namespace

2013-03-12 Thread Glauber Costa
If we are running upstream with user namespaces, we need to create the
container filesystem not with the ownership preserved, but reflecting the
mapping we need to apply.

The long term goal is to patch the tar utility to allow us to specify an offset
that will be applied during extraction. However, even if tar is modified, the
legacy base of old tar utilities is still too big. We can employ a trick to
allow container creation right now, as well as to avoid compatibility problems:
we will resort to LD_PRELOAD to load a schim that captures calls to the chown
family of system calls and applies the offset manually.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 scripts/vps-create.in   | 19 ++
 src/lib/Makefile.am |  3 ++
 src/lib/chown_preload.c | 93 +
 src/lib/create.c| 21 +++
 vzctl.spec  |  2 +-
 5 files changed, 130 insertions(+), 8 deletions(-)
 create mode 100644 src/lib/chown_preload.c

diff --git a/scripts/vps-create.in b/scripts/vps-create.in
index 126f048..53f6643 100755
--- a/scripts/vps-create.in
+++ b/scripts/vps-create.in
@@ -22,11 +22,27 @@
 # Required parameters:
 #   VE_PRVT- path to root of CT private areas
 #   PRIVATE_TEMPLATE   - path to private template used as a source for copying
+#
+# Optional parameters:
+#   UID_OFFSET - offset to be added to all tar UIDs
+#   GID_OFFSET - offset to be added to all tar GIDs
 
 . @SCRIPTDIR@/vps-functions
 
 vzcheckvar VE_PRVT PRIVATE_TEMPLATE
 
+chown_preload_if_needed()
+{
+   [ -z $UID_OFFSET -o -z $GID_OFFSET ]  return 
+
+   # TODO use tar with appropriate option when it becomes available 
+   #
+   # The goal is to get this merged into tar so we no longer
+   # need to do this preload. Whenever it happens, we will include
+   # a version check here as well
+   export LD_PRELOAD=libvzchown.so
+}
+
 create_prvt()
 {
local TMP AVAIL NEEDED HEADER OPT
@@ -75,6 +91,9 @@ create_prvt()
[ $AVAIL -ge $NEEDED ] ||
vzerror Insufficient disk space in $VE_PRVT; available: 
$AVAIL, needed: $NEEDED ${VZ_FS_NO_DISK_SPACE}
CAT=cat
+
+   chown_preload_if_needed
+
# Use pv to show nice progress bar if we can
pv -V /dev/null 21  CAT=pv
chmod 700 $VE_PRVT
diff --git a/src/lib/Makefile.am b/src/lib/Makefile.am
index 42a7ed6..44c8a69 100644
--- a/src/lib/Makefile.am
+++ b/src/lib/Makefile.am
@@ -70,6 +70,9 @@ libvzctl_la_LIBADD = $(XML_LIBS) $(CGROUP_LIBS) $(DL_LIBS)
 
 if HAVE_CGROUP
 libvzctl_la_SOURCES += cgroup.c hooks_ct.c
+
+lib_LTLIBRARIES += libvzchown.la
+libvzchown_la_SOURCES = chown_preload.c
 endif
 
 if HAVE_VZ_KERNEL
diff --git a/src/lib/chown_preload.c b/src/lib/chown_preload.c
new file mode 100644
index 000..4e2be4a
--- /dev/null
+++ b/src/lib/chown_preload.c
@@ -0,0 +1,93 @@
+/*
+ *  Copyright (C) 2013, Parallels, Inc. All rights reserved.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *  Authors: Kir Kolyshkin and Glauber Costa
+ */
+
+#include unistd.h
+#include stdio.h
+#include stdlib.h
+
+#include dlfcn.h
+
+static int (*real_chown)(const char *path, uid_t owner, gid_t group) = NULL;
+static int (*real_fchown)(int fd, uid_t owner, gid_t group) = NULL;
+static int (*real_lchown)(const char *path, uid_t owner, gid_t group) = NULL;
+static int (*real_fchownat)(int dirfd, const char *pathname,
+   uid_t owner, gid_t group, int flags) = NULL;
+
+uid_t uid_offset = 0;
+gid_t gid_offset = 0;
+
+static void __init(void)
+{
+   char *uidstr, *gidstr;
+
+   uidstr = getenv(UID_OFFSET);
+   gidstr = getenv(GID_OFFSET);
+   if (!uidstr || !gidstr) {
+   fprintf(stderr, Environment variables UID_OFFSET 
+   and GID_OFFSET are required -- aborting\n);
+   exit(33);
+   }
+   uid_offset = strtol(uidstr, NULL, 10);
+   gid_offset = strtol(gidstr, NULL, 10);
+
+   real_chown = dlsym(RTLD_NEXT, chown);
+   real_fchown = dlsym(RTLD_NEXT, fchown);
+   real_lchown = dlsym(RTLD_NEXT, lchown);
+   real_fchownat = dlsym(RTLD_NEXT, fchownat);
+
+   if (!real_chown || !real_fchown || !real_lchown || !real_fchownat) {
+   fprintf(stderr

  1   2   3   4   5   6   7   8   9   10   >