The commit is pushed to "branch-rh8-4.18.0-240.1.1.vz8.5.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-240.1.1.vz8.5.5
------>
commit 50459ce9472b7a9c24861bde7ab9e43a56cd3ff3
Author: Valeriy Vdovin <[email protected]>
Date:   Wed Mar 3 20:21:14 2021 +0300

    ve/cgroup: Set release_agent_path for root cgroups separately
    
    This is done so that each container could set it's own release agent.
    Release agent information is now stored in per-cgroup-root data
    structure in ve.
    
    https://jira.sw.ru/browse/PSBM-83887
    
    Signed-off-by: Valeriy Vdovin <[email protected]>
    
    +++
    ve/cgroup: change resource release order in ve_drop_context
    
    This fixes 87cb5fdb5b5c77ac617b46a0fe118a7d50a77b1c
    In the mentioned patch in cgroup_show_options ve->ve_ns is checked to
    ensure that ve->root_css_set is usable. But in ve_drop_context
    root_css_set is being released before ve_ns, which is a bug.
    root_css_set will now be set to NULL after ve_ns is released.
    This reordering only affects the described piece of code in
    cgroup_show_options.
    
    https://jira.sw.ru/browse/PSBM-121438
    Signed-off-by: Valeriy Vdovin <[email protected]>
    
    Reviewed-by: Kirill Tkhai <[email protected]>
    
    +++
    cgroup: do not use cgroup_mutex in cgroup_show_options
    
    In 87cb5fdb5b5c77ac617b46a0fe118a7d50a77b1c function cgroup_show_options
    started to lock cgroup_mutex, which introduced new deadlock possibility,
    described below:
    
    Thread A:
      m_start() --> down_read(&namespace_sem);
        cgroup_show_options() --> mutex_lock(&cgroup_mutex);
    
    Thread B:
    attach_task_by_pid()
      cgroup_lock_live_group -->  mutex_lock(&cgroup_mutex);
      threadgroup_lock() -->  down_write(&tsk->signal->group_rwsem);
    
    Thread C:
     copy_process
      threadgroup_change_begin() --> down_read(&tsk->signal->group_rwsem);
      copy_namespaces
       create_new_namespaces
         copy_mnt_ns
          namespace_lock() --> down_write(&namespace_sem)
    
    Clearly cgroup_mutex can not be locked right after locking
    namespace_sem, because opposite locking order is also present in
    the code and should be removed from cgroup_show_options.
    After reviewing cgroup_show_options, it was established that
    cgroup_mutex is not absolutely needed to guarantee safe access to root_cgrp.
    It was used in combination with a call to task_cgroup_from_root to
    ensure that root_cgrp lived long enough to access it's value of
    release_agent path.
    But in this funciton we know that root_cgrp is part of ve->root_css_set,
    which holds reference to it. In turn root_css_set is referenced while
    ve->ve_ns is not NULL, the check of which we already have in the code.
    This means that root_cgrp is valid until ve->ve_ns is valid.
    ve->ve_ns is valid until the point of rcu_synchronize in ve_drop_context,
    that's why rcu_read_lock should be maintained all the time when root_cgrp
    is being accessed.
    The patch also removes BUG_ON from css_cgroup_from_root, because all 3 calls
    to this function pass ve->root_css_set as an argument and the above logic
    applies.
    
    https://jira.sw.ru/browse/PSBM-121438
    Signed-off-by: Valeriy Vdovin <[email protected]>
    
    Reviewed-by: Kirill Tkhai <[email protected]>
    
    +++
    ve: cleanup in function ve_get_release_agent_path
    
    Cherry-picked from vz7 commit
    f1199bd9589b ("ve/cgroup: set release_agent_path for root cgroups 
separately for
    each ve.")
    
    Signed-off-by: Valeriy Vdovin <[email protected]>
    Reviewed-by: Kirill Tkhai <[email protected]>
    
    =====================
    Patchset description:
    
    ve/cgroup: Port release_agent virtualization from vz7
    
    This patchset ports virtualization of cgroup release_agent
    virtualization from vz7.
    
    Major challanges of porting are differences between vz7 and vz8 cgroup
    implementations:
    - transition of cgroups to kernfs
    - slightly changed locking scheme, which relies on css_set_lock in
      places, previously relied on cgroup_mutex.
    
    There is a small number of patches that have been ported without
    modifications, but most of the patches had suffered a lot of
    modification due to the factors described above.
    
    v1:
      - original patchset
    v2:
      - removed port of CGRP_REMOVED due to the use of CSS_ONLINE in VZ8 for
        same reason
      - changed ve_set(get)_release_agent_path signature for more optimal
      - added ve->is_running check before calling userspace executable
    v3:
      - use goto after check for ve->is_running in last patch
---
 include/linux/cgroup-defs.h     |  3 --
 include/linux/ve.h              |  6 +++
 kernel/cgroup/cgroup-internal.h |  4 +-
 kernel/cgroup/cgroup-v1.c       | 86 ++++++++++++++++++++++++++++-------------
 kernel/cgroup/cgroup.c          |  9 +++--
 kernel/ve/ve.c                  | 76 ++++++++++++++++++++++++++++++++++++
 6 files changed, 150 insertions(+), 34 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 22d84aa0778e..57ee48874404 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -569,9 +569,6 @@ struct cgroup_root {
        /* IDs for cgroups in this hierarchy */
        struct idr cgroup_idr;
 
-       /* The path to use for release notifications. */
-       char release_agent_path[PATH_MAX];
-
        /* The name for this hierarchy - may be empty */
        char name[MAX_CGROUP_ROOT_NAMELEN];
 };
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 44369dddeb24..65c19f2b9b98 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -144,6 +144,12 @@ extern int nr_ve;
 #ifdef CONFIG_VE
 void ve_add_to_release_list(struct cgroup *cgrp);
 void ve_rm_from_release_list(struct cgroup *cgrp);
+
+int ve_set_release_agent_path(struct ve_struct *ve, struct cgroup *cgroot,
+       const char *release_agent);
+
+const char *ve_get_release_agent_path(struct cgroup *cgrp_root);
+
 extern struct ve_struct *get_ve(struct ve_struct *ve);
 extern void put_ve(struct ve_struct *ve);
 
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 4de66630d456..be0cd157d4dc 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -160,6 +160,9 @@ static inline bool notify_on_release(const struct cgroup 
*cgrp)
        return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
 }
 
+struct cgroup *cset_cgroup_from_root(struct css_set *cset,
+                                           struct cgroup_root *root);
+
 bool cgroup_ssid_enabled(int ssid);
 bool cgroup_on_dfl(const struct cgroup *cgrp);
 bool cgroup_is_thread_root(struct cgroup *cgrp);
@@ -180,7 +183,6 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 
ss_mask);
 struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
                               struct cgroup_root *root, unsigned long magic,
                               struct cgroup_namespace *ns);
-
 int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp);
 void cgroup_migrate_finish(struct cgroup_mgctx *mgctx);
 void cgroup_migrate_add_src(struct css_set *src_cset, struct cgroup *dst_cgrp,
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index c1891317ae3a..31585ab907a3 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -37,12 +37,6 @@ static bool cgroup_no_v1_named;
  */
 static struct workqueue_struct *cgroup_pidlist_destroy_wq;
 
-/*
- * Protects cgroup_subsys->release_agent_path.  Modifying it also requires
- * cgroup_mutex.  Reading requires either cgroup_mutex or this spinlock.
- */
-static DEFINE_SPINLOCK(release_agent_path_lock);
-
 bool cgroup1_ssid_disabled(int ssid)
 {
        return cgroup_no_v1_mask & (1 << ssid);
@@ -559,27 +553,36 @@ static ssize_t cgroup_release_agent_write(struct 
kernfs_open_file *of,
                                          char *buf, size_t nbytes, loff_t off)
 {
        struct cgroup *cgrp;
-
-       BUILD_BUG_ON(sizeof(cgrp->root->release_agent_path) < PATH_MAX);
+       int ret = 0;
+       struct cgroup *root_cgrp;
 
        cgrp = cgroup_kn_lock_live(of->kn, false);
-       if (!cgrp)
-               return -ENODEV;
-       spin_lock(&release_agent_path_lock);
-       strlcpy(cgrp->root->release_agent_path, strstrip(buf),
-               sizeof(cgrp->root->release_agent_path));
-       spin_unlock(&release_agent_path_lock);
+       root_cgrp = cgroup_get_local_root(cgrp);
+       BUG_ON(!root_cgrp);
+       if (root_cgrp->ve_owner)
+               ret = ve_set_release_agent_path(root_cgrp, buffer);
+       else
+               ret = -ENODEV;
+
        cgroup_kn_unlock(of->kn);
-       return nbytes;
+       return ret;
 }
 
 static int cgroup_release_agent_show(struct seq_file *seq, void *v)
 {
+       const char *release_agent;
+       struct cgroup *root_cgrp;
        struct cgroup *cgrp = seq_css(seq)->cgroup;
 
-       spin_lock(&release_agent_path_lock);
-       seq_puts(seq, cgrp->root->release_agent_path);
-       spin_unlock(&release_agent_path_lock);
+       root_cgrp = cgroup_get_local_root(cgrp);
+       if (root_cgrp->ve_owner) {
+               rcu_read_lock();
+               release_agent = ve_get_release_agent_path(root_cgrp);
+
+               if (release_agent)
+                       seq_puts(seq, release_agent);
+               rcu_read_unlock();
+       }
        seq_putc(seq, '\n');
        return 0;
 }
@@ -950,8 +953,10 @@ static int cgroup1_rename(struct kernfs_node *kn, struct 
kernfs_node *new_parent
 
 static int cgroup1_show_options(struct seq_file *seq, struct kernfs_root 
*kf_root)
 {
+       const char *release_agent;
        struct cgroup_root *root = cgroup_root_from_kf(kf_root);
        struct cgroup_subsys *ss;
+       struct cgroup *root_cgrp = &root->cgrp;
        int ssid;
 
        for_each_subsys(ss, ssid)
@@ -964,12 +969,36 @@ static int cgroup1_show_options(struct seq_file *seq, 
struct kernfs_root *kf_roo
        if (root->flags & CGRP_ROOT_CPUSET_V2_MODE)
                seq_puts(seq, ",cpuset_v2_mode");
 
-       spin_lock(&release_agent_path_lock);
-       if (strlen(root->release_agent_path))
-               seq_show_option(seq, "release_agent",
-                               root->release_agent_path);
-       spin_unlock(&release_agent_path_lock);
-
+       rcu_read_lock();
+#ifdef CONFIG_VE
+       {
+               struct ve_struct *ve = get_exec_env();
+               struct css_set *cset;
+               struct nsproxy *ve_ns;
+
+               if (!ve_is_super(ve)) {
+                       /*
+                        * ve->init_task is NULL in case when cgroup is accessed
+                        * before ve_start_container has been called.
+                        *
+                        * ve->init_task is synchronized via ve->ve_ns rcu, see
+                        * ve_grab_context/drop_context.
+                        */
+                       ve_ns = rcu_dereference(ve->ve_ns);
+                       if (ve_ns) {
+                               spin_lock_irq(&css_set_lock);
+                               cset = ve_ns->cgroup_ns->root_cset;
+                               BUG_ON(!cset);
+                               root_cgrp = cset_cgroup_from_root(cset, root);
+                               spin_unlock_irq(&css_set_lock);
+                       }
+               }
+       }
+#endif
+       release_agent = ve_get_release_agent_path(root_cgrp);
+       if (release_agent && release_agent[0])
+               seq_show_option(seq, "release_agent", release_agent);
+       rcu_read_unlock();
        if (test_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags))
                seq_puts(seq, ",clone_children");
        if (strlen(root->name))
@@ -1160,9 +1189,12 @@ static int cgroup1_remount(struct kernfs_root *kf_root, 
int *flags, char *data)
        WARN_ON(rebind_subsystems(&cgrp_dfl_root, removed_mask));
 
        if (opts.release_agent) {
-               spin_lock(&release_agent_path_lock);
-               strcpy(root->release_agent_path, opts.release_agent);
-               spin_unlock(&release_agent_path_lock);
+               struct cgroup *root_cgrp;
+
+               root_cgrp = cgroup_get_local_root(&root->cgrp);
+               if (root_cgrp->ve_owner)
+                       ret = ve_set_release_agent_path(root_cgrp,
+                               opts.release_agent);
        }
 
        trace_cgroup_remount(root);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index abba370eded0..7b4d303b9d7e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1434,7 +1434,7 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
 }
 
 /* look up cgroup associated with given css_set on the specified hierarchy */
-static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
+struct cgroup *cset_cgroup_from_root(struct css_set *cset,
                                            struct cgroup_root *root)
 {
        struct cgroup *res = NULL;
@@ -2048,8 +2048,11 @@ void init_cgroup_root(struct cgroup_root *root, struct 
cgroup_sb_opts *opts)
        idr_init(&root->cgroup_idr);
 
        root->flags = opts->flags;
-       if (opts->release_agent)
-               strscpy(root->release_agent_path, opts->release_agent, 
PATH_MAX);
+       if (opts.release_agent) {
+               ret = ve_set_release_agent_path(root_cgrp,
+                       opts.release_agent);
+       }
+
        if (opts->name)
                strscpy(root->name, opts->name, MAX_CGROUP_ROOT_NAMELEN);
        if (opts->cpuset_clone_children)
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index a108cb63bc9f..d704c77f9bc7 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -36,6 +36,12 @@ struct per_cgroot_data {
         * data is related to this cgroup
         */
        struct cgroup *cgroot;
+
+       /*
+        * path to release agent binaray, that should
+        * be spawned for all cgroups under this cgroup root
+        */
+       struct cgroup_rcu_string __rcu *release_agent_path;
 };
 
 extern struct kmapset_set sysfs_ve_perms_set;
@@ -257,6 +263,71 @@ static inline struct per_cgroot_data 
*per_cgroot_get_or_create(
        return data;
 }
 
+int ve_set_release_agent_path(struct cgroup *cgroot,
+       const char *release_agent)
+{
+       struct ve_struct *ve;
+       unsigned long flags;
+       struct per_cgroot_data *data;
+       struct cgroup_rcu_string *new_path, *old_path;
+       int nbytes;
+
+       /*
+        * caller should grab cgroup_mutex to safely use
+        * ve_owner field
+        */
+       ve = cgroot->ve_owner;
+       BUG_ON(!ve);
+
+       nbytes = strlen(release_agent);
+       new_path = cgroup_rcu_strdup(release_agent, nbytes);
+       if (IS_ERR(new_path))
+               return PTR_ERR(new_path);
+
+       data = per_cgroot_get_or_create(ve, cgroot);
+       if (IS_ERR(data)) {
+               kfree(new_path);
+               return PTR_ERR(data);
+       }
+
+       spin_lock_irqsave(&ve->per_cgroot_list_lock, flags);
+
+       old_path = rcu_dereference_protected(data->release_agent_path,
+               lockdep_is_held(&ve->per_cgroot_list_lock));
+
+       rcu_assign_pointer(data->release_agent_path, new_path);
+       spin_unlock_irqrestore(&ve->per_cgroot_list_lock, flags);
+
+       if (old_path)
+               kfree_rcu(old_path, rcu_head);
+
+       return 0;
+}
+
+const char *ve_get_release_agent_path(struct cgroup *cgroot)
+{
+       /* caller must grab rcu_read_lock */
+       const char *result = NULL;
+       struct per_cgroot_data *data;
+       struct cgroup_rcu_string *str;
+       struct ve_struct *ve;
+
+       ve = rcu_dereference(cgroot->ve_owner);
+       if (!ve)
+               return NULL;
+
+       raw_spin_lock(&ve->per_cgroot_list_lock);
+
+       data = per_cgroot_data_find_locked(&ve->per_cgroot_list, cgroot);
+       if (data) {
+               str = rcu_dereference(data->release_agent_path);
+               if (str)
+                       result = str->val;
+       }
+       raw_spin_unlock(&ve->per_cgroot_list_lock);
+       return result;
+}
+
 struct cgroup_subsys_state *ve_get_init_css(struct ve_struct *ve, int 
subsys_id)
 {
        struct cgroup_subsys_state *css;
@@ -595,9 +666,14 @@ static void ve_per_cgroot_free(struct ve_struct *ve)
 {
        struct per_cgroot_data *data, *saved;
        unsigned long flags;
+       struct cgroup_rcu_string *release_agent;
 
        spin_lock_irqsave(&ve->per_cgroot_list_lock, flags);
        list_for_each_entry_safe(data, saved, &ve->per_cgroot_list, list) {
+               release_agent = data->release_agent_path;
+               RCU_INIT_POINTER(data->release_agent_path, NULL);
+               if (release_agent)
+                       kfree_rcu(release_agent, rcu_head);
                list_del_init(&data->list);
                kfree(data);
        }
_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to