Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup

2016-01-06 Thread Parav Pandit
On Wed, Jan 6, 2016 at 3:31 AM, Tejun Heo  wrote:
> Hello,
>
> On Wed, Jan 06, 2016 at 12:28:03AM +0530, Parav Pandit wrote:
>> +/* hash table to keep map of tgid to owner cgroup */
>> +DEFINE_HASHTABLE(pid_cg_map_tbl, 7);
>> +DEFINE_SPINLOCK(pid_cg_map_lock);/* lock to protect hash table access */
>> +
>> +/* Keeps mapping of pid to its owning cgroup at rdma level,
>> + * This mapping doesn't change, even if process migrates from one to other
>> + * rdma cgroup.
>> + */
>> +struct pid_cg_map {
>> + struct pid *pid;/* hash key */
>> + struct rdma_cgroup *cg;
>> +
>> + struct hlist_node hlist;/* pid to cgroup hash table link */
>> + atomic_t refcnt;/* count active user tasks to figure 
>> out
>> +  * when to free the memory
>> +  */
>> +};
>
> Ugh, there's something clearly wrong here.  Why does the rdma
> controller need to keep track of pid cgroup membership?
>
Rdma resource can be allocated by parent process, used and freed by
child process.
Child process could belong to different rdma cgroup.
Parent process might have been terminated after creation of rdma
cgroup. (Followed by cgroup might have been deleted too).
Its discussed in https://lkml.org/lkml/2015/11/2/307

In nutshell, there is process that clearly owns the rdma resource.
So to keep the design simple, rdma resource is owned by the creator
process and cgroup without modifying the task_struct.

>> +static void _dealloc_cg_rpool(struct rdma_cgroup *cg,
>> +   struct cg_resource_pool *rpool)
>> +{
>> + spin_lock(>cg_list_lock);
>> +
>> + /* if its started getting used by other task,
>> +  * before we take the spin lock, then skip,
>> +  * freeing it.
>> +  */
>
> Please follow CodingStyle.
>
>> + if (atomic_read(>refcnt) == 0) {
>> + list_del_init(>cg_list);
>> + spin_unlock(>cg_list_lock);
>> +
>> + _free_cg_rpool(rpool);
>> + return;
>> + }
>> + spin_unlock(>cg_list_lock);
>> +}
>> +
>> +static void dealloc_cg_rpool(struct rdma_cgroup *cg,
>> +  struct cg_resource_pool *rpool)
>> +{
>> + /* Don't free the resource pool which is created by the
>> +  * user, otherwise we miss the configured limits. We don't
>> +  * gain much either by splitting storage of limit and usage.
>> +  * So keep it around until user deletes the limits.
>> +  */
>> + if (atomic_read(>creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
>> + _dealloc_cg_rpool(cg, rpool);
>
> I'm pretty sure you can get away with an fixed length array of
> counters.  Please keep it simple.  It's a simple hard limit enforcer.
> There's no need to create a massive dynamic infrastrucure.
>
Every resource pool for verbs resource is fixed length array. Length
of the array is defined by the IB stack modules.
This array is per cgroup, per device.
Its per device, because we agreed that we want to address requirement
of controlling/configuring them on per device basis.
Devices appear and disappear. Therefore they are allocated dynamically.
Otherwise this array could be static in cgroup structure.



> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv1 3/6] rdmacg: implements rdma cgroup

2016-01-05 Thread Parav Pandit
Adds RDMA controller to limit the number of RDMA resources that can be
consumed by processes of a rdma cgroup.

RDMA resources are global resource that can be exhauasted without
reaching any kmemcg or other policy. RDMA cgroup implementation allows
limiting RDMA/IB well defined resources to be limited per cgroup.

RDMA resources are tracked using resource pool. Resource pool is per
device, per cgroup, per resource pool_type entity which allows setting
up accounting limits on per device basis.

RDMA cgroup returns error when user space applications try to allocate
resources more than its configured limit.

Rdma cgroup implements resource accounting for two types of resource
pools.
(a) RDMA IB specification level verb resources defined by IB stack
(b) HCA vendor device specific resources defined by vendor device driver

Resources are not defined by the RDMA cgroup, instead they are defined
by the external module, typically IB stack and optionally by HCA drivers
for those RDMA devices which doesn't have one to one mapping of IB verb
resource with hardware resource.

Signed-off-by: Parav Pandit 
---
 include/linux/cgroup_subsys.h |4 +
 init/Kconfig  |   12 +
 kernel/Makefile   |1 +
 kernel/cgroup_rdma.c  | 1220 +
 4 files changed, 1237 insertions(+)
 create mode 100644 kernel/cgroup_rdma.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0df0336a..d0e597c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -56,6 +56,10 @@ SUBSYS(hugetlb)
 SUBSYS(pids)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_RDMA)
+SUBSYS(rdma)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index f8754f5..f8055f5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1070,6 +1070,18 @@ config CGROUP_PIDS
  since the PIDs limit only affects a process's ability to fork, not to
  attach to a cgroup.
 
+config CGROUP_RDMA
+   bool "RDMA controller"
+   help
+ Provides enforcement of RDMA resources at RDMA/IB verb level and
+ enforcement of any RDMA/IB capable hardware advertized resources.
+ Its fairly easy for applications to exhaust RDMA resources, which
+ can result into kernel consumers or other application consumers of
+ RDMA resources left with no resources. RDMA controller is designed
+ to stop this from happening.
+ Attaching existing processes with active RDMA resources to the cgroup
+ hierarchy will be allowed even if can cross the hierarchy's limit.
+
 config CGROUP_FREEZER
bool "Freezer controller"
help
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf00..26e413c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
+obj-$(CONFIG_CGROUP_RDMA) += cgroup_rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_rdma.c b/kernel/cgroup_rdma.c
new file mode 100644
index 000..14c6fab
--- /dev/null
+++ b/kernel/cgroup_rdma.c
@@ -0,0 +1,1220 @@
+/*
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+enum rdmacg_file_type {
+   RDMACG_VERB_RESOURCE_LIMIT,
+   RDMACG_VERB_RESOURCE_USAGE,
+   RDMACG_VERB_RESOURCE_FAILCNT,
+   RDMACG_VERB_RESOURCE_LIST,
+   RDMACG_HW_RESOURCE_LIMIT,
+   RDMACG_HW_RESOURCE_USAGE,
+   RDMACG_HW_RESOURCE_FAILCNT,
+   RDMACG_HW_RESOURCE_LIST,
+};
+
+#define RDMACG_USR_CMD_REMOVE "remove"
+
+/* resource tracker per resource for rdma cgroup */
+struct cg_resource {
+   atomic_t usage;
+   int limit;
+   atomic_t failcnt;
+};
+
+/**
+ * pool type indicating either it got created as part of default
+ * operation or user has configured the group.
+ * Depends on the creator of the pool, its decided to free up
+ * later or not.
+ */
+enum rpool_creator {
+   RDMACG_RPOOL_CREATOR_DEFAULT,
+   RDMACG_RPOOL_CREATOR_USR,
+};
+
+/**
+ * resource pool object which represents, per cgroup, per device,
+ * per resource pool_type resources.
+ */
+struct cg_resource_pool {
+   struct list_head cg_list;
+   struct ib_device *device;
+   enum rdmacg_resource_pool_type type;
+
+   struct cg_resource *resources;
+
+   atomic_t refcnt;/* count active user tasks of this pool */
+   atomic_t creator;   /* user created or default type */
+};
+
+struct 

Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup

2016-01-05 Thread Tejun Heo
Hello,

On Wed, Jan 06, 2016 at 12:28:03AM +0530, Parav Pandit wrote:
> +/* hash table to keep map of tgid to owner cgroup */
> +DEFINE_HASHTABLE(pid_cg_map_tbl, 7);
> +DEFINE_SPINLOCK(pid_cg_map_lock);/* lock to protect hash table access */
> +
> +/* Keeps mapping of pid to its owning cgroup at rdma level,
> + * This mapping doesn't change, even if process migrates from one to other
> + * rdma cgroup.
> + */
> +struct pid_cg_map {
> + struct pid *pid;/* hash key */
> + struct rdma_cgroup *cg;
> +
> + struct hlist_node hlist;/* pid to cgroup hash table link */
> + atomic_t refcnt;/* count active user tasks to figure out
> +  * when to free the memory
> +  */
> +};

Ugh, there's something clearly wrong here.  Why does the rdma
controller need to keep track of pid cgroup membership?

> +static void _dealloc_cg_rpool(struct rdma_cgroup *cg,
> +   struct cg_resource_pool *rpool)
> +{
> + spin_lock(>cg_list_lock);
> +
> + /* if its started getting used by other task,
> +  * before we take the spin lock, then skip,
> +  * freeing it.
> +  */

Please follow CodingStyle.

> + if (atomic_read(>refcnt) == 0) {
> + list_del_init(>cg_list);
> + spin_unlock(>cg_list_lock);
> +
> + _free_cg_rpool(rpool);
> + return;
> + }
> + spin_unlock(>cg_list_lock);
> +}
> +
> +static void dealloc_cg_rpool(struct rdma_cgroup *cg,
> +  struct cg_resource_pool *rpool)
> +{
> + /* Don't free the resource pool which is created by the
> +  * user, otherwise we miss the configured limits. We don't
> +  * gain much either by splitting storage of limit and usage.
> +  * So keep it around until user deletes the limits.
> +  */
> + if (atomic_read(>creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
> + _dealloc_cg_rpool(cg, rpool);

I'm pretty sure you can get away with an fixed length array of
counters.  Please keep it simple.  It's a simple hard limit enforcer.
There's no need to create a massive dynamic infrastrucure.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html