Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.

2016-01-06 Thread Parav Pandit
On Thu, Jan 7, 2016 at 4:27 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello,
>
> On Thu, Jan 07, 2016 at 04:14:26AM +0530, Parav Pandit wrote:
>> Yes. I read through. I can see two changes to be made in V2 version of
>> this patch.
>> 1. rdma.resource.verb.usage and rdma.resource.verb.limit to change
>> respectively to,
>> 2. rdma.resource.verb.stat and rdma.resource.verb.max.
>> 3. rdma.resource.verb.failcnt indicate failure events, which I think
>> should go to events.
>
> What's up with the ".resource" part?

I can remove "resource" key word. If just that if something other than
resource comes up to limit to in future, it will be hard to define at
that time.

> Also can't the .max file list
> the available resources?  Why does it need a separtae list file?
>
max file does lists them only after limits are configured for that
device. Thats when rpool (array of max and usage counts) is allocated.

If user wants to know what all knobs are available, than list file
exposes them on per device basis without actually mentioning actual
limit or without allocating rpool arrays.

If you are hinting that I should allocate rpool array when rdma cgroup
is created, that can be done for already discovered devices.
If new devices are discovered after cgroup is created, for them we
anyway have to allocate/free when they appear/disappear.

In different implementation, where list of all the rdma cgroups can be
maintained, and rpool arrays can be allocated for all of them when new
device appear/disappear. This can move complexity of dynamic
allocation from try_charge/uncharge to device addition and removal
APIs. ib_register_ib_device() level.
However this comes with memory cost, where even if those device doesnt
participate in cgroup, for them rpool memory will be allocated for
each such rdma cgroup.

list file looks like below for two device entries.
mlx4_0 ah qp mr pd srq flow
ocrdma0 ah qp mr pd

max file looks like below.
mlx4_0 ah=100 qp=40 mr=10 pd=90 srq=10 flow=10


>> I roll out new patch for events post this patch as additional feature
>> and remove this feature in V2.
>>
>> rdma.resource.verb.list file is unique to rdma cgroup, so I believe
>> this is fine.
>
> Please see above.
>
>> We will conclude whether to have rdma.resource.hw. or not in
>> other patches.
>> I am in opinion to keep "resource" and "verb" or "hw" tags around to
>> keep it verbose enough to know what are we trying to control.
>
> What does that achieve?  I feel that it's getting overengineered
> constantly.

Please see above for "resource". I guess we are not loosing anything
by having "rdma.resource" vs just having "rdma".
But if that sounds too much, we can remove "resource".

>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.

2016-01-06 Thread Parav Pandit
On Wed, Jan 6, 2016 at 3:23 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello,
>
> On Wed, Jan 06, 2016 at 12:28:06AM +0530, Parav Pandit wrote:
>> +5-4-1. RDMA Interface Files
>> +
>> +  rdma.resource.verb.list
>> +  rdma.resource.verb.limit
>> +  rdma.resource.verb.usage
>> +  rdma.resource.verb.failcnt
>> +  rdma.resource.hw.list
>> +  rdma.resource.hw.limit
>> +  rdma.resource.hw.usage
>> +  rdma.resource.hw.failcnt
>
> Can you please read the rest of cgroup.txt and put the interface in
> line with the common conventions followed by other controllers?
>

Yes. I read through. I can see two changes to be made in V2 version of
this patch.
1. rdma.resource.verb.usage and rdma.resource.verb.limit to change
respectively to,
2. rdma.resource.verb.stat and rdma.resource.verb.max.
3. rdma.resource.verb.failcnt indicate failure events, which I think
should go to events.
I roll out new patch for events post this patch as additional feature
and remove this feature in V2.

rdma.resource.verb.list file is unique to rdma cgroup, so I believe
this is fine.

We will conclude whether to have rdma.resource.hw. or not in
other patches.
I am in opinion to keep "resource" and "verb" or "hw" tags around to
keep it verbose enough to know what are we trying to control.

Is that ok?

> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 0/6] rdma controller support

2016-01-06 Thread Parav Pandit
Hi Tejun,

On Wed, Jan 6, 2016 at 3:26 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello,
>
> On Wed, Jan 06, 2016 at 12:28:00AM +0530, Parav Pandit wrote:
>> Resources are not defined by the RDMA cgroup. Resources are defined
>> by RDMA/IB stack & optionally by HCA vendor device drivers.
>
> As I wrote before, I don't think this is a good idea.  Drivers will
> inevitably add non-sensical "resources" which don't make any sense
> without much scrutiny.

In our last discussion on v0 patch,
http://lkml.iu.edu/hypermail/linux/kernel/1509.1/04331.html

The direction was, that vendor should be able to define their own resources.
> If different controllers can't agree upon the
> same set of resources, which probably is a pretty good sign that this
> isn't too well thought out to begin with,

When you said "different controller" you meant "different hw vendors", right?
Or you meant, rdma, mem, cpu as controller here?

> at least make all resource
> types defined by the controller itself and let the controllers enable
> them selectively.
>
In this V1 patch, resource is defined by the IB stack and rdma cgroup
is facilitator for same.
By doing so, IB stack modules can define new resource without really
making changes to cgroup.
This design also allows hw vendors to define their own resources which
will be reviewed in rdma mailing list anway.
The idea is different hw versions can have different resource support,
so the whole intention is not about defining different resource but
rather enabling it.
But yes, I equally agree that by doing so, different hw controller
vendors can define different hw resources.


> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup

2016-01-06 Thread Parav Pandit
On Wed, Jan 6, 2016 at 3:31 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello,
>
> On Wed, Jan 06, 2016 at 12:28:03AM +0530, Parav Pandit wrote:
>> +/* hash table to keep map of tgid to owner cgroup */
>> +DEFINE_HASHTABLE(pid_cg_map_tbl, 7);
>> +DEFINE_SPINLOCK(pid_cg_map_lock);/* lock to protect hash table access */
>> +
>> +/* Keeps mapping of pid to its owning cgroup at rdma level,
>> + * This mapping doesn't change, even if process migrates from one to other
>> + * rdma cgroup.
>> + */
>> +struct pid_cg_map {
>> + struct pid *pid;/* hash key */
>> + struct rdma_cgroup *cg;
>> +
>> + struct hlist_node hlist;/* pid to cgroup hash table link */
>> + atomic_t refcnt;/* count active user tasks to figure 
>> out
>> +  * when to free the memory
>> +  */
>> +};
>
> Ugh, there's something clearly wrong here.  Why does the rdma
> controller need to keep track of pid cgroup membership?
>
Rdma resource can be allocated by parent process, used and freed by
child process.
Child process could belong to different rdma cgroup.
Parent process might have been terminated after creation of rdma
cgroup. (Followed by cgroup might have been deleted too).
Its discussed in https://lkml.org/lkml/2015/11/2/307

In nutshell, there is process that clearly owns the rdma resource.
So to keep the design simple, rdma resource is owned by the creator
process and cgroup without modifying the task_struct.

>> +static void _dealloc_cg_rpool(struct rdma_cgroup *cg,
>> +   struct cg_resource_pool *rpool)
>> +{
>> + spin_lock(>cg_list_lock);
>> +
>> + /* if its started getting used by other task,
>> +  * before we take the spin lock, then skip,
>> +  * freeing it.
>> +  */
>
> Please follow CodingStyle.
>
>> + if (atomic_read(>refcnt) == 0) {
>> + list_del_init(>cg_list);
>> + spin_unlock(>cg_list_lock);
>> +
>> + _free_cg_rpool(rpool);
>> + return;
>> + }
>> + spin_unlock(>cg_list_lock);
>> +}
>> +
>> +static void dealloc_cg_rpool(struct rdma_cgroup *cg,
>> +  struct cg_resource_pool *rpool)
>> +{
>> + /* Don't free the resource pool which is created by the
>> +  * user, otherwise we miss the configured limits. We don't
>> +  * gain much either by splitting storage of limit and usage.
>> +  * So keep it around until user deletes the limits.
>> +  */
>> + if (atomic_read(>creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
>> + _dealloc_cg_rpool(cg, rpool);
>
> I'm pretty sure you can get away with an fixed length array of
> counters.  Please keep it simple.  It's a simple hard limit enforcer.
> There's no need to create a massive dynamic infrastrucure.
>
Every resource pool for verbs resource is fixed length array. Length
of the array is defined by the IB stack modules.
This array is per cgroup, per device.
Its per device, because we agreed that we want to address requirement
of controlling/configuring them on per device basis.
Devices appear and disappear. Therefore they are allocated dynamically.
Otherwise this array could be static in cgroup structure.



> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 2/6] IB/core: Added members to support rdma cgroup

2016-01-06 Thread Parav Pandit
On Wed, Jan 6, 2016 at 3:26 AM, Tejun Heo <t...@kernel.org> wrote:
> On Wed, Jan 06, 2016 at 12:28:02AM +0530, Parav Pandit wrote:
>> Added function pointer table to store resource pool specific
>> operation for each resource type (verb and hw).
>> Added list node to link device to rdma cgroup so that it can
>> participate in resource accounting and limit configuration.
>
> Is there any point in splitting patches 1 and 2 from 3?
>
Patch 2 is in IB stack, so I separated that patch out from 1. That
makes it 3 patches.
If you think single patch is easier to review, let me know, I can
respin to have one patch for these 3 smaller patches.

> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv1 0/6] rdma controller support

2016-01-05 Thread Parav Pandit
This patchset adds support for RDMA cgroup by addressing review comments
of [1] by implementing published RFC [2].

Overview:
Currently user space applications can easily take away all the rdma
device specific resources such as AH, CQ, QP, MR etc. Due to which other
applications in other cgroup or kernel space ULPs may not even get chance
to allocate any rdma resources. This results into service unavailibility.

RDMA cgroup addresses this issue by allowing resource accounting,
limit enforcement on per cgroup, per rdma device basis.

Resources are not defined by the RDMA cgroup. Resources are defined
by RDMA/IB stack & optionally by HCA vendor device drivers.
This allows rdma cgroup to remain constant while RDMA/IB
stack can evolve without the need of rdma cgroup update. A new
resource can be easily added by the RDMA/IB stack without touching
rdma cgroup.

RDMA uverbs layer will enforce limits on well defined RDMA verb
resources without any HCA vendor device driver involvement.

RDMA uverbs layer will not do accounting of hw vendor specific resources.
Instead rdma cgroup provides set of APIs through which vendor specific 
drivers can define their own resources (upto 64) that can be accounted by
rdma cgroup.

Resource limit enforcement is hierarchical.

When process is migrated with active RDMA resources, rdma cgroup
continues to charge original cgroup.

Changes from v0:
(To address comments from Haggai, Doug, Liran, Tejun, Sean, Jason)
 * Redesigned to support per device per cgroup limit settings by bringing
   concept of resource pool.
 * Redesigned to let IB stack define the resources instead of rdma controller
   using resource template.
 * Redesigned to support hw vendor specific limits setting (optional to 
drivers).
 * Created new rdma controller instead of piggyback on device cgroup.
 * Fixed race conditions for multiple tasks sharing rdma resources.
 * Removed dependency on the task_struct.

[1] https://lkml.org/lkml/2015/9/7/476
[2] https://lkml.org/lkml/2015/10/28/144

This patchset is for Tejun's for-4.5 branch.
It is not attempted on Doug's rdma tree yet, which I will do once I receive
comments for this pathset.

Parav Pandit (6):
  rdmacg: Added rdma cgroup header file
  IB/core: Added members to support rdma cgroup
  rdmacg: implements rdma cgroup
  IB/core: rdmacg support infrastructure APIs
  IB/core: use rdma cgroup for resource accounting
  rdmacg: Added documentation for rdma controller.

 Documentation/cgroup-legacy/rdma.txt  |  129 
 Documentation/cgroup.txt  |   79 +++
 drivers/infiniband/core/Makefile  |1 +
 drivers/infiniband/core/cgroup.c  |   80 +++
 drivers/infiniband/core/core_priv.h   |5 +
 drivers/infiniband/core/device.c  |8 +
 drivers/infiniband/core/uverbs_cmd.c  |  244 ++-
 drivers/infiniband/core/uverbs_main.c |   30 +
 include/linux/cgroup_rdma.h   |   91 +++
 include/linux/cgroup_subsys.h |4 +
 include/rdma/ib_verbs.h   |   20 +
 init/Kconfig  |   12 +
 kernel/Makefile   |1 +
 kernel/cgroup_rdma.c  | 1220 +
 14 files changed, 1907 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/cgroup-legacy/rdma.txt
 create mode 100644 drivers/infiniband/core/cgroup.c
 create mode 100644 include/linux/cgroup_rdma.h
 create mode 100644 kernel/cgroup_rdma.c

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv1 3/6] rdmacg: implements rdma cgroup

2016-01-05 Thread Parav Pandit
Adds RDMA controller to limit the number of RDMA resources that can be
consumed by processes of a rdma cgroup.

RDMA resources are global resource that can be exhauasted without
reaching any kmemcg or other policy. RDMA cgroup implementation allows
limiting RDMA/IB well defined resources to be limited per cgroup.

RDMA resources are tracked using resource pool. Resource pool is per
device, per cgroup, per resource pool_type entity which allows setting
up accounting limits on per device basis.

RDMA cgroup returns error when user space applications try to allocate
resources more than its configured limit.

Rdma cgroup implements resource accounting for two types of resource
pools.
(a) RDMA IB specification level verb resources defined by IB stack
(b) HCA vendor device specific resources defined by vendor device driver

Resources are not defined by the RDMA cgroup, instead they are defined
by the external module, typically IB stack and optionally by HCA drivers
for those RDMA devices which doesn't have one to one mapping of IB verb
resource with hardware resource.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 include/linux/cgroup_subsys.h |4 +
 init/Kconfig  |   12 +
 kernel/Makefile   |1 +
 kernel/cgroup_rdma.c  | 1220 +
 4 files changed, 1237 insertions(+)
 create mode 100644 kernel/cgroup_rdma.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0df0336a..d0e597c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -56,6 +56,10 @@ SUBSYS(hugetlb)
 SUBSYS(pids)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_RDMA)
+SUBSYS(rdma)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index f8754f5..f8055f5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1070,6 +1070,18 @@ config CGROUP_PIDS
  since the PIDs limit only affects a process's ability to fork, not to
  attach to a cgroup.
 
+config CGROUP_RDMA
+   bool "RDMA controller"
+   help
+ Provides enforcement of RDMA resources at RDMA/IB verb level and
+ enforcement of any RDMA/IB capable hardware advertized resources.
+ Its fairly easy for applications to exhaust RDMA resources, which
+ can result into kernel consumers or other application consumers of
+ RDMA resources left with no resources. RDMA controller is designed
+ to stop this from happening.
+ Attaching existing processes with active RDMA resources to the cgroup
+ hierarchy will be allowed even if can cross the hierarchy's limit.
+
 config CGROUP_FREEZER
bool "Freezer controller"
help
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf00..26e413c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
+obj-$(CONFIG_CGROUP_RDMA) += cgroup_rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_rdma.c b/kernel/cgroup_rdma.c
new file mode 100644
index 000..14c6fab
--- /dev/null
+++ b/kernel/cgroup_rdma.c
@@ -0,0 +1,1220 @@
+/*
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+enum rdmacg_file_type {
+   RDMACG_VERB_RESOURCE_LIMIT,
+   RDMACG_VERB_RESOURCE_USAGE,
+   RDMACG_VERB_RESOURCE_FAILCNT,
+   RDMACG_VERB_RESOURCE_LIST,
+   RDMACG_HW_RESOURCE_LIMIT,
+   RDMACG_HW_RESOURCE_USAGE,
+   RDMACG_HW_RESOURCE_FAILCNT,
+   RDMACG_HW_RESOURCE_LIST,
+};
+
+#define RDMACG_USR_CMD_REMOVE "remove"
+
+/* resource tracker per resource for rdma cgroup */
+struct cg_resource {
+   atomic_t usage;
+   int limit;
+   atomic_t failcnt;
+};
+
+/**
+ * pool type indicating either it got created as part of default
+ * operation or user has configured the group.
+ * Depends on the creator of the pool, its decided to free up
+ * later or not.
+ */
+enum rpool_creator {
+   RDMACG_RPOOL_CREATOR_DEFAULT,
+   RDMACG_RPOOL_CREATOR_USR,
+};
+
+/**
+ * resource pool object which represents, per cgroup, per device,
+ * per resource pool_type resources.
+ */
+struct cg_resource_pool {
+   struct list_head cg_list;
+   struct ib_device *device;
+   enum rdmacg_resource_pool_type type;
+
+   struct cg_resource *resources;
+
+   atomic_t refcnt;/* count active user tasks of this pool */
+   atomic_t creator;   /* user crea

[PATCHv1 2/6] IB/core: Added members to support rdma cgroup

2016-01-05 Thread Parav Pandit
Added function pointer table to store resource pool specific
operation for each resource type (verb and hw).
Added list node to link device to rdma cgroup so that it can
participate in resource accounting and limit configuration.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 include/rdma/ib_verbs.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9a68a19..1a17249 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -51,6 +51,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -1823,6 +1824,12 @@ struct ib_device {
u8   node_type;
u8   phys_port_cnt;
 
+#ifdef CONFIG_CGROUP_RDMA
+   struct rdmacg_resource_pool_ops
+   *rpool_ops[RDMACG_RESOURCE_POOL_TYPE_MAX];
+   struct list_head rdmacg_list;
+#endif
+
/**
 * The following mandatory functions are used only at device
 * registration.  Keep functions such as these at the end of this
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv1 1/6] rdmacg: Added rdma cgroup header file

2016-01-05 Thread Parav Pandit
Added rdma cgroup header file which defines its APIs to perform
charing/uncharing functionality.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 include/linux/cgroup_rdma.h | 91 +
 1 file changed, 91 insertions(+)
 create mode 100644 include/linux/cgroup_rdma.h

diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
new file mode 100644
index 000..01d220f
--- /dev/null
+++ b/include/linux/cgroup_rdma.h
@@ -0,0 +1,91 @@
+#ifndef _CGROUP_RDMA_H
+#define _CGROUP_RDMA_H
+
+/*
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+enum rdmacg_resource_pool_type {
+   RDMACG_RESOURCE_POOL_VERB,
+   RDMACG_RESOURCE_POOL_HW,
+   RDMACG_RESOURCE_POOL_TYPE_MAX,
+};
+
+struct ib_device;
+struct pid;
+struct match_token;
+
+#ifdef CONFIG_CGROUP_RDMA
+#define RDMACG_MAX_RESOURCE_INDEX (64)
+
+struct rdmacg_pool_info {
+   struct match_token *resource_table;
+   int resource_count;
+};
+
+struct rdmacg_resource_pool_ops {
+   struct rdmacg_pool_info*
+   (*get_resource_pool_tokens)(struct ib_device *);
+};
+
+/* APIs for RDMA/IB subsystem to publish when a device wants to
+ * participate in resource accounting
+ */
+void rdmacg_register_ib_device(struct ib_device *device);
+void rdmacg_unregister_ib_device(struct ib_device *device);
+
+/* APIs for RDMA/IB subsystem to charge/uncharge pool specific resources */
+int rdmacg_try_charge_resource(struct ib_device *device,
+  struct pid *pid,
+  enum rdmacg_resource_pool_type type,
+  int resource_index,
+  int num);
+void rdmacg_uncharge_resource(struct ib_device *device,
+ struct pid *pid,
+ enum rdmacg_resource_pool_type type,
+ int resource_index,
+ int num);
+
+void rdmacg_set_rpool_ops(struct ib_device *device,
+ enum rdmacg_resource_pool_type pool_type,
+ struct rdmacg_resource_pool_ops *ops);
+void rdmacg_clear_rpool_ops(struct ib_device *device,
+   enum rdmacg_resource_pool_type pool_type);
+int rdmacg_query_resource_limit(struct ib_device *device,
+   struct pid *pid,
+   enum rdmacg_resource_pool_type type,
+   int *limits, int max_count);
+#else
+/* APIs for RDMA/IB subsystem to charge/uncharge device specific resources */
+static inline
+int rdmacg_try_charge_resource(struct ib_device *device,
+  struct pid *pid,
+  enum rdmacg_resource_pool_type type,
+  int resource_index,
+  int num)
+{ return 0; }
+
+static inline void rdmacg_uncharge_resource(struct ib_device *device,
+   struct pid *pid,
+   enum rdmacg_resource_pool_type type,
+   int resource_index,
+   int num)
+{ }
+
+static inline
+int rdmacg_query_resource_limit(struct ib_device *device,
+   struct pid *pid,
+   enum rdmacg_resource_pool_type type,
+   int *limits, int max_count)
+{
+   int i;
+
+   for (i = 0; i < max_count; i++)
+   limits[i] = S32_MAX;
+
+   return 0;
+}
+#endif /* CONFIG_CGROUP_RDMA */
+#endif /* _CGROUP_RDMA_H */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv1 5/6] IB/core: use rdma cgroup for resource accounting

2016-01-05 Thread Parav Pandit
It uses charge API to perform resource charing before allocating low
level resource. It continues to link the resource to the owning
thread group leader task.
It uncharges the resource after successful deallocation of resource.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 drivers/infiniband/core/device.c  |   8 ++
 drivers/infiniband/core/uverbs_cmd.c  | 244 +++---
 drivers/infiniband/core/uverbs_main.c |  30 +
 3 files changed, 265 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 179e813..59cab6b 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -352,6 +352,10 @@ int ib_register_device(struct ib_device *device,
goto out;
}
 
+#ifdef CONFIG_CGROUP_RDMA
+   ib_device_register_rdmacg(device);
+#endif
+
ret = ib_device_register_sysfs(device, port_callback);
if (ret) {
printk(KERN_WARNING "Couldn't register device %s with driver 
model\n",
@@ -405,6 +409,10 @@ void ib_unregister_device(struct ib_device *device)
 
mutex_unlock(_mutex);
 
+#ifdef CONFIG_CGROUP_RDMA
+   ib_device_unregister_rdmacg(device);
+#endif
+
ib_device_unregister_sysfs(device);
ib_cache_cleanup_one(device);
 
diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index 94816ae..1b3d60b 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -294,6 +294,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 #endif
struct ib_ucontext   *ucontext;
struct file  *filp;
+   struct pid   *tgid;
int ret;
 
if (out_len < sizeof resp)
@@ -313,10 +314,20 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
   (unsigned long) cmd.response + sizeof resp,
   in_len - sizeof cmd, out_len - sizeof resp);
 
+   rcu_read_lock();
+   tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
+   rcu_read_unlock();
+
+   ret = rdmacg_try_charge_resource(ib_dev, tgid,
+RDMACG_RESOURCE_POOL_VERB,
+RDMA_VERB_RESOURCE_UCTX, 1);
+   if (ret)
+   goto err_charge;
+
ucontext = ib_dev->alloc_ucontext(ib_dev, );
if (IS_ERR(ucontext)) {
ret = PTR_ERR(ucontext);
-   goto err;
+   goto err_alloc;
}
 
ucontext->device = ib_dev;
@@ -330,7 +341,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
INIT_LIST_HEAD(>xrcd_list);
INIT_LIST_HEAD(>rule_list);
rcu_read_lock();
-   ucontext->tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
+   ucontext->tgid = tgid;
rcu_read_unlock();
ucontext->closing = 0;
 
@@ -383,9 +394,15 @@ err_fd:
put_unused_fd(resp.async_fd);
 
 err_free:
-   put_pid(ucontext->tgid);
ib_dev->dealloc_ucontext(ucontext);
 
+err_alloc:
+   rdmacg_uncharge_resource(ib_dev, tgid, RDMACG_RESOURCE_POOL_VERB,
+RDMA_VERB_RESOURCE_UCTX, 1);
+
+err_charge:
+   put_pid(tgid);
+
 err:
mutex_unlock(>mutex);
return ret;
@@ -394,7 +411,8 @@ err:
 static void copy_query_dev_fields(struct ib_uverbs_file *file,
  struct ib_device *ib_dev,
  struct ib_uverbs_query_device_resp *resp,
- struct ib_device_attr *attr)
+ struct ib_device_attr *attr,
+ int *limits)
 {
resp->fw_ver= attr->fw_ver;
resp->node_guid = ib_dev->node_guid;
@@ -405,14 +423,19 @@ static void copy_query_dev_fields(struct ib_uverbs_file 
*file,
resp->vendor_part_id= attr->vendor_part_id;
resp->hw_ver= attr->hw_ver;
resp->max_qp= attr->max_qp;
+   resp->max_qp= min_t(int, attr->max_qp,
+   limits[RDMA_VERB_RESOURCE_QP]);
resp->max_qp_wr = attr->max_qp_wr;
resp->device_cap_flags  = attr->device_cap_flags;
resp->max_sge   = attr->max_sge;
resp->max_sge_rd= attr->max_sge_rd;
-   resp->max_cq= attr->max_cq;
+   resp->max_cq= min_t(int, attr->max_cq,
+   limits[RDMA_VERB_RESOURCE_CQ]);
resp->max_cqe   = attr->max_cqe;
-   resp->max_mr= attr->max_mr;
-   resp->max_pd= attr->max_pd;
+   resp->max_mr= min_t(

[PATCHv1 4/6] IB/core: rdmacg support infrastructure APIs

2016-01-05 Thread Parav Pandit
It defines verb RDMA resources that will be registered with
RDMA cgroup. It defines new APIs to register device with
RDMA cgroup and defines resource token table access interface.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 drivers/infiniband/core/Makefile|  1 +
 drivers/infiniband/core/cgroup.c| 80 +
 drivers/infiniband/core/core_priv.h |  5 +++
 include/rdma/ib_verbs.h | 13 ++
 4 files changed, 99 insertions(+)
 create mode 100644 drivers/infiniband/core/cgroup.c

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index d43a899..df40cee 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -13,6 +13,7 @@ ib_core-y :=  packer.o ud_header.o verbs.o 
sysfs.o \
roce_gid_mgmt.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
+ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
 
 ib_mad-y :=mad.o smi.o agent.o mad_rmpp.o
 
diff --git a/drivers/infiniband/core/cgroup.c b/drivers/infiniband/core/cgroup.c
new file mode 100644
index 000..8d80add
--- /dev/null
+++ b/drivers/infiniband/core/cgroup.c
@@ -0,0 +1,80 @@
+#include 
+#include 
+#include 
+
+#include "core_priv.h"
+
+/**
+ * resource table definition as to be seen by the user.
+ * Need to add entries to it when more resources are
+ * added/defined at IB verb/core layer.
+ */
+static match_table_t resource_tokens = {
+   {RDMA_VERB_RESOURCE_UCTX, "uctx=%d"},
+   {RDMA_VERB_RESOURCE_AH, "ah=%d"},
+   {RDMA_VERB_RESOURCE_PD, "pd=%d"},
+   {RDMA_VERB_RESOURCE_CQ, "cq=%d"},
+   {RDMA_VERB_RESOURCE_MR, "mr=%d"},
+   {RDMA_VERB_RESOURCE_MW, "mw=%d"},
+   {RDMA_VERB_RESOURCE_SRQ, "srq=%d"},
+   {RDMA_VERB_RESOURCE_QP, "qp=%d"},
+   {RDMA_VERB_RESOURCE_FLOW, "flow=%d"},
+   {-1, NULL}
+};
+
+/**
+ * setup table pointers for RDMA cgroup to access.
+ */
+static struct rdmacg_pool_info verbs_token_info = {
+   .resource_table = resource_tokens,
+   .resource_count =
+   (sizeof(resource_tokens) / sizeof(struct match_token)) - 1,
+};
+
+static struct rdmacg_pool_info*
+   rdmacg_get_resource_pool_tokens(struct ib_device *device)
+{
+   return _token_info;
+}
+
+static struct rdmacg_resource_pool_ops verbs_pool_ops = {
+   .get_resource_pool_tokens = _get_resource_pool_tokens,
+};
+
+/**
+ * ib_device_register_rdmacg - register with rdma cgroup.
+ * @device: device to register to participate in resource
+ *  accounting by rdma cgroup.
+ *
+ * Register with the rdma cgroup. Should be called before
+ * exposing rdma device to user space applications to avoid
+ * resource accounting leak.
+ * HCA drivers should set resource pool ops first if they wish
+ * to support hw specific resource accounting before IB core
+ * registers with rdma cgroup.
+ */
+void ib_device_register_rdmacg(struct ib_device *device)
+{
+   rdmacg_set_rpool_ops(device,
+RDMACG_RESOURCE_POOL_VERB,
+_pool_ops);
+   rdmacg_register_ib_device(device);
+}
+
+/**
+ * ib_device_unregister_rdmacg - unregister with rdma cgroup.
+ * @device: device to unregister.
+ *
+ * Unregister with the rdma cgroup. Should be called after
+ * all the resources are deallocated, and after a stage when any
+ * other resource allocation of user application cannot be done
+ * for this device to avoid any leak in accounting.
+ * HCA drivers should clear resource pool ops after ib stack
+ * unregisters with rdma cgroup.
+ */
+void ib_device_unregister_rdmacg(struct ib_device *device)
+{
+   rdmacg_unregister_ib_device(device);
+   rdmacg_clear_rpool_ops(device,
+  RDMACG_RESOURCE_POOL_VERB);
+}
diff --git a/drivers/infiniband/core/core_priv.h 
b/drivers/infiniband/core/core_priv.h
index 5cf6eb7..29bdfe2 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -92,4 +92,9 @@ int ib_cache_setup_one(struct ib_device *device);
 void ib_cache_cleanup_one(struct ib_device *device);
 void ib_cache_release_one(struct ib_device *device);
 
+#ifdef CONFIG_CGROUP_RDMA
+void ib_device_register_rdmacg(struct ib_device *device);
+void ib_device_unregister_rdmacg(struct ib_device *device);
+#endif
+
 #endif /* _CORE_PRIV_H */
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 1a17249..f44b884 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -96,6 +96,19 @@ enum rdma_protocol_type {
RDMA_PROTOCOL_USNIC_UDP
 };
 
+enum rdma_resource_type {
+   RDMA_VERB_RESOURCE_UCTX,
+   RDMA_VERB_RESOURCE_AH,
+   RDMA_VERB_RESOURCE_PD,
+   RDMA_VERB_RESOURCE_CQ,
+   RDMA_VERB

[PATCHv1 6/6] rdmacg: Added documentation for rdma controller.

2016-01-05 Thread Parav Pandit
Added documentation for rdma controller to use in legacy mode and
using new unified hirerchy.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 Documentation/cgroup-legacy/rdma.txt | 129 +++
 Documentation/cgroup.txt |  79 +
 2 files changed, 208 insertions(+)
 create mode 100644 Documentation/cgroup-legacy/rdma.txt

diff --git a/Documentation/cgroup-legacy/rdma.txt 
b/Documentation/cgroup-legacy/rdma.txt
new file mode 100644
index 000..70626c5
--- /dev/null
+++ b/Documentation/cgroup-legacy/rdma.txt
@@ -0,0 +1,129 @@
+   RDMA Resource Controller
+   
+
+Contents
+
+
+1. Overview
+  1-1. What is RDMA resource controller?
+  1-2. Why RDMA resource controller needed?
+  1-3. How is RDMA resource controller implemented?
+2. Usage Examples
+
+1. Overview
+
+1-1. What is RDMA resource controller?
+-
+
+RDMA resource controller allows user to limit RDMA/IB specific resources
+that a given set of processes can use. These processes are grouped using
+RDMA resource controller.
+
+RDMA resource controller currently allows two different type of resource
+pools.
+(a) RDMA IB specification level verb resources defined by IB stack
+(b) HCA vendor device specific resources
+
+RDMA resource controller controller allows maximum of upto 64 resources in
+a resource pool which is the internal construct of rdma cgroup explained
+at later part of this document.
+
+1-2. Why RDMA resource controller needed?
+
+
+Currently user space applications can easily take away all the rdma device
+specific resources such as AH, CQ, QP, MR etc. Due to which other applications
+in other cgroup or kernel space ULPs may not even get chance to allocate any
+rdma resources. This leads to service unavailability.
+
+Therefore RDMA resource controller is needed through which resource consumption
+of processes can be limited. Through this controller various different rdma
+resources described by IB uverbs layer and any HCA vendor driver can be
+accounted.
+
+1-3. How is RDMA resource controller implemented?
+
+
+rdma cgroup allows limit configuration of resources. These resources are not
+defined by the rdma controller. Instead they are defined by the IB stack
+and HCA device drivers(optionally).
+This provides great flexibility to allow IB stack to define new resources,
+without any changes to rdma cgroup.
+Rdma cgroup maintains resource accounting per cgroup, per device, per resource
+type using resource pool structure. Each such resource pool is limited up to
+64 resources in given resource pool by rdma cgroup, which can be extended
+later if required.
+
+This resource pool object is linked to the cgroup css. Typically there
+are 0 to 4 resource pool instances per cgroup, per device in most use cases.
+But nothing limits to have it more. At present hundreds of RDMA devices per
+single cgroup may not be handled optimally, however there is no known use case
+for such configuration either.
+
+Since RDMA resources can be allocated from any process and can be freed by any
+of the child processes which shares the address space, rdma resources are
+always owned by the creator cgroup css. This allows process migration from one
+to other cgroup without major complexity of transferring resource ownership;
+because such ownership is not really present due to shared nature of
+rdma resources. Linking resources around css also ensures that cgroups can be
+deleted after processes migrated. This allow progress migration as well with
+active resources, even though that’s not the primary use case.
+
+Finally mapping of the resource owner pid to cgroup is maintained using
+simple hash table to perform quick look-up during resource charing/uncharging
+time.
+
+Resource pool object is created in following situations.
+(a) User sets the limit and no previous resource pool exist for the device
+of interest for the cgroup.
+(b) No resource limits were configured, but IB/RDMA stack tries to
+charge the resource. So that it correctly uncharge them when applications are
+running without limits and later on when limits are enforced during uncharging,
+otherwise usage count will drop to negative. This is done using default
+resource pool. Instead of implementing any sort of time markers, default pool
+simplifies the design.
+
+Resource pool is destroyed if it was of default type (not created
+by administrative operation) and it’s the last resource getting
+deallocated. Resource pool created as administrative operation is not
+deleted, as it’s expected to be used in near future.
+
+If user setting tries to delete all the resource limit
+with active resources per device, RDMA cgroup just marks the pool as
+default pool with maximum limits for each resource, otherwise it deletes the
+d

Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-06 Thread Parav Pandit
On Mon, Dec 7, 2015 at 11:32 AM, Jason Gunthorpe
 wrote:
> On Thu, Dec 03, 2015 at 04:20:50PM +, Liran Liss wrote:
>> > From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
>>
>> > Subject: Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to
>> > wc
>> >
>> > Bloating the WC with a field that's not really useful for the ULPs seems 
>> > pretty
>> > sad..
>>
>> You need per packet (read per-WC) network type both for handling incoming 
>> connections over QP1 and in UD QPs.
>> It looks like this patch doesn't extend the structure size due to alignment, 
>> so no real harm in any case...
>
> Why? The sgid index must tell you the network type.
>

User space might not have access to internal properties of gid entry
like kernel. Not sure if more plumbing needs to be added in user space
to get such property and async updates of it.
So Liran might be thinking to have unified way to report required fields via wc?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC rdma cgroup

2015-11-04 Thread Parav Pandit
On Wed, Nov 4, 2015 at 5:28 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 03/11/2015 21:11, Parav Pandit wrote:
>> So it looks like below,
>> #cat rdma.resources.verbs.list
>> Output:
>> mlx4_0 uctx ah pd cq mr mw srq qp flow
>> mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq
> What happens if you set a limit of rss_wq to mlx4_0 in this example?
> Would it fail?
Yes, In above example, mlx4_0 device didn't had support for rss_wq, so
it didn't advertise in the list file that it supports rss_wq.

> I think it would be simpler for administrators if they
> can configure every resource supported by uverbs. If a resource is not
> supported by a specific device, you can never go over the limit anyway.
>
Exactly. Thats the implementation today.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC rdma cgroup

2015-11-03 Thread Parav Pandit
>> Resource are defined as index and as match_table_t.
>>
>> enum rdma_resource_type {
>> RDMA_VERB_RESOURCE_UCTX,
>> RDMA_VERB_RESOURCE_AH,
>> RDMA_VERB_RESOURCE_PD,
>> RDMA_VERB_RESOURCE_CQ,
>> RDMA_VERB_RESOURCE_MR,
>> RDMA_VERB_RESOURCE_MW,
>> RDMA_VERB_RESOURCE_SRQ,
>> RDMA_VERB_RESOURCE_QP,
>> RDMA_VERB_RESOURCE_FLOW,
>> RDMA_VERB_RESOURCE_MAX,
>> };
>> So UAPI RDMA resources can evolve by just adding more entries here.
> Are the names that appear in userspace also controlled by uverbs? What
> about the vendor specific resources?

I am not sure I followed your question.
Basically any RDMA resource that is allocated through uverb API can be tracked.
uverb makes the call to charge/uncharge.
There is list rdma.resources.verbs.list. This file lists all the verbs
resource names of all the devices which have registered themselves to
rdma cgroup.
Similarly there is rdma.resource.hw.list. This file lists all hw
specific resource names, which means they are defined run time and
potentially different for each vendor.

So it looks like below,
#cat rdma.resources.verbs.list
Output:
mlx4_0 uctx ah pd cq mr mw srq qp flow
mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq

#cat rdma.resources.hw.list
hfi1 hw_qp hw_mr sw_pd
(This particular one is hypothical example, I haven't actually coded
this, unlike uverbs which is real).

 (c) When process migrate from one to other cgroup, resource is
 continue to be owned by the creator cgroup (rather css).
 After process migration, whenever new resource is created in new
 cgroup, it will be owned by new cgroup.
>>> It sounds a little different from how other cgroups behave. I agree that
>>> mostly processes will create the resources in their cgroup and won't
>>> migrate, but why not move the charge during migration?
>>>
>> With fork() process doesn't really own the resource (unlike other file
>> and socket descriptors).
>> Parent process might have died also.
>> There is possibly no clear way to transfer resource to right child.
>> Child that cgroup picks might not even want to own RDMA resources.
>> RDMA resources might be allocated by one process and freed by other
>> process (though this might not be the way they use it).
>> Its pretty similar to other cgroups with exception in migration area,
>> such exception comes from different behavior of how RDMA resources are
>> owned, created and used.
>> Recent unified hierarchy patch from Tejun equally highlights to not
>> frequently migrate processes among cgroups.
>>
>> So in current implementation, (like other),
>> if process created a RDMA resource, forked a child.
>> child and parent both can allocate and free more resources.
>> child moved to different cgroup. But resource is shared among them.
>> child can free also the resource. All crazy combinations are possible
>> in theory (without much use cases).
>> So at best they are charged to the first cgroup css in which
>> parent/child are created and reference is hold to CSS.
>> cgroup, process can die, cut css remains until RDMA resources are freed.
>> This is similar to process behavior where task struct is release but
>> id is hold up for a while.
>
> I guess there aren't a lot of options when the resources can belong to
> multiple cgroups. So after migrating, new resources will belong to the
> new cgroup or the old one?
Resource always belongs to the cgroup in which its created, regardless
of process migration.
Again, its owned at the css level instead of cgroup. Therefore
original cgroup can also be deleted but internal reference to data
structure and that is freed and last rdma resource is freed.

>
> When I was talking about limiting to MAC/VLAN pairs I only meant
> limiting an RDMA device's ability to use that pair (e.g. use a GID that
> uses the specific MAC VLAN pair). I don't understand how that makes the
> RDMA cgroup any more generic than it is.
>
Oh ok. That doesn't. I meant that I wanted to limit how many vlans a
given container can create.
We have just high level capabilities (7) to enable such creation, but
not the count.

>>  or
>>> only a subset of P_Keys and GIDs it has. Do you see such limitations
>>> also as part of this cgroup?
>>>
>> At present no. Because GID, P_key resources are created from the
>> bottom up, either by stack or by network. They are kind of not tied to
>> the user processes, unlike mac, vlan, qp which are more application
>> driven or administrative driven.
> They are created from the network, after the network administrator
> configured them this way.
>
>> For applications that doesn't use RDMA-CM, query_device and query_port
>> will filter out the GID entries based on the network namespace in
>> which caller process is running.
> This could work well for RoCE, as each entry in the GID table is
> associated with a net device and a network namespace. However, in
> InfiniBand, the GID table isn't directly related to the network
> namespace. As 

Re: RFC rdma cgroup

2015-10-29 Thread Parav Pandit
Hi Haggai,

On Thu, Oct 29, 2015 at 8:27 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 28/10/2015 10:29, Parav Pandit wrote:
>> 3. Resources are not defined by the RDMA cgroup. Resources are defined
>> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
>> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
>> subsystem can evolve without the need of rdma cgroup update. A new
>> resource can be easily added by the RDMA/IB subsystem without touching
>> rdma cgroup.
> Resources exposed by the cgroup are basically a UAPI, so we have to be
> careful to make it stable when it evolves. I understand the need for
> vendor specific resources, following the discussion on the previous
> proposal, but could you write on how you plan to allow these set of
> resources to evolve?

Its fairly simple.
Here is the code snippet on how resources are defined in my tree.
It doesn't have the RSS work queues yet, but can be added right after
this patch.

Resource are defined as index and as match_table_t.

enum rdma_resource_type {
RDMA_VERB_RESOURCE_UCTX,
RDMA_VERB_RESOURCE_AH,
RDMA_VERB_RESOURCE_PD,
RDMA_VERB_RESOURCE_CQ,
RDMA_VERB_RESOURCE_MR,
RDMA_VERB_RESOURCE_MW,
RDMA_VERB_RESOURCE_SRQ,
RDMA_VERB_RESOURCE_QP,
RDMA_VERB_RESOURCE_FLOW,
RDMA_VERB_RESOURCE_MAX,
};
So UAPI RDMA resources can evolve by just adding more entries here.

>
>> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
>> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
>> hw resource pool per such device.
>> (Nothing stops to have more devices and pools, but design is around
>> this use case).
> In what way does the design depend on this assumption?

Current code when performs resource charging/uncharging, it needs to
identify the resource pool which one to charge to.
This resource pool is maintained as list_head and so its linear search
per device.
If we are thinking of 100 of RDMA devices per container, than liner
search will not be good way and different data structure needs to be
deployed.


>
>> 9. Resource pool object is created in following situations.
>> (a) administrative operation is done to set the limit and no previous
>> resource pool exist for the device of interest for the cgroup.
>> (b) no resource limits were configured, but IB/RDMA subsystem tries to
>> charge the resource. so that when applications are running without
>> limits and later on when limits are enforced, during uncharging, it
>> correctly uncharges them, otherwise usage count will drop to negative.
>> This is done using default resource pool.
>> Instead of implementing any sort of time markers, default pool
>> simplifies the design.
> Having a default resource pool kind of implies there is a non-default
> one. Is the only difference between the default and non-default the fact
> that the second was created with an administrative operation and has
> specified limits or is there some other difference?
>
You described it correctly.

>> (c) When process migrate from one to other cgroup, resource is
>> continue to be owned by the creator cgroup (rather css).
>> After process migration, whenever new resource is created in new
>> cgroup, it will be owned by new cgroup.
> It sounds a little different from how other cgroups behave. I agree that
> mostly processes will create the resources in their cgroup and won't
> migrate, but why not move the charge during migration?
>
With fork() process doesn't really own the resource (unlike other file
and socket descriptors).
Parent process might have died also.
There is possibly no clear way to transfer resource to right child.
Child that cgroup picks might not even want to own RDMA resources.
RDMA resources might be allocated by one process and freed by other
process (though this might not be the way they use it).
Its pretty similar to other cgroups with exception in migration area,
such exception comes from different behavior of how RDMA resources are
owned, created and used.
Recent unified hierarchy patch from Tejun equally highlights to not
frequently migrate processes among cgroups.

So in current implementation, (like other),
if process created a RDMA resource, forked a child.
child and parent both can allocate and free more resources.
child moved to different cgroup. But resource is shared among them.
child can free also the resource. All crazy combinations are possible
in theory (without much use cases).
So at best they are charged to the first cgroup css in which
parent/child are created and reference is hold to CSS.
cgroup, process can die, cut css remains until RDMA resources are freed.
This is similar to process behavior where task struct is release but
id is hold up for a

RFC rdma cgroup

2015-10-28 Thread Parav Pandit
of the resource limit is configured, that particular resource will be
enforced, rest will enjoy upto their maximum limit.

8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
hw resource pool per such device.
(Nothing stops to have more devices and pools, but design is around
this use case).

9. Resource pool object is created in following situations.
(a) administrative operation is done to set the limit and no previous
resource pool exist for the device of interest for the cgroup.
(b) no resource limits were configured, but IB/RDMA subsystem tries to
charge the resource. so that when applications are running without
limits and later on when limits are enforced, during uncharging, it
correctly uncharges them, otherwise usage count will drop to negative.
This is done using default resource pool.
Instead of implementing any sort of time markers, default pool
simplifies the design.
(c) When process migrate from one to other cgroup, resource is
continue to be owned by the creator cgroup (rather css).
After process migration, whenever new resource is created in new
cgroup, it will be owned by new cgroup.

10. Resource pool is destroyed if it was of default type (not created
by administrative operation) and its the last resource getting
deallocated. Resource pool created as administrative operation is not
deleted, as its expected to be used in near future.

13. if administrative command tries to delete all the resource limit
with active resources per device, RDMA cgroup just marks the pool as
default pool with maximum limits.


Examples:
#configure resource limit:
echo mlx4_0 mr=100 qp=10 ah=2 cq=10 >
/sys/fs/cgroup/rdma/1/rdma.resource.verb.limit
echo ocrdma1 mr=120 qp=20 ah=2 cq=10 >
/sys/fs/cgroup/rdma/2/rdma.resource.verb.limit

#query resource limit:
cat /sys/fs/cgroup/rdma/2/rdma.resource.verb.limit
#output:
mlx4_0 mr=100 qp=10 ah=2 cq=10
ocrdma1 mr=120 qp=20 cq=10

#delete resource limit:
echo mlx4_0 del > /sys/fs/cgroup/rdma/1/rdma.resource.verb.limit

#query resource list:
cat /sys/fs/cgroup/rdma/1/rdma.resource.verb.list
mlx4_0 mr qp ah pd cq

cat /sys/fs/cgroup/rdma/1/rdma.hw.verb.list
vendor1 hw_qp hw_cq hw_timer

#configure hw specific resource limit
echo vendor1 hw_qp=56 > /sys/fs/cgroup/rdma/2/rdma.resource.hw.limit

-

I have completed initial development of above design. I am currently
testing this design.
I will post the patch soon once I am done validating it.

Let me know if there are any design comments.

Regards,
Parav Pandit
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-10-28 Thread Parav Pandit
Hi,

I finally got some chance and progress on redesigning rdma cgroup
controller for the most use cases that we discussed in this email
chain.
I am posting RFC and soon code in new email.

Parav


On Sun, Sep 20, 2015 at 4:05 PM, Haggai Eran  wrote:
> On 15/09/2015 06:45, Jason Gunthorpe wrote:
>> No, I'm saying the resource pool is *well defined* and *fixed* by each
>> hardware.
>>
>> The only question is how do we expose the N resource limits, the list
>> of which is totally vendor specific.
>
> I don't see why you say the limits are vendor specific. It is true that
> different RDMA devices have different implementations and capabilities,
> but they all use the expose the same set of RDMA objects with their
> limitations. Whether those limitations come from hardware limitations,
> from the driver, or just because the address space is limited, they can
> still be exhausted.
>
>> Yes, using a % scheme fixes the ratios, 1% is going to be a certain
>> number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver
>> configuration. That is the trade off for API simplicity.
>>
>>
>> Yes, this results in some resources being over provisioned.
>
> I agree that such a scheme will be easy to configure, but I don't think
> it can work well in all situations. Imagine you want to let one
> container use almost all RC QPs as you want it to connect to the entire
> cluster through RC. Other containers can still use a single datagram QP
> to connect to the entire cluster, but they would require many address
> handles. If you force a fixed ratio of resources given to each container
> it would be hard to describe such a partitioning.
>
> I think it would be better to expose different controls for the
> different RDMA resources.
>
> Regards,
> Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 2/7] IB: Introduce Work Queue object and its verbs

2015-10-18 Thread Parav Pandit
On Sun, Oct 18, 2015 at 8:38 PM, Yishai Hadas
<yish...@dev.mellanox.co.il> wrote:
> On 10/15/2015 7:49 PM, Parav Pandit wrote:
>
>> If there is stateless WQ being used by multiple QPs in multiplexed
>
>
> The WQ is not stateless and always has its own PD.
>
>> way, it should be able to multiplex between QP's of different PD as
>> well.
>> Otherwise for every PD being created, there will have be one WQ needed
>> to service all the QPs belonging to that PD.
>
>
> As mentioned, same WQ can serve multiple QPs, from PD point of view it
> behaves similarly to SRQ that may be associated with many QPs with different
> PDs.
>
> See IB SPEC, Release 1.3, o10-2.2.1:
> "SRQ may be associated with the same PD as used by one or more of its
> associated QPs or a different PD."
>
> As part of coming V1 will improve the commit message to better clarify the
> WQ's PD behavior, thanks.

Ok. Got it. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 2/7] IB: Introduce Work Queue object and its verbs

2015-10-15 Thread Parav Pandit
On Thu, Oct 15, 2015 at 7:42 PM, Yishai Hadas
<yish...@dev.mellanox.co.il> wrote:
> On 10/15/2015 12:13 PM, Parav Pandit wrote:
>>
>> Just curious, why does WQ need to be bind to PD?
>> Isn't ucontext sufficient?
>> Or because kcontext doesn't exist, PD serves that role?
>> Or Is this just manifestation of how hardware behave?
>
>
> PD is an attribute of a work queue (i.e. send/receive queue), it's used by
> the hardware for security validation before scattering to a memory region.
> For that, an external WQ object needs a PD, letting the
> hardware makes that validation.
>
>> Since you mentioned, "QP can be configured to use "external" WQ
>> object", it might be worth to reuse the WQ across multiple QPs of
>> different PD?
>
>
> Correct, external WQ can be used across multiple QPs, in that case its PD is
> used by the hardware for security validation when it accesses to the MR, in
> that case the QP's PD is not in use.
>
I think I get it, just confirming with below example.

So I think below is possible.
WQ_A having PD=1.
QP_A having PD=2 bound to WQ_A.
QP_B having PD=3 bound to WQ_A.
MR_X having PD=2.
And checks are done between MR and QP.

In other use case,
MR is not at all used. (only physical addresses are used)
WQ_A having PD=1.
QP_A having PD=2 bound to WQ_A.
QP_B having PD=3 bound to WQ_A.

WQ entries fail as MR is not associated and QP are bound to different
PD than the PD of WQ_A.
Because at QP bound time with WQ, its unknown whether it will use MR
or not in the WQE at run time.
Right?


>> Because MR and QP validation check has to happen among MR and actual
>> QP and might not require that check against WQ.
>
>
> No, in that case of an external WQ its PD is used and the QP's PD is not in
> use.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 2/7] IB: Introduce Work Queue object and its verbs

2015-10-15 Thread Parav Pandit
On Thu, Oct 15, 2015 at 9:55 PM, Yishai Hadas
<yish...@dev.mellanox.co.il> wrote:
> On 10/15/2015 6:17 PM, Parav Pandit wrote:
>>
>> On Thu, Oct 15, 2015 at 7:42 PM, Yishai Hadas
>> <yish...@dev.mellanox.co.il> wrote:
>>>
>>> On 10/15/2015 12:13 PM, Parav Pandit wrote:
>>>>
>>>>
>>>> Just curious, why does WQ need to be bind to PD?
>>>> Isn't ucontext sufficient?
>>>> Or because kcontext doesn't exist, PD serves that role?
>>>> Or Is this just manifestation of how hardware behave?
>>>
>>>
>>>
>>> PD is an attribute of a work queue (i.e. send/receive queue), it's used
>>> by
>>> the hardware for security validation before scattering to a memory
>>> region.
>>> For that, an external WQ object needs a PD, letting the
>>> hardware makes that validation.
>>>
>>>> Since you mentioned, "QP can be configured to use "external" WQ
>>>> object", it might be worth to reuse the WQ across multiple QPs of
>>>> different PD?
>>>
>>>
>>>
>>> Correct, external WQ can be used across multiple QPs, in that case its PD
>>> is
>>> used by the hardware for security validation when it accesses to the MR,
>>> in
>>> that case the QP's PD is not in use.
>>>
>> I think I get it, just confirming with below example.
>
>
> .
>
>> So I think below is possible.
>> WQ_A having PD=1.
>> QP_A having PD=2 bound to WQ_A.
>> QP_B having PD=3 bound to WQ_A.
>> MR_X having PD=2.
>> And checks are done between MR and QP.
>
> No, please follow above description, in that case PD=1 of WQ_A is used for
> the checks.
>
This appears to me a manifestation of hardware implementation
surfacing the verb layer.
There may be nothing wrong in it, but worth to know how to actually do
verb programming.

If there is stateless WQ being used by multiple QPs in multiplexed
way, it should be able to multiplex between QP's of different PD as
well.
Otherwise for every PD being created, there will have be one WQ needed
to service all the QPs belonging to that PD.

>> In other use case,
>> MR is not at all used. (only physical addresses are used)
>> WQ_A having PD=1.
>> QP_A having PD=2 bound to WQ_A.
>> QP_B having PD=3 bound to WQ_A.
>>
>> WQ entries fail as MR is not associated and QP are bound to different
>> PD than the PD of WQ_A.
>> Because at QP bound time with WQ, its unknown whether it will use MR
>> or not in the WQE at run time.
>> Right?
>
>
> In case there is MR for physical addresses it has a PD and the WQ's PD is
> used, in case there is no MR the PD is not applicable.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-15 Thread Parav Pandit
Hi Jason, Sean, Tejun,

I am in process of defining new approach, design based on the feedback
given here for new RDMA cgroup from all of you.
I have also collected feedback from Liran yesterday and ORNL folks too.

Soon I will post the new approach, high level APIs and functionality
for review before submitting actual implementation.

Regards,
Parav Pandit

On Tue, Sep 15, 2015 at 9:15 AM, Jason Gunthorpe
<jguntho...@obsidianresearch.com> wrote:
> On Tue, Sep 15, 2015 at 08:38:54AM +0530, Parav Pandit wrote:
>
>> As you precisely described, about wild ratio,
>> we are asking vendor driver (bottom most layer) to statically define
>> what the resource pool is, without telling him which application are
>> we going to run to use those pool.
>> Therefore vendor layer cannot ever define "right" resource pool.
>
> No, I'm saying the resource pool is *well defined* and *fixed* by each
> hardware.
>
> The only question is how do we expose the N resource limits, the list
> of which is totally vendor specific.
>


>> rdma cgroup will allow us to run post 512 or 1024 containers without
>> using PCIe SR-IOV, without creating any vendor specific resource
>> pools.
>
> If you ignore any vendor specific resource limits then you've just
> left open a hole, a wayward container can exhaust all others - so what
> was the point of doing all this work?
>
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-14 Thread Parav Pandit
On Sat, Sep 12, 2015 at 1:36 AM, Hefty, Sean  wrote:
>> > Trying to limit the number of QPs that an app can allocate,
>> > therefore, just limits how much of the address space an app can use.
>> > There's no clear link between QP limits and HW resource limits,
>> > unless you assume a very specific underlying implementation.
>>
>> Isn't that the point though? We have several vendors with hardware
>> that does impose hard limits on specific resources. There is no way to
>> avoid that, and ultimately, those exact HW resources need to be
>> limited.
>
> My point is that limiting the number of QPs that an app can allocate doesn't 
> necessarily mean anything.  Is allocating 1000 QPs with 1 entry each better 
> or worse than 1 QP with 10,000 entries?  Who knows?

I think it means if its RDMA RC QP, than whether you can talk to 1000
nodes or 1 node in network.
When we deploy MPI application, it know the rank of the application,
we know the cluster size of the deployment and based on that resource
allocation can be done.
If you meant to say from performance point of view, than resource
count is possibly not the right measure.

Just because we have not defined those interface for performance today
in this patch set, doesn't mean that we won't do it.
I could easily see a number_of_messages/sec as one interface to be
added in future.
But that won't stop process hoarders to stop taking away all the QPs,
just the way we needed PID controller.

Now when it comes to Intel implementation, if it driver layer knows
(in future we new APIs) that whether 10 or 100 user QPs should map to
few hw-QPs or more hw-QPs (uSNIC).
so that hw-QP exposed to one cgroup is isolated from hw-QP exposed to
other cgroup.
If hw- implementation doesn't require isolation, it could just
continue from single pool, its left to the vendor implementation on
how to use this information (this API is not present in the patch).

So cgroup can also provides a control point for vendor layer to tune
internal resource allocation based on provided matrix, which cannot be
done by just providing "memory usage by RDMA structures".

If I have to compare it with other cgroup knobs, low level individual
knobs by itself, doesn't serve any meaningful purpose either.
Just by defined how much CPU to use or how much memory to use, it
cannot define the application performance either.
I am not sure, whether iocontroller can achieve 10 million IOPs by
defining single CPU and 64KB of memory.
all the knobs needs to be set in right way to reach desired number.

In similar line RDMA resource knobs as individual knobs are not
definition of performance, its just another knob.

>
>> If we want to talk about abstraction, then I'd suggest something very
>> general and simple - two limits:
>>  '% of the RDMA hardware resource pool' (per device or per ep?)
>>  'bytes of kernel memory for RDMA structures' (all devices)
>
> Yes - this makes more sense to me.
>

Sean, Jason,
Help me to understand this scheme.

1. How does the % of resource, is different than absolute number? With
rest of the cgroups systems we define absolute number at most places
to my knowledge.
Such as (a) number_of_tcp_bytes, (b) IOPs of block device, (c) cpu cycles etc.
20% of QP = 20 QPs when 100 QPs are with hw.
I prefer to keep the resource scheme consistent with other resource
control points - i.e. absolute number.

2. bytes of  kernel memory for RDMA structures
One QP of one vendor might consume X bytes and other Y bytes. How does
the application knows how much memory to give.
application can allocate 100 QP of each 1 entry deep or 1 QP of 100
entries deep as in Sean's example.
Both might consume almost same memory.
Application doing 100 QP allocation, still within limit of memory of
cgroup leaves other applications without any QP.
I don't see a point of memory footprint based scheme, as memory limits
are well addressed by more smarter memory controller anyway.

I do agree with Tejun, Sean on the point that abstraction level has to
be different for using RDMA and thats why libfabrics and other
interfaces are emerging which will take its own time to get stabilize,
integrated.

Until pure IB style RDMA programming model exist - based on RDMA
resource based scheme, I think control point also has to be on
resources.
Once a stable abstraction level is on table (possibly across fabric
not just RDMA), than a right resource controller can be implemented.
Even when RDMA abstraction layer arrives, as Jason mentioned, at the
end it would consume some hw resource anyway, that needs to be
controlled too.

Jason,
If the hardware vendor defines the resource pool without saying its
resource QP or MR, how would actually management/control point can
decide what should be controlled to what limit?
We will need additional user space library component to decode than,
after that it needs to be abstracted out as QP or MR so that it can be
deal in vendor agnostic way as application layer.
and than it would look 

Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-14 Thread Parav Pandit
Hi Tejun,

I missed to acknowledge your point that we need both - hard limit and
soft limit/weight. Current patchset is only based on hard limit.
I see that weight would be another helfpul layer in chain that we can
implement after this as incremental that makes review, debugging
manageable?

Parav



On Mon, Sep 14, 2015 at 4:39 PM, Parav Pandit <pandit.pa...@gmail.com> wrote:
> On Sat, Sep 12, 2015 at 1:36 AM, Hefty, Sean <sean.he...@intel.com> wrote:
>>> > Trying to limit the number of QPs that an app can allocate,
>>> > therefore, just limits how much of the address space an app can use.
>>> > There's no clear link between QP limits and HW resource limits,
>>> > unless you assume a very specific underlying implementation.
>>>
>>> Isn't that the point though? We have several vendors with hardware
>>> that does impose hard limits on specific resources. There is no way to
>>> avoid that, and ultimately, those exact HW resources need to be
>>> limited.
>>
>> My point is that limiting the number of QPs that an app can allocate doesn't 
>> necessarily mean anything.  Is allocating 1000 QPs with 1 entry each better 
>> or worse than 1 QP with 10,000 entries?  Who knows?
>
> I think it means if its RDMA RC QP, than whether you can talk to 1000
> nodes or 1 node in network.
> When we deploy MPI application, it know the rank of the application,
> we know the cluster size of the deployment and based on that resource
> allocation can be done.
> If you meant to say from performance point of view, than resource
> count is possibly not the right measure.
>
> Just because we have not defined those interface for performance today
> in this patch set, doesn't mean that we won't do it.
> I could easily see a number_of_messages/sec as one interface to be
> added in future.
> But that won't stop process hoarders to stop taking away all the QPs,
> just the way we needed PID controller.
>
> Now when it comes to Intel implementation, if it driver layer knows
> (in future we new APIs) that whether 10 or 100 user QPs should map to
> few hw-QPs or more hw-QPs (uSNIC).
> so that hw-QP exposed to one cgroup is isolated from hw-QP exposed to
> other cgroup.
> If hw- implementation doesn't require isolation, it could just
> continue from single pool, its left to the vendor implementation on
> how to use this information (this API is not present in the patch).
>
> So cgroup can also provides a control point for vendor layer to tune
> internal resource allocation based on provided matrix, which cannot be
> done by just providing "memory usage by RDMA structures".
>
> If I have to compare it with other cgroup knobs, low level individual
> knobs by itself, doesn't serve any meaningful purpose either.
> Just by defined how much CPU to use or how much memory to use, it
> cannot define the application performance either.
> I am not sure, whether iocontroller can achieve 10 million IOPs by
> defining single CPU and 64KB of memory.
> all the knobs needs to be set in right way to reach desired number.
>
> In similar line RDMA resource knobs as individual knobs are not
> definition of performance, its just another knob.
>
>>
>>> If we want to talk about abstraction, then I'd suggest something very
>>> general and simple - two limits:
>>>  '% of the RDMA hardware resource pool' (per device or per ep?)
>>>  'bytes of kernel memory for RDMA structures' (all devices)
>>
>> Yes - this makes more sense to me.
>>
>
> Sean, Jason,
> Help me to understand this scheme.
>
> 1. How does the % of resource, is different than absolute number? With
> rest of the cgroups systems we define absolute number at most places
> to my knowledge.
> Such as (a) number_of_tcp_bytes, (b) IOPs of block device, (c) cpu cycles etc.
> 20% of QP = 20 QPs when 100 QPs are with hw.
> I prefer to keep the resource scheme consistent with other resource
> control points - i.e. absolute number.
>
> 2. bytes of  kernel memory for RDMA structures
> One QP of one vendor might consume X bytes and other Y bytes. How does
> the application knows how much memory to give.
> application can allocate 100 QP of each 1 entry deep or 1 QP of 100
> entries deep as in Sean's example.
> Both might consume almost same memory.
> Application doing 100 QP allocation, still within limit of memory of
> cgroup leaves other applications without any QP.
> I don't see a point of memory footprint based scheme, as memory limits
> are well addressed by more smarter memory controller anyway.
>
> I do agree with Tejun, Sean on the point that abstraction level has to
> be different for using RDMA and thats why libfabrics

Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-14 Thread Parav Pandit
> Because actual hardware resources *ARE* the limit. We cannot abstract
> it away. The hardware/driver has real, fixed, immutable limits. No API
> abstraction can possibly change that.
>
> The limits are such there *IS NO* API boundary that can bundle them
> into something simpler. There will always be apps that require wildly
> different ratios of the basic verbs resources (PD/QP/CQ/AH/MR)
>
> Either we control each and every vendor's limited resource directly
> (which is where you started), or we just roll them up into a 'all
> resource' bundle and control them indirectly. There just isn't a
> mythical third 'better API' choice with the hardware we have today.
>

As you precisely described, about wild ratio,
we are asking vendor driver (bottom most layer) to statically define
what the resource pool is, without telling him which application are
we going to run to use those pool.
Therefore vendor layer cannot ever define "right" resource pool.

If we try to fix defining "right" resource pool, we will have to come
up with API to modify/tune individual element of the pool.
Once we bring that complexity, it becomes what is proposed in this pachset.

Instead of bringing such complex solution, that affecting all the
layers which solves the same problem as this patch,
its better to keep definition of "bundle" in the user
library/application deployment engine.
where bundle is set of those resources.

May be instead of having invidividual files for each resource, at user
interface level, we can have rdma.bundle file.
this bundle cgroup file defines these resources such as
"ah 100
mr 100
qp 10"

> So? I don't think it is really important to have an exact, precise,
> limit. The HW pools are pretty big, unless you plan to run tens of
> thousands of containers eacg with tiny RDMA limits, it is fine to talk
> in broader terms (ie 10% of all HW limited resource) which is totally
> adaquate to hard-prevent run away or exhaustion scenarios.
>

rdma cgroup will allow us to run post 512 or 1024 containers without
using PCIe SR-IOV, without creating any vendor specific resource
pools.


> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-14 Thread Parav Pandit
On Mon, Sep 14, 2015 at 10:58 PM, Jason Gunthorpe
<jguntho...@obsidianresearch.com> wrote:
> On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote:
>
>> 1. How does the % of resource, is different than absolute number? With
>> rest of the cgroups systems we define absolute number at most places
>> to my knowledge.
>
> There isn't really much choice if the abstraction is a bundle of all
> resources. You can't use an absolute number unless every possible
> hardware limited resource is defined, which doesn't seem smart to me
> either.

Absolute number of percentage is representation for a given property.
That property needs definition. Isn't it?
How do we say that "Some undefined" resource you give certain amount,
which user doesn't know about what to administer, or configure.
It has to be quantifiable entity.

It is not abstract enough, and doesn't match our universe of
> hardware very well.
>
Why does the user need to know the actual hardware resource limits or
define hardware based resource.

RDMA verbs is the abstraction point.
We could well define
(a) how many number of RDMA connections are allowed instead of QP, or CQ or AH.
(b) how many data transfer buffers to use.

The fact is we have so many mid layers, which uses these resources
differently, above abstraction does not fit the bill.
But we know the mid layers how they operate, and how they use the RDMA
resource keeping.
So if we deploy MPI application for given cluster of container, we can
accurately configure the RDMA resource, isn't it?

Another example would be, if we don't want only 50% resources to be
given to all containers and rest 50% to kernel consumers such as NFS,
all containers can reside in single rdma cgroup limited to given
limits.


>> 2. bytes of  kernel memory for RDMA structures
>> One QP of one vendor might consume X bytes and other Y bytes. How does
>> the application knows how much memory to give.
>
> I don't see this distinction being useful at such a fine granularity
> where the control side needs to distinguish between 1 and 2 QPs.
>
> The majority use for control groups has been along with containers to
> prevent a container for exhausting resources in a way that impacts
> another.
>
Right. Thats the intention.

> In that use model limiting each container to N MB of kernel memory
> makes it straightforward to reason about resource exhaustion in a
> multi-tennant environment. We have other controllers that do this,
> just more indirectly (ie limiting the number of inotifies, or the
> number of fds indirectly cap kernel memory consumption)
>
> ie Presumably some fairly small limitation like 10MB is enough for
> most non-MPI jobs.

Container application always write a simple for loop code to take away
majority of QP with 10MB limit.
>
>> Application doing 100 QP allocation, still within limit of memory of
>> cgroup leaves other applications without any QP.
>
> No, if the HW has a fixed QP pool then it would hit #1 above. Both are
> active at once. For example you'd say a container cannot use more than
> 10% of the device's hardware resources, or more than 10MB of kernel
> memory.
>
Right. we need to define this resource pool, right?
Why it cannot be verbs abstraction?
How many resources are really used to implement verb layer in reality
is left to hardware vendor
Abstract pool just added confusion instead of clarity.

Imagine instead of tcp_bytes or kmem bytes, its "some memory
resource", how would someone debug/tune a system with abstract knobs?

> If on an mlx card, you probably hit the 10% of QP resources first. If
> on an qib card there is no HW QP pool (well, almost, QPNs are always
> limited), so you'd hit the memory limit instead.
>
> In either case, we don't want to see a container able to exhaust
> either all of kernel memory or all of the HW resources to deny other
> containers.
>
> If you have a non-container use case in mind I'd be curious to hear
> it..

Container is the prime case. Additionally equally prime case of non
container use case.
Today, application can take up all the resource being first class
citizan, and NFS mount will fail.
So without container also we should be able to restrict resources to
user mode app.


>
>> I don't see a point of memory footprint based scheme, as memory limits
>> are well addressed by more smarter memory controller anyway.
>
> I don't thing #1 is controlled but another controller. This is long
> lived kernel-side memory allocations to support RDMA resource
> allocation - we certainly have nothing in the rdma layer that is
> tracking this stuff.
>
Some drivers performs mmap() of kernel memory to user space, some
drivers does user space page allocation and maps to device.
Putting or tracking all those is just so intrusive changes spreading

Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-14 Thread Parav Pandit
On Sat, Sep 12, 2015 at 12:52 AM, Hefty, Sean  wrote:
>> So, the existence of resource limitations is fine.  That's what we
>> deal with all the time.  The problem usually with this sort of
>> interfaces which expose implementation details to users directly is
>> that it severely limits engineering manuevering space.  You usually
>> want your users to express their intentions and a mechanism to
>> arbitrate resources to satisfy those intentions (and in a way more
>> graceful than "we can't, maybe try later?"); otherwise, implementing
>> any sort of high level resource distribution scheme becomes painful
>> and usually the only thing possible is preventing runaway disasters -
>> you don't wanna pin unused resource permanently if there actually is
>> contention around it, so usually all you can do with hard limits is
>> overcommiting limits so that it at least prevents disasters.
>
> I agree with Tejun that this proposal is at the wrong level of abstraction.
>
> If you look at just trying to limit QPs, it's not clear what that attempts to 
> accomplish.  Conceptually, a QP is little more than an addressable endpoint.  
> It may or may not map to HW resources (for Intel NICs it does not).  Even 
> when HW resources do back the QP, the hardware is limited by how many QPs can 
> realistically be active at any one time, based on how much caching is 
> available in the NIC.
>

cgroups as it stands today provides resource controls in effective
manner of existing defined resource, such as cpu cycles, memory in
user and kernel space, tcp bytes, IOPS etc.
Similarly RDMA programming model defines its own set of resources
which is used by applications which accesses those resources directly.

What we are debating here is that, RDMA exposing hardware resources is
not correct, and therefore whether a cgroup controller is needed or
not.
There are two points here.
1. Whether RDMA programming model is correct or not which works on
defined resources of IB spec.
2. Assuming that programming model is fine, (because we have actively
maintained IB stack in kernel and adoption of user space components in
OS),
whether we need to control those resources or not via cgroup.

Tejun trying to say that because point_1 is doesn't seem to be right
way to solve problem, point_2 should not be done or done at different
level of abstraction.
More questions/comments in Jason and Sean thread.

Sean,
Even though there is no one to one map of verb-QP to hw-QP, in order
for driver or lower layer to effectively map the right verb-QP to
hw-QP, such vendor specific layer needs to know how is it going to be
used. Otherwise two contending applications for a QP may not get the
right number of hw-QPs to use.

> Trying to limit the number of QPs that an app can allocate, therefore, just 
> limits how much of the address space an app can use.  There's no clear link 
> between QP limits and HW resource limits, unless you assume a very specific 
> underlying implementation.
>
> - Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-14 Thread Parav Pandit
On Sat, Sep 12, 2015 at 12:55 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello, Parav.
>
> On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote:
>> > If you're planning on following what the existing memcg did in this
>> > area, it's unlikely to go well.  Would you mind sharing what you have
>> > on mind in the long term?  Where do you see this going?
>>
>> At least current thoughts are: central entity authority monitors fail
>> count and new threashold count.
>> Fail count - as similar to other indicates how many time resource
>> failure occured
>> threshold count - indicates upto what this resource has gone upto in
>> usage. (application might not be able to poll on thousands of such
>> resources entries).
>> So based on fail count and threshold count, it can tune it further.
>
> So, regardless of the specific resource in question, implementing
> adaptive resource distribution requires more than simple thresholds
> and failcnts.

May be yes. Buts in difficult to go through the whole design to shape
up right now.
This is the infrastructure getting build with few capabilities.
I see this as starting point instead of end point.

> The very minimum would be a way to exert reclaim
> pressure and then a way to measure how much lack of a given resource
> is affecting the workload.  Maybe it can adaptively lower the limits
> and then watch how often allocation fails but that's highly unlikely
> to be an effective measure as it can't do anything to hoarders and the
> frequency of allocation failure doesn't necessarily correlate with the
> amount of impact the workload is getting (it's not a measure of
> usage).

It can always kill the hoarding process(es), which is holding up the
resources without using it.
Such processes will eventually will get restarted but will not be able
to hoard so much because its been on the radar for hoarding and its
limits have been reduced.

>
> This is what I'm awry about.  The kernel-userland interface here is
> cut pretty low in the stack leaving most of arbitration and management
> logic in the userland, which seems to be what people wanted and that's
> fine, but then you're trying to implement an intelligent resource
> control layer which straddles across kernel and userland with those
> low level primitives which inevitably would increase the required
> interface surface as nobody has enough information.
>
We might be able to get the information as we go along.
Such arbitration and management layer outside (instead of inside) has
more visibility into multiple systems which are part of single cluster
and processes are spreaded across cgroup in each such system.
While a logic inside can manage just a manage a process of single node
which are using multiple cgroups.

> Just to illustrate the point, please think of the alsa interface.  We
> expose hardware capabilities pretty much as-is leaving management and
> multiplexing to userland and there's nothing wrong with it.  It fits
> better that way; however, we don't then go try to implement cgroup
> controller for PCM channels.  To do any high-level resource
> management, you gotta do it where the said resource is actually
> managed and arbitrated.
>
> What's the allocation frequency you're expecting?  It might be better
> to just let allocations themselves go through the agent that you're
> planning.
In that case we might need to build FUSE style infrastructure.
Frequency for RDMA resource allocation is certainly less than read/write calls.

> You sure can use cgroup membership to identify who's asking
> tho.  Given how the whole thing is architectured, I'd suggest thinking
> more about how the whole thing should turn out eventually.
>
Yes, I agree.
At this point, its software solution to provide resource isolation in
simple manner which has scope to become adaptive in future.

> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Parav Pandit
> cpuset is a special case but think of cpu, memory or io controllers.
> Their resource distribution schemes are a lot more developed than
> what's proposed in this patchset and that's a necessity because nobody
> wants to cripple their machines for resource control.

IO controller and applications are mature in nature.
When IO controller throttles the IO, applications are pretty mature
where if IO takes longer to complete, there is possibly almost no way
to cancel the system call or rather application might not want to
cancel the IO at least the non asynchronous one.
So application just notice lower performance than throttled way.
Its really not possible at RDMA level with RDMA resource to hold up
resource creation call for longer time, because reusing existing
resource with failed status can likely to give better performance.
As Doug explained in his example, many RDMA resources as its been used
by applications are relatively long lived. So holding ups resource
creation while its taken by other process will certainly will look bad
on application performance front compare to returning failure and
reusing existing one once its available or once new one is available.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Parav Pandit
On Fri, Sep 11, 2015 at 10:04 PM, Tejun Heo <t...@kernel.org> wrote:
> Hello, Parav.
>
> On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote:
>> Resource run away by application can lead to (a) kernel and (b) other
>> applications left out with no resources situation.
>
> Yeap, that this controller would be able to prevent to a reasonable
> extent.
>
>> Both the problems are the target of this patch set by accounting via cgroup.
>>
>> Performance contention can be resolved with higher level user space,
>> which will tune it.
>
> If individual applications are gonna be allowed to do that, what's to
> prevent them from jacking up their limits?
I should have been more explicit. I didnt mean the application to
control which is allocating it.
> So, I assume you're
> thinking of a central authority overseeing distribution and enforcing
> the policy through cgroups?
>
Exactly.



>> Threshold and fail counters are on the way in follow on patch.
>
> If you're planning on following what the existing memcg did in this
> area, it's unlikely to go well.  Would you mind sharing what you have
> on mind in the long term?  Where do you see this going?
>
At least current thoughts are: central entity authority monitors fail
count and new threashold count.
Fail count - as similar to other indicates how many time resource
failure occured
threshold count - indicates upto what this resource has gone upto in
usage. (application might not be able to poll on thousands of such
resources entries).
So based on fail count and threshold count, it can tune it further.




> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-11 Thread Parav Pandit
> If the resource isn't and the main goal is preventing runaway
> hogs, it'll be able to do that but is that the goal here?  For this to
> be actually useful for performance contended cases, it'd need higher
> level abstractions.
>

Resource run away by application can lead to (a) kernel and (b) other
applications left out with no resources situation.
Both the problems are the target of this patch set by accounting via cgroup.

Performance contention can be resolved with higher level user space,
which will tune it.
Threshold and fail counters are on the way in follow on patch.

> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-10 Thread Parav Pandit
On Thu, Sep 10, 2015 at 10:19 PM, Tejun Heo <t...@kernel.org> wrote:
> Hello, Parav.
>
> On Wed, Sep 09, 2015 at 09:27:40AM +0530, Parav Pandit wrote:
>> This is one old white paper, but most of the reasoning still holds true on 
>> RDMA.
>> http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf
>
> Just read it.  Much appreciated.
>
> ...
>> These resources include are-  QP (queue pair) to transfer data, CQ
>> (Completion queue) to indicate completion of data transfer operation,
>> MR (memory region) to represent user application memory as source or
>> destination for data transfer.
>> Common resources are QP, SRQ (shared received queue), CQ, MR, AH
>> (Address handle), FLOW, PD (protection domain), user context etc.
>
> It's kinda bothering that all these are disparate resources.

Actually not. They are linked resources. Every QP needs associated one
or two CQ, one PD.
Every QP will use few MRs for data transfer.
Here is the good programming guide of the RDMA APIs exposed to the
user space application.

http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
So first version of the cgroups patch will address the control
operation for section 3.4.


> I suppose that each restriction comes from the underlying hardware and
> there's no accepted higher level abstraction for these things?
>
There is higher level abstraction which is through the verbs layer
currently which does actually expose the hardware resource but in
vendor agnostic way.
There are many vendors who support these verbs layer, some of them
which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers
which support these verbs are in  kernel tree.

There is higher level APIs above the verb layer, such as MPI,
libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer.
They all rely on the hardware resource. All of these higher level
abstraction is accepted and well used by certain application class. It
would be long discussion to go over them here.


>> >> This patch-set allows limiting rdma resources to set of processes.
>> >> It extend device cgroup controller for limiting rdma device limits.
>> >
>> > I don't think this belongs to devcg.  If these make sense as a set of
>> > resources to be controlled via cgroup, the right way prolly would be a
>> > separate controller.
>> >
>>
>> In past there has been similar comment to have dedicated cgroup
>> controller for RDMA instead of merging with device cgroup.
>> I am ok with both the approach, however I prefer to utilize device
>> controller instead of spinning of new controller for new devices
>> category.
>> I anticipate more such need would arise and for new device category,
>> it might not be worth to have new cgroup controller.
>> RapidIO though very less popular and upcoming PCIe are on horizon to
>> offer similar benefits as that of RDMA and in future having one
>> controller for each of them again would not be right approach.
>>
>> I certainly seek your and others inputs in this email thread here whether
>> (a) to continue to extend device cgroup (which support character,
>> block devices white list) and now RDMA devices
>> or
>> (b) to spin of new controller, if so what are the compelling reasons
>> that it can provide compare to extension.
>
> I'm doubtful that these things are gonna be mainstream w/o building up
> higher level abstractions on top and if we ever get there we won't be
> talking about MR or CQ or whatever.

Some of the higher level examples I gave above will adapt to resource
allocation failure. Some are actually adaptive to few resource
allocation failure, they do query resources. But its not completely
there yet. Once we have this notion of limited resource in place,
abstraction layer would adapt to relatively smaller value of such
resource.
These higher level abstraction is mainstream. Its shipped at least in
Redhat Enterprise Linux.

> Also, whatever next-gen is
> unlikely to have enough commonalities when the proposed resource knobs
> are this low level,

I agree that resource won't be common in next-gen other transport
whenever they arrive.
But with my existing background working on some of those transport,
they appear similar in nature and it might seek similar knobs.

> so let's please keep it separate, so that if/when
> this goes out of fashion for one reason or another, the controller can
> silently wither away too.
>
>> Current scope of the patch is limited to RDMA resources as first
>> patch, but for fact I am sure that there are more functionality in
>> pipe to support via this cgroup by me and others.
>> So keeping atleast these two aspects in mind, I need input on
>> d

Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-10 Thread Parav Pandit
On Fri, Sep 11, 2015 at 1:52 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello, Parav.
>
> On Thu, Sep 10, 2015 at 11:16:49PM +0530, Parav Pandit wrote:
>> >> These resources include are-  QP (queue pair) to transfer data, CQ
>> >> (Completion queue) to indicate completion of data transfer operation,
>> >> MR (memory region) to represent user application memory as source or
>> >> destination for data transfer.
>> >> Common resources are QP, SRQ (shared received queue), CQ, MR, AH
>> >> (Address handle), FLOW, PD (protection domain), user context etc.
>> >
>> > It's kinda bothering that all these are disparate resources.
>>
>> Actually not. They are linked resources. Every QP needs associated one
>> or two CQ, one PD.
>> Every QP will use few MRs for data transfer.
>
> So, if that's the case, let's please implement something higher level.
> The goal is providing reasonable isolation or protection.  If that can
> be achieved at a higher level of abstraction, please do that.
>
>> Here is the good programming guide of the RDMA APIs exposed to the
>> user space application.
>>
>> http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
>> So first version of the cgroups patch will address the control
>> operation for section 3.4.
>>
>> > I suppose that each restriction comes from the underlying hardware and
>> > there's no accepted higher level abstraction for these things?
>>
>> There is higher level abstraction which is through the verbs layer
>> currently which does actually expose the hardware resource but in
>> vendor agnostic way.
>> There are many vendors who support these verbs layer, some of them
>> which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers
>> which support these verbs are in  kernel tree.
>>
>> There is higher level APIs above the verb layer, such as MPI,
>> libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer.
>> They all rely on the hardware resource. All of these higher level
>> abstraction is accepted and well used by certain application class. It
>> would be long discussion to go over them here.
>
> Well, the programming interface that userland builds on top doesn't
> matter too much here but if there is a common resource abstraction
> which can be made in terms of constructs that consumers of the
> facility would care about, that likely is a better choice than
> exposing whatever hardware exposes.
>

Tejun,
The fact is that user level application uses hardware resources.
Verbs layer is software abstraction for it. Drivers are hiding how
they implement this QP or CQ or whatever hardware resource they
project via API layer.
For all of the userland on top of verb layer I mentioned above, the
common resource abstraction is these resources AH, QP, CQ, MR etc.
Hardware (and driver) might have different view of this resource in
their real implementation.
For example, verb layer can say that it has 100 QPs, but hardware
might actually have 20 QPs that driver decide how to efficiently use
it.

>> > I'm doubtful that these things are gonna be mainstream w/o building up
>> > higher level abstractions on top and if we ever get there we won't be
>> > talking about MR or CQ or whatever.
>>
>> Some of the higher level examples I gave above will adapt to resource
>> allocation failure. Some are actually adaptive to few resource
>> allocation failure, they do query resources. But its not completely
>> there yet. Once we have this notion of limited resource in place,
>> abstraction layer would adapt to relatively smaller value of such
>> resource.
>>
>> These higher level abstraction is mainstream. Its shipped at least in
>> Redhat Enterprise Linux.
>
> Again, I was talking more about resource abstraction - e.g. something
> along the line of "I want N command buffers".
>

Yes. We are still talking of resource abstraction here.
RDMA and IBTA defines these resources. On top of these resources
various frameworks are build.
so for example,
User land is tuning environment deploying for MPI application,
it would configure:
10 processes from the PID controller,
10 CPUs in cpuset controller,
1 PD, 20 CQ, 10 QP, 100 MRs in rdma controller,

say user land is tuning environment for deploying rsocket application
for 100 connections,
it would configure, 100 PD, 100 QP, 200 MR.
When verb layer see failure with it, they will adapt to live with what
they have at lower performance.

Since every higher level which I mentioned in different in the way, it
uses RDMA resources, we cannot generalize it as "N command buffers".
That generalizatio

Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-10 Thread Parav Pandit
On Fri, Sep 11, 2015 at 9:34 AM, Tejun Heo <t...@kernel.org> wrote:
> Hello, Parav.
>
> On Fri, Sep 11, 2015 at 09:09:58AM +0530, Parav Pandit wrote:
>> The fact is that user level application uses hardware resources.
>> Verbs layer is software abstraction for it. Drivers are hiding how
>> they implement this QP or CQ or whatever hardware resource they
>> project via API layer.
>> For all of the userland on top of verb layer I mentioned above, the
>> common resource abstraction is these resources AH, QP, CQ, MR etc.
>> Hardware (and driver) might have different view of this resource in
>> their real implementation.
>> For example, verb layer can say that it has 100 QPs, but hardware
>> might actually have 20 QPs that driver decide how to efficiently use
>> it.
>
> My uneducated suspicion is that the abstraction is just not developed
> enough.  It should be possible to virtualize these resources through,
> most likely, time-sharing to the level where userland simply says "I
> want this chunk transferred there" and OS schedules the transfer
> prioritizing competing requests.

Tejun,
That is such a perfect abstraction to have at OS level, but not sure
how much close it can be to bare metal RDMA it can be.
I have started discussion on that front as well as part of other
thread, but its certainly long way to go.
Most want to enjoy the performance benefit of the bare metal
interfaces it provides.

Such abstraction that you mentioned, exists, the only difference is
instead of its OS as central entity, its the higher level libraries,
drivers and hw together does it today for the applications.


>
> It could be that given the use cases rdma might not need such level of
> abstraction - e.g. most users want to be and are pretty close to bare
> metal, but, if that's true, it also kinda is weird to build
> hierarchical resource distribution scheme on top of such bare
> abstraction.
>
> ...
>> > I don't know.  What's proposed in this thread seems way too low level
>> > to be useful anywhere else.  Also, what if there are multiple devices?
>> > Is that a problem to worry about?
>>
>> o.k. It doesn't have to be useful anywhere else. If it suffice the
>> need of RDMA applications, its fine for near future.
>> This patch allows limiting resources across multiple devices.
>> As we go along the path, and if requirement come up to have knob on
>> per device basis, thats something we can extend in future.
>
> You kinda have to decide that upfront cuz it gets baked into the
> interface.

Well, all the interfaces are not yet defined. Except the test and
benchmark utilities, real world applications wouldn't really bother
much about which device are they are going through.
so I expect that per device level control would nice for very specific
applications, but I don't anticipate that in first place.
If others have different view, I would be happy to hear that.

Even if we extend per device control, I would expect per cgroup
control at top level without which its uncontrolled access.

>
>> > I'm kinda doubtful we're gonna have too many of these.  Hardware
>> > details being exposed to userland this directly isn't common.
>>
>> Its common in RDMA applications. Again they may not be real hardware
>> resource, its just API layer which defines those RDMA constructs.
>
> It's still a very low level of abstraction which pretty much gets
> decided by what the hardware and driver decide to do.
>
>> > I'd say keep it simple and do the minimum. :)
>>
>> o.k. In that case new rdma cgroup controller which does rdma resource
>> accounting is possibly the most simplest form?
>> Make sense?
>
> So, this fits cgroup's purpose to certain level but it feels like
> we're trying to build too much on top of something which hasn't
> developed sufficiently.  I suppose it could be that this is the level
> of development that rdma is gonna reach and dumb cgroup controller can
> be useful for some use cases.  I don't know, so, yeah, let's keep it
> simple and avoid doing crazy stuff.
>

o.k. thanks. I would wait for some more time to collect more feedback.
In absence of that,

I will send updated patch V1 which will include,
(a) functionality of this patch in new rdma cgroup as you recommended,
(b) fixes for comments from Haggai for this patch
(c) more fixes which I have done in mean time

> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task

2015-09-08 Thread Parav Pandit
On Tue, Sep 8, 2015 at 1:54 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 08/09/2015 10:04, Parav Pandit wrote:
>> On Tue, Sep 8, 2015 at 11:18 AM, Haggai Eran <hagg...@mellanox.com> wrote:
>>> On 07/09/2015 23:38, Parav Pandit wrote:
>>>> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct 
>>>> task_struct *p)
>>>>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>>>>   * subscriptions and synchronises with wait4().  Also used in procfs.  
>>>> Also
>>>>   * pins the final release of task.io_context.  Also protects ->cpuset and
>>>> - * ->cgroup.subsys[]. And ->vfork_done.
>>>> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
>>> s/projtects/protects/
>>>>   *
>>>>   * Nests both inside and outside of read_lock(_lock).
>>>>   * It must not be nested with write_lock_irq(_lock),
>>>
>>
>> Hi Haggai Eran,
>> Did you miss to put comments or I missed something?
>
> Yes, I wrote "s/projtects/protects/" to tell you that you have a typo in
> your comment. You should change the word "projtects" to "protects".
>
> Haggai
>
ah. ok. Right. Will correct it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task

2015-09-08 Thread Parav Pandit
On Tue, Sep 8, 2015 at 11:18 AM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct 
>> task_struct *p)
>>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>>   * pins the final release of task.io_context.  Also protects ->cpuset and
>> - * ->cgroup.subsys[]. And ->vfork_done.
>> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
> s/projtects/protects/
>>   *
>>   * Nests both inside and outside of read_lock(_lock).
>>   * It must not be nested with write_lock_irq(_lock),
>

Hi Haggai Eran,
Did you miss to put comments or I missed something?

Parav
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup.

2015-09-08 Thread Parav Pandit
On Tue, Sep 8, 2015 at 11:01 AM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
>> index 8b64221..cdbdd60 100644
>> --- a/include/linux/device_cgroup.h
>> +++ b/include/linux/device_cgroup.h
>> @@ -1,6 +1,57 @@
>> +#ifndef _DEVICE_CGROUP
>> +#define _DEVICE_CGROUP
>> +
>>  #include 
>> +#include 
>> +#include 
>
> You cannot add this include line before adding the device_rdma_cgroup.h
> (added in patch 5). You should reorder the patches so that after each
> patch the kernel builds correctly.
>
o.k. got it. I will send V1 with this suggested changes.

> I also noticed in patch 2 you add device_rdma_cgroup.o to the Makefile
> before it was added to the kernel.
>
o.k.

> Regards,
> Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-08 Thread Parav Pandit
On Tue, Sep 8, 2015 at 2:06 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>> +   enum devcgroup_rdma_rt type, int num)
>> +{
>> + struct dev_cgroup *dev_cg, *p;
>> + struct task_struct *ctx_task;
>> +
>> + if (!num)
>> + return;
>> +
>> + /* get cgroup of ib_ucontext it belong to, to uncharge
>> +  * so that when its called from any worker tasks or any
>> +  * other tasks to which this resource doesn't belong to,
>> +  * it can be uncharged correctly.
>> +  */
>> + if (ucontext)
>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>> + else
>> + ctx_task = current;
> So what happens if a process creates a ucontext, forks, and then the
> child creates and destroys a CQ? If I understand correctly, created
> resources are always charged to the current process (the child), but
> when it is destroyed the owner of the ucontext (the parent) will be
> uncharged.
>
> Since ucontexts are not meant to be used by multiple processes, I think
> it would be okay to always charge the owner process (the one that
> created the ucontext).

I need to think about it. I would like to avoid keep per task resource
counters for two reasons.
For a while I thought that native fork() doesn't take care to share
the RDMA resources and all CQ, QP dmaable memory from PID namespace
perspective.

1. Because, it could well happen that process and its child process is
created in PID namespace_A, after which child is migrated to new PID
namespace_B.
after which parent from the namespace_A is terminated. I am not sure
how the ucontext ownership changes from parent to child process at
that point today.
I prefer to keep this complexity out if at all it exists as process
migration across namespaces is not a frequent event for which to
optimize the code for.

2. by having per task counter (as cost of memory some memory) allows
to avoid using atomic during charge(), uncharge().

The intent is to have per task (process and thread) to have their
resource counter instance, but I can see that its broken where its
charging parent process as of now without atomics.
As you said its ok to always charge the owner process, I have to relax
2nd requirement and fallback to use atomics for charge(), uncharge()
or I have to get rid of ucontext from the uncharge() API which is
difficult due to fput() being in worker thread context.

>
>> + dev_cg = task_devcgroup(ctx_task);
>> +
>> + spin_lock(_task->rdma_res_counter->lock);
>> + ctx_task->rdma_res_counter->usage[type] -= num;
>> +
>> + for (p = dev_cg; p; p = parent_devcgroup(p))
>> + uncharge_resource(p, type, num);
>> +
>> + spin_unlock(_task->rdma_res_counter->lock);
>> +
>> + if (type == DEVCG_RDMA_RES_TYPE_UCTX)
>> + rdma_free_res_counter(ctx_task);
>> +}
>> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup.

2015-09-08 Thread Parav Pandit
On Tue, Sep 8, 2015 at 2:10 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> +static void init_ucontext_lists(struct ib_ucontext *ucontext)
>> +{
>> + INIT_LIST_HEAD(>pd_list);
>> + INIT_LIST_HEAD(>mr_list);
>> + INIT_LIST_HEAD(>mw_list);
>> + INIT_LIST_HEAD(>cq_list);
>> + INIT_LIST_HEAD(>qp_list);
>> + INIT_LIST_HEAD(>srq_list);
>> + INIT_LIST_HEAD(>ah_list);
>> + INIT_LIST_HEAD(>xrcd_list);
>> + INIT_LIST_HEAD(>rule_list);
>> +}
>
> I don't see how this change is related to the patch.

Its not but code which I added makes this function to grow longer, so
to keep it to same readability level, I did the cleanup.
May be I can send separate patch for cleanup?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-08 Thread Parav Pandit
On Tue, Sep 8, 2015 at 1:52 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> +/* RDMA resources from device cgroup perspective */
>> +enum devcgroup_rdma_rt {
>> + DEVCG_RDMA_RES_TYPE_UCTX,
>> + DEVCG_RDMA_RES_TYPE_CQ,
>> + DEVCG_RDMA_RES_TYPE_PD,
>> + DEVCG_RDMA_RES_TYPE_AH,
>> + DEVCG_RDMA_RES_TYPE_MR,
>> + DEVCG_RDMA_RES_TYPE_MW,
> I didn't see memory windows in dev_cgroup_files in patch 3. Is it used?

ib_uverbs_dereg_mr() needs a fix in my patch for MW and alloc_mw()
also needs to use it.
I will fix it.

>> + DEVCG_RDMA_RES_TYPE_SRQ,
>> + DEVCG_RDMA_RES_TYPE_QP,
>> + DEVCG_RDMA_RES_TYPE_FLOW,
>> + DEVCG_RDMA_RES_TYPE_MAX,
>> +};
>
>> +struct devcgroup_rdma_tracker {
>> + int limit;
>> + atomic_t usage;
>> + int failcnt;
>> +};
> Have you considered using struct res_counter?

No. I will look into the structure and see if it fits or not.

>
>> + * RDMA resource limits are hierarchical, so the highest configured limit of
>> + * the hierarchy is enforced. Allowing resource limit configuration to 
>> default
>> + * cgroup allows fair share to kernel space ULPs as well.
> In what way is the highest configured limit of the hierarchy enforced? I
> would expect all the limits along the hierarchy to be enforced.
>
In  hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied.

Lets take example to clarify.
Say cg_A, cg_B, cg_C
Role  name   limit
Parent   cg_A   100
Child_level1  cg_B (child of cg_A)20
Child_level2: cg_C (child of cg_B)50

If the process allocating rdma resource belongs to cg_C, limit lowest
limit in the hierarchy is applied during charge() stage.
If cg_A limit happens to be 10, since 10 is lowest, its limit would be
applicable as you expected.
this is similar to newly added PID subsystem in functionality.

>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
>> +{
>> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
>> + int type = seq_cft(sf)->private;
>> + u32 usage;
>> +
>> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
>> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
>> + } else {
>> + usage = dev_cg->rdma.tracker[type].limit;
> If this is the resource limit, don't name it 'usage'.
>
o.k. This is typo mistake from usage show function I made. I will change it.

>> + seq_printf(sf, "%u\n", usage);
>> + }
>> + return 0;
>> +}
>
>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
>> +{
>> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
>> + int type = seq_cft(sf)->private;
>> + u32 usage;
>> +
>> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
>> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
> I'm not sure hiding the actual number is good, especially in the
> show_usage case.

This is similar to following other controller same as newly added PID
subsystem in showing max limit.

>
>> + } else {
>> + usage = dev_cg->rdma.tracker[type].limit;
>> + seq_printf(sf, "%u\n", usage);
>> + }
>> + return 0;
>> +}
>
>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>> +   enum devcgroup_rdma_rt type, int num)
>> +{
>> + struct dev_cgroup *dev_cg, *p;
>> + struct task_struct *ctx_task;
>> +
>> + if (!num)
>> + return;
>> +
>> + /* get cgroup of ib_ucontext it belong to, to uncharge
>> +  * so that when its called from any worker tasks or any
>> +  * other tasks to which this resource doesn't belong to,
>> +  * it can be uncharged correctly.
>> +  */
>> + if (ucontext)
>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>> + else
>> + ctx_task = current;
>> + dev_cg = task_devcgroup(ctx_task);
>> +
>> + spin_lock(_task->rdma_res_counter->lock);
> Don't you need an rcu read lock and rcu_dereference to access
> rdma_res_counter?

I believe, its not required because when uncharge() is happening, it
can happen only from 3 contexts.
(a) from the caller task context, who has made allocation call, so no
synchronizing needed.
(b) from the dealloc resource context, again this is from the same
task cont

Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-08 Thread Parav Pandit
On Tue, Sep 8, 2015 at 7:20 PM, Haggai Eran <hagg...@mellanox.com> wrote:
> On 08/09/2015 13:18, Parav Pandit wrote:
>>> >
>>>> >> + * RDMA resource limits are hierarchical, so the highest configured 
>>>> >> limit of
>>>> >> + * the hierarchy is enforced. Allowing resource limit configuration to 
>>>> >> default
>>>> >> + * cgroup allows fair share to kernel space ULPs as well.
>>> > In what way is the highest configured limit of the hierarchy enforced? I
>>> > would expect all the limits along the hierarchy to be enforced.
>>> >
>> In  hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied.
>>
>> Lets take example to clarify.
>> Say cg_A, cg_B, cg_C
>> Role  name   limit
>> Parent   cg_A   100
>> Child_level1  cg_B (child of cg_A)20
>> Child_level2: cg_C (child of cg_B)50
>>
>> If the process allocating rdma resource belongs to cg_C, limit lowest
>> limit in the hierarchy is applied during charge() stage.
>> If cg_A limit happens to be 10, since 10 is lowest, its limit would be
>> applicable as you expected.
>
> Looking at the code, the usage in every level is charged. This is what I
> would expect. I just think the comment is a bit misleading.
>
>>>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
>>>> +{
>>>> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
>>>> + int type = seq_cft(sf)->private;
>>>> + u32 usage;
>>>> +
>>>> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
>>>> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
>>> I'm not sure hiding the actual number is good, especially in the
>>> show_usage case.
>>
>> This is similar to following other controller same as newly added PID
>> subsystem in showing max limit.
>
> Okay.
>
>>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>>>> +   enum devcgroup_rdma_rt type, int num)
>>>> +{
>>>> + struct dev_cgroup *dev_cg, *p;
>>>> + struct task_struct *ctx_task;
>>>> +
>>>> + if (!num)
>>>> + return;
>>>> +
>>>> + /* get cgroup of ib_ucontext it belong to, to uncharge
>>>> +  * so that when its called from any worker tasks or any
>>>> +  * other tasks to which this resource doesn't belong to,
>>>> +  * it can be uncharged correctly.
>>>> +  */
>>>> + if (ucontext)
>>>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>>>> + else
>>>> + ctx_task = current;
>>>> + dev_cg = task_devcgroup(ctx_task);
>>>> +
>>>> + spin_lock(_task->rdma_res_counter->lock);
>>> Don't you need an rcu read lock and rcu_dereference to access
>>> rdma_res_counter?
>>
>> I believe, its not required because when uncharge() is happening, it
>> can happen only from 3 contexts.
>> (a) from the caller task context, who has made allocation call, so no
>> synchronizing needed.
>> (b) from the dealloc resource context, again this is from the same
>> task context which allocated, it so this is single threaded, no need
>> to syncronize.
> I don't think it is true. You can access uverbs from multiple threads.
Yes, thats right. Though I design counter structure allocation on per
task basis for individual thread access, I totally missed out ucontext
sharing among threads. I replied in other thread to make counters
during charge, uncharge to atomic to cover that case.
Therefore I need rcu lock and deference as well.

> What may help your case here I think is the fact that only when the last
> ucontext is released you can change the rdma_res_counter field, and
> ucontext release takes the ib_uverbs_file->mutex.
>
> Still, I think it would be best to use rcu_dereference(), if only for
> documentation and sparse.

yes.

>
>> (c) from the fput() context when process is terminated abruptly or as
>> part of differed cleanup, when this is happening there cannot be
>> allocator task anyway.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-08 Thread Parav Pandit
On Tue, Sep 8, 2015 at 8:53 PM, Tejun Heo <t...@kernel.org> wrote:
> Hello, Parav.
>
> On Tue, Sep 08, 2015 at 02:08:16AM +0530, Parav Pandit wrote:
>> Currently user space applications can easily take away all the rdma
>> device specific resources such as AH, CQ, QP, MR etc. Due to which other
>> applications in other cgroup or kernel space ULPs may not even get chance
>> to allocate any rdma resources.
>
> Is there something simple I can read up on what each resource is?
> What's the usual access control mechanism?
>
Hi Tejun,
This is one old white paper, but most of the reasoning still holds true on RDMA.
http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf

More notes on RDMA resources and summary:
RDMA allows data transport from one system to other system where RDMA
device implements OSI layers 4 to 1 typically in hardware, drivers.
RDMA device provides data path semantics to perform data transfer in
zero copy manner from one to other host, very similar to local dma
controller.
It also allows data transfer operation from user space application of
one to other system.
In order to do so, all the resources are created using trusted kernel
space which also provides isolation among applications.
These resources include are-  QP (queue pair) to transfer data, CQ
(Completion queue) to indicate completion of data transfer operation,
MR (memory region) to represent user application memory as source or
destination for data transfer.
Common resources are QP, SRQ (shared received queue), CQ, MR, AH
(Address handle), FLOW, PD (protection domain), user context etc.

>> This patch-set allows limiting rdma resources to set of processes.
>> It extend device cgroup controller for limiting rdma device limits.
>
> I don't think this belongs to devcg.  If these make sense as a set of
> resources to be controlled via cgroup, the right way prolly would be a
> separate controller.
>

In past there has been similar comment to have dedicated cgroup
controller for RDMA instead of merging with device cgroup.
I am ok with both the approach, however I prefer to utilize device
controller instead of spinning of new controller for new devices
category.
I anticipate more such need would arise and for new device category,
it might not be worth to have new cgroup controller.
RapidIO though very less popular and upcoming PCIe are on horizon to
offer similar benefits as that of RDMA and in future having one
controller for each of them again would not be right approach.

I certainly seek your and others inputs in this email thread here whether
(a) to continue to extend device cgroup (which support character,
block devices white list) and now RDMA devices
or
(b) to spin of new controller, if so what are the compelling reasons
that it can provide compare to extension.

Current scope of the patch is limited to RDMA resources as first
patch, but for fact I am sure that there are more functionality in
pipe to support via this cgroup by me and others.
So keeping atleast these two aspects in mind, I need input on
direction of dedicated controller or new one.

In future, I anticipate that we might have sub directory to device
cgroup for individual device class to control.
such as,
 Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] devcg: Added infrastructure for rdma device cgroup.

2015-09-07 Thread Parav Pandit
1. Moved necessary functions and data structures to header file to
reuse them at device cgroup white list functionality and for rdma
functionality.
2. Added infrastructure to invoke RDMA specific routines for resource
configuration, query and during fork handling.
3. Added sysfs interface files for configuring max limit of each rdma
resource and one file for querying controllers current resource usage.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 include/linux/device_cgroup.h |  53 +++
 security/device_cgroup.c  | 119 +-
 2 files changed, 136 insertions(+), 36 deletions(-)

diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 8b64221..cdbdd60 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -1,6 +1,57 @@
+#ifndef _DEVICE_CGROUP
+#define _DEVICE_CGROUP
+
 #include 
+#include 
+#include 
 
 #ifdef CONFIG_CGROUP_DEVICE
+
+enum devcg_behavior {
+   DEVCG_DEFAULT_NONE,
+   DEVCG_DEFAULT_ALLOW,
+   DEVCG_DEFAULT_DENY,
+};
+
+/*
+ * exception list locking rules:
+ * hold devcgroup_mutex for update/read.
+ * hold rcu_read_lock() for read.
+ */
+
+struct dev_exception_item {
+   u32 major, minor;
+   short type;
+   short access;
+   struct list_head list;
+   struct rcu_head rcu;
+};
+
+struct dev_cgroup {
+   struct cgroup_subsys_state css;
+   struct list_head exceptions;
+   enum devcg_behavior behavior;
+
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+   struct devcgroup_rdma rdma;
+#endif
+};
+
+static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state 
*s)
+{
+   return s ? container_of(s, struct dev_cgroup, css) : NULL;
+}
+
+static inline struct dev_cgroup *parent_devcgroup(struct dev_cgroup *dev_cg)
+{
+   return css_to_devcgroup(dev_cg->css.parent);
+}
+
+static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
+{
+   return css_to_devcgroup(task_css(task, devices_cgrp_id));
+}
+
 extern int __devcgroup_inode_permission(struct inode *inode, int mask);
 extern int devcgroup_inode_mknod(int mode, dev_t dev);
 static inline int devcgroup_inode_permission(struct inode *inode, int mask)
@@ -17,3 +68,5 @@ static inline int devcgroup_inode_permission(struct inode 
*inode, int mask)
 static inline int devcgroup_inode_mknod(int mode, dev_t dev)
 { return 0; }
 #endif
+
+#endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 188c1d2..a0b3239 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -25,42 +25,6 @@
 
 static DEFINE_MUTEX(devcgroup_mutex);
 
-enum devcg_behavior {
-   DEVCG_DEFAULT_NONE,
-   DEVCG_DEFAULT_ALLOW,
-   DEVCG_DEFAULT_DENY,
-};
-
-/*
- * exception list locking rules:
- * hold devcgroup_mutex for update/read.
- * hold rcu_read_lock() for read.
- */
-
-struct dev_exception_item {
-   u32 major, minor;
-   short type;
-   short access;
-   struct list_head list;
-   struct rcu_head rcu;
-};
-
-struct dev_cgroup {
-   struct cgroup_subsys_state css;
-   struct list_head exceptions;
-   enum devcg_behavior behavior;
-};
-
-static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state 
*s)
-{
-   return s ? container_of(s, struct dev_cgroup, css) : NULL;
-}
-
-static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
-{
-   return css_to_devcgroup(task_css(task, devices_cgrp_id));
-}
-
 /*
  * called under devcgroup_mutex
  */
@@ -223,6 +187,9 @@ devcgroup_css_alloc(struct cgroup_subsys_state *parent_css)
INIT_LIST_HEAD(_cgroup->exceptions);
dev_cgroup->behavior = DEVCG_DEFAULT_NONE;
 
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+   init_devcgroup_rdma_tracker(dev_cgroup);
+#endif
return _cgroup->css;
 }
 
@@ -234,6 +201,25 @@ static void devcgroup_css_free(struct cgroup_subsys_state 
*css)
kfree(dev_cgroup);
 }
 
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+static int devcgroup_can_attach(struct cgroup_subsys_state *dst_css,
+   struct cgroup_taskset *tset)
+{
+   return devcgroup_rdma_can_attach(dst_css, tset);
+}
+
+static void devcgroup_cancel_attach(struct cgroup_subsys_state *dst_css,
+   struct cgroup_taskset *tset)
+{
+   devcgroup_cancel_attach(dst_css, tset);
+}
+
+static void devcgroup_fork(struct task_struct *task, void *priv)
+{
+   devcgroup_rdma_fork(task, priv);
+}
+#endif
+
 #define DEVCG_ALLOW 1
 #define DEVCG_DENY 2
 #define DEVCG_LIST 3
@@ -788,6 +774,62 @@ static struct cftype dev_cgroup_files[] = {
.seq_show = devcgroup_seq_show,
.private = DEVCG_LIST,
},
+
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+   {
+   .name = "rdma.resource.uctx.max",
+   .write = devcgroup_rdma_set_max_resource,
+   .seq_show = devcgroup_rdma_get_max_resour

[PATCH 7/7] devcg: Added Documentation of RDMA device cgroup.

2015-09-07 Thread Parav Pandit
Modified device cgroup documentation to reflect its dual purpose
without creating new cgroup subsystem for rdma.

Added documentation to describe functionality and usage of device cgroup
extension for RDMA.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 Documentation/cgroups/devices.txt | 32 +---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/devices.txt 
b/Documentation/cgroups/devices.txt
index 3c1095c..eca5b70 100644
--- a/Documentation/cgroups/devices.txt
+++ b/Documentation/cgroups/devices.txt
@@ -1,9 +1,12 @@
-Device Whitelist Controller
+Device Controller
 
 1. Description:
 
-Implement a cgroup to track and enforce open and mknod restrictions
-on device files.  A device cgroup associates a device access
+Device controller implements a cgroup for two purposes.
+
+1.1 Device white list controller
+It implement a cgroup to track and enforce open and mknod
+restrictions on device files.  A device cgroup associates a device access
 whitelist with each cgroup.  A whitelist entry has 4 fields.
 'type' is a (all), c (char), or b (block).  'all' means it applies
 to all types and all major and minor numbers.  Major and minor are
@@ -15,8 +18,15 @@ cgroup gets a copy of the parent.  Administrators can then 
remove
 devices from the whitelist or add new entries.  A child cgroup can
 never receive a device access which is denied by its parent.
 
+1.2 RDMA device resource controller
+It implements a cgroup to limit various RDMA device resources for
+a controller. Such resource includes RDMA PD, CQ, AH, MR, SRQ, QP, FLOW.
+It limits RDMA resources access to tasks of the cgroup across multiple
+RDMA devices.
+
 2. User Interface
 
+2.1 Device white list controller
 An entry is added using devices.allow, and removed using
 devices.deny.  For instance
 
@@ -33,6 +43,22 @@ will remove the default 'a *:* rwm' entry. Doing
 
 will add the 'a *:* rwm' entry to the whitelist.
 
+2.2 RDMA device controller
+
+RDMA resources are limited using devices.rdma.resource.max..
+Doing
+   echo 200 > /sys/fs/cgroup/1/rdma.resource.max_qp
+will limit maximum number of QP across all the process of cgroup to 200.
+
+More examples:
+   echo 200 > /sys/fs/cgroup/1/rdma.resource.max_flow
+   echo 10  > /sys/fs/cgroup/1/rdma.resource.max_pd
+   echo 15  > /sys/fs/cgroup/1/rdma.resource.max_srq
+   echo 1   > /sys/fs/cgroup/1/rdma.resource.max_uctx
+
+RDMA resource current usage can be tracked using devices.rdma.resource.usage
+   cat /sys/fs/cgroup/1/devices.rdma.resource.usage
+
 3. Security
 
 Any task can move itself between cgroups.  This clearly won't
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-07 Thread Parav Pandit
Hi Doug, Tejun,

This is from cgroups for-4.3 branch.
linux-rdma trunk will face compilation error as its behind Tejun's
for-4.3 branch.
Patch has dependency on the some of the cgroup subsystem functionality
for fork().
Therefore its required to merge those changes first to linux-rdma trunk.

Parav


On Tue, Sep 8, 2015 at 2:08 AM, Parav Pandit <pandit.pa...@gmail.com> wrote:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources.
>
> This patch-set allows limiting rdma resources to set of processes.
> It extend device cgroup controller for limiting rdma device limits.
>
> With this patch, user verbs module queries rdma device cgroup controller
> to query process's limit to consume such resource. It uncharge resource
> counter after resource is being freed.
>
> It extends the task structure to hold the statistic information about 
> process's
> rdma resource usage so that when process migrates from one to other 
> controller,
> right amount of resources can be migrated from one to other cgroup.
>
> Future patches will support RDMA flows resource and will be enhanced further
> to enforce limit of other resources and capabilities.
>
> Parav Pandit (7):
>   devcg: Added user option to rdma resource tracking.
>   devcg: Added rdma resource tracking module.
>   devcg: Added infrastructure for rdma device cgroup.
>   devcg: Added rdma resource tracker object per task
>   devcg: device cgroup's extension for RDMA resource.
>   devcg: Added support to use RDMA device cgroup.
>   devcg: Added Documentation of RDMA device cgroup.
>
>  Documentation/cgroups/devices.txt |  32 ++-
>  drivers/infiniband/core/uverbs_cmd.c  | 139 +--
>  drivers/infiniband/core/uverbs_main.c |  39 +++-
>  include/linux/device_cgroup.h |  53 +
>  include/linux/device_rdma_cgroup.h|  83 +++
>  include/linux/sched.h |  12 +-
>  init/Kconfig  |  12 +
>  security/Makefile |   1 +
>  security/device_cgroup.c  | 119 +++---
>  security/device_rdma_cgroup.c | 422 
> ++
>  10 files changed, 850 insertions(+), 62 deletions(-)
>  create mode 100644 include/linux/device_rdma_cgroup.h
>  create mode 100644 security/device_rdma_cgroup.c
>
> --
> 1.8.3.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] devcg: Added user option to rdma resource tracking.

2015-09-07 Thread Parav Pandit
Added user configuration option to enable/disable RDMA resource tracking
feature of device cgroup as sub module.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 init/Kconfig | 12 
 1 file changed, 12 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 2184b34..089db85 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -977,6 +977,18 @@ config CGROUP_DEVICE
  Provides a cgroup implementing whitelists for devices which
  a process in the cgroup can mknod or open.
 
+config CGROUP_RDMA_RESOURCE
+   bool "RDMA Resource Controller for cgroups"
+   depends on CGROUP_DEVICE
+   default n
+   help
+ This option enables limiting rdma resources for a device cgroup.
+ Using this option, user space processes can be limited to use
+ limited number of RDMA resources such as MR, PD, QP, AH, FLOW, CQ
+ etc.
+
+ Say N if unsure.
+
 config CPUSETS
bool "Cpuset support"
help
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] devcg: Added rdma resource tracking module.

2015-09-07 Thread Parav Pandit
Added RDMA resource tracking object of device cgroup.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 security/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/security/Makefile b/security/Makefile
index c9bfbc8..c9ad56d 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -23,6 +23,7 @@ obj-$(CONFIG_SECURITY_TOMOYO) += tomoyo/
 obj-$(CONFIG_SECURITY_APPARMOR)+= apparmor/
 obj-$(CONFIG_SECURITY_YAMA)+= yama/
 obj-$(CONFIG_CGROUP_DEVICE)+= device_cgroup.o
+obj-$(CONFIG_CGROUP_RDMA_RESOURCE) += device_rdma_cgroup.o
 
 # Object integrity file lists
 subdir-$(CONFIG_INTEGRITY) += integrity
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-07 Thread Parav Pandit
Extension of device cgroup for RDMA device resources.
This implements RDMA resource tracker to limit RDMA resources such as
AH, CQ, PD, QP, MR, SRQ etc resources for processes of the cgroup.
It implements RDMA resource limit module to limit consuming RDMA
resources for processes of the cgroup.
RDMA resources are tracked on per task basis.
RDMA resources across multiple such devices are limited among multiple
processes of the owning device cgroup.

RDMA device cgroup extension returns error when user space applications
try to allocate resources more than its configured limit.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 include/linux/device_rdma_cgroup.h |  83 
 security/device_rdma_cgroup.c  | 422 +
 2 files changed, 505 insertions(+)
 create mode 100644 include/linux/device_rdma_cgroup.h
 create mode 100644 security/device_rdma_cgroup.c

diff --git a/include/linux/device_rdma_cgroup.h 
b/include/linux/device_rdma_cgroup.h
new file mode 100644
index 000..a2c261b
--- /dev/null
+++ b/include/linux/device_rdma_cgroup.h
@@ -0,0 +1,83 @@
+#ifndef _DEVICE_RDMA_CGROUP_H
+#define _DEVICE_RDMA_CGROUP_H
+
+#include 
+
+/* RDMA resources from device cgroup perspective */
+enum devcgroup_rdma_rt {
+   DEVCG_RDMA_RES_TYPE_UCTX,
+   DEVCG_RDMA_RES_TYPE_CQ,
+   DEVCG_RDMA_RES_TYPE_PD,
+   DEVCG_RDMA_RES_TYPE_AH,
+   DEVCG_RDMA_RES_TYPE_MR,
+   DEVCG_RDMA_RES_TYPE_MW,
+   DEVCG_RDMA_RES_TYPE_SRQ,
+   DEVCG_RDMA_RES_TYPE_QP,
+   DEVCG_RDMA_RES_TYPE_FLOW,
+   DEVCG_RDMA_RES_TYPE_MAX,
+};
+
+struct ib_ucontext;
+
+#define DEVCG_RDMA_MAX_RESOURCES S32_MAX
+
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+
+#define DEVCG_RDMA_MAX_RESOURCE_STR "max"
+
+enum devcgroup_rdma_access_files {
+   DEVCG_RDMA_LIST_USAGE,
+};
+
+struct task_rdma_res_counter {
+   /* allows atomic increment of task and cgroup counters
+*  to avoid race with migration task.
+*/
+   spinlock_t lock;
+   u32 usage[DEVCG_RDMA_RES_TYPE_MAX];
+};
+
+struct devcgroup_rdma_tracker {
+   int limit;
+   atomic_t usage;
+   int failcnt;
+};
+
+struct devcgroup_rdma {
+   struct devcgroup_rdma_tracker tracker[DEVCG_RDMA_RES_TYPE_MAX];
+};
+
+struct dev_cgroup;
+
+void init_devcgroup_rdma_tracker(struct dev_cgroup *dev_cg);
+ssize_t devcgroup_rdma_set_max_resource(struct kernfs_open_file *of,
+   char *buf,
+   size_t nbytes, loff_t off);
+int devcgroup_rdma_get_max_resource(struct seq_file *m, void *v);
+int devcgroup_rdma_show_usage(struct seq_file *m, void *v);
+
+int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num);
+void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
+ enum devcgroup_rdma_rt type, int num);
+void devcgroup_rdma_fork(struct task_struct *task, void *priv);
+
+int devcgroup_rdma_can_attach(struct cgroup_subsys_state *css,
+ struct cgroup_taskset *tset);
+void devcgroup_rdma_cancel_attach(struct cgroup_subsys_state *css,
+ struct cgroup_taskset *tset);
+int devcgroup_rdma_query_resource_limit(enum devcgroup_rdma_rt type);
+#else
+
+static inline int devcgroup_rdma_try_charge_resource(
+   enum devcgroup_rdma_rt type, int num)
+{ return 0; }
+static inline void devcgroup_rdma_uncharge_resource(
+   struct ib_ucontext *ucontext,
+   enum devcgroup_rdma_rt type, int num)
+{ }
+static inline int devcgroup_rdma_query_resource_limit(
+   enum devcgroup_rdma_rt type)
+{ return DEVCG_RDMA_MAX_RESOURCES; }
+#endif
+
+#endif
diff --git a/security/device_rdma_cgroup.c b/security/device_rdma_cgroup.c
new file mode 100644
index 000..fb4cc59
--- /dev/null
+++ b/security/device_rdma_cgroup.c
@@ -0,0 +1,422 @@
+/*
+ * RDMA device cgroup controller of device controller cgroup.
+ *
+ * Provides a cgroup hierarchy to limit various RDMA resource allocation to a
+ * configured limit of the cgroup.
+ *
+ * Its easy for user space applications to consume of RDMA device specific
+ * hardware resources. Such resource exhaustion should be prevented so that
+ * user space applications and other kernel consumers gets chance to allocate
+ * and effectively use the hardware resources.
+ *
+ * In order to use the device rdma controller, set the maximum resource count
+ * per cgroup, which ensures that total rdma resources for processes belonging
+ * to a cgroup doesn't exceed configured limit.
+ *
+ * RDMA resource limits are hierarchical, so the highest configured limit of
+ * the hierarchy is enforced. Allowing resource limit configuration to default
+ * cgroup allows fair share to kernel space ULPs as well.
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * Genera

[PATCH 6/7] devcg: Added support to use RDMA device cgroup.

2015-09-07 Thread Parav Pandit
RDMA uverbs modules now queries associated device cgroup rdma controller
before allocating device resources and uncharge them while freeing
rdma device resources.
Since fput() sequence can free the resources from the workqueue
context (instead of task context which allocated the resource),
it passes associated ucontext pointer during uncharge, so that
rdma cgroup controller can correctly free the resource of right
task and right cgroup.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 drivers/infiniband/core/uverbs_cmd.c  | 139 +-
 drivers/infiniband/core/uverbs_main.c |  39 +-
 2 files changed, 156 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index bbb02ff..c080374 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -281,6 +282,19 @@ static void put_xrcd_read(struct ib_uobject *uobj)
put_uobj_read(uobj);
 }
 
+static void init_ucontext_lists(struct ib_ucontext *ucontext)
+{
+   INIT_LIST_HEAD(>pd_list);
+   INIT_LIST_HEAD(>mr_list);
+   INIT_LIST_HEAD(>mw_list);
+   INIT_LIST_HEAD(>cq_list);
+   INIT_LIST_HEAD(>qp_list);
+   INIT_LIST_HEAD(>srq_list);
+   INIT_LIST_HEAD(>ah_list);
+   INIT_LIST_HEAD(>xrcd_list);
+   INIT_LIST_HEAD(>rule_list);
+}
+
 ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
  const char __user *buf,
  int in_len, int out_len)
@@ -313,22 +327,18 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
   (unsigned long) cmd.response + sizeof resp,
   in_len - sizeof cmd, out_len - sizeof resp);
 
+   ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_UCTX, 1);
+   if (ret)
+   goto err;
+
ucontext = ibdev->alloc_ucontext(ibdev, );
if (IS_ERR(ucontext)) {
ret = PTR_ERR(ucontext);
-   goto err;
+   goto err_alloc;
}
 
ucontext->device = ibdev;
-   INIT_LIST_HEAD(>pd_list);
-   INIT_LIST_HEAD(>mr_list);
-   INIT_LIST_HEAD(>mw_list);
-   INIT_LIST_HEAD(>cq_list);
-   INIT_LIST_HEAD(>qp_list);
-   INIT_LIST_HEAD(>srq_list);
-   INIT_LIST_HEAD(>ah_list);
-   INIT_LIST_HEAD(>xrcd_list);
-   INIT_LIST_HEAD(>rule_list);
+   init_ucontext_lists(ucontext);
rcu_read_lock();
ucontext->tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
rcu_read_unlock();
@@ -395,6 +405,8 @@ err_free:
put_pid(ucontext->tgid);
ibdev->dealloc_ucontext(ucontext);
 
+err_alloc:
+   devcgroup_rdma_uncharge_resource(NULL, DEVCG_RDMA_RES_TYPE_UCTX, 1);
 err:
mutex_unlock(>mutex);
return ret;
@@ -412,15 +424,23 @@ static void copy_query_dev_fields(struct ib_uverbs_file 
*file,
resp->vendor_id = attr->vendor_id;
resp->vendor_part_id= attr->vendor_part_id;
resp->hw_ver= attr->hw_ver;
-   resp->max_qp= attr->max_qp;
+   resp->max_qp= min_t(int, attr->max_qp,
+   devcgroup_rdma_query_resource_limit(
+   DEVCG_RDMA_RES_TYPE_QP));
resp->max_qp_wr = attr->max_qp_wr;
resp->device_cap_flags  = attr->device_cap_flags;
resp->max_sge   = attr->max_sge;
resp->max_sge_rd= attr->max_sge_rd;
-   resp->max_cq= attr->max_cq;
+   resp->max_cq= min_t(int, attr->max_cq,
+   devcgroup_rdma_query_resource_limit(
+   DEVCG_RDMA_RES_TYPE_CQ));
resp->max_cqe   = attr->max_cqe;
-   resp->max_mr= attr->max_mr;
-   resp->max_pd= attr->max_pd;
+   resp->max_mr= min_t(int, attr->max_mr,
+   devcgroup_rdma_query_resource_limit(
+   DEVCG_RDMA_RES_TYPE_MR));
+   resp->max_pd= min_t(int, attr->max_pd,
+   devcgroup_rdma_query_resource_limit(
+   DEVCG_RDMA_RES_TYPE_PD));
resp->max_qp_rd_atom= attr->max_qp_rd_atom;
resp->max_ee_rd_atom= attr->max_ee_rd_atom;
resp->max_res_rd_atom   = attr->max_res_rd_atom;
@@ -429,16 +449,22 @@ static void copy_query_dev_fields(struct ib_uverbs_file 
*file,
resp->atomic_cap= attr->atomic_cap;
resp->max_ee  

[PATCH 4/7] devcg: Added rdma resource tracker object per task

2015-09-07 Thread Parav Pandit
Added RDMA device resource tracking object per task.
Added comments to capture usage of task lock by device cgroup
for rdma.

Signed-off-by: Parav Pandit <pandit.pa...@gmail.com>
---
 include/linux/sched.h | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae21f15..a5f79b6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1334,6 +1334,8 @@ union rcu_special {
 };
 struct rcu_node;
 
+struct task_rdma_res_counter;
+
 enum perf_event_task_context {
perf_invalid_context = -1,
perf_hw_context = 0,
@@ -1637,6 +1639,14 @@ struct task_struct {
struct css_set __rcu *cgroups;
/* cg_list protected by css_set_lock and tsk->alloc_lock */
struct list_head cg_list;
+
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+   /* RDMA resource accounting counters, allocated only
+* when RDMA resources are created by a task.
+*/
+   struct task_rdma_res_counter *rdma_res_counter;
+#endif
+
 #endif
 #ifdef CONFIG_FUTEX
struct robust_list_head __user *robust_list;
@@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct 
*p)
  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
  * pins the final release of task.io_context.  Also protects ->cpuset and
- * ->cgroup.subsys[]. And ->vfork_done.
+ * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
  *
  * Nests both inside and outside of read_lock(_lock).
  * It must not be nested with write_lock_irq(_lock),
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA/CM and multiple QPs

2015-09-06 Thread Parav Pandit
Hi Christoph,

Establishing multiple QP is just one part of it.
Bigger challenge is how do we distribute the work request among
multiple QPs specially when STAG advertisements, their invalidation is
agnostic at Verbs layer (which is not part of the IB spec and every
ULP has their own method possibly for good reason).

Few months back when I was working on this problem; solution we
considered is similar to what networking stack currently does.
As below:

1. instead of having pure ib_send, write, read verbs, invalidate, we
need to have more higher level verbs for data transport.
such send_data, receive_data, advertise data_buffers etc. Of course
keeping zero copy semantics in mind.

2. Perform device aggregation similar to Ethernet netdev link aggregation.
So two ib_device forms the pair on which one or more QPs will be created.
This virtual device provides higher level data transfer APIS than just
raw IB semantics.
By doing so, this layer decides how to advertise memory, when to
invalidate, which QP to use for transport (load balance or failover).

3. I have not thought through on how we can port existing ULPs whose
specification is IB driven to migrate on this newly defined interface.

4. Accelio is one such framework come close to this design philosophy,
however its current implementation brings resource overhead for MRs
and as we go along we have scope to optimize it.

5. Since this layer is located above raw IB verbs layer and above
RDMA-CM, core is untouched for the functionality. Once we have it many
of the migration related issue can be solved, where node can
disconnect and reconnect in stateful way.

6. This way pure hardware resource is detached from transport
acceleration, it gives flexibility to implement services which is
often difficult to do at raw IB verbs level.

Parav






On Sun, Sep 6, 2015 at 12:15 PM, Christoph Hellwig  wrote:
> Hi All,
>
> right now RDMA/CM works on a QP basis, but seems very awakward if you
> want multiple QPs as part of a single logical device, which will be
> useful for a lot of modern protocols.  For example we will need to check
> in the CM handler that we're not getting a different ib_device if we
> want to apply the device limit in any sort of global scope, and it's
> generally very hard to get a struct ib_device that can be used as
> a driver model parent.
>
> Is there any interest in trying to add an API to the CM to do a single
> address resolution and allocate multiple QPs with these checks in
> place?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA/CM and multiple QPs

2015-09-06 Thread Parav Pandit
On Sun, Sep 6, 2015 at 1:20 PM, Christoph Hellwig <h...@infradead.org> wrote:
> On Sun, Sep 06, 2015 at 01:12:56PM +0530, Parav Pandit wrote:
>> Hi Christoph,
>>
>> Establishing multiple QP is just one part of it.
>> Bigger challenge is how do we distribute the work request among
>> multiple QPs
>
> For my case I simply rely on the blk-mq layer to have cpu-local queues,
> so that's a somewhat solved issue as long as you are fine with the
> usage model.  If your usage is skewed heavily towards certain CPUs
> it might be a little suboptimal.
>
> Note that the SRP driver already in tree is a good example for this,
> although it doesn't use RDMA/CM and thus already operates on a
> per-ib_device level.

Yes. SRP is good example. The point I am trying to make is, SRP
implements failover and request spreading where one QP fails it
delivers to other QP.
So one Session spans across multiple transport QP connections.
Similarly we every ULP needs to implement such functionalities.
Instead there could be single such transport mid layer who should do it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 1/2] IB/core: Add support for RX/TX checksum offload capabilities report

2015-08-06 Thread Parav Pandit
On Thu, Aug 6, 2015 at 4:30 PM, Haggai Eran hagg...@mellanox.com wrote:
 On Wednesday, August 5, 2015 8:16 PM, Jason Gunthorpe 
 jguntho...@obsidianresearch.com wrote:
 On Wed, Aug 05, 2015 at 06:34:26PM +0300, Amir Vadai wrote:
  struct ib_uverbs_ex_query_device {
   __u32 comp_mask;
 + __u32 csum_caps;
   __u32 reserved;
  };

 Uh no.
 This is the struct of the command, not the response. There is no need to 
 extend it. The command is designed to always return as much information as 
 possible, so the user space code doesn't need to pass anything for it to work.

 Even if you did want to extend it, you would need to replace the reserved 
 word. The structs in this header file must be made in such way that they have 
 the same size on 32-bit systems and on 64-bit systems (see the comment at the 
 beginning of the header file). This is why the reserved word is there.


 @@ -221,6 +222,7 @@ struct ib_uverbs_odp_caps {
  struct ib_uverbs_ex_query_device_resp {
   struct ib_uverbs_query_device_resp base;
   __u32 comp_mask;
 + __u32 csum_caps;
   __u32 response_length;
   struct ib_uverbs_odp_caps odp_caps;
   __u64 timestamp_mask;

 Also totally wrong.

 The response struct must maintain backward compatibility. You cannot change 
 the order of the existing fields. The only valid way of extending it is at 
 the end. Here too, you must make sure that the struct has the same size on 
 32-bit systems, so you would need to add a 32-bit reserved word at the end.


As struct ib_uverbs_ex_query_device_resp captures extended
capabilities, does it make sense to have few more reserved words
defined as part of this patch?
So that later on those reserved can be defined in future for
additional features.
This way for every new feature we dont need to bump structure size of
ABI, not we need to define new set of ABI calls.
Its hard to say how much more is sufficient, but was thinking of 8 32-bit words.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] split struct ib_send_wr

2015-08-06 Thread Parav Pandit
Do you see value in dividing ib_ud _wr into ib_ud_wr and ib_ud_gsi_wr
to save 4 bytes?

On Thu, Aug 6, 2015 at 9:54 PM, Christoph Hellwig h...@infradead.org wrote:
 I've pushed out a new version.  Updates:

  - the ib_recv_wr change Bart notices has been fixed.
  - iser and isert have been converted
  - the handling of the embedded WR in the qib software queue entry
has been fixed.

 Which means we're basically done now and the patch could use
 broader testing.

 The full patch will be too much for the list again, so here is the
 git commit:

 http://git.infradead.org/users/hch/scsi.git/commitdiff/a0027ed00fc3ae2686d8a843a724b50597115a71

 ib_vers.h diff below:

 diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
 index 0940051..2f2efdd 100644
 --- a/include/rdma/ib_verbs.h
 +++ b/include/rdma/ib_verbs.h
 @@ -1100,54 +1100,94 @@ struct ib_send_wr {
 __be32  imm_data;
 u32 invalidate_rkey;
 } ex;
 -   union {
 -   struct {
 -   u64 remote_addr;
 -   u32 rkey;
 -   } rdma;
 -   struct {
 -   u64 remote_addr;
 -   u64 compare_add;
 -   u64 swap;
 -   u64 compare_add_mask;
 -   u64 swap_mask;
 -   u32 rkey;
 -   } atomic;
 -   struct {
 -   struct ib_ah *ah;
 -   void   *header;
 -   int hlen;
 -   int mss;
 -   u32 remote_qpn;
 -   u32 remote_qkey;
 -   u16 pkey_index; /* valid for GSI only */
 -   u8  port_num;   /* valid for DR SMPs on switch 
 only */
 -   } ud;
 -   struct {
 -   u64 iova_start;
 -   struct ib_fast_reg_page_list   *page_list;
 -   unsigned intpage_shift;
 -   unsigned intpage_list_len;
 -   u32 length;
 -   int access_flags;
 -   u32 rkey;
 -   } fast_reg;
 -   struct {
 -   struct ib_mw*mw;
 -   /* The new rkey for the memory window. */
 -   u32  rkey;
 -   struct ib_mw_bind_info   bind_info;
 -   } bind_mw;
 -   struct {
 -   struct ib_sig_attrs*sig_attrs;
 -   struct ib_mr   *sig_mr;
 -   int access_flags;
 -   struct ib_sge  *prot;
 -   } sig_handover;
 -   } wr;
 u32 xrc_remote_srq_num; /* XRC TGT QPs only */
  };

 +struct ib_rdma_wr {
 +   struct ib_send_wr   wr;
 +   u64 remote_addr;
 +   u32 rkey;
 +};
 +
 +static inline struct ib_rdma_wr *rdma_wr(struct ib_send_wr *wr)
 +{
 +   return container_of(wr, struct ib_rdma_wr, wr);
 +}
 +
 +struct ib_atomic_wr {
 +   struct ib_send_wr   wr;
 +   u64 remote_addr;
 +   u64 compare_add;
 +   u64 swap;
 +   u64 compare_add_mask;
 +   u64 swap_mask;
 +   u32 rkey;
 +};
 +
 +static inline struct ib_atomic_wr *atomic_wr(struct ib_send_wr *wr)
 +{
 +   return container_of(wr, struct ib_atomic_wr, wr);
 +}
 +
 +struct ib_ud_wr {
 +   struct ib_send_wr   wr;
 +   struct ib_ah*ah;
 +   void*header;
 +   int hlen;
 +   int mss;
 +   u32 remote_qpn;
 +   u32 remote_qkey;
 +   u16 pkey_index; /* valid for GSI only */
 +   u8  port_num;   /* valid for DR SMPs on switch 
 only */
 +};
 +
 +static inline struct ib_ud_wr *ud_wr(struct ib_send_wr *wr)
 +{
 +   return container_of(wr, struct ib_ud_wr, wr);
 +}
 +
 +struct ib_fast_reg_wr {
 +   struct ib_send_wr   wr;
 +   u64 iova_start;
 +   struct ib_fast_reg_page_list *page_list;
 +   unsigned intpage_shift;
 +   unsigned intpage_list_len;
 +   u32 length;
 +   int access_flags;
 +   u32 rkey;
 +};
 +
 +static inline struct ib_fast_reg_wr *fast_reg_wr(struct ib_send_wr *wr)
 +{
 +   return container_of(wr, struct 

Re: [PATCH for-next 1/2] IB/core: Add support for RX/TX checksum offload capabilities report

2015-08-06 Thread Parav Pandit
On Thu, Aug 6, 2015 at 10:20 PM, Haggai Eran hagg...@mellanox.com wrote:
 On 08/06/2015 02:18 PM, Parav Pandit wrote:
 On Thu, Aug 6, 2015 at 4:30 PM, Haggai Eran hagg...@mellanox.com
 mailto:hagg...@mellanox.com wrote:

 On Wednesday, August 5, 2015 8:16 PM, Jason Gunthorpe
 jguntho...@obsidianresearch.com
 mailto:jguntho...@obsidianresearch.com wrote:
  On Wed, Aug 05, 2015 at 06:34:26PM +0300, Amir Vadai wrote:
   struct ib_uverbs_ex_query_device {
__u32 comp_mask;
  + __u32 csum_caps;
__u32 reserved;
   };
 
  Uh no.
 This is the struct of the command, not the response. There is no
 need to extend it. The command is designed to always return as much
 information as possible, so the user space code doesn't need to pass
 anything for it to work.

 Even if you did want to extend it, you would need to replace the
 reserved word. The structs in this header file must be made in such
 way that they have the same size on 32-bit systems and on 64-bit
 systems (see the comment at the beginning of the header file). This
 is why the reserved word is there.

 
  @@ -221,6 +222,7 @@ struct ib_uverbs_odp_caps {
   struct ib_uverbs_ex_query_device_resp {
struct ib_uverbs_query_device_resp base;
__u32 comp_mask;
  + __u32 csum_caps;
__u32 response_length;
struct ib_uverbs_odp_caps odp_caps;
__u64 timestamp_mask;
 
  Also totally wrong.

 The response struct must maintain backward compatibility. You cannot
 change the order of the existing fields. The only valid way of
 extending it is at the end. Here too, you must make sure that the
 struct has the same size on 32-bit systems, so you would need to add
 a 32-bit reserved word at the end.

 Haggai

 As struct ib_uverbs_ex_query_device_resp captures extended capabilities,
 does it make sense to have few more reserved words defined as part of
 this patch?
 So that later on those reserved can be defined in future for additional
 features.
 This way for every new feature we dont need to bump structure size of
 ABI, not we need to define new set of ABI calls.
 Its hard to say how much more is sufficient, but was thinking of 8
 32-bit words.


 I don't see how increasing the size now would get you anything that
 changing the returned response_length field wouldn't.

It won't. Eventually this code will have switch case for various
different response length to support backward compatibility. I was
trying to avoid adding such switch-case. Instead based on supported
kernel version, it will fill up the information.

 I'm not sure what
 you consider an ABI change. Doesn't adding new meaning to reserved
 fields count as a change? In any case, increasing the response length
 doesn't require adding new calls.

Yes, it doesn't. I don't see issue with response length increase, it
solves it. I was considering a solution where we don't have to keep
doing that.

 The kernel code will agree to fill only the fields that fit in the buffer 
 provided by the user-space caller.

 Haggai
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RDMA/ocrdma: Fixed cqe expansion of unsignaled wqe

2012-08-17 Thread Parav Pandit
Fixed cqe expansion of unsignaled wqe - stop expanding the cqe
when wqe index of the completed cqe matches with last pending wqe (tail)
in the queue.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index cb5b7f7..b29a424 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -2219,7 +2219,6 @@ static bool ocrdma_poll_success_scqe(struct ocrdma_qp *qp,
u32 wqe_idx;
 
if (!qp-wqe_wr_id_tbl[tail].signaled) {
-   expand = true;  /* CQE cannot be consumed yet */
*polled = false;/* WC cannot be consumed yet */
} else {
ibwc-status = IB_WC_SUCCESS;
@@ -2227,10 +2226,11 @@ static bool ocrdma_poll_success_scqe(struct ocrdma_qp 
*qp,
ibwc-qp = qp-ibqp;
ocrdma_update_wc(qp, ibwc, tail);
*polled = true;
-   wqe_idx = le32_to_cpu(cqe-wq.wqeidx)  OCRDMA_CQE_WQEIDX_MASK;
-   if (tail != wqe_idx)
-   expand = true; /* Coalesced CQE can't be consumed yet */
}
+   wqe_idx = le32_to_cpu(cqe-wq.wqeidx)  OCRDMA_CQE_WQEIDX_MASK;
+   if (tail != wqe_idx)
+   expand = true; /* Coalesced CQE can't be consumed yet */
+
ocrdma_hwq_inc_tail(qp-sq);
return expand;
 }
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv1] RDMA/ocrdma: Fixed CONFIG_VLAN_8021Q.

2012-08-11 Thread Parav Pandit
Fixed avoiding checking real vlan dev in scenario
when VLAN is disabled and ipv6 is enabled.

Reported-by: Fengguang Wu fengguang...@intel.com
Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_main.c |   16 
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index 5a04452..f4e3696 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -161,7 +161,7 @@ static void ocrdma_add_default_sgid(struct ocrdma_dev *dev)
ocrdma_get_guid(dev, sgid-raw[8]);
 }
 
-#if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)
+#if IS_ENABLED(CONFIG_VLAN_8021Q)
 static void ocrdma_add_vlan_sgids(struct ocrdma_dev *dev)
 {
struct net_device *netdev, *tmp;
@@ -202,8 +202,16 @@ static int ocrdma_build_sgid_tbl(struct ocrdma_dev *dev)
return 0;
 }
 
-#if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_VLAN_8021Q)
+static struct net_device *ocrdma_get_real_netdev(struct net_device *netdev)
+{
+#if IS_ENABLED(CONFIG_VLAN_8021Q)
+   return vlan_dev_real_dev(netdev);
+#else
+   return netdev;
+#endif
+}
 
+#if IS_ENABLED(CONFIG_IPV6)
 static int ocrdma_inet6addr_event(struct notifier_block *notifier,
  unsigned long event, void *ptr)
 {
@@ -217,7 +225,7 @@ static int ocrdma_inet6addr_event(struct notifier_block 
*notifier,
bool is_vlan = false;
u16 vid = 0;
 
-   netdev = vlan_dev_real_dev(event_netdev);
+   netdev = ocrdma_get_real_netdev(event_netdev);
if (netdev != event_netdev) {
is_vlan = true;
vid = vlan_dev_vlan_id(event_netdev);
@@ -262,7 +270,7 @@ static struct notifier_block ocrdma_inet6addr_notifier = {
.notifier_call = ocrdma_inet6addr_event
 };
 
-#endif /* IPV6 and VLAN */
+#endif /* IPV6 */
 
 static enum rdma_link_layer ocrdma_link_layer(struct ib_device *device,
  u8 port_num)
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RDMA/ocrdma: Fixed CONFIG_VLAN_8021Q.

2012-08-10 Thread Parav Pandit
Fixed avoiding checking real vlan dev in scenario
when VLAN is disabled and ipv6 is enabled.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_main.c |   16 
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index 5a04452..7146ffd 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -161,7 +161,7 @@ static void ocrdma_add_default_sgid(struct ocrdma_dev *dev)
ocrdma_get_guid(dev, sgid-raw[8]);
 }
 
-#if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)
+#if IS_ENABLED(CONFIG_VLAN_8021Q) || IS_ENABLED(CONFIG_VLAN_8021Q_MODULE)
 static void ocrdma_add_vlan_sgids(struct ocrdma_dev *dev)
 {
struct net_device *netdev, *tmp;
@@ -202,8 +202,16 @@ static int ocrdma_build_sgid_tbl(struct ocrdma_dev *dev)
return 0;
 }
 
-#if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_VLAN_8021Q)
+static struct net_device *ocrdma_get_real_netdev(struct net_device *netdev)
+{
+#if IS_ENABLED(CONFIG_VLAN_8021Q) || IS_ENABLED(CONFIG_VLAN_8021Q_MODULE)
+   return vlan_dev_real_dev(netdev);
+#else
+   return netdev;
+#endif
+}
 
+#if IS_ENABLED(CONFIG_IPV6)
 static int ocrdma_inet6addr_event(struct notifier_block *notifier,
  unsigned long event, void *ptr)
 {
@@ -217,7 +225,7 @@ static int ocrdma_inet6addr_event(struct notifier_block 
*notifier,
bool is_vlan = false;
u16 vid = 0;
 
-   netdev = vlan_dev_real_dev(event_netdev);
+   netdev = ocrdma_get_real_netdev(event_netdev);
if (netdev != event_netdev) {
is_vlan = true;
vid = vlan_dev_vlan_id(event_netdev);
@@ -262,7 +270,7 @@ static struct notifier_block ocrdma_inet6addr_notifier = {
.notifier_call = ocrdma_inet6addr_event
 };
 
-#endif /* IPV6 and VLAN */
+#endif /* IPV6 */
 
 static enum rdma_link_layer ocrdma_link_layer(struct ib_device *device,
  u8 port_num)
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RDMA/ocrdma: Fixed polling RQ error CQE polling.

2012-06-11 Thread Parav Pandit
Fixed polling RQ/SRQ error CQE polling.
Returning error CQE to consumer for error case which was not returned 
previously.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index d16d172..0ec44d5 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -2301,8 +2301,10 @@ static bool ocrdma_poll_err_rcqe(struct ocrdma_qp *qp, 
struct ocrdma_cqe *cqe,
*stop = true;
expand = false;
}
-   } else
+   } else {
+   *polled = true;
expand = ocrdma_update_err_rcqe(ibwc, cqe, qp, status);
+   }
return expand;
 }
 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RDMA/ocrdma: fixed gid table for vlan and events.

2012-06-08 Thread Parav Pandit
1. Fixed reporting GID table addition event also.
2. Enable vlan based GID entries only when VLAN is enabled at compile time using
CONFIG_VLAN_8021Q CONFIG_VLAN_8021Q_MODULE.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_main.c |   63 +++-
 1 files changed, 34 insertions(+), 29 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index 04fef3d..b050e62 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -97,13 +97,11 @@ static void ocrdma_build_sgid_mac(union ib_gid *sgid, 
unsigned char *mac_addr,
sgid-raw[15] = mac_addr[5];
 }
 
-static void ocrdma_add_sgid(struct ocrdma_dev *dev, unsigned char *mac_addr,
+static bool ocrdma_add_sgid(struct ocrdma_dev *dev, unsigned char *mac_addr,
bool is_vlan, u16 vlan_id)
 {
int i;
-   bool found = false;
union ib_gid new_sgid;
-   int free_idx = OCRDMA_MAX_SGID;
unsigned long flags;
 
memset(ocrdma_zero_sgid, 0, sizeof(union ib_gid));
@@ -115,23 +113,19 @@ static void ocrdma_add_sgid(struct ocrdma_dev *dev, 
unsigned char *mac_addr,
if (!memcmp(dev-sgid_tbl[i], ocrdma_zero_sgid,
sizeof(union ib_gid))) {
/* found free entry */
-   if (!found) {
-   free_idx = i;
-   found = true;
-   break;
-   }
+   memcpy(dev-sgid_tbl[i], new_sgid,
+  sizeof(union ib_gid));
+   spin_unlock_irqrestore(dev-sgid_lock, flags);
+   return true;
} else if (!memcmp(dev-sgid_tbl[i], new_sgid,
   sizeof(union ib_gid))) {
/* entry already present, no addition is required. */
spin_unlock_irqrestore(dev-sgid_lock, flags);
-   return;
+   return false;
}
}
-   /* if entry doesn't exist and if table has some space, add entry */
-   if (found)
-   memcpy(dev-sgid_tbl[free_idx], new_sgid,
-  sizeof(union ib_gid));
spin_unlock_irqrestore(dev-sgid_lock, flags);
+   return false;
 }
 
 static bool ocrdma_del_sgid(struct ocrdma_dev *dev, unsigned char *mac_addr,
@@ -167,7 +161,8 @@ static void ocrdma_add_default_sgid(struct ocrdma_dev *dev)
ocrdma_get_guid(dev, sgid-raw[8]);
 }
 
-static int ocrdma_build_sgid_tbl(struct ocrdma_dev *dev)
+#if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)
+static void ocrdma_add_vlan_sgids(struct ocrdma_dev *dev)
 {
struct net_device *netdev, *tmp;
u16 vlan_id;
@@ -175,8 +170,6 @@ static int ocrdma_build_sgid_tbl(struct ocrdma_dev *dev)
 
netdev = dev-nic_info.netdev;
 
-   ocrdma_add_default_sgid(dev);
-
rcu_read_lock();
for_each_netdev_rcu(init_net, tmp) {
if (netdev == tmp || vlan_dev_real_dev(tmp) == netdev) {
@@ -194,10 +187,23 @@ static int ocrdma_build_sgid_tbl(struct ocrdma_dev *dev)
}
}
rcu_read_unlock();
+}
+#else
+static void ocrdma_add_vlan_sgids(struct ocrdma_dev *dev)
+{
+
+}
+#endif /* VLAN */
+
+static int ocrdma_build_sgid_tbl(struct ocrdma_dev *dev)
+{
+   ocrdma_add_default_sgid(dev);
+   ocrdma_add_vlan_sgids(dev);
return 0;
 }
 
-#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) || \
+defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)
 
 static int ocrdma_inet6addr_event(struct notifier_block *notifier,
  unsigned long event, void *ptr)
@@ -208,6 +214,7 @@ static int ocrdma_inet6addr_event(struct notifier_block 
*notifier,
struct ib_event gid_event;
struct ocrdma_dev *dev;
bool found = false;
+   bool updated = false;
bool is_vlan = false;
u16 vid = 0;
 
@@ -233,23 +240,21 @@ static int ocrdma_inet6addr_event(struct notifier_block 
*notifier,
mutex_lock(dev-dev_lock);
switch (event) {
case NETDEV_UP:
-   ocrdma_add_sgid(dev, netdev-dev_addr, is_vlan, vid);
+   updated = ocrdma_add_sgid(dev, netdev-dev_addr, is_vlan, vid);
break;
case NETDEV_DOWN:
-   found = ocrdma_del_sgid(dev, netdev-dev_addr, is_vlan, vid);
-   if (found) {
-   /* found the matching entry, notify
-* the consumers about it
-*/
-   gid_event.device = dev-ibdev;
-   gid_event.element.port_num = 1;
-   gid_event.event = IB_EVENT_GID_CHANGE

[PATCH] RDMA/ocrdma: Corrected Queue max values.

2012-06-08 Thread Parav Pandit
From: Mahesh Vardhamanaiah mahesh.vardhamana...@emulex.com

Fixed  code to read the max wqe and max rqe values from
mailbox response.

Signed-off-by: Mahesh Vardhamanaiah mahesh.vardhamana...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c  |   15 +--
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h |2 +-
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
index f26314f..5704bb9 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
@@ -990,8 +990,6 @@ static void ocrdma_get_attr(struct ocrdma_dev *dev,
  struct ocrdma_dev_attr *attr,
  struct ocrdma_mbx_query_config *rsp)
 {
-   int max_q_mem;
-
attr-max_pd =
(rsp-max_pd_ca_ack_delay  OCRDMA_MBX_QUERY_CFG_MAX_PD_MASK) 
OCRDMA_MBX_QUERY_CFG_MAX_PD_SHIFT;
@@ -1037,18 +1035,15 @@ static void ocrdma_get_attr(struct ocrdma_dev *dev,
attr-max_inline_data =
attr-wqe_size - (sizeof(struct ocrdma_hdr_wqe) +
  sizeof(struct ocrdma_sge));
-   max_q_mem = OCRDMA_Q_PAGE_BASE_SIZE  (OCRDMA_MAX_Q_PAGE_SIZE_CNT - 1);
-   /* hw can queue one less then the configured size,
-* so publish less by one to stack.
-*/
if (dev-nic_info.dev_family == OCRDMA_GEN2_FAMILY) {
-   dev-attr.max_wqe = max_q_mem / dev-attr.wqe_size;
attr-ird = 1;
attr-ird_page_size = OCRDMA_MIN_Q_PAGE_SIZE;
attr-num_ird_pages = MAX_OCRDMA_IRD_PAGES;
-   } else
-   dev-attr.max_wqe = (max_q_mem / dev-attr.wqe_size) - 1;
-   dev-attr.max_rqe = (max_q_mem / dev-attr.rqe_size) - 1;
+   }
+   dev-attr.max_wqe = rsp-max_wqes_rqes_per_q 
+OCRDMA_MBX_QUERY_CFG_MAX_WQES_PER_WQ_OFFSET;
+   dev-attr.max_rqe = rsp-max_wqes_rqes_per_q 
+   OCRDMA_MBX_QUERY_CFG_MAX_RQES_PER_RQ_MASK;
 }
 
 static int ocrdma_check_fw_config(struct ocrdma_dev *dev,
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_sli.h 
b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
index 7fd80cc..8411441 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
@@ -458,7 +458,7 @@ enum {
OCRDMA_MBX_QUERY_CFG_MAX_WQES_PER_WQ_OFFSET,
OCRDMA_MBX_QUERY_CFG_MAX_RQES_PER_RQ_OFFSET = 0,
OCRDMA_MBX_QUERY_CFG_MAX_RQES_PER_RQ_MASK   = 0x 
-   OCRDMA_MBX_QUERY_CFG_MAX_WQES_PER_WQ_OFFSET,
+   OCRDMA_MBX_QUERY_CFG_MAX_RQES_PER_RQ_OFFSET,
 
OCRDMA_MBX_QUERY_CFG_MAX_CQ_OFFSET  = 16,
OCRDMA_MBX_QUERY_CFG_MAX_CQ_MASK= 0x 
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RDMA/ocrdma: Corrected queue SGE calculation.

2012-06-08 Thread Parav Pandit
From: Mahesh Vardhamanaiah mahesh.vardhamana...@emulex.com

Fixed max sge calculation for sq, rq, srq for all hardware types.

Signed-off-by: Mahesh Vardhamanaiah mahesh.vardhamana...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma.h   |1 +
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c|3 +++
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h   |3 +++
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |6 +++---
 4 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h 
b/drivers/infiniband/hw/ocrdma/ocrdma.h
index 037f5ce..48970af 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
@@ -61,6 +61,7 @@ struct ocrdma_dev_attr {
u32 max_inline_data;
int max_send_sge;
int max_recv_sge;
+   int max_srq_sge;
int max_mr;
u64 max_mr_size;
u32 max_num_mr_pbl;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
index 5704bb9..ea94caf 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
@@ -1002,6 +1002,9 @@ static void ocrdma_get_attr(struct ocrdma_dev *dev,
attr-max_recv_sge = (rsp-max_write_send_sge 
  OCRDMA_MBX_QUERY_CFG_MAX_SEND_SGE_MASK) 
OCRDMA_MBX_QUERY_CFG_MAX_SEND_SGE_SHIFT;
+   attr-max_srq_sge = (rsp-max_srq_rqe_sge 
+ OCRDMA_MBX_QUERY_CFG_MAX_SRQ_SGE_MASK) 
+   OCRDMA_MBX_QUERY_CFG_MAX_SRQ_SGE_OFFSET;
attr-max_ord_per_qp = (rsp-max_ird_ord_per_qp 
OCRDMA_MBX_QUERY_CFG_MAX_ORD_PER_QP_MASK) 
OCRDMA_MBX_QUERY_CFG_MAX_ORD_PER_QP_SHIFT;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_sli.h 
b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
index 8411441..c75cbdf 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
@@ -418,6 +418,9 @@ enum {
 
OCRDMA_MBX_QUERY_CFG_MAX_SEND_SGE_SHIFT = 0,
OCRDMA_MBX_QUERY_CFG_MAX_SEND_SGE_MASK  = 0x,
+   OCRDMA_MBX_QUERY_CFG_MAX_WRITE_SGE_SHIFT= 16,
+   OCRDMA_MBX_QUERY_CFG_MAX_WRITE_SGE_MASK = 0x 
+   OCRDMA_MBX_QUERY_CFG_MAX_WRITE_SGE_SHIFT,
 
OCRDMA_MBX_QUERY_CFG_MAX_ORD_PER_QP_SHIFT   = 0,
OCRDMA_MBX_QUERY_CFG_MAX_ORD_PER_QP_MASK= 0x,
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index d16d172..0e88088 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -83,8 +83,8 @@ int ocrdma_query_device(struct ib_device *ibdev, struct 
ib_device_attr *attr)
IB_DEVICE_SHUTDOWN_PORT |
IB_DEVICE_SYS_IMAGE_GUID |
IB_DEVICE_LOCAL_DMA_LKEY;
-   attr-max_sge = dev-attr.max_send_sge;
-   attr-max_sge_rd = dev-attr.max_send_sge;
+   attr-max_sge = min(dev-attr.max_send_sge, dev-attr.max_srq_sge);
+   attr-max_sge_rd = 0;
attr-max_cq = dev-attr.max_cq;
attr-max_cqe = dev-attr.max_cqe;
attr-max_mr = dev-attr.max_mr;
@@ -97,7 +97,7 @@ int ocrdma_query_device(struct ib_device *ibdev, struct 
ib_device_attr *attr)
min(dev-attr.max_ord_per_qp, dev-attr.max_ird_per_qp);
attr-max_qp_init_rd_atom = dev-attr.max_ord_per_qp;
attr-max_srq = (dev-attr.max_qp - 1);
-   attr-max_srq_sge = attr-max_sge;
+   attr-max_srq_sge = attr-max_srq_sge;
attr-max_srq_wr = dev-attr.max_rqe;
attr-local_ca_ack_delay = dev-attr.local_ca_ack_delay;
attr-max_fast_reg_page_list_len = 0;
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RDMA/ocrdma: corrected queue free count math

2012-05-23 Thread Parav Pandit
Corrected queue free count math for SQ, RQ for all hardware type.
Updated user-kernel abi interface.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma.h   |1 -
 drivers/infiniband/hw/ocrdma/ocrdma_abi.h   |5 +
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c|7 ---
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |5 -
 4 files changed, 1 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h 
b/drivers/infiniband/hw/ocrdma/ocrdma.h
index 85a69c9..037f5ce 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
@@ -231,7 +231,6 @@ struct ocrdma_qp_hwq_info {
u32 entry_size;
u32 max_cnt;
u32 max_wqe_idx;
-   u32 free_delta;
u16 dbid;   /* qid, where to ring the doorbell. */
u32 len;
dma_addr_t pa;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_abi.h 
b/drivers/infiniband/hw/ocrdma/ocrdma_abi.h
index a411a4e..517ab20 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_abi.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_abi.h
@@ -101,8 +101,6 @@ struct ocrdma_create_qp_uresp {
u32 rsvd1;
u32 num_wqe_allocated;
u32 num_rqe_allocated;
-   u32 free_wqe_delta;
-   u32 free_rqe_delta;
u32 db_sq_offset;
u32 db_rq_offset;
u32 db_shift;
@@ -126,8 +124,7 @@ struct ocrdma_create_srq_uresp {
u32 db_rq_offset;
u32 db_shift;
 
-   u32 free_rqe_delta;
-   u32 rsvd2;
+   u64 rsvd2;
u64 rsvd3;
 } __packed;
 
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
index 9b204b1..f26314f 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
@@ -1990,19 +1990,12 @@ static void ocrdma_get_create_qp_rsp(struct 
ocrdma_create_qp_rsp *rsp,
max_wqe_allocated = 1  max_wqe_allocated;
max_rqe_allocated = 1  ((u16)rsp-max_wqe_rqe);
 
-   if (qp-dev-nic_info.dev_family == OCRDMA_GEN2_FAMILY) {
-   qp-sq.free_delta = 0;
-   qp-rq.free_delta = 1;
-   } else
-   qp-sq.free_delta = 1;
-
qp-sq.max_cnt = max_wqe_allocated;
qp-sq.max_wqe_idx = max_wqe_allocated - 1;
 
if (!attrs-srq) {
qp-rq.max_cnt = max_rqe_allocated;
qp-rq.max_wqe_idx = max_rqe_allocated - 1;
-   qp-rq.free_delta = 1;
}
 }
 
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index e9f74d1..d16d172 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -940,8 +940,6 @@ static int ocrdma_copy_qp_uresp(struct ocrdma_qp *qp,
uresp.db_rq_offset = OCRDMA_DB_RQ_OFFSET;
uresp.db_shift = 16;
}
-   uresp.free_wqe_delta = qp-sq.free_delta;
-   uresp.free_rqe_delta = qp-rq.free_delta;
 
if (qp-dpp_enabled) {
uresp.dpp_credit = dpp_credit_lmt;
@@ -1307,8 +1305,6 @@ static int ocrdma_hwq_free_cnt(struct ocrdma_qp_hwq_info 
*q)
free_cnt = (q-max_cnt - q-head) + q-tail;
else
free_cnt = q-tail - q-head;
-   if (q-free_delta)
-   free_cnt -= q-free_delta;
return free_cnt;
 }
 
@@ -1501,7 +1497,6 @@ static int ocrdma_copy_srq_uresp(struct ocrdma_srq *srq, 
struct ib_udata *udata)
(srq-pd-id * srq-dev-nic_info.db_page_size);
uresp.db_page_size = srq-dev-nic_info.db_page_size;
uresp.num_rqe_allocated = srq-rq.max_cnt;
-   uresp.free_rqe_delta = 1;
if (srq-dev-nic_info.dev_family == OCRDMA_GEN2_FAMILY) {
uresp.db_rq_offset = OCRDMA_DB_GEN2_RQ1_OFFSET;
uresp.db_shift = 24;
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RDMA/ocrdma: fixed enum for SRQ_LIMIT_REACHED.

2012-05-23 Thread Parav Pandit
Fixed enum value for SRQ_LIMIT_REACHED async event.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
index f26314f..9343a15 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
@@ -732,7 +732,7 @@ static void ocrdma_dispatch_ibevent(struct ocrdma_dev *dev,
break;
case OCRDMA_SRQ_LIMIT_EVENT:
ib_evt.element.srq = qp-srq-ibsrq;
-   ib_evt.event = IB_EVENT_QP_LAST_WQE_REACHED;
+   ib_evt.event = IB_EVENT_SRQ_LIMIT_REACHED;
srq_event = 1;
qp_event = 0;
break;
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] be2net: Added functionality to support RoCE driver

2012-03-26 Thread Parav Pandit
From: Parav Pandit parav.pan...@emulex.com

This patch is for netdev net-next tree. It is based on previous RFC.
Those review comments are addressed.
This patch adds functionality to support RoCE (RDMA over Ethernet) driver.
- Detecting RoCE supported adapters and creating linked list of them.
- Enabling 5 more MSIX vectors for RoCE functionality.
- Calling registered callback functions of the RoCE driver
  whenever new RoCE capable device is added/removed.
- Notifying events to RoCE driver when interface is up or down.
- Provides device specific details to RoCE driver for each roce device.
- Provides low level mailbox command to be issued by RoCE driver
  before it can have it own MQ.

Parav Pandit (2):
  be2net: Added function to issue mailbox cmd on MQ.
  be2net: Added functionality to support RoCE driver

 drivers/net/ethernet/emulex/benet/Makefile  |2 +-
 drivers/net/ethernet/emulex/benet/be.h  |   38 ++-
 drivers/net/ethernet/emulex/benet/be_cmds.c |   39 ++
 drivers/net/ethernet/emulex/benet/be_cmds.h |1 +
 drivers/net/ethernet/emulex/benet/be_hw.h   |4 +-
 drivers/net/ethernet/emulex/benet/be_main.c |   88 +++--
 drivers/net/ethernet/emulex/benet/be_roce.c |  182 +++
 drivers/net/ethernet/emulex/benet/be_roce.h |   75 +++
 8 files changed, 414 insertions(+), 15 deletions(-)
 create mode 100644 drivers/net/ethernet/emulex/benet/be_roce.c
 create mode 100644 drivers/net/ethernet/emulex/benet/be_roce.h

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] be2net: Added function to issue mailbox cmd on MQ.

2012-03-26 Thread Parav Pandit
From: Parav Pandit parav.pan...@emulex.com

- Added generic function to issue mailbox cmd on MQ as export function.
- RoCE driver will use this before it setups its own MQ.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/net/ethernet/emulex/benet/be_cmds.c |   39 +++
 1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c 
b/drivers/net/ethernet/emulex/benet/be_cmds.c
index 398fb5c..393ad05 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -15,6 +15,7 @@
  * Costa Mesa, CA 92626
  */
 
+#include linux/module.h
 #include be.h
 #include be_cmds.h
 
@@ -2418,3 +2419,41 @@ err:
spin_unlock_bh(adapter-mcc_lock);
return status;
 }
+
+int be_roce_mcc_cmd(void *netdev_handle, void *wrb_payload,
+   int wrb_payload_size, u16 *cmd_status, u16 *ext_status)
+{
+   struct be_adapter *adapter = netdev_priv(netdev_handle);
+   struct be_mcc_wrb *wrb;
+   struct be_cmd_req_hdr *hdr = (struct be_cmd_req_hdr *) wrb_payload;
+   struct be_cmd_req_hdr *req;
+   struct be_cmd_resp_hdr *resp;
+   int status;
+
+   spin_lock_bh(adapter-mcc_lock);
+
+   wrb = wrb_from_mccq(adapter);
+   if (!wrb) {
+   status = -EBUSY;
+   goto err;
+   }
+   req = embedded_payload(wrb);
+   resp = embedded_payload(wrb);
+
+   be_wrb_cmd_hdr_prepare(req, hdr-subsystem,
+  hdr-opcode, wrb_payload_size, wrb, NULL);
+   memcpy(req, wrb_payload, wrb_payload_size);
+   be_dws_cpu_to_le(req, wrb_payload_size);
+
+   status = be_mcc_notify_wait(adapter);
+   if (cmd_status)
+   *cmd_status = (status  0x);
+   if (ext_status)
+   *ext_status = 0;
+   memcpy(wrb_payload, resp, sizeof(*resp) + resp-response_length);
+   be_dws_le_to_cpu(wrb_payload, sizeof(*resp) + resp-response_length);
+err:
+   spin_unlock_bh(adapter-mcc_lock);
+   return status;
+}
+EXPORT_SYMBOL(be_roce_mcc_cmd);
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] be2net: Added functionality to support RoCE driver

2012-03-26 Thread Parav Pandit
From: Parav Pandit parav.pan...@emulex.com

- Increased MSIX vectors by 5 for RoCE traffic.
- Added macro to check roce support on a device.
- Added device specific doorbell, msix vector fields shared with nic 
functionality.
- Provides RoCE driver registration and deregistration functions.
- Added support functions which will be invoked on adapter
  add/remove and port up/down events.
- Traverses through the list of adapters for invoking callback functions

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/net/ethernet/emulex/benet/Makefile  |2 +-
 drivers/net/ethernet/emulex/benet/be.h  |   38 ++-
 drivers/net/ethernet/emulex/benet/be_cmds.h |1 +
 drivers/net/ethernet/emulex/benet/be_hw.h   |4 +-
 drivers/net/ethernet/emulex/benet/be_main.c |   88 +++--
 drivers/net/ethernet/emulex/benet/be_roce.c |  182 +++
 drivers/net/ethernet/emulex/benet/be_roce.h |   75 +++
 7 files changed, 375 insertions(+), 15 deletions(-)
 create mode 100644 drivers/net/ethernet/emulex/benet/be_roce.c
 create mode 100644 drivers/net/ethernet/emulex/benet/be_roce.h

diff --git a/drivers/net/ethernet/emulex/benet/Makefile 
b/drivers/net/ethernet/emulex/benet/Makefile
index a60cd80..1a91b27 100644
--- a/drivers/net/ethernet/emulex/benet/Makefile
+++ b/drivers/net/ethernet/emulex/benet/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_BE2NET) += be2net.o
 
-be2net-y :=  be_main.o be_cmds.o be_ethtool.o
+be2net-y :=  be_main.o be_cmds.o be_ethtool.o be_roce.o
diff --git a/drivers/net/ethernet/emulex/benet/be.h 
b/drivers/net/ethernet/emulex/benet/be.h
index ab24e46..15d0c88 100644
--- a/drivers/net/ethernet/emulex/benet/be.h
+++ b/drivers/net/ethernet/emulex/benet/be.h
@@ -32,6 +32,7 @@
 #include linux/u64_stats_sync.h
 
 #include be_hw.h
+#include be_roce.h
 
 #define DRV_VER4.2.116u
 #define DRV_NAME   be2net
@@ -98,7 +99,8 @@ static inline char *nic_name(struct pci_dev *pdev)
 #define MAX_RX_QS  (MAX_RSS_QS + 1) /* RSS qs + 1 def Rx */
 
 #define MAX_TX_QS  8
-#define MAX_MSIX_VECTORS   MAX_RSS_QS
+#define MAX_ROCE_EQS   5
+#define MAX_MSIX_VECTORS   (MAX_RSS_QS + MAX_ROCE_EQS) /* RSS qs + RoCE */
 #define BE_TX_BUDGET   256
 #define BE_NAPI_WEIGHT 64
 #define MAX_RX_POSTBE_NAPI_WEIGHT /* Frags posted at a time */
@@ -376,6 +378,17 @@ struct be_adapter {
u8 transceiver;
u8 autoneg;
u8 generation;  /* BladeEngine ASIC generation */
+   u32 if_type;
+   struct {
+   u8 __iomem *base;   /* Door Bell */
+   u32 size;
+   u32 total_size;
+   u64 io_addr;
+   } roce_db;
+   u32 num_msix_roce_vec;
+   struct ocrdma_dev *ocrdma_dev;
+   struct list_head entry;
+
u32 flash_status;
struct completion flash_compl;
 
@@ -403,6 +416,10 @@ struct be_adapter {
 #define lancer_chip(adapter)   ((adapter-pdev-device == OC_DEVICE_ID3) || \
 (adapter-pdev-device == OC_DEVICE_ID4))
 
+#define be_roce_supported(adapter) ((adapter-if_type == SLI_INTF_TYPE_3 || \
+   adapter-sli_family == SKYHAWK_SLI_FAMILY)  \
+   (adapter-function_mode  RDMA_ENABLED))
+
 extern const struct ethtool_ops be_ethtool_ops;
 
 #define msix_enabled(adapter)  (adapter-num_msix_vec  0)
@@ -549,9 +566,28 @@ static inline bool be_error(struct be_adapter *adapter)
return adapter-eeh_err || adapter-ue_detected || adapter-fw_timeout;
 }
 
+static inline bool be_type_2_3(struct be_adapter *adapter)
+{
+   return (adapter-if_type == SLI_INTF_TYPE_2 ||
+   adapter-if_type == SLI_INTF_TYPE_3) ? true : false;
+}
+
 extern void be_cq_notify(struct be_adapter *adapter, u16 qid, bool arm,
u16 num_popped);
 extern void be_link_status_update(struct be_adapter *adapter, u8 link_status);
 extern void be_parse_stats(struct be_adapter *adapter);
 extern int be_load_fw(struct be_adapter *adapter, u8 *func);
+
+/*
+ * internal function to initialize-cleanup roce device.
+ */
+extern void be_roce_dev_add(struct be_adapter *);
+extern void be_roce_dev_remove(struct be_adapter *);
+
+/*
+ * internal function to open-close roce device during ifup-ifdown.
+ */
+extern void be_roce_dev_open(struct be_adapter *);
+extern void be_roce_dev_close(struct be_adapter *);
+
 #endif /* BE_H */
diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.h 
b/drivers/net/ethernet/emulex/benet/be_cmds.h
index 687c420..b457532 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.h
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.h
@@ -1054,6 +1054,7 @@ struct be_cmd_resp_modify_eq_delay {
 /* The HW can come up in either of the following multi-channel modes
  * based on the skew/IPL.
  */
+#define RDMA_ENABLED   0x4
 #define FLEX10_MODE

[PATCH v1 0/9] ocrdma: Driver for Emulex OneConnect RDMA adapter

2012-03-26 Thread Parav Pandit
From: Parav Pandit parav.pan...@emulex.com

Thanks a lot for review comments given for past patch.
I have addressed those review comments in this PATCH v1.

Emulex One Connect Adapter is RDMA (RoCE) capable multi-function
PCI Express device.
This driver patch enables RoCE support on such adapter.

This ocrdma driver depends on be2net NIC driver.
This patch depends on the previously submitted be2net NIC driver patch.

Code organization:
- ocrdma.h   : driver header file.
- ocrdma_main.c  : driver registration with stack.
- ocrdma_sli.h   : driver-adapter interface definitions.
- ocrdma_hw.*: hardware specific initialization, mailbox cmds.
- ocrdma_verbs.* : verbs interface functionality.
- ocrdma_ah.*: address handle related functionaliy.
- ocrdma_abi.h   : user space library interaction definitions.

This patch is made against the current git tree.
Thank you.

Parav Pandi (9):
  ocrdma: Driver for Emulex OneConnect RDMA adapter
  ocrdma: Driver for Emulex OneConnect RDMA adapter
  ocrdma: Driver for Emulex OneConnect RDMA adapter
  ocrdma: Driver for Emulex OneConnect RDMA adapter
  ocrdma: Driver for Emulex OneConnect RDMA adapter
  ocrdma: Driver for Emulex OneConnect RDMA adapter
  ocrdma: Driver for Emulex OneConnect RDMA adapter
  ocrdma: Driver for Emulex OneConnect RDMA adapter
  ocrdma: Driver for Emulex OneConnect RDMA adapter

 drivers/infiniband/Kconfig  |1 +
 drivers/infiniband/Makefile |1 +
 drivers/infiniband/hw/ocrdma/Kconfig|8 +
 drivers/infiniband/hw/ocrdma/Makefile   |5 +
 drivers/infiniband/hw/ocrdma/ocrdma.h   |  377 
 drivers/infiniband/hw/ocrdma/ocrdma_abi.h   |  136 ++
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c|  172 ++
 drivers/infiniband/hw/ocrdma/ocrdma_ah.h|   42 +
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c| 2556 +++
 drivers/infiniband/hw/ocrdma/ocrdma_hw.h|  132 ++
 drivers/infiniband/hw/ocrdma/ocrdma_main.c  |  622 +++
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h   | 1658 +
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2536 ++
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |   94 +
 14 files changed, 8340 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/hw/ocrdma/Kconfig
 create mode 100644 drivers/infiniband/hw/ocrdma/Makefile
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma.h
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_abi.h
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_ah.c
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_ah.h
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_hw.c
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_hw.h
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_main.c
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_sli.h
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 1/9] ocrdma: Driver for Emulex OneConnect RDMA adapter

2012-03-26 Thread Parav Pandit
From: Parav Pandit parav.pan...@emulex.com

- Header file for device and resource specific data structures.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma.h |  377 +
 1 files changed, 377 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma.h

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h 
b/drivers/infiniband/hw/ocrdma/ocrdma.h
new file mode 100644
index 000..596cf74
--- /dev/null
+++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
@@ -0,0 +1,377 @@
+/***
+ * This file is part of the Emulex RoCE Device Driver for  *
+ * RoCE (RDMA over Converged Ethernet) adapters.   *
+ * Copyright (C) 2008-2012 Emulex. All rights reserved.*
+ * EMULEX and SLI are trademarks of Emulex.*
+ * www.emulex.com  *
+ * *
+ * This program is free software; you can redistribute it and/or   *
+ * modify it under the terms of version 2 of the GNU General   *
+ * Public License as published by the Free Software Foundation.*
+ * This program is distributed in the hope that it will be useful. *
+ * ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND  *
+ * WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY,  *
+ * FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT, ARE  *
+ * DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD *
+ * TO BE LEGALLY INVALID.  See the GNU General Public License for  *
+ * more details, a copy of which can be found in the file COPYING  *
+ * included with this package. *
+ *
+ * Contact Information:
+ * linux-driv...@emulex.com
+ *
+ * Emulex
+ *  Susan Street
+ * Costa Mesa, CA 92626
+ ***/
+
+#ifndef __OCRDMA_H__
+#define __OCRDMA_H__
+
+#include linux/mutex.h
+#include linux/list.h
+#include linux/spinlock.h
+#include linux/pci.h
+
+#include rdma/ib_verbs.h
+#include rdma/ib_user_verbs.h
+
+#include be_roce.h
+#include ocrdma_sli.h
+
+#define OCRDMA_ROCE_DEV_VERSION 1.0.0
+#define OCRDMA_NODE_DESC Emulex OneConnect RoCE HCA
+
+#define OCRDMA_MAX_AH 512
+
+#define OCRDMA_UVERBS(CMD_NAME) (1ull  IB_USER_VERBS_CMD_##CMD_NAME)
+
+struct ocrdma_dev_attr {
+   u8 fw_ver[32];
+   u32 vendor_id;
+   u32 device_id;
+   u16 max_pd;
+   u16 max_cq;
+   u16 max_cqe;
+   u16 max_qp;
+   u16 max_wqe;
+   u16 max_rqe;
+   u32 max_inline_data;
+   int max_send_sge;
+   int max_recv_sge;
+   int max_mr;
+   u64 max_mr_size;
+   u32 max_num_mr_pbl;
+   int max_fmr;
+   int max_map_per_fmr;
+   int max_pages_per_frmr;
+   u16 max_ord_per_qp;
+   u16 max_ird_per_qp;
+
+   int device_cap_flags;
+   u8 cq_overflow_detect;
+   u8 srq_supported;
+
+   u32 wqe_size;
+   u32 rqe_size;
+   u32 ird_page_size;
+   u8 local_ca_ack_delay;
+   u8 ird;
+   u8 num_ird_pages;
+};
+
+struct ocrdma_pbl {
+   void *va;
+   dma_addr_t pa;
+};
+
+struct ocrdma_queue_info {
+   void *va;
+   dma_addr_t dma;
+   u32 size;
+   u16 len;
+   u16 entry_size; /* Size of an element in the queue */
+   u16 id; /* qid, where to ring the doorbell. */
+   u16 head, tail;
+   bool created;
+};
+
+struct ocrdma_eq {
+   struct ocrdma_queue_info q;
+   u32 vector;
+   int cq_cnt;
+   struct ocrdma_dev *dev;
+   char irq_name[32];
+};
+
+struct ocrdma_mq {
+   struct ocrdma_queue_info sq;
+   struct ocrdma_queue_info cq;
+   bool rearm_cq;
+};
+
+struct mqe_ctx {
+   struct mutex lock; /* for serializing mailbox commands on MQ */
+   wait_queue_head_t cmd_wait;
+   u32 tag;
+   u16 cqe_status;
+   u16 ext_status;
+   bool cmd_done;
+};
+
+struct ocrdma_dev {
+   struct ib_device ibdev;
+   struct ocrdma_dev_attr attr;
+
+   struct mutex dev_lock; /* provides syncronise access to device data */
+   spinlock_t flush_q_lock cacheline_aligned;
+
+   struct ocrdma_cq **cq_tbl;
+   struct ocrdma_qp **qp_tbl;
+
+   struct ocrdma_eq meq;
+   struct ocrdma_eq *qp_eq_tbl;
+   int eq_cnt;
+   u16 base_eqid;
+   u16 max_eq;
+
+   union ib_gid *sgid_tbl;
+   /* provided synchronization to sgid table for
+* updating gid entries triggered by notifier.
+*/
+   spinlock_t sgid_lock;
+
+   int gsi_qp_created;
+   struct ocrdma_cq *gsi_sqcq;
+   struct ocrdma_cq *gsi_rqcq;
+
+   struct {
+   struct ocrdma_av *va;
+   dma_addr_t pa;
+   u32 size;
+   u32 num_ah;
+   /* provide synchronization for av

[PATCH v1 2/9] ocrdma: Driver for Emulex OneConnect RDMA adapter

2012-03-26 Thread Parav Pandit
From: Parav Pandit parav.pan...@emulex.com

- Header file for userspace library and kernel driver interface.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_abi.h |  136 +
 1 files changed, 136 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_abi.h

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_abi.h 
b/drivers/infiniband/hw/ocrdma/ocrdma_abi.h
new file mode 100644
index 000..bac01db
--- /dev/null
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_abi.h
@@ -0,0 +1,136 @@
+/***
+ * This file is part of the Emulex RoCE Device Driver for  *
+ * RoCE (RDMA over Converged Ethernet) adapters.   *
+ * Copyright (C) 2008-2012 Emulex. All rights reserved.*
+ * EMULEX and SLI are trademarks of Emulex.*
+ * www.emulex.com  *
+ * *
+ * This program is free software; you can redistribute it and/or   *
+ * modify it under the terms of version 2 of the GNU General   *
+ * Public License as published by the Free Software Foundation.*
+ * This program is distributed in the hope that it will be useful. *
+ * ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND  *
+ * WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY,  *
+ * FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT, ARE  *
+ * DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD *
+ * TO BE LEGALLY INVALID.  See the GNU General Public License for  *
+ * more details, a copy of which can be found in the file COPYING  *
+ * included with this package. *
+ *
+ * Contact Information:
+ * linux-driv...@emulex.com
+ *
+ * Emulex
+ *  Susan Street
+ * Costa Mesa, CA 92626
+ ***/
+
+#ifndef __OCRDMA_ABI_H__
+#define __OCRDMA_ABI_H__
+
+/* user kernel communication data structures. */
+
+struct ocrdma_alloc_ucontext_resp {
+   u32 dev_id;
+   u32 wqe_size;
+   u32 max_inline_data;
+   u32 dpp_wqe_size;
+   u64 ah_tbl_page;
+   u32 ah_tbl_len;
+   u32 rqe_size;
+   u8 fw_ver[32];
+   /* for future use/new features in progress */
+   u64 rsvd1;
+   u64 rsvd2;
+};
+
+struct ocrdma_alloc_pd_ureq {
+   u64 rsvd1;
+};
+
+struct ocrdma_alloc_pd_uresp {
+   u32 id;
+   u32 dpp_enabled;
+   u32 dpp_page_addr_hi;
+   u32 dpp_page_addr_lo;
+   u64 rsvd1;
+};
+
+struct ocrdma_create_cq_ureq {
+   u32 dpp_cq;
+   u32 rsvd;   /* pad */
+} __packed;
+
+#define MAX_CQ_PAGES 8
+struct ocrdma_create_cq_uresp {
+   u32 cq_id;
+   u32 page_size;
+   u32 num_pages;
+   u32 max_hw_cqe;
+   u64 page_addr[MAX_CQ_PAGES];
+   u64 db_page_addr;
+   u32 db_page_size;
+   u32 phase_change;
+   /* for future use/new features in progress */
+   u64 rsvd1;
+   u64 rsvd2;
+};
+
+#define MAX_QP_PAGES 8
+#define MAX_UD_AV_PAGES 8
+
+struct ocrdma_create_qp_ureq {
+   u8 enable_dpp_cq;
+   u8 rsvd;
+   u16 dpp_cq_id;
+   u32 rsvd1;  /* pad */
+};
+
+struct ocrdma_create_qp_uresp {
+   u16 qp_id;
+   u16 sq_dbid;
+   u16 rq_dbid;
+   u16 resv0;  /* pad */
+   u32 sq_page_size;
+   u32 rq_page_size;
+   u32 num_sq_pages;
+   u32 num_rq_pages;
+   u64 sq_page_addr[MAX_QP_PAGES];
+   u64 rq_page_addr[MAX_QP_PAGES];
+   u64 db_page_addr;
+   u32 db_page_size;
+   u32 dpp_credit;
+   u32 dpp_offset;
+   u32 num_wqe_allocated;
+   u32 num_rqe_allocated;
+   u32 free_wqe_delta;
+   u32 free_rqe_delta;
+   u32 db_sq_offset;
+   u32 db_rq_offset;
+   u32 db_shift;
+   u64 rsvd1;
+   u64 rsvd2;
+} __packed;
+
+struct ocrdma_create_srq_uresp {
+   u16 rq_dbid;
+   u16 resv0;  /* pad */
+   u32 resv1;
+
+   u32 rq_page_size;
+   u32 num_rq_pages;
+
+   u64 rq_page_addr[MAX_QP_PAGES];
+   u64 db_page_addr;
+
+   u32 db_page_size;
+   u32 num_rqe_allocated;
+   u32 db_rq_offset;
+   u32 db_shift;
+
+   u32 free_rqe_delta;
+   u32 rsvd2;
+   u64 rsvd3;
+};
+
+#endif /* __OCRDMA_ABI_H__ */
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 3/9] ocrdma: Driver for Emulex OneConnect RDMA adapter

2012-03-26 Thread Parav Pandit
From: Parav Pandit parav.pan...@emulex.com

- Header file for driver-adapter interface.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_sli.h | 1658 +
 1 files changed, 1658 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_sli.h

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_sli.h 
b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
new file mode 100644
index 000..82f2656
--- /dev/null
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_sli.h
@@ -0,0 +1,1658 @@
+/***
+ * This file is part of the Emulex RoCE Device Driver for  *
+ * RoCE (RDMA over Converged Ethernet) adapters.   *
+ * Copyright (C) 2008-2012 Emulex. All rights reserved.*
+ * EMULEX and SLI are trademarks of Emulex.*
+ * www.emulex.com  *
+ * *
+ * This program is free software; you can redistribute it and/or   *
+ * modify it under the terms of version 2 of the GNU General   *
+ * Public License as published by the Free Software Foundation.*
+ * This program is distributed in the hope that it will be useful. *
+ * ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND  *
+ * WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY,  *
+ * FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT, ARE  *
+ * DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD *
+ * TO BE LEGALLY INVALID.  See the GNU General Public License for  *
+ * more details, a copy of which can be found in the file COPYING  *
+ * included with this package. *
+ *
+ * Contact Information:
+ * linux-driv...@emulex.com
+ *
+ * Emulex
+ *  Susan Street
+ * Costa Mesa, CA 92626
+ ***/
+
+#ifndef __OCRDMA_SLI_H__
+#define __OCRDMA_SLI_H__
+
+#define Bit(_b) (1  (_b))
+
+#define OCRDMA_GEN1_FAMILY 0xB
+#define OCRDMA_GEN2_FAMILY 0x2
+
+#define OCRDMA_SUBSYS_ROCE 10
+enum {
+   OCRDMA_CMD_QUERY_CONFIG = 1,
+   OCRDMA_CMD_ALLOC_PD,
+   OCRDMA_CMD_DEALLOC_PD,
+
+   OCRDMA_CMD_CREATE_AH_TBL,
+   OCRDMA_CMD_DELETE_AH_TBL,
+
+   OCRDMA_CMD_CREATE_QP,
+   OCRDMA_CMD_QUERY_QP,
+   OCRDMA_CMD_MODIFY_QP,
+   OCRDMA_CMD_DELETE_QP,
+
+   OCRDMA_CMD_RSVD1,
+   OCRDMA_CMD_ALLOC_LKEY,
+   OCRDMA_CMD_DEALLOC_LKEY,
+   OCRDMA_CMD_REGISTER_NSMR,
+   OCRDMA_CMD_REREGISTER_NSMR,
+   OCRDMA_CMD_REGISTER_NSMR_CONT,
+   OCRDMA_CMD_QUERY_NSMR,
+   OCRDMA_CMD_ALLOC_MW,
+   OCRDMA_CMD_QUERY_MW,
+
+   OCRDMA_CMD_CREATE_SRQ,
+   OCRDMA_CMD_QUERY_SRQ,
+   OCRDMA_CMD_MODIFY_SRQ,
+   OCRDMA_CMD_DELETE_SRQ,
+
+   OCRDMA_CMD_ATTACH_MCAST,
+   OCRDMA_CMD_DETACH_MCAST,
+
+   OCRDMA_CMD_MAX
+};
+
+#define OCRDMA_SUBSYS_COMMON 1
+enum {
+   OCRDMA_CMD_CREATE_CQ= 12,
+   OCRDMA_CMD_CREATE_EQ= 13,
+   OCRDMA_CMD_CREATE_MQ= 21,
+   OCRDMA_CMD_GET_FW_VER   = 35,
+   OCRDMA_CMD_DELETE_MQ= 53,
+   OCRDMA_CMD_DELETE_CQ= 54,
+   OCRDMA_CMD_DELETE_EQ= 55,
+   OCRDMA_CMD_GET_FW_CONFIG= 58,
+   OCRDMA_CMD_CREATE_MQ_EXT= 90
+};
+
+enum {
+   QTYPE_EQ= 1,
+   QTYPE_CQ= 2,
+   QTYPE_MCCQ  = 3
+};
+
+#define OCRDMA_MAX_SGID (8)
+
+#define OCRDMA_MAX_QP2048
+#define OCRDMA_MAX_CQ2048
+
+enum {
+   OCRDMA_DB_RQ_OFFSET = 0xE0,
+   OCRDMA_DB_GEN2_RQ1_OFFSET   = 0x100,
+   OCRDMA_DB_GEN2_RQ2_OFFSET   = 0xC0,
+   OCRDMA_DB_SQ_OFFSET = 0x60,
+   OCRDMA_DB_GEN2_SQ_OFFSET= 0x1C0,
+   OCRDMA_DB_SRQ_OFFSET= OCRDMA_DB_RQ_OFFSET,
+   OCRDMA_DB_GEN2_SRQ_OFFSET   = OCRDMA_DB_GEN2_RQ1_OFFSET,
+   OCRDMA_DB_CQ_OFFSET = 0x120,
+   OCRDMA_DB_EQ_OFFSET = OCRDMA_DB_CQ_OFFSET,
+   OCRDMA_DB_MQ_OFFSET = 0x140
+};
+
+#define OCRDMA_DB_CQ_RING_ID_MASK   0x3FF  /* bits 0 - 9 */
+#define OCRDMA_DB_CQ_RING_ID_EXT_MASK  0x0C00  /* bits 10-11 of qid at 12-11 */
+/* qid #2 msbits at 12-11 */
+#define OCRDMA_DB_CQ_RING_ID_EXT_MASK_SHIFT  0x1
+#define OCRDMA_DB_CQ_NUM_POPPED_SHIFT   (16)   /* bits 16 - 28 */
+/* Rearm bit */
+#define OCRDMA_DB_CQ_REARM_SHIFT(29)   /* bit 29 */
+/* solicited bit */
+#define OCRDMA_DB_CQ_SOLICIT_SHIFT   (31)  /* bit 31 */
+
+#define OCRDMA_EQ_ID_MASK  0x1FF   /* bits 0 - 8 */
+#define OCRDMA_EQ_ID_EXT_MASK  0x3e00  /* bits 9-13 */
+#define OCRDMA_EQ_ID_EXT_MASK_SHIFT(2) /* qid bits 9-13 at 11-15 */
+
+/* Clear the interrupt for this eq */
+#define OCRDMA_EQ_CLR_SHIFT(9) /* bit 9

[PATCH v1 5/9] ocrdma: Driver for Emulex OneConnect RDMA adapter

2012-03-26 Thread Parav Pandit
From: Parav Pandit parav.pan...@emulex.com

- main file registering with Infiniband stack.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_main.c |  622 
 1 files changed, 622 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_main.c

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
new file mode 100644
index 000..b7574bb
--- /dev/null
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -0,0 +1,622 @@
+/***
+ * This file is part of the Emulex RoCE Device Driver for  *
+ * RoCE (RDMA over Converged Ethernet) adapters.   *
+ * Copyright (C) 2008-2012 Emulex. All rights reserved.*
+ * EMULEX and SLI are trademarks of Emulex.*
+ * www.emulex.com  *
+ * *
+ * This program is free software; you can redistribute it and/or   *
+ * modify it under the terms of version 2 of the GNU General   *
+ * Public License as published by the Free Software Foundation.*
+ * This program is distributed in the hope that it will be useful. *
+ * ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND  *
+ * WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY,  *
+ * FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT, ARE  *
+ * DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD *
+ * TO BE LEGALLY INVALID.  See the GNU General Public License for  *
+ * more details, a copy of which can be found in the file COPYING  *
+ * included with this package. *
+ *
+ * Contact Information:
+ * linux-driv...@emulex.com
+ *
+ * Emulex
+ *  Susan Street
+ * Costa Mesa, CA 92626
+ ***/
+
+#include linux/module.h
+#include linux/version.h
+#include linux/idr.h
+#include rdma/ib_verbs.h
+#include rdma/ib_user_verbs.h
+#include rdma/ib_addr.h
+
+#include linux/netdevice.h
+#include net/addrconf.h
+
+#include ocrdma.h
+#include ocrdma_verbs.h
+#include ocrdma_ah.h
+#include be_roce.h
+#include ocrdma_hw.h
+
+MODULE_VERSION(OCRDMA_ROCE_DEV_VERSION);
+MODULE_DESCRIPTION(Emulex RoCE HCA Driver);
+MODULE_AUTHOR(Emulex Corporation);
+MODULE_LICENSE(GPL);
+
+static LIST_HEAD(ocrdma_dev_list);
+static DEFINE_MUTEX(ocrdma_devlist_lock);
+static DEFINE_IDR(ocrdma_dev_id);
+
+static union ib_gid ocrdma_zero_sgid;
+static int ocrdma_inet6addr_event(struct notifier_block *,
+ unsigned long, void *);
+
+static struct notifier_block ocrdma_inet6addr_notifier = {
+   .notifier_call = ocrdma_inet6addr_event
+};
+
+static inline void ocrdma_check_size(void)
+{
+   BUILD_BUG_ON(sizeof(struct ocrdma_mbx_hdr) != 16);
+   BUILD_BUG_ON(sizeof(struct ocrdma_mbx_rsp) != 16);
+   BUILD_BUG_ON(sizeof(struct ocrdma_mqe_sge) != 12);
+   BUILD_BUG_ON(sizeof(struct ocrdma_mqe_hdr) != 20);
+   BUILD_BUG_ON(sizeof(struct ocrdma_mqe_emb_cmd) != 236);
+   BUILD_BUG_ON(sizeof(struct ocrdma_mqe) != 256);
+   BUILD_BUG_ON(sizeof(struct ocrdma_delete_q_req) != 20);
+   BUILD_BUG_ON(sizeof(struct ocrdma_pa) != 8);
+   BUILD_BUG_ON(sizeof(struct ocrdma_create_eq_req) != 100);
+   BUILD_BUG_ON(sizeof(struct ocrdma_create_eq_rsp) != 20);
+   BUILD_BUG_ON(sizeof(struct ocrdma_mcqe) != 16);
+   BUILD_BUG_ON(sizeof(struct ocrdma_ae_mcqe) != 16);
+   BUILD_BUG_ON(sizeof(struct ocrdma_ae_mpa_mcqe) != 16);
+   BUILD_BUG_ON(sizeof(struct ocrdma_ae_qp_mcqe) != 16);
+   BUILD_BUG_ON(sizeof(struct ocrdma_mbx_query_config) != 124);
+   BUILD_BUG_ON(sizeof(struct ocrdma_fw_ver_rsp) != 68);
+   BUILD_BUG_ON(sizeof(struct ocrdma_fw_conf_rsp) != 176);
+   BUILD_BUG_ON(sizeof(struct ocrdma_create_cq_cmd) != 68);
+   BUILD_BUG_ON(sizeof(struct ocrdma_create_cq) != 88);
+   BUILD_BUG_ON(sizeof(struct ocrdma_create_cq_cmd_rsp) != 20);
+   BUILD_BUG_ON(sizeof(struct ocrdma_create_mq_v0) != 84);
+   BUILD_BUG_ON(sizeof(struct ocrdma_create_mq_v1) != 88);
+   BUILD_BUG_ON(sizeof(struct ocrdma_create_mq_req) != 104);
+   BUILD_BUG_ON(sizeof(struct ocrdma_create_mq_rsp) != 20);
+   BUILD_BUG_ON(sizeof(struct ocrdma_destroy_cq) != 40);
+   BUILD_BUG_ON(sizeof(struct ocrdma_destroy_cq_rsp) != 36);
+   BUILD_BUG_ON(sizeof(struct ocrdma_create_qp_req) != 236);
+   BUILD_BUG_ON(sizeof(struct ocrdma_create_qp_rsp) != 64);
+   BUILD_BUG_ON(sizeof(struct ocrdma_destroy_qp) != 40);
+   BUILD_BUG_ON(sizeof(struct ocrdma_destroy_qp_rsp) != 36);
+   BUILD_BUG_ON(sizeof(struct ocrdma_qp_params) != 88);
+   BUILD_BUG_ON(sizeof(struct ocrdma_modify_qp) != 136);
+   BUILD_BUG_ON(sizeof(struct ocrdma_modify_qp_rsp) != 44

[PATCH v1 8/9] ocrdma: Driver for Emulex OneConnect RDMA adapter

2012-03-26 Thread Parav Pandit
From: Parav Pandit parav.pan...@emulex.com

- build files for building ocrdma driver

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/Kconfig  |8 
 drivers/infiniband/hw/ocrdma/Makefile |5 +
 2 files changed, 13 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/hw/ocrdma/Kconfig
 create mode 100644 drivers/infiniband/hw/ocrdma/Makefile

diff --git a/drivers/infiniband/hw/ocrdma/Kconfig 
b/drivers/infiniband/hw/ocrdma/Kconfig
new file mode 100644
index 000..cf99342
--- /dev/null
+++ b/drivers/infiniband/hw/ocrdma/Kconfig
@@ -0,0 +1,8 @@
+config INFINIBAND_OCRDMA
+   tristate Emulex One Connect HCA support
+   depends on ETHERNET  NETDEVICES  PCI
+   select NET_VENDOR_EMULEX
+   select BE2NET
+   ---help---
+ This driver provides low-level InfiniBand over Ethernet
+ support for Emulex One Connect host channel adapters (HCAs).
diff --git a/drivers/infiniband/hw/ocrdma/Makefile 
b/drivers/infiniband/hw/ocrdma/Makefile
new file mode 100644
index 000..06a5bed
--- /dev/null
+++ b/drivers/infiniband/hw/ocrdma/Makefile
@@ -0,0 +1,5 @@
+ccflags-y := -Idrivers/net/ethernet/emulex/benet
+
+obj-$(CONFIG_INFINIBAND_OCRDMA)+= ocrdma.o
+
+ocrdma-y :=ocrdma_main.o ocrdma_verbs.o ocrdma_hw.o ocrdma_ah.o
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 7/9] ocrdma: Driver for Emulex OneConnect RDMA adapter

2012-03-26 Thread Parav Pandit
From: Parav Pandit parav.pan...@emulex.com

- address handle specific handling.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |  172 ++
 drivers/infiniband/hw/ocrdma/ocrdma_ah.h |   42 +++
 2 files changed, 214 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_ah.c
 create mode 100644 drivers/infiniband/hw/ocrdma/ocrdma_ah.h

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
new file mode 100644
index 000..cca8e38
--- /dev/null
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
@@ -0,0 +1,172 @@
+/***
+ * This file is part of the Emulex RoCE Device Driver for  *
+ * RoCE (RDMA over Converged Ethernet) adapters.   *
+ * Copyright (C) 2008-2012 Emulex. All rights reserved.*
+ * EMULEX and SLI are trademarks of Emulex.*
+ * www.emulex.com  *
+ * *
+ * This program is free software; you can redistribute it and/or   *
+ * modify it under the terms of version 2 of the GNU General   *
+ * Public License as published by the Free Software Foundation.*
+ * This program is distributed in the hope that it will be useful. *
+ * ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND  *
+ * WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY,  *
+ * FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT, ARE  *
+ * DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD *
+ * TO BE LEGALLY INVALID.  See the GNU General Public License for  *
+ * more details, a copy of which can be found in the file COPYING  *
+ * included with this package. *
+ *
+ * Contact Information:
+ * linux-driv...@emulex.com
+ *
+ * Emulex
+ *  Susan Street
+ * Costa Mesa, CA 92626
+ ***/
+
+#include net/neighbour.h
+#include net/netevent.h
+
+#include rdma/ib_addr.h
+#include rdma/ib_cache.h
+
+#include ocrdma.h
+#include ocrdma_verbs.h
+#include ocrdma_ah.h
+#include ocrdma_hw.h
+
+static inline int set_av_attr(struct ocrdma_dev *dev, struct ocrdma_ah *ah,
+ struct ib_ah_attr *attr, int pdid)
+{
+   int status = 0;
+   u16 vlan_tag; bool vlan_enabled = false;
+   struct ocrdma_eth_vlan eth;
+   struct ocrdma_grh grh;
+   int eth_sz;
+
+   memset(eth, 0, sizeof(eth));
+   memset(grh, 0, sizeof(grh));
+
+   ah-sgid_index = attr-grh.sgid_index;
+
+   vlan_tag = rdma_get_vlan_id(attr-grh.dgid);
+   if (vlan_tag  (vlan_tag  0x1000)) {
+   eth.eth_type = cpu_to_be16(0x8100);
+   eth.roce_eth_type = cpu_to_be16(OCRDMA_ROCE_ETH_TYPE);
+   vlan_tag |= (attr-sl  7)  13;
+   eth.vlan_tag = cpu_to_be16(vlan_tag);
+   eth_sz = sizeof(struct ocrdma_eth_vlan);
+   vlan_enabled = true;
+   } else {
+   eth.eth_type = cpu_to_be16(OCRDMA_ROCE_ETH_TYPE);
+   eth_sz = sizeof(struct ocrdma_eth_basic);
+   }
+   memcpy(eth.smac[0], dev-nic_info.mac_addr[0], ETH_ALEN);
+   status = ocrdma_resolve_dgid(dev, attr-grh.dgid, eth.dmac[0]);
+   if (status)
+   return status;
+   status = ocrdma_query_gid(dev-ibdev, 1, attr-grh.sgid_index,
+   (union ib_gid *)grh.sgid[0]);
+   if (status)
+   return status;
+
+   grh.tclass_flow = cpu_to_be32((6  28) |
+   (attr-grh.traffic_class  24) |
+   attr-grh.flow_label);
+   /* 0x1b is next header value in GRH */
+   grh.pdid_hoplimit = cpu_to_be32((pdid  16) |
+   (0x1b  8) | attr-grh.hop_limit);
+
+   memcpy(grh.dgid[0], attr-grh.dgid.raw, sizeof(attr-grh.dgid.raw));
+   memcpy(ah-av-eth_hdr, eth, eth_sz);
+   memcpy((u8 *)ah-av + eth_sz, grh, sizeof(struct ocrdma_grh));
+   if (vlan_enabled)
+   ah-av-valid |= OCRDMA_AV_VLAN_VALID;
+   return status;
+}
+
+struct ib_ah *ocrdma_create_ah(struct ib_pd *ibpd, struct ib_ah_attr *attr)
+{
+   u32 *ahid_addr;
+   int status;
+   struct ocrdma_ah *ah;
+   struct ocrdma_pd *pd = get_ocrdma_pd(ibpd);
+   struct ocrdma_dev *dev = get_ocrdma_dev(ibpd-device);
+
+   if (!(attr-ah_flags  IB_AH_GRH))
+   return ERR_PTR(-EINVAL);
+
+   ah = kzalloc(sizeof *ah, GFP_ATOMIC);
+   if (!ah)
+   return ERR_PTR(-ENOMEM);
+
+   status = ocrdma_alloc_av(dev, ah);
+   if (status)
+   goto av_err;
+   status = set_av_attr(dev, ah, attr, pd-id);
+   if (status)
+   goto av_conf_err;
+
+   /* if pd is for the user process, pass the ah_id

[PATCH v1 9/9] ocrdma: Driver for Emulex OneConnect RDMA adapter

2012-03-26 Thread Parav Pandit
From: Parav Pandit parav.pan...@emulex.com

- top level build files to build ocrdma driver.

Signed-off-by: Parav Pandit parav.pan...@emulex.com
---
 drivers/infiniband/Kconfig  |1 +
 drivers/infiniband/Makefile |1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index eb0add3..a0f29c1 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -51,6 +51,7 @@ source drivers/infiniband/hw/cxgb3/Kconfig
 source drivers/infiniband/hw/cxgb4/Kconfig
 source drivers/infiniband/hw/mlx4/Kconfig
 source drivers/infiniband/hw/nes/Kconfig
+source drivers/infiniband/hw/ocrdma/Kconfig
 
 source drivers/infiniband/ulp/ipoib/Kconfig
 
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index a3b2d8e..bf846a1 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_INFINIBAND_CXGB3)  += hw/cxgb3/
 obj-$(CONFIG_INFINIBAND_CXGB4) += hw/cxgb4/
 obj-$(CONFIG_MLX4_INFINIBAND)  += hw/mlx4/
 obj-$(CONFIG_INFINIBAND_NES)   += hw/nes/
+obj-$(CONFIG_INFINIBAND_OCRDMA)+= hw/ocrdma/
 obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)   += ulp/srp/
 obj-$(CONFIG_INFINIBAND_SRPT)  += ulp/srpt/
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Does SDP on Chelsio RNIC in ofed 1.5.2 work?

2011-04-06 Thread Parav Pandit
Hi,

I really appreciate such a quick support. Thanks a lot. This give me direction 
to explore some other protocols.

On a side line,
I digged the OFED 1.3 source code mentioned in the site for with SDP test was 
done for Chelsio adapters.
SDP ULP was not using the FMR pool at that point of time and might have worked 
because of that.

Now I am using SLES 11 2.27.x kernel version and not sure whether 1.3.x will 
work straight forward.

Regards,
Parav Pandit


--- On Wed, 4/6/11, Jaszcza, Andrzej andrzej.jasz...@intel.com wrote:

 From: Jaszcza, Andrzej andrzej.jasz...@intel.com
 Subject: RE: Does SDP on Chelsio RNIC in ofed 1.5.2 work?
 To: Andrea Gozzelino andrea.gozzel...@lnl.infn.it, Parav Pandit 
 paravpan...@yahoo.com
 Cc: linux-rdma@vger.kernel.org linux-rdma@vger.kernel.org
 Date: Wednesday, April 6, 2011, 2:43 PM
 Hi,
 
 SDP is not officially supported on Intel NetEffect adapters
 at this moment.
 
 Thanks,
 Andrzej
 
 -Original Message-
 From: linux-rdma-ow...@vger.kernel.org
 [mailto:linux-rdma-ow...@vger.kernel.org]
 On Behalf Of Andrea Gozzelino
 Sent: Wednesday, April 06, 2011 10:53 AM
 To: Parav Pandit
 Cc: linux-rdma@vger.kernel.org
 Subject: Re: Does SDP on Chelsio RNIC in ofed 1.5.2 work?
 
 Hi Parav,
 
 I study SDP on NE020 cards from IntelNetEffect in 2010
 without success.
 Please refer to bug 2027 and 2028 on bugzilla.
 You can search on the web discussions with my name in 2010
 related to SDP.
 
 I hope that can help you!
 Regards,
 Andrea
 
 Andrea Gozzelino
 
 PhD Università di Padova
 Dipartimento di Fisica Galileo Galilei
 Via Marzolo, 8 - I - 35131 - Padova (PD) - Italia Ufficio @
 Padova: 138
 Tel: +39 049 8277103
 
 INFN - Laboratori Nazionali di Legnaro   
 (LNL)
 Viale dell'Universita' 2 - I - 35020 - Legnaro (PD)- Italia
 Ufficio @ LNL: E-101
 Tel: +39 049 8068346
 Fax: +39 049 641925
         
 
 On Apr 06, 2011 10:06 AM, Parav Pandit paravpan...@yahoo.com
 wrote:
 
  Hi,
  
  Can anyone please help in running SDP on iWarp
 adapters?
  
  Sbould I use Neteffect adapter to run SDP or change
 some configuration 
  on Chelsio?
  
  Regards,
  Parav Pandit
  
  
  --- On Tue, 4/5/11, Parav Pandit paravpan...@yahoo.com
 wrote:
  
   From: Parav Pandit paravpan...@yahoo.com
   Subject: Does SDP on Chelsio RNIC in ofed 1.5.2
 work?
   To: linux-rdma@vger.kernel.org
   Date: Tuesday, April 5, 2011, 6:44 PM Hi,
   
   I am having Chelsio T310 adapters connected via
 10G switch in 
   servers using OFED 1.5.2.
   
   I am trying to measure SDP performance over iWarp
 on Chelsio 
   adapters.
   but I find below error message from the
 /var/log/messages.
   
   Apr  5 23:28:11 linux kernel: [ 2066.238932]
 eth8:
   link up, 10Gbps, full-duplex
   Apr  5 23:28:11 linux kernel: [ 2066.240016]
   ADDRCONF(NETDEV_CHANGE): eth8: link
   becomes ready
   Apr  5 23:28:15 linux kernel: [ 2070.629530]
 iw_cxgb3:
   Chelsio T3 RDMA Driver -
   version 1.1
   Apr  5 23:28:15 linux kernel: [ 2070.642134]
 fmr_pool:
   Device cxgb3_0 does not
   support FMRs
   Apr  5 23:28:15 linux kernel: [ 2070.642137]
 Error creating fmr pool 
   Apr  5 23:28:15 linux kernel: [ 2070.642142]
 iw_cxgb3:
   Initialized device
   :05:00.0
   Apr  5 23:28:18 linux kernel: [ 2073.551607]
   sdp_post_recv:236 sdp_sock( 3142:2
   41209:0): ib_post_recv failed. status -22
   
   I find some discussion on previous ofed list in
 2008 about broken 
   support for same.
  
   http://www.mail-archive.com/general@lists.openfabrics.org/msg10720.h

   tml
   
   Below link seem to calculate the SDP performance
 over Chelsio iWARP.
   http://hpc.ufl.edu/benchmarks/iwarp_sdp/

   
   Can anyone please tell me what am I missing? I
 have done SDP 
   configuration from the above link.
   
   Regards,
   Parav Pandit
   
   
  --
  To unsubscribe from this list: send the line
 unsubscribe linux-rdma
  in
  the body of a message to majord...@vger.kernel.org
 More majordomo info 
  at  http://vger.kernel.org/majordomo-info.html

  
 
 
            
             
 
 --
 To unsubscribe from this list: send the line unsubscribe
 linux-rdma in the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 -
 Intel Technology Poland sp. z o.o.
 z siedziba w Gdansku
 ul. Slowackiego 173
 80-298 Gdansk
 
 Sad Rejonowy Gdansk Polnoc w Gdansku, 
 VII Wydzial Gospodarczy Krajowego Rejestru Sadowego, 
 numer KRS 101882
 
 NIP 957-07-52-316
 Kapital zakladowy 200.000 zl
 
 This e-mail and any attachments may contain confidential
 material for
 the sole use of the intended recipient(s). Any review or
 distribution
 by others is strictly prohibited. If you are not the
 intended
 recipient, please contact the sender and delete all
 copies.
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org

Does SDP on Chelsio RNIC in ofed 1.5.2 work?

2011-04-05 Thread Parav Pandit
Hi,

I am having Chelsio T310 adapters connected via 10G switch in servers using 
OFED 1.5.2.

I am trying to measure SDP performance over iWarp on Chelsio adapters.
but I find below error message from the /var/log/messages.

Apr  5 23:28:11 linux kernel: [ 2066.238932] eth8: link up, 10Gbps, full-duplex
Apr  5 23:28:11 linux kernel: [ 2066.240016] ADDRCONF(NETDEV_CHANGE): eth8: link
becomes ready
Apr  5 23:28:15 linux kernel: [ 2070.629530] iw_cxgb3: Chelsio T3 RDMA Driver -
version 1.1
Apr  5 23:28:15 linux kernel: [ 2070.642134] fmr_pool: Device cxgb3_0 does not
support FMRs
Apr  5 23:28:15 linux kernel: [ 2070.642137] Error creating fmr pool
Apr  5 23:28:15 linux kernel: [ 2070.642142] iw_cxgb3: Initialized device
:05:00.0
Apr  5 23:28:18 linux kernel: [ 2073.551607] sdp_post_recv:236 sdp_sock( 3142:2
41209:0): ib_post_recv failed. status -22

I find some discussion on previous ofed list in 2008 about broken support for 
same.
http://www.mail-archive.com/general@lists.openfabrics.org/msg10720.html

Below link seem to calculate the SDP performance over Chelsio iWARP.
http://hpc.ufl.edu/benchmarks/iwarp_sdp/

Can anyone please tell me what am I missing? I have done SDP configuration from 
the above link.

Regards,
Parav Pandit

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


GIT tree for OFED 1.5.x latest source code

2010-12-24 Thread Parav Pandit
Hi,

Can anyone please confirm,
which is the latest OFED 1.5.x tree against which enhancements/fixes/new 
functionality/modules patches can be applied after reviews?

Is it the same tree that will be use for merging with kernel.org?

Can you please confirm, is it the below one?
http://git.openfabrics.org/git?p=ofed_1_5/linux-2.6.git;a=summary

Regards,
Parav Pandit



  
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how new GIDs are notified to IB stack in OFED 1.5.1?

2010-09-27 Thread Parav Pandit
Hi,

So as we know GIDs are based on IPv6 addresses, and GID table entries are 
updated on the fly when new IPv6 addresses are assigned to eth and vlan based 
eth interfaces.

How does IB stack will get to know about new GIDs which are added into the 
table?
So that query_gid() can be called with right index?

Regards,
Parav Pandit



  
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html