Re: [RFC] split struct ib_send_wr

2015-08-13 Thread Doug Ledford
On 08/13/2015 01:54 AM, Christoph Hellwig wrote:
 On Wed, Aug 12, 2015 at 07:24:44PM -0700, Chuck Lever wrote:
 That makes sense, but you already Acked the change that breaks Lustre,
 and it's going in through the NFS tree. Are you changing that to a NAK?

No.  Lustre fits in my languishing in the staging tree category.

 It seems like Doug was mostly concened about to be removed drivers.
 I defintively refuse to fix Lustre for anything I tough because it's
 such a giant mess with uses just about every major subsystem in
 an incorrect way.
 
 Doug:  was your mail a request to fix up the two de-staged drivers?
 I'm happy to do that if you're fine with the patch in general.  amso1100
 should be trivial anyway, while ipath is a mess, just like the new intel
 driver with the third copy of the soft ib stack.

Correct.


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH V6 6/9] isert: Rename IO functions to more descriptive names

2015-08-13 Thread Sagi Grimberg


Nic is silent...

Sagi, do you have an ETA on when you can have the recode ready for detailed 
review and test? If we can't make linux-4.3, can we be early in staging it for 
linux-4.4?


Hi Steve,

I have something, but its not remotely close to be submission ready.
This ended up being a rewrite of the registration path which is pretty
convoluted at the moment. My aim is mostly simplifying it in a way that
iWARP support would be (almost) straight-forward...

I can send you my WIP to test.

Sagi.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/hfi1: Remove inline from trace functions

2015-08-13 Thread Mike Marciniszyn
From: Dennis Dalessandro dennis.dalessan...@intel.com

inline in trace functions causes the following build error when
CONFIG_OPTIMIZE_INLINING is not defined in the kernel config:
error: function can never be inlined because it uses
variable argument lists

Reported by 0-day build:
https://lists.01.org/pipermail/kbuild-all/2015-August/011215.html

This patch converts to a non-inline version of the hfi1 trace functions

Reviewed-by: Jubin John jubin.j...@intel.com
Reviewed-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
---
 drivers/staging/hfi1/trace.c |   15 +++-
 drivers/staging/hfi1/trace.h |   51 --
 2 files changed, 32 insertions(+), 34 deletions(-)

diff --git a/drivers/staging/hfi1/trace.c b/drivers/staging/hfi1/trace.c
index afbb212..ea95591 100644
--- a/drivers/staging/hfi1/trace.c
+++ b/drivers/staging/hfi1/trace.c
@@ -48,7 +48,6 @@
  *
  */
 #define CREATE_TRACE_POINTS
-#define HFI1_TRACE_DO_NOT_CREATE_INLINES
 #include trace.h
 
 u8 ibhdr_exhdr_len(struct hfi1_ib_header *hdr)
@@ -208,4 +207,16 @@ const char *print_u64_array(
return ret;
 }
 
-#undef HFI1_TRACE_DO_NOT_CREATE_INLINES
+__hfi1_trace_fn(PKT);
+__hfi1_trace_fn(PROC);
+__hfi1_trace_fn(SDMA);
+__hfi1_trace_fn(LINKVERB);
+__hfi1_trace_fn(DEBUG);
+__hfi1_trace_fn(SNOOP);
+__hfi1_trace_fn(CNTR);
+__hfi1_trace_fn(PIO);
+__hfi1_trace_fn(DC8051);
+__hfi1_trace_fn(FIRMWARE);
+__hfi1_trace_fn(RCVCTRL);
+__hfi1_trace_fn(TID);
+
diff --git a/drivers/staging/hfi1/trace.h b/drivers/staging/hfi1/trace.h
index 5c34606..05c7ce8 100644
--- a/drivers/staging/hfi1/trace.h
+++ b/drivers/staging/hfi1/trace.h
@@ -1339,22 +1339,17 @@ DECLARE_EVENT_CLASS(hfi1_trace_template,
 
 /*
  * It may be nice to macroize the __hfi1_trace but the va_* stuff requires an
- * actual function to work and can not be in a macro. Also the fmt can not be a
- * constant char * because we need to be able to manipulate the \n if it is
- * present.
+ * actual function to work and can not be in a macro.
  */
-#define __hfi1_trace_event(lvl) \
+#define __hfi1_trace_def(lvl) \
+void __hfi1_trace_##lvl(const char *funct, char *fmt, ...);\
+   \
 DEFINE_EVENT(hfi1_trace_template, hfi1_ ##lvl, \
TP_PROTO(const char *function, struct va_format *vaf),  \
TP_ARGS(function, vaf))
 
-#ifdef HFI1_TRACE_DO_NOT_CREATE_INLINES
-#define __hfi1_trace_fn(fn) __hfi1_trace_event(fn)
-#else
-#define __hfi1_trace_fn(fn) \
-__hfi1_trace_event(fn); \
-__printf(2, 3) \
-static inline void __hfi1_trace_##fn(const char *func, char *fmt, ...) \
+#define __hfi1_trace_fn(lvl) \
+void __hfi1_trace_##lvl(const char *func, char *fmt, ...)  \
 {  \
struct va_format vaf = {\
.fmt = fmt, \
@@ -1363,36 +1358,28 @@ static inline void __hfi1_trace_##fn(const char *func, 
char *fmt, ...)  \
\
va_start(args, fmt);\
vaf.va = args; \
-   trace_hfi1_ ##fn(func, vaf);   \
+   trace_hfi1_ ##lvl(func, vaf);  \
va_end(args);   \
return; \
 }
-#endif
 
 /*
  * To create a new trace level simply define it as below. This will create all
  * the hooks for calling hfi1_cdbg(LVL, fmt, ...); as well as take care of all
  * the debugfs stuff.
  */
-__hfi1_trace_fn(RVPKT);
-__hfi1_trace_fn(INIT);
-__hfi1_trace_fn(VERB);
-__hfi1_trace_fn(PKT);
-__hfi1_trace_fn(PROC);
-__hfi1_trace_fn(MM);
-__hfi1_trace_fn(ERRPKT);
-__hfi1_trace_fn(SDMA);
-__hfi1_trace_fn(VPKT);
-__hfi1_trace_fn(LINKVERB);
-__hfi1_trace_fn(VERBOSE);
-__hfi1_trace_fn(DEBUG);
-__hfi1_trace_fn(SNOOP);
-__hfi1_trace_fn(CNTR);
-__hfi1_trace_fn(PIO);
-__hfi1_trace_fn(DC8051);
-__hfi1_trace_fn(FIRMWARE);
-__hfi1_trace_fn(RCVCTRL);
-__hfi1_trace_fn(TID);
+__hfi1_trace_def(PKT);
+__hfi1_trace_def(PROC);
+__hfi1_trace_def(SDMA);
+__hfi1_trace_def(LINKVERB);
+__hfi1_trace_def(DEBUG);
+__hfi1_trace_def(SNOOP);
+__hfi1_trace_def(CNTR);
+__hfi1_trace_def(PIO);
+__hfi1_trace_def(DC8051);
+__hfi1_trace_def(FIRMWARE);
+__hfi1_trace_def(RCVCTRL);
+__hfi1_trace_def(TID);
 
 #define hfi1_cdbg(which, fmt, ...) \
__hfi1_trace_##which(__func__, fmt, ##__VA_ARGS__)

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] split struct ib_send_wr

2015-08-13 Thread Jason Gunthorpe
On Thu, Aug 13, 2015 at 09:04:39AM -0700, Christoph Hellwig wrote:
 On Thu, Aug 13, 2015 at 09:07:14AM -0400, Doug Ledford wrote:
   Doug:  was your mail a request to fix up the two de-staged drivers?
   I'm happy to do that if you're fine with the patch in general.  amso1100
   should be trivial anyway, while ipath is a mess, just like the new intel
   driver with the third copy of the soft ib stack.
  
  Correct.
 
 http://git.infradead.org/users/hch/rdma.git/commitdiff/efb2b0f21645b9caabcce955481ab6966e52ad90
 
 contains the updates for ipath and amso1100, as well as the reviewed-by
 and tested-by tags.

The uverbs change needs to drop/move the original kmalloc:

next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) +
   user_wr-num_sge * sizeof (struct ib_sge),
   GFP_KERNEL);

It looks like it is leaking that allocation right now. Every path
replaces next with the result of alloc_mr..

Noticed a couple of trailing whitespaces too..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/hfi1: Remove inline from trace functions

2015-08-13 Thread Jason Gunthorpe
On Thu, Aug 13, 2015 at 10:06:14AM -0400, Mike Marciniszyn wrote:
 From: Dennis Dalessandro dennis.dalessan...@intel.com
 
 inline in trace functions causes the following build error when
 CONFIG_OPTIMIZE_INLINING is not defined in the kernel config:
 error: function can never be inlined because it uses
 variable argument lists

There are all manner of tracing things in the kernel. Does this driver
really need a custom designed one?

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] split struct ib_send_wr

2015-08-13 Thread Jason Gunthorpe
On Thu, Aug 13, 2015 at 10:53:54AM -0700, Christoph Hellwig wrote:
 http://git.infradead.org/users/hch/rdma.git/commitdiff/5d7e6fa563dae32d4b6f63e29e3795717a545f11

For the core bits:

Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCE] dapl-2.1.6-1 release

2015-08-13 Thread Davis, Arlin R
New release for uDAPL (2.1.6) is available at 
http://downloads.openfabrics.org/dapl/

Vlad, please pull into OFED 3.18-1 

md5sum: dce3ef7c943807d35bcb26dae72b1d88 dapl-2.1.6.tar.gz

For v2.1 package install RPM packages as follow:

dapl-2.1.6-1
dapl-utils-2.1.6-1
dapl-devel-2.1.6-1
dapl-debuginfo-2.1.6-1

Release notes: 
http://downloads.openfabrics.org/dapl/documentation/uDAPL_release_notes.txt

Summary:

Release 2.1.6
ucm: add cluster size environments to adjust CM timers
mpxyd: proxy_in data transfers can improperly start before RTU received
mcm: forward open/query for MFO devices in query only mode
mpxyd: byte swap incorrect on WRC wr_len
dtest: remove ERR message from flush QP function
dapltest: Quit command with -n port number will core dump
config: update dat.conf for MFO qib devices, 2 adapters/ports
mpxyd: add MFO support on proxy side
mcm: add MFO proxy commands, device, and CM support
mcm: add MFO support to openib_common code base
mcm: add full offload (MFO) mode to provider to support qib on MIC
dtest: pre-allocated buffer too small for RMR, DTO ops timeout
mpxyd: fix buffer initialization when no-inline support is active
mpxyd: reduce log level on qp_flush to CM level
mcm: intra-node proxy missing LID setup on rejects
mcm: add intra-node support via ibscif device and mcm provider
mcm: provide MIC address info with proxy device open
mcm: add device info to non-debug log
common: add DAPL_DTO_TYPE_EXTENSION_IMM for rdma_write_imm DTO type checking
mpxyd: fix up some of the PI logging
dtest: modify rdma_write_with_msg to support uni-direction streaming
mcm,mpxyd: fix dreq processing to defer QP flush when proxy WRs still pending
mpxyd: update byte_len and comp_cnt for PO to remote HST communications
mcm: bug fixes for non-inline devices
mcm: return CM_rej with CM_req_in errors
mpxyd,mcm: RDMA write with immed data not signaled on request side
mcm: add WC opcode and wc_flags in debug log message
mpxyd: set options bug fix for mcm_ib_inline

Release commit:  
http://git.openfabrics.org/?p=~ardavis/dapl.git;a=commit;h=91febc42f0070b2b9eaa81c0c113c6ff7ab8ea60

Regards, Arlin
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] split struct ib_send_wr

2015-08-13 Thread Chuck Lever

On Aug 13, 2015, at 9:04 AM, Christoph Hellwig h...@infradead.org wrote:

 On Thu, Aug 13, 2015 at 09:07:14AM -0400, Doug Ledford wrote:
 Doug:  was your mail a request to fix up the two de-staged drivers?
 I'm happy to do that if you're fine with the patch in general.  amso1100
 should be trivial anyway, while ipath is a mess, just like the new intel
 driver with the third copy of the soft ib stack.
 
 Correct.
 
 http://git.infradead.org/users/hch/rdma.git/commitdiff/efb2b0f21645b9caabcce955481ab6966e52ad90
 
 contains the updates for ipath and amso1100, as well as the reviewed-by
 and tested-by tags.
 
 Note that for now I've skipped the new intel hfi1 driver as updating
 two of the soft ib codebases already was tiresome enough.

This looks like a straightforward mechanical change. For the
hunks under net/sunrpc/xprtrdma/ :

Reviewed-by: Chuck Lever chuck.le...@oracle.com

--
Chuck Lever



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] split struct ib_send_wr

2015-08-13 Thread Christoph Hellwig
On Thu, Aug 13, 2015 at 11:22:34AM -0600, Jason Gunthorpe wrote:
 The uverbs change needs to drop/move the original kmalloc:
 
   next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) +
  user_wr-num_sge * sizeof (struct ib_sge),
  GFP_KERNEL);
 
 It looks like it is leaking that allocation right now. Every path
 replaces next with the result of alloc_mr..

Thanks.  It should be come and indeed was in my first version.  Not
sure how it sneaked in during a rebase.

 Noticed a couple of trailing whitespaces too..

checkpatch found two of them, which I've fixed now.

New version at:

http://git.infradead.org/users/hch/rdma.git/commitdiff/5d7e6fa563dae32d4b6f63e29e3795717a545f11

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 5/8 v2] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM v3.

2015-08-13 Thread Jérôme Glisse
This add new core infiniband structure and helper to implement ODP (on
demand paging) on top of HMM. We need to retain the tree of ib_umem as
some hardware associate unique identifiant with each umem (or mr) and
only allow hardware page table to be updated using this unique id.

Changed since v1:
  - Adapt to new hmm_mirror lifetime rules.
  - Fix scan of existing mirror in ib_umem_odp_get().

Changed since v2:
  - Remove FIXME for empty umem as it is an invalid case.
  - Fix HMM version of ib_umem_odp_release()

Signed-off-by: Jérôme Glisse jgli...@redhat.com
Signed-off-by: John Hubbard jhubb...@nvidia.com
Signed-off-by: Haggai Eran hagg...@mellanox.com
---
 drivers/infiniband/core/umem_odp.c| 145 ++
 drivers/infiniband/core/uverbs_cmd.c  |   1 +
 drivers/infiniband/core/uverbs_main.c |   6 ++
 include/rdma/ib_umem_odp.h|  27 +++
 include/rdma/ib_verbs.h   |  12 +++
 5 files changed, 191 insertions(+)

diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index d3b65d4..bcbc2c2 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -42,7 +42,152 @@
 #include rdma/ib_umem_odp.h
 
 #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+
+static void ib_mirror_destroy(struct kref *kref)
+{
+   struct ib_mirror *ib_mirror;
+   struct ib_device *ib_device;
+
+   ib_mirror = container_of(kref, struct ib_mirror, kref);
+
+   ib_device = ib_mirror-ib_device;
+   mutex_lock(ib_device-hmm_mutex);
+   list_del_init(ib_mirror-list);
+   mutex_unlock(ib_device-hmm_mutex);
+
+   /* hmm_mirror_unregister() will free the structure. */
+   hmm_mirror_unregister(ib_mirror-base);
+}
+
+void ib_mirror_unref(struct ib_mirror *ib_mirror)
+{
+   if (ib_mirror == NULL)
+   return;
+
+   kref_put(ib_mirror-kref, ib_mirror_destroy);
+}
+EXPORT_SYMBOL(ib_mirror_unref);
+
+static inline struct ib_mirror *ib_mirror_ref(struct ib_mirror *ib_mirror)
+{
+   if (!ib_mirror || !kref_get_unless_zero(ib_mirror-kref))
+   return NULL;
+   return ib_mirror;
+}
+
+int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
+{
+   struct mm_struct *mm = get_task_mm(current);
+   struct ib_device *ib_device = context-device;
+   struct ib_mirror *ib_mirror;
+   struct pid *our_pid;
+   int ret;
+
+   if (!mm || !ib_device-hmm_ready)
+   return -EINVAL;
+
+   /* This can not happen ! */
+   if (unlikely(ib_umem_start(umem) == ib_umem_end(umem)))
+   return -EINVAL;
+
+   /* Prevent creating ODP MRs in child processes */
+   rcu_read_lock();
+   our_pid = get_task_pid(current-group_leader, PIDTYPE_PID);
+   rcu_read_unlock();
+   put_pid(our_pid);
+   if (context-tgid != our_pid) {
+   mmput(mm);
+   return -EINVAL;
+   }
+
+   umem-hugetlb = 0;
+   umem-odp_data = kmalloc(sizeof(*umem-odp_data), GFP_KERNEL);
+   if (umem-odp_data == NULL) {
+   mmput(mm);
+   return -ENOMEM;
+   }
+   umem-odp_data-private = NULL;
+   umem-odp_data-umem = umem;
+
+   mutex_lock(ib_device-hmm_mutex);
+   /* Is there an existing mirror for this process mm ? */
+   ib_mirror = ib_mirror_ref(context-ib_mirror);
+   if (!ib_mirror) {
+   struct ib_mirror *tmp;
+
+   list_for_each_entry(tmp, ib_device-ib_mirrors, list) {
+   if (tmp-base.hmm-mm != mm)
+   continue;
+   ib_mirror = ib_mirror_ref(tmp);
+   break;
+   }
+   }
+
+   if (!ib_mirror) {
+   /* We need to create a new mirror. */
+   ib_mirror = kmalloc(sizeof(*ib_mirror), GFP_KERNEL);
+   if (!ib_mirror) {
+   mutex_unlock(ib_device-hmm_mutex);
+   mmput(mm);
+   return -ENOMEM;
+   }
+   kref_init(ib_mirror-kref);
+   init_rwsem(ib_mirror-hmm_mr_rwsem);
+   ib_mirror-umem_tree = RB_ROOT;
+   ib_mirror-ib_device = ib_device;
+
+   ib_mirror-base.device = ib_device-hmm_dev;
+   ret = hmm_mirror_register(ib_mirror-base);
+   if (ret) {
+   mutex_unlock(ib_device-hmm_mutex);
+   kfree(ib_mirror);
+   mmput(mm);
+   return ret;
+   }
+
+   list_add(ib_mirror-list, ib_device-ib_mirrors);
+   context-ib_mirror = ib_mirror_ref(ib_mirror);
+   }
+   mutex_unlock(ib_device-hmm_mutex);
+   umem-odp_data.ib_mirror = ib_mirror;
+
+   down_write(ib_mirror-umem_rwsem);
+   rbt_ib_umem_insert(umem-odp_data-interval_tree, mirror-umem_tree);
+   up_write(ib_mirror-umem_rwsem);
+
+  

[RFC PATCH 4/8 v2] IB/odp/hmm: prepare for HMM code path.

2015-08-13 Thread Jérôme Glisse
This is a preparatory patch for HMM implementation of ODP (on demand
paging). It shuffle codes around that will be share between current
ODP implementation and HMM code path. It also convert many #ifdef
CONFIG to #if IS_ENABLED().

Signed-off-by: Jérôme Glisse jgli...@redhat.com
---
 drivers/infiniband/core/umem_odp.c   |   3 +
 drivers/infiniband/core/uverbs_cmd.c |  24 --
 drivers/infiniband/hw/mlx5/main.c|  13 ++-
 drivers/infiniband/hw/mlx5/mem.c |  11 ++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  14 ++--
 drivers/infiniband/hw/mlx5/mr.c  |  19 +++--
 drivers/infiniband/hw/mlx5/odp.c | 118 ++-
 drivers/infiniband/hw/mlx5/qp.c  |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/qp.c |   8 +-
 include/rdma/ib_umem_odp.h   |  51 +++-
 include/rdma/ib_verbs.h  |   7 +-
 12 files changed, 159 insertions(+), 115 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 0541761..d3b65d4 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -41,6 +41,8 @@
 #include rdma/ib_umem.h
 #include rdma/ib_umem_odp.h
 
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 static void ib_umem_notifier_start_account(struct ib_umem *item)
 {
mutex_lock(item-odp_data-umem_mutex);
@@ -667,3 +669,4 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 
virt,
mutex_unlock(umem-odp_data-umem_mutex);
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index bbb02ff..53163aa 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -289,9 +289,12 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
struct ib_uverbs_get_context_resp resp;
struct ib_udata   udata;
struct ib_device *ibdev = file-device-ib_dev;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
struct ib_device_attr dev_attr;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
struct ib_ucontext   *ucontext;
struct file  *filp;
int ret;
@@ -334,7 +337,9 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
rcu_read_unlock();
ucontext-closing = 0;
 
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
ucontext-umem_tree = RB_ROOT;
init_rwsem(ucontext-umem_rwsem);
ucontext-odp_mrs_count = 0;
@@ -345,8 +350,8 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
goto err_free;
if (!(dev_attr.device_cap_flags  IB_DEVICE_ON_DEMAND_PAGING))
ucontext-invalidate_range = NULL;
-
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
resp.num_comp_vectors = file-device-num_comp_vectors;
 
@@ -3438,7 +3443,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
if (ucore-outlen  resp.response_length + sizeof(resp.odp_caps))
goto end;
 
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
resp.odp_caps.general_caps = attr.odp_caps.general_caps;
resp.odp_caps.per_transport_caps.rc_odp_caps =
attr.odp_caps.per_transport_caps.rc_odp_caps;
@@ -3447,9 +3454,10 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file 
*file,
resp.odp_caps.per_transport_caps.ud_odp_caps =
attr.odp_caps.per_transport_caps.ud_odp_caps;
resp.odp_caps.reserved = 0;
-#else
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
memset(resp.odp_caps, 0, sizeof(resp.odp_caps));
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
resp.response_length += sizeof(resp.odp_caps);
 
if (ucore-outlen  resp.response_length + sizeof(resp.timestamp_mask))
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 085c24b..da31c70 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -293,11 +293,14 @@ static int mlx5_ib_query_device(struct ib_device 

[RFC PATCH 6/8 v2] IB/mlx5/hmm: add mlx5 HMM device initialization and callback v3.

2015-08-13 Thread Jérôme Glisse
This add the core HMM callback for mlx5 device driver and initialize
the HMM device for the mlx5 infiniband device driver.

Changed since v1:
  - Adapt to new hmm_mirror lifetime rules.
  - HMM_ISDIRTY no longer exist.

Changed since v2:
  - Adapt to HMM page table changes.

Signed-off-by: Jérôme Glisse jgli...@redhat.com
Signed-off-by: John Hubbard jhubb...@nvidia.com
---
 drivers/infiniband/core/umem_odp.c   |  10 +-
 drivers/infiniband/hw/mlx5/main.c|   5 +
 drivers/infiniband/hw/mlx5/mem.c |  38 
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  17 
 drivers/infiniband/hw/mlx5/mr.c  |   7 ++
 drivers/infiniband/hw/mlx5/odp.c | 174 +++
 include/rdma/ib_umem_odp.h   |  17 
 7 files changed, 264 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index bcbc2c2..b7dd8228 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -132,7 +132,7 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct 
ib_umem *umem)
return -ENOMEM;
}
kref_init(ib_mirror-kref);
-   init_rwsem(ib_mirror-hmm_mr_rwsem);
+   init_rwsem(ib_mirror-umem_rwsem);
ib_mirror-umem_tree = RB_ROOT;
ib_mirror-ib_device = ib_device;
 
@@ -149,10 +149,11 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct 
ib_umem *umem)
context-ib_mirror = ib_mirror_ref(ib_mirror);
}
mutex_unlock(ib_device-hmm_mutex);
-   umem-odp_data.ib_mirror = ib_mirror;
+   umem-odp_data-ib_mirror = ib_mirror;
 
down_write(ib_mirror-umem_rwsem);
-   rbt_ib_umem_insert(umem-odp_data-interval_tree, mirror-umem_tree);
+   rbt_ib_umem_insert(umem-odp_data-interval_tree,
+  ib_mirror-umem_tree);
up_write(ib_mirror-umem_rwsem);
 
mmput(mm);
@@ -178,7 +179,8 @@ void ib_umem_odp_release(struct ib_umem *umem)
 * range covered by one and only one umem while holding the umem rwsem.
 */
down_write(ib_mirror-umem_rwsem);
-   rbt_ib_umem_remove(umem-odp_data-interval_tree, mirror-umem_tree);
+   rbt_ib_umem_remove(umem-odp_data-interval_tree,
+  ib_mirror-umem_tree);
up_write(ib_mirror-umem_rwsem);
 
ib_mirror_unref(ib_mirror);
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index da31c70..32ed2f1 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1530,6 +1530,9 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
if (err)
goto err_rsrc;
 
+   /* If HMM initialization fails we just do not enable odp. */
+   mlx5_dev_init_odp_hmm(dev-ib_dev, mdev-pdev-dev);
+
err = ib_register_device(dev-ib_dev, NULL);
if (err)
goto err_odp;
@@ -1554,6 +1557,7 @@ err_umrc:
 
 err_dev:
ib_unregister_device(dev-ib_dev);
+   mlx5_dev_fini_odp_hmm(dev-ib_dev);
 
 err_odp:
mlx5_ib_odp_remove_one(dev);
@@ -1573,6 +1577,7 @@ static void mlx5_ib_remove(struct mlx5_core_dev *mdev, 
void *context)
 
ib_unregister_device(dev-ib_dev);
destroy_umrc_res(dev);
+   mlx5_dev_fini_odp_hmm(dev-ib_dev);
mlx5_ib_odp_remove_one(dev);
destroy_dev_resources(dev-devr);
ib_dealloc_device(dev-ib_dev);
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index 19354b6..0d74eac 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -154,6 +154,8 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct 
ib_umem *umem,
__be64 *pas, int access_flags, void *data)
 {
unsigned long umem_page_shift = ilog2(umem-page_size);
+   unsigned long start = ib_umem_start(umem) + (offset  PAGE_SHIFT);
+   unsigned long end = start + (num_pages  PAGE_SHIFT);
int shift = page_shift - umem_page_shift;
int mask = (1  shift) - 1;
int i, k;
@@ -164,6 +166,42 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, 
struct ib_umem *umem,
int entry;
 #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
 #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+   if (umem-odp_data) {
+   struct ib_mirror *ib_mirror = umem-odp_data-ib_mirror;
+   struct hmm_mirror *mirror = ib_mirror-base;
+   struct hmm_pt_iter *iter = data, local_iter;
+   unsigned long addr;
+
+   if (iter == NULL) {
+   iter = local_iter;
+   hmm_pt_iter_init(iter, mirror-pt);
+   }
+
+   for (i=0, addr=start; i  num_pages; ++i, addr+=PAGE_SIZE) {
+   unsigned long next = end;
+   dma_addr_t *ptep, pte;
+
+   /* Get and 

Re: [GIT PULL] pull request: linux-firmware: Add Intel OPA hfi1 firmware

2015-08-13 Thread ira.weiny
On Tue, Aug 11, 2015 at 05:35:53PM -0600, Jason Gunthorpe wrote:
 On Tue, Aug 11, 2015 at 10:47:03PM +, Vogel, Steve wrote:
  The license terms allow anyone to distribute (but not sell) the firmware but
  only for use on Intel products. 
 
 Redistribution alone may be enough to be included in linux-firmware
 
 However, most of the additional terms (and there are lots of them)
 this imposes beyond the usual likely make it impossible to include in a
 distro, so pragmatically, there is no reason to push for inclusion in
 linux-firmware.
 
 This is going to be a hard road for you guys. Falling in line with
 every other Intel firmware blob's (i915, ibt, iwlwifi, SST2) license
 would be much easier on you and the distros.
 
 Frankly, I think the onus is on you to get statements from the
 licensing teams at Fedora, Debian, RH and SuSE on if they can include
 work under this license or not.
 
 I suspect Fedora and Debian will both say they can't, just based on
 their public policies and the additional restrictions in this
 license.. But hey, I'm not a licensing lawyer..
 

I just noticed that the email from Steve that Jason Replied to did not make it
to the lists.

Here is the text from Steve for reference.

quote
Here is an interpretation of the grant language:

2.11Grant.  Subject to Your compliance with the terms of this Agreement, and
  the limitations set forth in Section 2.2, Intel hereby grants You, during the
term of this Agreement, a non-transferable, non-exclusive, non-sublicenseable
(except as expressly set forth below), limited right and license:
(A)Onunder Intel’s copyrights, to:
(1)Onreproduce and execute the Software only for internal use with Intel
Products, including designing products for Intel Products,; this license does
not include the right to sublicense, and may be exercised only within Your
facilities by Your employees;

[This allows anyone obtaining the software to
make copies and use the software, but not to re-license it.]


(2)Ondistribute the unmodified Software only in Object Code, only for use with
Intel Products; this license includes the right to sublicense, but only the
rights to execute the Software and only under Intel’s End User Software License
Agreement attached as Attachment B, without the right to further sublicense;


[This allows anyone to re-distribute the software for use on Intel products and
requires the them to re-distribute with the license in Attachment B]


/quote

Ira

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2] IB/hfi1: Remove inline from trace functions

2015-08-13 Thread Marciniszyn, Mike
 Subject: [PATCH v2] IB/hfi1: Remove inline from trace functions
 

v2 adjusts some of the comment text to clarify adding new traces.


[RFC PATCH 0/8 v2] Implement ODP using HMM v2

2015-08-13 Thread Jérôme Glisse
Posting just for comment, still waiting on HMM to be accepted before
this patchset can be considered for inclusion.

This patchset implement the on demand paging feature using HMM. It
depends on the HMM patchset v10 (previous post (1)). Long term plan
is to replace ODP with HMM allowing to share same code infrastructure
accross different class of devices.

HMM (Heterogeneous Memory Management) is an helper layer for device
that want to mirror a process address space into their own mmu. Main
target is GPU but other hardware, like network device can take also
use HMM.

Tree with the patchset:
git://people.freedesktop.org/~glisse/linux hmm-v10 branch

(1) Previous patchset posting :
v1 http://lwn.net/Articles/597289/
v2 https://lkml.org/lkml/2014/6/12/559
v3 https://lkml.org/lkml/2014/6/13/633
v4 https://lkml.org/lkml/2014/8/29/423
v5 https://lkml.org/lkml/2014/11/3/759
v6 http://lwn.net/Articles/619737/
v7 http://lwn.net/Articles/627316/
v8 https://lwn.net/Articles/645515/
v9 https://lwn.net/Articles/651553/

Cheers,
Jérôme

To: linux-rdma@vger.kernel.org,
To: linux-ker...@vger.kernel.org,
Cc: Kevin E Martin k...@redhat.com,
Cc: Christophe Harle cha...@nvidia.com,
Cc: Duncan Poole dpo...@nvidia.com,
Cc: Sherry Cheung sche...@nvidia.com,
Cc: Subhash Gutti sgu...@nvidia.com,
Cc: John Hubbard jhubb...@nvidia.com,
Cc: Mark Hairgrove mhairgr...@nvidia.com,
Cc: Lucien Dunning ldunn...@nvidia.com,
Cc: Cameron Buschardt cabuscha...@nvidia.com,
Cc: Arvind Gopalakrishnan arvi...@nvidia.com,
Cc: Haggai Eran hagg...@mellanox.com,
Cc: Or Gerlitz ogerl...@mellanox.com,
Cc: Sagi Grimberg sa...@mellanox.com
Cc: Shachar Raindel rain...@mellanox.com,
Cc: Liran Liss lir...@mellanox.com,
Cc: Roland Dreier rol...@purestorage.com,
Cc: Sander, Ben ben.san...@amd.com,
Cc: Stoner, Greg greg.sto...@amd.com,
Cc: Bridgman, John john.bridg...@amd.com,
Cc: Mantor, Michael michael.man...@amd.com,
Cc: Blinzer, Paul paul.blin...@amd.com,
Cc: Morichetti, Laurent laurent.moriche...@amd.com,
Cc: Deucher, Alexander alexander.deuc...@amd.com,
Cc: Leonid Shamis leonid.sha...@amd.com,

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 7/8 v2] IB/mlx5/hmm: add page fault support for ODP on HMM v2.

2015-08-13 Thread Jérôme Glisse
This patch add HMM specific support for hardware page faulting of
user memory region.

Changed since v1:
  - Adapt to HMM page table changes.
  - Turn some sanity test to BUG_ON().

Signed-off-by: Jérôme Glisse jgli...@redhat.com
---
 drivers/infiniband/hw/mlx5/odp.c | 144 ++-
 1 file changed, 143 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 5ef31da..658bfca 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -54,6 +54,52 @@ static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct 
mlx5_ib_dev *dev,
 
 #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
 
+struct mlx5_hmm_pfault {
+   struct mlx5_ib_mr   *mlx5_ib_mr;
+   u64 start_idx;
+   dma_addr_t  access_mask;
+   unsignednpages;
+   struct hmm_eventevent;
+};
+
+static int mlx5_hmm_pfault(struct mlx5_ib_dev *mlx5_ib_dev,
+  struct hmm_mirror *mirror,
+  const struct hmm_event *event)
+{
+   struct mlx5_hmm_pfault *pfault;
+   struct hmm_pt_iter iter;
+   unsigned long addr, cnt;
+   int ret;
+
+   pfault = container_of(event, struct mlx5_hmm_pfault, event);
+   hmm_pt_iter_init(iter, mirror-pt);
+
+   for (addr = event-start, cnt = 0; addr  event-end;
+addr += PAGE_SIZE, ++cnt) {
+   unsigned long next = event-end;
+   dma_addr_t *ptep;
+
+   /* Get and lock pointer to mirror page table. */
+   ptep = hmm_pt_iter_lookup(iter, addr, next);
+   BUG_ON(!ptep);
+   for (; ptep  addr  next; addr += PAGE_SIZE, ptep++) {
+   /* This could be BUG_ON() as it can not happen. */
+   BUG_ON(!hmm_pte_test_valid_dma(ptep));
+   BUG_ON((pfault-access_mask  ODP_WRITE_ALLOWED_BIT) 
+   !hmm_pte_test_write(ptep));
+   if (hmm_pte_test_write(ptep))
+   hmm_pte_set_bit(ptep, ODP_WRITE_ALLOWED_SHIFT);
+   hmm_pte_set_bit(ptep, ODP_READ_ALLOWED_SHIFT);
+   pfault-npages++;
+   }
+   }
+   ret = mlx5_ib_update_mtt(pfault-mlx5_ib_mr,
+pfault-start_idx,
+cnt, 0, iter);
+   hmm_pt_iter_fini(iter);
+   return ret;
+}
+
 int mlx5_ib_umem_invalidate(struct ib_umem *umem, u64 start,
u64 end, void *cookie)
 {
@@ -179,12 +225,19 @@ static int mlx5_hmm_update(struct hmm_mirror *mirror,
struct hmm_event *event)
 {
struct device *device = mirror-device-dev;
+   struct mlx5_ib_dev *mlx5_ib_dev;
+   struct ib_device *ib_device;
int ret = 0;
 
+   ib_device = container_of(mirror-device, struct ib_device, hmm_dev);
+   mlx5_ib_dev = to_mdev(ib_device);
+
switch (event-etype) {
case HMM_DEVICE_RFAULT:
case HMM_DEVICE_WFAULT:
-   /* FIXME implement. */
+   ret = mlx5_hmm_pfault(mlx5_ib_dev, mirror, event);
+   if (ret)
+   return ret;
break;
case HMM_NONE:
default:
@@ -227,6 +280,95 @@ void mlx5_dev_fini_odp_hmm(struct ib_device *ib_device)
hmm_device_unregister(ib_device-hmm_dev);
 }
 
+/*
+ * Handle a single data segment in a page-fault WQE.
+ *
+ * Returns number of pages retrieved on success. The caller will continue to
+ * the next data segment.
+ * Can return the following error codes:
+ * -EAGAIN to designate a temporary error. The caller will abort handling the
+ *  page fault and resolve it.
+ * -EFAULT when there's an error mapping the requested pages. The caller will
+ *  abort the page fault handling and possibly move the QP to an error state.
+ * On other errors the QP should also be closed with an error.
+ */
+static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
+struct mlx5_ib_pfault *pfault,
+u32 key, u64 io_virt, size_t bcnt,
+u32 *bytes_mapped)
+{
+   struct mlx5_ib_dev *mlx5_ib_dev = to_mdev(qp-ibqp.pd-device);
+   struct ib_mirror *ib_mirror;
+   struct mlx5_hmm_pfault hmm_pfault;
+   int srcu_key;
+   int ret = 0;
+
+   srcu_key = srcu_read_lock(mlx5_ib_dev-mr_srcu);
+   hmm_pfault.mlx5_ib_mr = mlx5_ib_odp_find_mr_lkey(mlx5_ib_dev, key);
+   /*
+* If we didn't find the MR, it means the MR was closed while we were
+* handling the ODP event. In this case we return -EFAULT so that the
+* QP will be closed.
+*/
+   if (!hmm_pfault.mlx5_ib_mr || !hmm_pfault.mlx5_ib_mr-ibmr.pd) {
+   pr_err(Failed to find relevant mr for 

[RFC PATCH 8/8 v2] IB/mlx5/hmm: enable ODP using HMM v2.

2015-08-13 Thread Jérôme Glisse
All pieces are in place for ODP (on demand paging) to work using HMM.
Add kernel option and final code to enable it.

Changed since v1:
  - Added kernel option in this last patch of the serie.

Signed-off-by: Jérôme Glisse jgli...@redhat.com
---
 drivers/infiniband/Kconfig   | 10 ++
 drivers/infiniband/core/uverbs_cmd.c |  3 ---
 drivers/infiniband/hw/mlx5/main.c|  4 
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index b899531..764f524 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -49,6 +49,16 @@ config INFINIBAND_ON_DEMAND_PAGING
  memory regions without pinning their pages, fetching the
  pages on demand instead.
 
+config INFINIBAND_ON_DEMAND_PAGING_HMM
+   bool InfiniBand on-demand paging support using HMM.
+   depends on HMM
+   depends on INFINIBAND_ON_DEMAND_PAGING
+   default n
+   ---help---
+ Use HMM (heterogeneous memory management) kernel API for
+ on demand paging. No userspace difference, this is just
+ an alternative implementation of the feature.
+
 config INFINIBAND_ADDR_TRANS
bool
depends on INFINIBAND
diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index 1db6a17..c3e14a8 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -3445,8 +3445,6 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
goto end;
 
 #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
-#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
-#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
resp.odp_caps.general_caps = attr.odp_caps.general_caps;
resp.odp_caps.per_transport_caps.rc_odp_caps =
attr.odp_caps.per_transport_caps.rc_odp_caps;
@@ -3455,7 +3453,6 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
resp.odp_caps.per_transport_caps.ud_odp_caps =
attr.odp_caps.per_transport_caps.ud_odp_caps;
resp.odp_caps.reserved = 0;
-#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
memset(resp.odp_caps, 0, sizeof(resp.odp_caps));
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 32ed2f1..c340c3a 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -295,6 +295,10 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 
 #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
 #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+   if (MLX5_CAP_GEN(mdev, pg)  ibdev-hmm_ready) {
+   props-device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
+   props-odp_caps = dev-odp_caps;
+   }
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
if (MLX5_CAP_GEN(mdev, pg))
props-device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] IB/hfi1: Remove inline from trace functions

2015-08-13 Thread Marciniszyn, Mike
  From: Dennis Dalessandro dennis.dalessan...@intel.com
 
  inline in trace functions causes the following build error when
  CONFIG_OPTIMIZE_INLINING is not defined in the kernel config:
  error: function can never be inlined because it uses variable argument
  lists
 
 There are all manner of tracing things in the kernel. Does this driver really
 need a custom designed one?
 

All of our trace infrastructure is built out of events/tracepoints, so we are 
not inventing anything new here.

The fast patch traces are built out of tracepoints.

The *_cdbg() ones are intended for slow path code and compared to the native 
trace points are easier to new trace capabilities.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RDMA/cma: fix IPv6 address resolution

2015-08-13 Thread Spencer Baugh
Resolving a link-local IPv6 address with an unspecified source address
was broken by commit 5462eddd7a, which prevented the IPv6 stack from
learning the scope id of the link-local IPv6 address, causing random
failures as the IP stack chose a random link to resolve the address on.

This commit 5462eddd7a made us bail out of cma_check_linklocal early if
the address passed in was not an IPv6 link-local address. On the address
resolution path, the address passed in is the source address; if the
source address is the unspecified address, which is not link-local, we
will bail out early.

This is mostly correct, but if the destination address is a link-local
address, then we will be following a link-local route, and we'll need to
tell the IPv6 stack what the scope id of the destination address is.
This used to be done by last line of cma_check_linklocal, which is
skipped when bailing out early:

dev_addr-bound_dev_if = sin6-sin6_scope_id;

(In cma_bind_addr, the sin6_scope_id of the source address is set to the
sin6_scope_id of the destination address, so this is correct)
This line is required in turn for the following line, L279 of
addr6_resolve, to actually inform the IPv6 stack of the scope id:

  fl6.flowi6_oif = addr-bound_dev_if;

Since we can only know we are in this failure case when we have access
to both the source IPv6 address and destination IPv6 address, we have to
deal with this further up the stack. So detect this failure case in
cma_bind_addr, and set bound_dev_if to the destination address scope id
to correct it.

Signed-off-by: Spencer Baugh sba...@catern.com
---
 drivers/infiniband/core/cma.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 6a6b60a..3b71154 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2188,8 +2188,11 @@ static int cma_bind_addr(struct rdma_cm_id *id, struct 
sockaddr *src_addr,
src_addr = (struct sockaddr *) id-route.addr.src_addr;
src_addr-sa_family = dst_addr-sa_family;
if (dst_addr-sa_family == AF_INET6) {
-   ((struct sockaddr_in6 *) src_addr)-sin6_scope_id =
-   ((struct sockaddr_in6 *) 
dst_addr)-sin6_scope_id;
+   struct sockaddr_in6 *src_addr6 = (struct sockaddr_in6 
*) src_addr;
+   struct sockaddr_in6 *dst_addr6 = (struct sockaddr_in6 
*) dst_addr;
+   src_addr6-sin6_scope_id = dst_addr6-sin6_scope_id;
+   if (ipv6_addr_type(dst_addr6-sin6_addr)  
IPV6_ADDR_LINKLOCAL)
+   id-route.addr.dev_addr.bound_dev_if = 
dst_addr6-sin6_scope_id;
} else if (dst_addr-sa_family == AF_IB) {
((struct sockaddr_ib *) src_addr)-sib_pkey =
((struct sockaddr_ib *) dst_addr)-sib_pkey;
-- 
2.5.0.rc3

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4/8 v2] IB/odp/hmm: prepare for HMM code path.

2015-08-13 Thread Jerome Glisse
On Thu, Aug 13, 2015 at 02:13:35PM -0600, Jason Gunthorpe wrote:
 On Thu, Aug 13, 2015 at 03:20:49PM -0400, Jérôme Glisse wrote:
   
  +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
  +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
 Yuk, what is wrong with
 
  #if !IS_ENABLED(...)
 
 ?

Just that latter patches add code btw #if and #else, and that
originaly it was a bigger patch that added the #if code #else
at the same time. Hence why this patch looks like this.

 
  -#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
  +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
  +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
  +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
 
 Double yuk
 
  #if !(IS_ENABLED(..)  IS_ENABLED(..))
 
 ?

Same reason as above.


 And the #ifdefs suck, as many as possible should be normal if
 statements, and one should think carefully if we really need to remove
 fields from structures..

My patch only add #if, i am not responsible for previous code
that used #ifdef, i was told to convert to #if and that's what
i am doing.

Regarding fields, yes this is intentional, ODP is an infrastructure
that is private to infiniband and thus needs more fields inside ib
struct. While HMM is intended to be a common infrastructure not only
for ib device but for other kind of devices too.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/8 v2] IB/odp: export rbt_ib_umem_for_each_in_range()

2015-08-13 Thread Jérôme Glisse
The mlx5 driver will need this function for its driver specific bit
of ODP (on demand paging) on HMM (Heterogeneous Memory Management).

Signed-off-by: Jérôme Glisse jgli...@redhat.com
---
 drivers/infiniband/core/umem_rbtree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/infiniband/core/umem_rbtree.c 
b/drivers/infiniband/core/umem_rbtree.c
index 727d788..f030ec0 100644
--- a/drivers/infiniband/core/umem_rbtree.c
+++ b/drivers/infiniband/core/umem_rbtree.c
@@ -92,3 +92,4 @@ int rbt_ib_umem_for_each_in_range(struct rb_root *root,
 
return ret_val;
 }
+EXPORT_SYMBOL(rbt_ib_umem_for_each_in_range);
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/8 v2] IB/mlx5: add a new parameter to __mlx_ib_populated_pas for ODP with HMM.

2015-08-13 Thread Jérôme Glisse
When using HMM for ODP it will be useful to pass the current mirror
page table iterator for __mlx_ib_populated_pas() function benefit. Add
void parameter for this.

Signed-off-by: Jérôme Glisse jgli...@redhat.com
---
 drivers/infiniband/hw/mlx5/mem.c | 8 +---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +-
 drivers/infiniband/hw/mlx5/mr.c  | 2 +-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index 40df2cc..df56b7d 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -145,11 +145,13 @@ static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
  * num_pages - total number of pages to fill
  * pas - bus addresses array to fill
  * access_flags - access flags to set on all present pages.
- use enum mlx5_ib_mtt_access_flags for this.
+ *use enum mlx5_ib_mtt_access_flags for this.
+ * data - intended for odp with hmm, it should point to current mirror page
+ *table iterator.
  */
 void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
int page_shift, size_t offset, size_t num_pages,
-   __be64 *pas, int access_flags)
+   __be64 *pas, int access_flags, void *data)
 {
unsigned long umem_page_shift = ilog2(umem-page_size);
int shift = page_shift - umem_page_shift;
@@ -201,7 +203,7 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct 
ib_umem *umem,
 {
return __mlx5_ib_populate_pas(dev, umem, page_shift, 0,
  ib_umem_num_pages(umem), pas,
- access_flags);
+ access_flags, NULL);
 }
 int mlx5_ib_get_buf_offset(u64 addr, int page_shift, u32 *offset)
 {
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 7cae098..d4dbd8e 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -622,7 +622,7 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int 
*count, int *shift,
int *ncont, int *order);
 void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
int page_shift, size_t offset, size_t num_pages,
-   __be64 *pas, int access_flags);
+   __be64 *pas, int access_flags, void *data);
 void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
  int page_shift, __be64 *pas, int access_flags);
 void mlx5_ib_copy_pas(u64 *old, u64 *new, int step, int num);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index bc9a0de..ef63e5f 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -912,7 +912,7 @@ int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 
start_page_index, int npages,
if (!zap) {
__mlx5_ib_populate_pas(dev, umem, PAGE_SHIFT,
   start_page_index, npages, pas,
-  MLX5_IB_MTT_PRESENT);
+  MLX5_IB_MTT_PRESENT, NULL);
/* Clear padding after the pages brought from the
 * umem. */
memset(pas + npages, 0, size - npages * sizeof(u64));
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/8 v2] IB/mlx5: add a new parameter to mlx5_ib_update_mtt() for ODP with HMM.

2015-08-13 Thread Jérôme Glisse
When using HMM for ODP it will be useful to pass the current mirror
page table iterator for mlx5_ib_update_mtt() function benefit. Add
void parameter for this.

Signed-off-by: Jérôme Glisse jgli...@redhat.com
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +-
 drivers/infiniband/hw/mlx5/mr.c  | 4 ++--
 drivers/infiniband/hw/mlx5/odp.c | 8 +---
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index d4dbd8e..79d1e7c 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -571,7 +571,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
  u64 virt_addr, int access_flags,
  struct ib_udata *udata);
 int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index,
-  int npages, int zap);
+  int npages, int zap, void *data);
 int mlx5_ib_dereg_mr(struct ib_mr *ibmr);
 int mlx5_ib_destroy_mr(struct ib_mr *ibmr);
 struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index ef63e5f..3ad371d 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -845,7 +845,7 @@ free_mr:
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
-  int zap)
+  int zap, void *data)
 {
struct mlx5_ib_dev *dev = mr-dev;
struct device *ddev = dev-ib_dev.dma_device;
@@ -912,7 +912,7 @@ int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 
start_page_index, int npages,
if (!zap) {
__mlx5_ib_populate_pas(dev, umem, PAGE_SHIFT,
   start_page_index, npages, pas,
-  MLX5_IB_MTT_PRESENT, NULL);
+  MLX5_IB_MTT_PRESENT, data);
/* Clear padding after the pages brought from the
 * umem. */
memset(pas + npages, 0, size - npages * sizeof(u64));
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index aa8391e..df86d05 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -91,14 +91,15 @@ void mlx5_ib_invalidate_range(struct ib_umem *umem, 
unsigned long start,
 
if (in_block  umr_offset == 0) {
mlx5_ib_update_mtt(mr, blk_start_idx,
-  idx - blk_start_idx, 1);
+  idx - blk_start_idx, 1,
+  NULL);
in_block = 0;
}
}
}
if (in_block)
mlx5_ib_update_mtt(mr, blk_start_idx, idx - blk_start_idx + 1,
-  1);
+  1, NULL);
 
/*
 * We are now sure that the device will not access the
@@ -249,7 +250,8 @@ static int pagefault_single_data_segment(struct mlx5_ib_qp 
*qp,
 * this MR, since ib_umem_odp_map_dma_pages already
 * checks this.
 */
-   ret = mlx5_ib_update_mtt(mr, start_idx, npages, 0);
+   ret = mlx5_ib_update_mtt(mr, start_idx,
+npages, 0, NULL);
} else {
ret = -EAGAIN;
}
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] IB/hfi1: Remove inline from trace functions

2015-08-13 Thread Mike Marciniszyn
From: Dennis Dalessandro dennis.dalessan...@intel.com

Inline in trace functions causes the following build error when
CONFIG_OPTIMIZE_INLINING is not defined in the kernel config:
error: function can never be inlined because it uses
variable argument lists

Reported by 0-day build:
https://lists.01.org/pipermail/kbuild-all/2015-August/011215.html

This patch converts to a non-inline version of the hfi1 trace functions

Reviewed-by: Jubin John jubin.j...@intel.com
Reviewed-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
---
 drivers/staging/hfi1/trace.c |   15 ++-
 drivers/staging/hfi1/trace.h |   56 +-
 2 files changed, 35 insertions(+), 36 deletions(-)

diff --git a/drivers/staging/hfi1/trace.c b/drivers/staging/hfi1/trace.c
index afbb212..ea95591 100644
--- a/drivers/staging/hfi1/trace.c
+++ b/drivers/staging/hfi1/trace.c
@@ -48,7 +48,6 @@
  *
  */
 #define CREATE_TRACE_POINTS
-#define HFI1_TRACE_DO_NOT_CREATE_INLINES
 #include trace.h
 
 u8 ibhdr_exhdr_len(struct hfi1_ib_header *hdr)
@@ -208,4 +207,16 @@ const char *print_u64_array(
return ret;
 }
 
-#undef HFI1_TRACE_DO_NOT_CREATE_INLINES
+__hfi1_trace_fn(PKT);
+__hfi1_trace_fn(PROC);
+__hfi1_trace_fn(SDMA);
+__hfi1_trace_fn(LINKVERB);
+__hfi1_trace_fn(DEBUG);
+__hfi1_trace_fn(SNOOP);
+__hfi1_trace_fn(CNTR);
+__hfi1_trace_fn(PIO);
+__hfi1_trace_fn(DC8051);
+__hfi1_trace_fn(FIRMWARE);
+__hfi1_trace_fn(RCVCTRL);
+__hfi1_trace_fn(TID);
+
diff --git a/drivers/staging/hfi1/trace.h b/drivers/staging/hfi1/trace.h
index 5c34606..d7851c0 100644
--- a/drivers/staging/hfi1/trace.h
+++ b/drivers/staging/hfi1/trace.h
@@ -1339,22 +1339,17 @@ DECLARE_EVENT_CLASS(hfi1_trace_template,
 
 /*
  * It may be nice to macroize the __hfi1_trace but the va_* stuff requires an
- * actual function to work and can not be in a macro. Also the fmt can not be a
- * constant char * because we need to be able to manipulate the \n if it is
- * present.
+ * actual function to work and can not be in a macro.
  */
-#define __hfi1_trace_event(lvl) \
+#define __hfi1_trace_def(lvl) \
+void __hfi1_trace_##lvl(const char *funct, char *fmt, ...);\
+   \
 DEFINE_EVENT(hfi1_trace_template, hfi1_ ##lvl, \
TP_PROTO(const char *function, struct va_format *vaf),  \
TP_ARGS(function, vaf))
 
-#ifdef HFI1_TRACE_DO_NOT_CREATE_INLINES
-#define __hfi1_trace_fn(fn) __hfi1_trace_event(fn)
-#else
-#define __hfi1_trace_fn(fn) \
-__hfi1_trace_event(fn); \
-__printf(2, 3) \
-static inline void __hfi1_trace_##fn(const char *func, char *fmt, ...) \
+#define __hfi1_trace_fn(lvl) \
+void __hfi1_trace_##lvl(const char *func, char *fmt, ...)  \
 {  \
struct va_format vaf = {\
.fmt = fmt, \
@@ -1363,36 +1358,29 @@ static inline void __hfi1_trace_##fn(const char *func, 
char *fmt, ...)  \
\
va_start(args, fmt);\
vaf.va = args; \
-   trace_hfi1_ ##fn(func, vaf);   \
+   trace_hfi1_ ##lvl(func, vaf);  \
va_end(args);   \
return; \
 }
-#endif
 
 /*
- * To create a new trace level simply define it as below. This will create all
- * the hooks for calling hfi1_cdbg(LVL, fmt, ...); as well as take care of all
+ * To create a new trace level simply define it below and as a __hfi1_trace_fn
+ * in trace.c. This will create all the hooks for calling
+ * hfi1_cdbg(LVL, fmt, ...); as well as take care of all
  * the debugfs stuff.
  */
-__hfi1_trace_fn(RVPKT);
-__hfi1_trace_fn(INIT);
-__hfi1_trace_fn(VERB);
-__hfi1_trace_fn(PKT);
-__hfi1_trace_fn(PROC);
-__hfi1_trace_fn(MM);
-__hfi1_trace_fn(ERRPKT);
-__hfi1_trace_fn(SDMA);
-__hfi1_trace_fn(VPKT);
-__hfi1_trace_fn(LINKVERB);
-__hfi1_trace_fn(VERBOSE);
-__hfi1_trace_fn(DEBUG);
-__hfi1_trace_fn(SNOOP);
-__hfi1_trace_fn(CNTR);
-__hfi1_trace_fn(PIO);
-__hfi1_trace_fn(DC8051);
-__hfi1_trace_fn(FIRMWARE);
-__hfi1_trace_fn(RCVCTRL);
-__hfi1_trace_fn(TID);
+__hfi1_trace_def(PKT);
+__hfi1_trace_def(PROC);
+__hfi1_trace_def(SDMA);
+__hfi1_trace_def(LINKVERB);
+__hfi1_trace_def(DEBUG);
+__hfi1_trace_def(SNOOP);
+__hfi1_trace_def(CNTR);
+__hfi1_trace_def(PIO);
+__hfi1_trace_def(DC8051);
+__hfi1_trace_def(FIRMWARE);
+__hfi1_trace_def(RCVCTRL);
+__hfi1_trace_def(TID);
 
 #define hfi1_cdbg(which, fmt, ...) \
__hfi1_trace_##which(__func__, fmt, ##__VA_ARGS__)

--
To unsubscribe from this list: 

Re: [RFC PATCH 4/8 v2] IB/odp/hmm: prepare for HMM code path.

2015-08-13 Thread Jason Gunthorpe
On Thu, Aug 13, 2015 at 03:20:49PM -0400, Jérôme Glisse wrote:
  
 +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
 +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */

Yuk, what is wrong with

 #if !IS_ENABLED(...)

?

 -#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
 +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
 +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */

Double yuk

 #if !(IS_ENABLED(..)  IS_ENABLED(..))

?

And the #ifdefs suck, as many as possible should be normal if
statements, and one should think carefully if we really need to remove
fields from structures..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 12/12] rds/ib: Remove ib_get_dma_mr calls

2015-08-13 Thread santosh shilimkar

On 7/30/2015 4:22 PM, Jason Gunthorpe wrote:

The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
  net/rds/ib.c  | 8 
  net/rds/ib.h  | 2 --
  net/rds/ib_cm.c   | 4 +---
  net/rds/ib_recv.c | 6 +++---
  net/rds/ib_send.c | 8 
  5 files changed, 8 insertions(+), 20 deletions(-)


I wanted to try this series earlier but couldn't do it because of
broken RDS RDMA. Now I have that fixed with bunch of patches soon
to be posted, tried the series. It works as expected.

The rds change looks also straight forward since ib_get_dma_mr()
is being used for local write.

So feel free to add below tag if you need one.

Tested-Acked-by: Santosh Shilimkar santosh.shilim...@oracle.com


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 1/9] IB/core: Add gid_type to gid attribute

2015-08-13 Thread Matan Barak
In order to support multiple GID types, we need to store the gid_type
with each GID. This is also aligned with the RoCE v2 annex RoCEv2 PORT
GID table entries shall have a GID type attribute that denotes the L3
Address type. The currently supported GID is IB_GID_TYPE_IB which is
also RoCE v1 GID type.

This implies that gid_type should be added to roce_gid_table meta-data.

Signed-off-by: Matan Barak mat...@mellanox.com
---
 drivers/infiniband/core/cache.c   | 127 +-
 drivers/infiniband/core/cm.c  |   2 +-
 drivers/infiniband/core/cma.c |   3 +-
 drivers/infiniband/core/core_priv.h   |   4 +
 drivers/infiniband/core/device.c  |   9 ++-
 drivers/infiniband/core/multicast.c   |   2 +-
 drivers/infiniband/core/roce_gid_mgmt.c   |  60 --
 drivers/infiniband/core/sa_query.c|   5 +-
 drivers/infiniband/core/uverbs_marshall.c |   1 +
 drivers/infiniband/core/verbs.c   |   1 +
 include/rdma/ib_cache.h   |   4 +
 include/rdma/ib_sa.h  |   1 +
 include/rdma/ib_verbs.h   |  11 ++-
 13 files changed, 177 insertions(+), 53 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index e62b63c..513a1ef 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -64,6 +64,7 @@ enum gid_attr_find_mask {
GID_ATTR_FIND_MASK_GID  = 1UL  0,
GID_ATTR_FIND_MASK_NETDEV   = 1UL  1,
GID_ATTR_FIND_MASK_DEFAULT  = 1UL  2,
+   GID_ATTR_FIND_MASK_GID_TYPE = 1UL  3,
 };
 
 enum gid_table_entry_props {
@@ -112,6 +113,19 @@ struct ib_gid_table {
struct ib_gid_table_entry *data_vec;
 };
 
+static const char * const gid_type_str[] = {
+   [IB_GID_TYPE_IB]= IB/RoCE v1,
+};
+
+const char *ib_cache_gid_type_str(enum ib_gid_type gid_type)
+{
+   if (gid_type  ARRAY_SIZE(gid_type_str)  gid_type_str[gid_type])
+   return gid_type_str[gid_type];
+
+   return Invalid GID type;
+}
+EXPORT_SYMBOL(ib_cache_gid_type_str);
+
 static int write_gid(struct ib_device *ib_dev, u8 port,
 struct ib_gid_table *table, int ix,
 const union ib_gid *gid,
@@ -216,6 +230,10 @@ static int find_gid(struct ib_gid_table *table, const 
union ib_gid *gid,
if (table-data_vec[i].props  GID_TABLE_ENTRY_INVALID)
goto next;
 
+   if (mask  GID_ATTR_FIND_MASK_GID_TYPE 
+   attr-gid_type != val-gid_type)
+   goto next;
+
if (mask  GID_ATTR_FIND_MASK_GID 
memcmp(gid, table-data_vec[i].gid, sizeof(*gid)))
goto next;
@@ -277,6 +295,7 @@ int ib_cache_gid_add(struct ib_device *ib_dev, u8 port,
mutex_lock(table-lock);
 
ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID |
+ GID_ATTR_FIND_MASK_GID_TYPE |
  GID_ATTR_FIND_MASK_NETDEV);
if (ix = 0)
goto out_unlock;
@@ -308,6 +327,7 @@ int ib_cache_gid_del(struct ib_device *ib_dev, u8 port,
 
ix = find_gid(table, gid, attr, false,
  GID_ATTR_FIND_MASK_GID  |
+ GID_ATTR_FIND_MASK_GID_TYPE |
  GID_ATTR_FIND_MASK_NETDEV   |
  GID_ATTR_FIND_MASK_DEFAULT);
if (ix  0)
@@ -396,11 +416,13 @@ static int _ib_cache_gid_table_find(struct ib_device 
*ib_dev,
 
 static int ib_cache_gid_find(struct ib_device *ib_dev,
 const union ib_gid *gid,
+enum ib_gid_type gid_type,
 struct net_device *ndev, u8 *port,
 u16 *index)
 {
-   unsigned long mask = GID_ATTR_FIND_MASK_GID;
-   struct ib_gid_attr gid_attr_val = {.ndev = ndev};
+   unsigned long mask = GID_ATTR_FIND_MASK_GID |
+GID_ATTR_FIND_MASK_GID_TYPE;
+   struct ib_gid_attr gid_attr_val = {.ndev = ndev, .gid_type = gid_type};
 
if (ndev)
mask |= GID_ATTR_FIND_MASK_NETDEV;
@@ -411,14 +433,16 @@ static int ib_cache_gid_find(struct ib_device *ib_dev,
 
 int ib_find_cached_gid_by_port(struct ib_device *ib_dev,
   const union ib_gid *gid,
+  enum ib_gid_type gid_type,
   u8 port, struct net_device *ndev,
   u16 *index)
 {
int local_index;
struct ib_gid_table **ports_table = ib_dev-cache.gid_cache;
struct ib_gid_table *table;
-   unsigned long mask = GID_ATTR_FIND_MASK_GID;
-   struct ib_gid_attr val = {.ndev = ndev};
+   unsigned long mask = GID_ATTR_FIND_MASK_GID |
+GID_ATTR_FIND_MASK_GID_TYPE;
+   struct ib_gid_attr val = {.ndev = ndev, .gid_type = gid_type};
 
if (port  

[PATCH for-next 3/9] IB/core: Add gid attributes to sysfs

2015-08-13 Thread Matan Barak
This patch set adds attributes of net device and gid type to each GID
in the GID table. Users that use verbs directly need to specify
the GID index. Since the same GID could have different types or
associated net devices, users should have the ability to query the
associated GID attributes. Adding these attributes to sysfs.

Signed-off-by: Matan Barak mat...@mellanox.com
---
 drivers/infiniband/core/sysfs.c | 184 +++-
 1 file changed, 182 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index b1f37d4..4d5d87a 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -37,12 +37,22 @@
 #include linux/slab.h
 #include linux/stat.h
 #include linux/string.h
+#include linux/netdevice.h
 
 #include rdma/ib_mad.h
 
+struct ib_port;
+
+struct gid_attr_group {
+   struct ib_port  *port;
+   struct kobject  kobj;
+   struct attribute_group  ndev;
+   struct attribute_group  type;
+};
 struct ib_port {
struct kobject kobj;
struct ib_device  *ibdev;
+   struct gid_attr_group *gid_attr_group;
struct attribute_group gid_group;
struct attribute_group pkey_group;
u8 port_num;
@@ -84,6 +94,24 @@ static const struct sysfs_ops port_sysfs_ops = {
.show = port_attr_show
 };
 
+static ssize_t gid_attr_show(struct kobject *kobj,
+struct attribute *attr, char *buf)
+{
+   struct port_attribute *port_attr =
+   container_of(attr, struct port_attribute, attr);
+   struct ib_port *p = container_of(kobj, struct gid_attr_group,
+kobj)-port;
+
+   if (!port_attr-show)
+   return -EIO;
+
+   return port_attr-show(p, port_attr, buf);
+}
+
+static const struct sysfs_ops gid_attr_sysfs_ops = {
+   .show = gid_attr_show
+};
+
 static ssize_t state_show(struct ib_port *p, struct port_attribute *unused,
  char *buf)
 {
@@ -281,6 +309,46 @@ static struct attribute *port_default_attrs[] = {
NULL
 };
 
+static size_t print_ndev(struct ib_gid_attr *gid_attr, char *buf)
+{
+   if (!gid_attr-ndev)
+   return -EINVAL;
+
+   return sprintf(buf, %s\n, gid_attr-ndev-name);
+}
+
+static size_t print_gid_type(struct ib_gid_attr *gid_attr, char *buf)
+{
+   return sprintf(buf, %s\n, ib_cache_gid_type_str(gid_attr-gid_type));
+}
+
+static ssize_t _show_port_gid_attr(struct ib_port *p,
+  struct port_attribute *attr,
+  char *buf,
+  size_t (*print)(struct ib_gid_attr *gid_attr,
+  char *buf))
+{
+   struct port_table_attribute *tab_attr =
+   container_of(attr, struct port_table_attribute, attr);
+   union ib_gid gid;
+   struct ib_gid_attr gid_attr = {};
+   ssize_t ret;
+   va_list args;
+
+   ret = ib_query_gid(p-ibdev, p-port_num, tab_attr-index, gid,
+  gid_attr);
+   if (ret)
+   goto err;
+
+   ret = print(gid_attr, buf);
+
+err:
+   if (gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   va_end(args);
+   return ret;
+}
+
 static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr,
 char *buf)
 {
@@ -296,6 +364,19 @@ static ssize_t show_port_gid(struct ib_port *p, struct 
port_attribute *attr,
return sprintf(buf, %pI6\n, gid.raw);
 }
 
+static ssize_t show_port_gid_attr_ndev(struct ib_port *p,
+  struct port_attribute *attr, char *buf)
+{
+   return _show_port_gid_attr(p, attr, buf, print_ndev);
+}
+
+static ssize_t show_port_gid_attr_gid_type(struct ib_port *p,
+  struct port_attribute *attr,
+  char *buf)
+{
+   return _show_port_gid_attr(p, attr, buf, print_gid_type);
+}
+
 static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr,
  char *buf)
 {
@@ -451,12 +532,41 @@ static void ib_port_release(struct kobject *kobj)
kfree(p);
 }
 
+static void ib_port_gid_attr_release(struct kobject *kobj)
+{
+   struct gid_attr_group *g = container_of(kobj, struct gid_attr_group,
+   kobj);
+   struct attribute *a;
+   int i;
+
+   if (g-ndev.attrs) {
+   for (i = 0; (a = g-ndev.attrs[i]); ++i)
+   kfree(a);
+
+   kfree(g-ndev.attrs);
+   }
+
+   if (g-type.attrs) {
+   for (i = 0; (a = g-type.attrs[i]); ++i)
+   kfree(a);
+
+   kfree(g-type.attrs);
+   }
+
+   kfree(g);
+}
+
 static struct kobj_type port_type = {
.release  

[PATCH for-next 4/9] IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type

2015-08-13 Thread Matan Barak
Adding RoCE v2 GID type and port type. Vendors
which support this type will get their GID table
populated with RoCE v2 GIDs automatically.

Signed-off-by: Matan Barak mat...@mellanox.com
---
 drivers/infiniband/core/cache.c |  1 +
 drivers/infiniband/core/roce_gid_mgmt.c |  3 ++-
 include/rdma/ib_verbs.h | 23 +--
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 513a1ef..ddd0406 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -115,6 +115,7 @@ struct ib_gid_table {
 
 static const char * const gid_type_str[] = {
[IB_GID_TYPE_IB]= IB/RoCE v1,
+   [IB_GID_TYPE_ROCE_UDP_ENCAP]= RoCE v2,
 };
 
 const char *ib_cache_gid_type_str(enum ib_gid_type gid_type)
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
b/drivers/infiniband/core/roce_gid_mgmt.c
index 7dec6f2..46b52b9 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -71,7 +71,8 @@ static const struct {
bool (*is_supported)(const struct ib_device *device, u8 port_num);
enum ib_gid_type gid_type;
 } PORT_CAP_TO_GID_TYPE[] = {
-   {rdma_protocol_roce,   IB_GID_TYPE_ROCE},
+   {rdma_protocol_roce_eth_encap, IB_GID_TYPE_ROCE},
+   {rdma_protocol_roce_udp_encap, IB_GID_TYPE_ROCE_UDP_ENCAP},
 };
 
 #define CAP_TO_GID_TABLE_SIZE  ARRAY_SIZE(PORT_CAP_TO_GID_TYPE)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index a85926d..dd06be8 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -70,6 +70,7 @@ enum ib_gid_type {
/* If link layer is Ethernet, this is RoCE V1 */
IB_GID_TYPE_IB= 0,
IB_GID_TYPE_ROCE  = 0,
+   IB_GID_TYPE_ROCE_UDP_ENCAP = 1,
IB_GID_TYPE_SIZE
 };
 
@@ -398,6 +399,7 @@ union rdma_protocol_stats {
 #define RDMA_CORE_CAP_PROT_IB   0x0010
 #define RDMA_CORE_CAP_PROT_ROCE 0x0020
 #define RDMA_CORE_CAP_PROT_IWARP0x0040
+#define RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP 0x0080
 
 #define RDMA_CORE_PORT_IBA_IB  (RDMA_CORE_CAP_PROT_IB  \
| RDMA_CORE_CAP_IB_MAD \
@@ -410,6 +412,12 @@ union rdma_protocol_stats {
| RDMA_CORE_CAP_IB_CM   \
| RDMA_CORE_CAP_AF_IB   \
| RDMA_CORE_CAP_ETH_AH)
+#define RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP  \
+   (RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP \
+   | RDMA_CORE_CAP_IB_MAD  \
+   | RDMA_CORE_CAP_IB_CM   \
+   | RDMA_CORE_CAP_AF_IB   \
+   | RDMA_CORE_CAP_ETH_AH)
 #define RDMA_CORE_PORT_IWARP   (RDMA_CORE_CAP_PROT_IWARP \
| RDMA_CORE_CAP_IW_CM)
 #define RDMA_CORE_PORT_INTEL_OPA   (RDMA_CORE_PORT_IBA_IB  \
@@ -1919,6 +1927,17 @@ static inline bool rdma_protocol_ib(const struct 
ib_device *device, u8 port_num)
 
 static inline bool rdma_protocol_roce(const struct ib_device *device, u8 
port_num)
 {
+   return device-port_immutable[port_num].core_cap_flags 
+   (RDMA_CORE_CAP_PROT_ROCE | RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP);
+}
+
+static inline bool rdma_protocol_roce_udp_encap(const struct ib_device 
*device, u8 port_num)
+{
+   return device-port_immutable[port_num].core_cap_flags  
RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP;
+}
+
+static inline bool rdma_protocol_roce_eth_encap(const struct ib_device 
*device, u8 port_num)
+{
return device-port_immutable[port_num].core_cap_flags  
RDMA_CORE_CAP_PROT_ROCE;
 }
 
@@ -1929,8 +1948,8 @@ static inline bool rdma_protocol_iwarp(const struct 
ib_device *device, u8 port_n
 
 static inline bool rdma_ib_or_roce(const struct ib_device *device, u8 port_num)
 {
-   return device-port_immutable[port_num].core_cap_flags 
-   (RDMA_CORE_CAP_PROT_IB | RDMA_CORE_CAP_PROT_ROCE);
+   return rdma_protocol_ib(device, port_num) ||
+   rdma_protocol_roce(device, port_num);
 }
 
 /**
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 9/9] IB/cma: Join and leave multicast groups with IGMP

2015-08-13 Thread Matan Barak
From: Moni Shoua mo...@mellanox.com

Since RoCEv2 is a protocol over IP header it is required to send IGMP
join and leave requests to the network when joining and leaving
multicast groups.

Signed-off-by: Moni Shoua mo...@mellanox.com
---
 drivers/infiniband/core/cma.c   | 96 +
 drivers/infiniband/core/multicast.c | 20 +++-
 include/rdma/ib_sa.h|  3 ++
 3 files changed, 107 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index e4f4d23..35976d5 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -38,6 +38,7 @@
 #include linux/in6.h
 #include linux/mutex.h
 #include linux/random.h
+#include linux/igmp.h
 #include linux/idr.h
 #include linux/inetdevice.h
 #include linux/slab.h
@@ -250,6 +251,7 @@ struct cma_multicast {
void*context;
struct sockaddr_storage addr;
struct kref mcref;
+   booligmp_joined;
 };
 
 struct cma_work {
@@ -337,6 +339,26 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 
ip_ver)
hdr-ip_version = (ip_ver  4) | (hdr-ip_version  0xF);
 }
 
+static int cma_igmp_send(struct net_device *ndev, union ib_gid *mgid, bool 
join)
+{
+   struct in_device *in_dev = NULL;
+
+   if (ndev) {
+   rtnl_lock();
+   in_dev = __in_dev_get_rtnl(ndev);
+   if (in_dev) {
+   if (join)
+   ip_mc_inc_group(in_dev,
+   *(__be32 *)(mgid-raw + 12));
+   else
+   ip_mc_dec_group(in_dev,
+   *(__be32 *)(mgid-raw + 12));
+   }
+   rtnl_unlock();
+   }
+   return (in_dev) ? 0 : -ENODEV;
+}
+
 static void cma_attach_to_dev(struct rdma_id_private *id_priv,
  struct cma_device *cma_dev)
 {
@@ -1137,8 +1159,24 @@ static void cma_leave_mc_groups(struct rdma_id_private 
*id_priv)
  id_priv-id.port_num)) {
ib_sa_free_multicast(mc-multicast.ib);
kfree(mc);
-   } else
+   } else {
+   if (mc-igmp_joined) {
+   struct rdma_dev_addr *dev_addr =
+   id_priv-id.route.addr.dev_addr;
+   struct net_device *ndev = NULL;
+
+   if (dev_addr-bound_dev_if)
+   ndev = dev_get_by_index(init_net,
+   
dev_addr-bound_dev_if);
+   if (ndev) {
+   cma_igmp_send(ndev,
+ 
mc-multicast.ib-rec.mgid,
+ false);
+   dev_put(ndev);
+   }
+   }
kref_put(mc-mcref, release_mc);
+   }
}
 }
 
@@ -3379,7 +3417,7 @@ static int cma_iboe_join_multicast(struct rdma_id_private 
*id_priv,
 {
struct iboe_mcast_work *work;
struct rdma_dev_addr *dev_addr = id_priv-id.route.addr.dev_addr;
-   int err;
+   int err = 0;
struct sockaddr *addr = (struct sockaddr *)mc-addr;
struct net_device *ndev = NULL;
 
@@ -3411,13 +3449,35 @@ static int cma_iboe_join_multicast(struct 
rdma_id_private *id_priv,
mc-multicast.ib-rec.rate = iboe_get_rate(ndev);
mc-multicast.ib-rec.hop_limit = 1;
mc-multicast.ib-rec.mtu = iboe_get_mtu(ndev-mtu);
+   mc-multicast.ib-rec.ifindex = dev_addr-bound_dev_if;
+   mc-multicast.ib-rec.net = init_net;
+   rdma_ip2gid((struct sockaddr *)id_priv-id.route.addr.src_addr,
+   mc-multicast.ib-rec.port_gid);
+
+   mc-multicast.ib-rec.gid_type =
+   id_priv-cma_dev-default_gid_type[id_priv-id.port_num -
+   rdma_start_port(id_priv-cma_dev-device)];
+   if (addr-sa_family == AF_INET) {
+   if (mc-multicast.ib-rec.gid_type == 
IB_GID_TYPE_ROCE_UDP_ENCAP)
+   err = cma_igmp_send(ndev, mc-multicast.ib-rec.mgid,
+   true);
+   if (!err) {
+   mc-igmp_joined = true;
+   mc-multicast.ib-rec.hop_limit = IPV6_DEFAULT_HOPLIMIT;
+   }
+   } else {
+   if (mc-multicast.ib-rec.gid_type == 
IB_GID_TYPE_ROCE_UDP_ENCAP)
+   err = -ENOTSUPP;
+   else
+   mc-multicast.ib-rec.gid_type = IB_GID_TYPE_IB;
+   }
dev_put(ndev);
-   if (!mc-multicast.ib-rec.mtu) {
-   err = 

[PATCH for-next 8/9] IB/core: Initialize UD header structure with IP and UDP headers

2015-08-13 Thread Matan Barak
From: Moni Shoua mo...@mellanox.com

ib_ud_header_init() is used to format InfiniBand headers
in a buffer up to (but not with) BTH. For RoCE UDP ENCAP it is
required that this function would be able to build also IP and UDP
headers.

Signed-off-by: Moni Shoua mo...@mellanox.com
Signed-off-by: Matan Barak mat...@mellanox.com
---
 drivers/infiniband/core/ud_header.c| 155 ++---
 drivers/infiniband/hw/mlx4/qp.c|   7 +-
 drivers/infiniband/hw/mthca/mthca_qp.c |   2 +-
 include/rdma/ib_pack.h |  45 --
 4 files changed, 188 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/core/ud_header.c 
b/drivers/infiniband/core/ud_header.c
index 72feee6..96697e7 100644
--- a/drivers/infiniband/core/ud_header.c
+++ b/drivers/infiniband/core/ud_header.c
@@ -35,6 +35,7 @@
 #include linux/string.h
 #include linux/export.h
 #include linux/if_ether.h
+#include linux/ip.h
 
 #include rdma/ib_pack.h
 
@@ -116,6 +117,72 @@ static const struct ib_field vlan_table[]  = {
  .size_bits= 16 }
 };
 
+static const struct ib_field ip4_table[]  = {
+   { STRUCT_FIELD(ip4, ver),
+ .offset_words = 0,
+ .offset_bits  = 0,
+ .size_bits= 4 },
+   { STRUCT_FIELD(ip4, hdr_len),
+ .offset_words = 0,
+ .offset_bits  = 4,
+ .size_bits= 4 },
+   { STRUCT_FIELD(ip4, tos),
+ .offset_words = 0,
+ .offset_bits  = 8,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, tot_len),
+ .offset_words = 0,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, id),
+ .offset_words = 1,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, frag_off),
+ .offset_words = 1,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, ttl),
+ .offset_words = 2,
+ .offset_bits  = 0,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, protocol),
+ .offset_words = 2,
+ .offset_bits  = 8,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, check),
+ .offset_words = 2,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, saddr),
+ .offset_words = 3,
+ .offset_bits  = 0,
+ .size_bits= 32 },
+   { STRUCT_FIELD(ip4, daddr),
+ .offset_words = 4,
+ .offset_bits  = 0,
+ .size_bits= 32 }
+};
+
+static const struct ib_field udp_table[]  = {
+   { STRUCT_FIELD(udp, sport),
+ .offset_words = 0,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, dport),
+ .offset_words = 0,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, length),
+ .offset_words = 1,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, csum),
+ .offset_words = 1,
+ .offset_bits  = 16,
+ .size_bits= 16 }
+};
+
 static const struct ib_field grh_table[]  = {
{ STRUCT_FIELD(grh, ip_version),
  .offset_words = 0,
@@ -213,26 +280,57 @@ static const struct ib_field deth_table[] = {
  .size_bits= 24 }
 };
 
+__be16 ib_ud_ip4_csum(struct ib_ud_header *header)
+{
+   struct iphdr iph;
+
+   iph.ihl = 5;
+   iph.version = 4;
+   iph.tos = header-ip4.tos;
+   iph.tot_len = header-ip4.tot_len;
+   iph.id  = header-ip4.id;
+   iph.frag_off= header-ip4.frag_off;
+   iph.ttl = header-ip4.ttl;
+   iph.protocol= header-ip4.protocol;
+   iph.check   = 0;
+   iph.saddr   = header-ip4.saddr;
+   iph.daddr   = header-ip4.daddr;
+
+   return ip_fast_csum((u8 *)iph, iph.ihl);
+}
+EXPORT_SYMBOL(ib_ud_ip4_csum);
+
 /**
  * ib_ud_header_init - Initialize UD header structure
  * @payload_bytes:Length of packet payload
  * @lrh_present: specify if LRH is present
  * @eth_present: specify if Eth header is present
  * @vlan_present: packet is tagged vlan
- * @grh_present:GRH flag (if non-zero, GRH will be included)
+ * @grh_present: GRH flag (if non-zero, GRH will be included)
+ * @ip_version: if non-zero, IP header, V4 or V6, will be included
+ * @udp_present :if non-zero, UDP header will be included
  * @immediate_present: specify if immediate data is present
  * @header:Structure to initialize
  */
-void ib_ud_header_init(int payload_bytes,
-  int  lrh_present,
-  int  eth_present,
-  int  vlan_present,
-  int  grh_present,
-  int  immediate_present,
-  struct ib_ud_header *header)
+int ib_ud_header_init(int payload_bytes,
+ intlrh_present,
+  

[PATCH for-next 0/9] Add RoCE v2 support

2015-08-13 Thread Matan Barak
Hi Doug,

This series adds the support for RoCE v2. In order to support RoCE v2,
we add gid_type attribute to every GID. When the RoCE GID management
populates the GID table, it duplicates each GID with all supported types.
This gives the user the ability to communicate over each supported
type.

Patch 0001, 0002 and 0003 add support for multiple GID types to the
cache and related APIs. The third patch exposes the GID attributes
information is sysfs.

Patch 0004 adds the RoCE v2 GID type and the capabilities required
from the vendor in order to implement RoCE v2. These capabilities
are grouped together as RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP.

RoCE v2 could work at IPv4 and IPv6 networks. When receiving ib_wc, this
information should come from the vendor's driver. In case the vendor
doesn't supply this information, we parse the packet headers and resolve
its network type. Patch 0005 adds this information and required utilities.

Patches 0006 and 0007 add configfs support (and the required
infrastructure) for CMA. The administrator should be able to set the
default RoCE type. This is done through a new per-port
default_roce_mode configfs file.

Patch 0008 formats a QP1 packet in order to support RoCE v2 CM
packets. This is required for vendors which implement their
QP1 as a Raw QP.

Patch 0009 adds support for IPv4 multicast as an IPv4 network
requires IGMP to be sent in order to join multicast groups.

Vendors code aren't part of this patch-set. Soft-Roce will be
sent soon and depends on these patches. Other vendors, like
mlx4, ocrdma and mlx5 will follow.

This patch is applied on Add RoCE GID cache usage in verbs/cma
which was sent to the mailing list.

Thanks,
Matan

Matan Barak (6):
  IB/core: Add gid_type to gid attribute
  IB/cm: Use the source GID index type
  IB/core: Add gid attributes to sysfs
  IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type
  IB/rdma_cm: Add wrapper for cma reference count
  IB/cma: Add configfs for rdma_cm

Moni Shoua (2):
  IB/core: Initialize UD header structure with IP and UDP headers
  IB/cma: Join and leave multicast groups with IGMP

Somnath Kotur (1):
  IB/core: Add rdma_network_type to wc

 drivers/infiniband/Kconfig|   9 +
 drivers/infiniband/core/Makefile  |   2 +
 drivers/infiniband/core/addr.c|  14 ++
 drivers/infiniband/core/cache.c   | 152 +
 drivers/infiniband/core/cm.c  |  25 ++-
 drivers/infiniband/core/cma.c | 216 --
 drivers/infiniband/core/cma_configfs.c| 353 ++
 drivers/infiniband/core/core_priv.h   |  32 +++
 drivers/infiniband/core/device.c  |   9 +-
 drivers/infiniband/core/multicast.c   |  20 +-
 drivers/infiniband/core/roce_gid_mgmt.c   |  61 +-
 drivers/infiniband/core/sa_query.c|   5 +-
 drivers/infiniband/core/sysfs.c   | 184 +++-
 drivers/infiniband/core/ud_header.c   | 155 -
 drivers/infiniband/core/uverbs_marshall.c |   1 +
 drivers/infiniband/core/verbs.c   | 124 ++-
 drivers/infiniband/hw/mlx4/qp.c   |   7 +-
 drivers/infiniband/hw/mthca/mthca_qp.c|   2 +-
 include/rdma/ib_addr.h|   1 +
 include/rdma/ib_cache.h   |   4 +
 include/rdma/ib_pack.h|  45 +++-
 include/rdma/ib_sa.h  |   4 +
 include/rdma/ib_verbs.h   |  78 ++-
 23 files changed, 1399 insertions(+), 104 deletions(-)
 create mode 100644 drivers/infiniband/core/cma_configfs.c

-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] split struct ib_send_wr

2015-08-13 Thread Christoph Hellwig
On Thu, Aug 13, 2015 at 09:07:14AM -0400, Doug Ledford wrote:
  Doug:  was your mail a request to fix up the two de-staged drivers?
  I'm happy to do that if you're fine with the patch in general.  amso1100
  should be trivial anyway, while ipath is a mess, just like the new intel
  driver with the third copy of the soft ib stack.
 
 Correct.

http://git.infradead.org/users/hch/rdma.git/commitdiff/efb2b0f21645b9caabcce955481ab6966e52ad90

contains the updates for ipath and amso1100, as well as the reviewed-by
and tested-by tags.

Note that for now I've skipped the new intel hfi1 driver as updating
two of the soft ib codebases already was tiresome enough.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 7/9] IB/cma: Add configfs for rdma_cm

2015-08-13 Thread Matan Barak
Users would like to control the behaviour of rdma_cm.
For example, old applications which don't set the
required RoCE gid type could be executed on RoCE V2
network types. In order to support this configuration,
we implement a configfs for rdma_cm.

In order to use the configfs, one needs to mount it and
mkdir IB device name inside rdma_cm directory.

The patch adds support for a single configuration file,
default_roce_mode. The mode can either be IB/RoCE v1 or
RoCE v2.

Signed-off-by: Matan Barak mat...@mellanox.com
---
 drivers/infiniband/Kconfig |   9 +
 drivers/infiniband/core/Makefile   |   2 +
 drivers/infiniband/core/cache.c|  24 +++
 drivers/infiniband/core/cma.c  |  95 -
 drivers/infiniband/core/cma_configfs.c | 353 +
 drivers/infiniband/core/core_priv.h|  24 +++
 6 files changed, 503 insertions(+), 4 deletions(-)
 create mode 100644 drivers/infiniband/core/cma_configfs.c

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index da4c697..9ee82a2 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -54,6 +54,15 @@ config INFINIBAND_ADDR_TRANS
depends on INFINIBAND
default y
 
+config INFINIBAND_ADDR_TRANS_CONFIGFS
+   bool
+   depends on INFINIBAND_ADDR_TRANS  CONFIGFS_FS
+   default y
+   ---help---
+ ConfigFS support for RDMA communication manager (CM).
+ This allows the user to config the default GID type that the CM
+ uses for each device, when initiaing new connections.
+
 source drivers/infiniband/hw/mthca/Kconfig
 source drivers/infiniband/hw/qib/Kconfig
 source drivers/infiniband/hw/ehca/Kconfig
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index d43a899..7922fa7 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -24,6 +24,8 @@ iw_cm-y :=iwcm.o iwpm_util.o iwpm_msg.o
 
 rdma_cm-y :=   cma.o
 
+rdma_cm-$(CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS) += cma_configfs.o
+
 rdma_ucm-y :=  ucma.o
 
 ib_addr-y :=   addr.o
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index ddd0406..66090ce 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -127,6 +127,30 @@ const char *ib_cache_gid_type_str(enum ib_gid_type 
gid_type)
 }
 EXPORT_SYMBOL(ib_cache_gid_type_str);
 
+int ib_cache_gid_parse_type_str(const char *buf)
+{
+   unsigned int i;
+   size_t len;
+   int err = -EINVAL;
+
+   len = strlen(buf);
+   if (len == 0)
+   return -EINVAL;
+
+   if (buf[len - 1] == '\n')
+   len--;
+
+   for (i = 0; i  ARRAY_SIZE(gid_type_str); ++i)
+   if (gid_type_str[i]  !strncmp(buf, gid_type_str[i], len) 
+   len == strlen(gid_type_str[i])) {
+   err = i;
+   break;
+   }
+
+   return err;
+}
+EXPORT_SYMBOL(ib_cache_gid_parse_type_str);
+
 static int write_gid(struct ib_device *ib_dev, u8 port,
 struct ib_gid_table *table, int ix,
 const union ib_gid *gid,
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 22003dd..e4f4d23 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -121,6 +121,7 @@ struct cma_device {
struct completion   comp;
atomic_trefcount;
struct list_headid_list;
+   enum ib_gid_type*default_gid_type;
 };
 
 struct rdma_bind_list {
@@ -138,6 +139,62 @@ void cma_ref_dev(struct cma_device *cma_dev)
atomic_inc(cma_dev-refcount);
 }
 
+struct cma_device *cma_enum_devices_by_ibdev(cma_device_filter filter,
+void   *cookie)
+{
+   struct cma_device *cma_dev;
+   struct cma_device *found_cma_dev = NULL;
+
+   mutex_lock(lock);
+
+   list_for_each_entry(cma_dev, dev_list, list)
+   if (filter(cma_dev-device, cookie)) {
+   found_cma_dev = cma_dev;
+   break;
+   }
+
+   if (found_cma_dev)
+   cma_ref_dev(found_cma_dev);
+   mutex_unlock(lock);
+   return found_cma_dev;
+}
+
+int cma_get_default_gid_type(struct cma_device *cma_dev,
+unsigned int port)
+{
+   if (port  rdma_start_port(cma_dev-device) ||
+   port  rdma_end_port(cma_dev-device))
+   return -EINVAL;
+
+   return cma_dev-default_gid_type[port - 
rdma_start_port(cma_dev-device)];
+}
+
+int cma_set_default_gid_type(struct cma_device *cma_dev,
+unsigned int port,
+enum ib_gid_type default_gid_type)
+{
+   unsigned long supported_gids;
+
+   if (port  rdma_start_port(cma_dev-device) ||
+   port  

[PATCH for-next 6/9] IB/rdma_cm: Add wrapper for cma reference count

2015-08-13 Thread Matan Barak
Currently, cma users can't increase or decrease the cma reference
count. This is necassary when setting cma attributes (like the
default GID type) in order to avoid use-after-free errors.
Adding cma_ref_dev and cma_deref_dev APIs.

Signed-off-by: Matan Barak mat...@mellanox.com
---
 drivers/infiniband/core/cma.c   | 11 +--
 drivers/infiniband/core/core_priv.h |  4 
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 99e9e3e..22003dd 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -56,6 +56,8 @@
 #include rdma/ib_sa.h
 #include rdma/iw_cm.h
 
+#include core_priv.h
+
 MODULE_AUTHOR(Sean Hefty);
 MODULE_DESCRIPTION(Generic RDMA CM Agent);
 MODULE_LICENSE(Dual BSD/GPL);
@@ -131,6 +133,11 @@ enum {
CMA_OPTION_AFONLY,
 };
 
+void cma_ref_dev(struct cma_device *cma_dev)
+{
+   atomic_inc(cma_dev-refcount);
+}
+
 /*
  * Device removal can occur at anytime, so we need extra handling to
  * serialize notifying the user of device removal with other callbacks.
@@ -276,7 +283,7 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 
ip_ver)
 static void cma_attach_to_dev(struct rdma_id_private *id_priv,
  struct cma_device *cma_dev)
 {
-   atomic_inc(cma_dev-refcount);
+   cma_ref_dev(cma_dev);
id_priv-cma_dev = cma_dev;
id_priv-id.device = cma_dev-device;
id_priv-id.route.addr.dev_addr.transport =
@@ -284,7 +291,7 @@ static void cma_attach_to_dev(struct rdma_id_private 
*id_priv,
list_add_tail(id_priv-list, cma_dev-id_list);
 }
 
-static inline void cma_deref_dev(struct cma_device *cma_dev)
+void cma_deref_dev(struct cma_device *cma_dev)
 {
if (atomic_dec_and_test(cma_dev-refcount))
complete(cma_dev-comp);
diff --git a/drivers/infiniband/core/core_priv.h 
b/drivers/infiniband/core/core_priv.h
index f2c5cc9..7fbfea9 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -38,6 +38,10 @@
 
 #include rdma/ib_verbs.h
 
+struct cma_device;
+void cma_ref_dev(struct cma_device *cma_dev);
+void cma_deref_dev(struct cma_device *cma_dev);
+
 int  ib_device_register_sysfs(struct ib_device *device,
  int (*port_callback)(struct ib_device *,
   u8, struct kobject *));
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 2/9] IB/cm: Use the source GID index type

2015-08-13 Thread Matan Barak
Previosuly, cm and cma modules supported only IB and RoCE v1 GID type.
In order to support multiple GID types, the gid_type is passed to
cm_init_av_by_path and stored in the path record.

The rdma cm client would use a default GID type that will be saved in
rdma_id_private.

Signed-off-by: Matan Barak mat...@mellanox.com
---
 drivers/infiniband/core/cm.c  | 25 -
 drivers/infiniband/core/cma.c |  2 ++
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 0c15488..ba81025 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -362,7 +362,7 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, 
struct cm_av *av)
read_lock_irqsave(cm.device_lock, flags);
list_for_each_entry(cm_dev, cm.device_list, list) {
if (!ib_find_cached_gid(cm_dev-ib_device, path-sgid,
-   IB_GID_TYPE_IB, ndev, p, NULL)) {
+   path-gid_type, ndev, p, NULL)) {
port = cm_dev-port[p-1];
break;
}
@@ -1536,6 +1536,8 @@ static int cm_req_handler(struct cm_work *work)
struct ib_cm_id *cm_id;
struct cm_id_private *cm_id_priv, *listen_cm_id_priv;
struct cm_req_msg *req_msg;
+   union ib_gid gid;
+   struct ib_gid_attr gid_attr;
int ret;
 
req_msg = (struct cm_req_msg *)work-mad_recv_wc-recv_buf.mad;
@@ -1575,11 +1577,24 @@ static int cm_req_handler(struct cm_work *work)
cm_format_paths_from_req(req_msg, work-path[0], work-path[1]);
 
memcpy(work-path[0].dmac, cm_id_priv-av.ah_attr.dmac, ETH_ALEN);
-   ret = cm_init_av_by_path(work-path[0], cm_id_priv-av);
+   ret = ib_get_cached_gid(work-port-cm_dev-ib_device,
+   work-port-port_num,
+   cm_id_priv-av.ah_attr.grh.sgid_index,
+   gid, gid_attr);
+   if (!ret) {
+   if (gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   work-path[0].gid_type = gid_attr.gid_type;
+   ret = cm_init_av_by_path(work-path[0], cm_id_priv-av);
+   }
if (ret) {
-   ib_get_cached_gid(work-port-cm_dev-ib_device,
- work-port-port_num, 0, work-path[0].sgid,
- NULL);
+   int err = ib_get_cached_gid(work-port-cm_dev-ib_device,
+   work-port-port_num, 0,
+   work-path[0].sgid,
+   gid_attr);
+   if (!err  gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   work-path[0].gid_type = gid_attr.gid_type;
ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
   work-path[0].sgid, sizeof work-path[0].sgid,
   NULL, 0);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index dfb92bf..f78b8dd 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -174,6 +174,7 @@ struct rdma_id_private {
u8  tos;
u8  reuseaddr;
u8  afonly;
+   enum ib_gid_typegid_type;
 };
 
 struct cma_multicast {
@@ -1952,6 +1953,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
ndev = dev_get_by_index(init_net, addr-dev_addr.bound_dev_if);
route-path_rec-net = init_net;
route-path_rec-ifindex = addr-dev_addr.bound_dev_if;
+   route-path_rec-gid_type = id_priv-gid_type;
}
if (!ndev) {
ret = -ENODEV;
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 5/9] IB/core: Add rdma_network_type to wc

2015-08-13 Thread Matan Barak
From: Somnath Kotur somnath.ko...@avagotech.com

Providers should tell IB core the wc's network type.
This is used in order to search for the proper GID in the
GID table. When using HCAs that can't provide this info,
IB core tries to deep examine the packet and extract
the GID type by itself.

We choose sgid_index and type from all the matching entries in
RDMA-CM based on hint from the IP stack and we set hop_limit for
the IP packet based on above hint from IP stack.

Signed-off-by: Matan Barak mat...@mellanox.com
Signed-off-by: Somnath Kotur somnath.ko...@avagotech.com
---
 drivers/infiniband/core/addr.c  |  14 +
 drivers/infiniband/core/cma.c   |  11 +++-
 drivers/infiniband/core/verbs.c | 123 ++--
 include/rdma/ib_addr.h  |   1 +
 include/rdma/ib_verbs.h |  44 ++
 5 files changed, 187 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index d3c42b3..3e1f93c 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -257,6 +257,12 @@ static int addr4_resolve(struct sockaddr_in *src_in,
goto put;
}
 
+   /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
+* routable) and we could set the network type accordingly.
+*/
+   if (rt-rt_uses_gateway)
+   addr-network = RDMA_NETWORK_IPV4;
+
ret = dst_fetch_ha(rt-dst, addr, fl4.daddr);
 put:
ip_rt_put(rt);
@@ -271,6 +277,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
 {
struct flowi6 fl6;
struct dst_entry *dst;
+   struct rt6_info *rt;
int ret;
 
memset(fl6, 0, sizeof fl6);
@@ -282,6 +289,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
if ((ret = dst-error))
goto put;
 
+   rt = (struct rt6_info *)dst;
if (ipv6_addr_any(fl6.saddr)) {
ret = ipv6_dev_get_saddr(init_net, ip6_dst_idev(dst)-dev,
 fl6.daddr, 0, fl6.saddr);
@@ -305,6 +313,12 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
goto put;
}
 
+   /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
+* routable) and we could set the network type accordingly.
+*/
+   if (rt-rt6i_flags  RTF_GATEWAY)
+   addr-network = RDMA_NETWORK_IPV6;
+
ret = dst_fetch_ha(dst, addr, fl6.daddr);
 put:
dst_release(dst);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index f78b8dd..99e9e3e 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1929,6 +1929,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
 {
struct rdma_route *route = id_priv-id.route;
struct rdma_addr *addr = route-addr;
+   enum ib_gid_type network_gid_type;
struct cma_work *work;
int ret;
struct net_device *ndev = NULL;
@@ -1967,7 +1968,15 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
rdma_ip2gid((struct sockaddr *)id_priv-id.route.addr.dst_addr,
route-path_rec-dgid);
 
-   route-path_rec-hop_limit = 1;
+   /* Use the hint from IP Stack to select GID Type */
+   network_gid_type = ib_network_to_gid_type(addr-dev_addr.network);
+   if (addr-dev_addr.network != RDMA_NETWORK_IB) {
+   route-path_rec-gid_type = network_gid_type;
+   /* TODO: get the hoplimit from the inet/inet6 device */
+   route-path_rec-hop_limit = IPV6_DEFAULT_HOPLIMIT;
+   } else {
+   route-path_rec-hop_limit = 1;
+   }
route-path_rec-reversible = 1;
route-path_rec-pkey = cpu_to_be16(0x);
route-path_rec-mtu_selector = IB_SA_EQ;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 62a3d01..1d1cab3 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -260,8 +260,61 @@ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct 
ib_ah_attr *ah_attr)
 }
 EXPORT_SYMBOL(ib_create_ah);
 
+static int ib_get_header_version(const union rdma_network_hdr *hdr)
+{
+   const struct iphdr *ip4h = (struct iphdr *)hdr-roce4grh;
+   struct iphdr ip4h_checked;
+   const struct ipv6hdr *ip6h = (struct ipv6hdr *)hdr-ibgrh;
+
+   /* If it's IPv6, the version must be 6, otherwise, the first
+* 20 bytes (before the IPv4 header) are garbled.
+*/
+   if (ip6h-version != 6)
+   return (ip4h-version == 4) ? 4 : 0;
+   /* version may be 6 or 4 because the first 20 bytes could be garbled */
+
+   /* RoCE v2 requires no options, thus header length
+  must be 5 words
+   */
+   if (ip4h-ihl != 5)
+   return 6;
+
+   /* Verify checksum.
+  We can't write on scattered buffers so we need to copy to
+  temp 

[PATCH for-next V8 3/6] IB/uverbs: Explicitly pass ib_dev to uverbs commands

2015-08-13 Thread Yishai Hadas
Done in preparation for deploying RCU for the device removal
flow. Allows isolating the RCU handling to the uverb_main layer and
keeping the uverbs_cmd code as is.

Signed-off-by: Yishai Hadas yish...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
 drivers/infiniband/core/uverbs.h  |3 +
 drivers/infiniband/core/uverbs_cmd.c  |  103 ++---
 drivers/infiniband/core/uverbs_main.c |   21 +--
 3 files changed, 88 insertions(+), 39 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index 92ec765..ea52db1 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -178,6 +178,7 @@ extern struct idr ib_uverbs_rule_idr;
 void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj);
 
 struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
+   struct ib_device *ib_dev,
int is_async);
 void ib_uverbs_free_async_event_file(struct ib_uverbs_file *uverbs_file);
 struct ib_uverbs_event_file *ib_uverbs_lookup_comp_file(int fd);
@@ -214,6 +215,7 @@ struct ib_uverbs_flow_spec {
 
 #define IB_UVERBS_DECLARE_CMD(name)\
ssize_t ib_uverbs_##name(struct ib_uverbs_file *file,   \
+struct ib_device *ib_dev,  \
 const char __user *buf, int in_len,\
 int out_len)
 
@@ -255,6 +257,7 @@ IB_UVERBS_DECLARE_CMD(close_xrcd);
 
 #define IB_UVERBS_DECLARE_EX_CMD(name) \
int ib_uverbs_ex_##name(struct ib_uverbs_file *file,\
+   struct ib_device *ib_dev,   \
struct ib_udata *ucore, \
struct ib_udata *uhw)
 
diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index 5720a92..29443c0 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -282,13 +282,13 @@ static void put_xrcd_read(struct ib_uobject *uobj)
 }
 
 ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
+ struct ib_device *ib_dev,
  const char __user *buf,
  int in_len, int out_len)
 {
struct ib_uverbs_get_context  cmd;
struct ib_uverbs_get_context_resp resp;
struct ib_udata   udata;
-   struct ib_device *ibdev = file-device-ib_dev;
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
struct ib_device_attr dev_attr;
 #endif
@@ -313,13 +313,13 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
   (unsigned long) cmd.response + sizeof resp,
   in_len - sizeof cmd, out_len - sizeof resp);
 
-   ucontext = ibdev-alloc_ucontext(ibdev, udata);
+   ucontext = ib_dev-alloc_ucontext(ib_dev, udata);
if (IS_ERR(ucontext)) {
ret = PTR_ERR(ucontext);
goto err;
}
 
-   ucontext-device = ibdev;
+   ucontext-device = ib_dev;
INIT_LIST_HEAD(ucontext-pd_list);
INIT_LIST_HEAD(ucontext-mr_list);
INIT_LIST_HEAD(ucontext-mw_list);
@@ -340,7 +340,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
ucontext-odp_mrs_count = 0;
INIT_LIST_HEAD(ucontext-no_private_counters);
 
-   ret = ib_query_device(ibdev, dev_attr);
+   ret = ib_query_device(ib_dev, dev_attr);
if (ret)
goto err_free;
if (!(dev_attr.device_cap_flags  IB_DEVICE_ON_DEMAND_PAGING))
@@ -355,7 +355,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
goto err_free;
resp.async_fd = ret;
 
-   filp = ib_uverbs_alloc_event_file(file, 1);
+   filp = ib_uverbs_alloc_event_file(file, ib_dev, 1);
if (IS_ERR(filp)) {
ret = PTR_ERR(filp);
goto err_fd;
@@ -384,7 +384,7 @@ err_fd:
 
 err_free:
put_pid(ucontext-tgid);
-   ibdev-dealloc_ucontext(ucontext);
+   ib_dev-dealloc_ucontext(ucontext);
 
 err:
mutex_unlock(file-mutex);
@@ -392,11 +392,12 @@ err:
 }
 
 static void copy_query_dev_fields(struct ib_uverbs_file *file,
+ struct ib_device *ib_dev,
  struct ib_uverbs_query_device_resp *resp,
  struct ib_device_attr *attr)
 {
resp-fw_ver= attr-fw_ver;
-   resp-node_guid = file-device-ib_dev-node_guid;
+   resp-node_guid = ib_dev-node_guid;
resp-sys_image_guid= attr-sys_image_guid;
resp-max_mr_size   = attr-max_mr_size;
resp-page_size_cap = 

[PATCH for-next V8 6/6] IB/ucma: HW Device hot-removal support

2015-08-13 Thread Yishai Hadas
Currently, IB/cma remove_one flow blocks until all user descriptor managed by
IB/ucma are released. This prevents hot-removal of IB devices. This patch
allows IB/cma to remove devices regardless of user space activity. Upon getting
the RDMA_CM_EVENT_DEVICE_REMOVAL event we close all the underlying HW resources
for the given ucontext. The ucontext itself is still alive till its explicit
destroying by its creator.

Running applications at that time will have some zombie device, further
operations may fail.

Signed-off-by: Yishai Hadas yish...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Reviewed-by: Haggai Eran hagg...@mellanox.com
---
 drivers/infiniband/core/ucma.c |  140 ---
 1 files changed, 129 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 29b2121..c41aef4 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -74,6 +74,7 @@ struct ucma_file {
struct list_headctx_list;
struct list_headevent_list;
wait_queue_head_t   poll_wait;
+   struct workqueue_struct *close_wq;
 };
 
 struct ucma_context {
@@ -89,6 +90,13 @@ struct ucma_context {
 
struct list_headlist;
struct list_headmc_list;
+   /* mark that device is in process of destroying the internal HW
+* resources, protected by the global mut
+*/
+   int closing;
+   /* sync between removal event and id destroy, protected by file mut */
+   int destroying;
+   struct work_struct  close_work;
 };
 
 struct ucma_multicast {
@@ -107,6 +115,7 @@ struct ucma_event {
struct list_headlist;
struct rdma_cm_id   *cm_id;
struct rdma_ucm_event_resp resp;
+   struct work_struct  close_work;
 };
 
 static DEFINE_MUTEX(mut);
@@ -132,8 +141,12 @@ static struct ucma_context *ucma_get_ctx(struct ucma_file 
*file, int id)
 
mutex_lock(mut);
ctx = _ucma_find_context(id, file);
-   if (!IS_ERR(ctx))
-   atomic_inc(ctx-ref);
+   if (!IS_ERR(ctx)) {
+   if (ctx-closing)
+   ctx = ERR_PTR(-EIO);
+   else
+   atomic_inc(ctx-ref);
+   }
mutex_unlock(mut);
return ctx;
 }
@@ -144,6 +157,28 @@ static void ucma_put_ctx(struct ucma_context *ctx)
complete(ctx-comp);
 }
 
+static void ucma_close_event_id(struct work_struct *work)
+{
+   struct ucma_event *uevent_close =  container_of(work, struct 
ucma_event, close_work);
+
+   rdma_destroy_id(uevent_close-cm_id);
+   kfree(uevent_close);
+}
+
+static void ucma_close_id(struct work_struct *work)
+{
+   struct ucma_context *ctx =  container_of(work, struct ucma_context, 
close_work);
+
+   /* once all inflight tasks are finished, we close all underlying
+* resources. The context is still alive till its explicit destryoing
+* by its creator.
+*/
+   ucma_put_ctx(ctx);
+   wait_for_completion(ctx-comp);
+   /* No new events will be generated after destroying the id. */
+   rdma_destroy_id(ctx-cm_id);
+}
+
 static struct ucma_context *ucma_alloc_ctx(struct ucma_file *file)
 {
struct ucma_context *ctx;
@@ -152,6 +187,7 @@ static struct ucma_context *ucma_alloc_ctx(struct ucma_file 
*file)
if (!ctx)
return NULL;
 
+   INIT_WORK(ctx-close_work, ucma_close_id);
atomic_set(ctx-ref, 1);
init_completion(ctx-comp);
INIT_LIST_HEAD(ctx-mc_list);
@@ -242,6 +278,44 @@ static void ucma_set_event_context(struct ucma_context 
*ctx,
}
 }
 
+/* Called with file-mut locked for the relevant context. */
+static void ucma_removal_event_handler(struct rdma_cm_id *cm_id)
+{
+   struct ucma_context *ctx = cm_id-context;
+   struct ucma_event *con_req_eve;
+   int event_found = 0;
+
+   if (ctx-destroying)
+   return;
+
+   /* only if context is pointing to cm_id that it owns it and can be
+* queued to be closed, otherwise that cm_id is an inflight one that
+* is part of that context event list pending to be detached and
+* reattached to its new context as part of ucma_get_event,
+* handled separately below.
+*/
+   if (ctx-cm_id == cm_id) {
+   mutex_lock(mut);
+   ctx-closing = 1;
+   mutex_unlock(mut);
+   queue_work(ctx-file-close_wq, ctx-close_work);
+   return;
+   }
+
+   list_for_each_entry(con_req_eve, ctx-file-event_list, list) {
+   if (con_req_eve-cm_id == cm_id 
+   con_req_eve-resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) {
+   list_del(con_req_eve-list);
+   INIT_WORK(con_req_eve-close_work, 
ucma_close_event_id);
+

[PATCH for-next V8 1/6] IB/uverbs: Fix reference counting usage of event files

2015-08-13 Thread Yishai Hadas
Fix the reference counting usage to be handled in the event file
creation/destruction function, instead of being done by the caller.
This is done for both async/non-async event files.

Based on Jason Gunthorpe report at https://www.mail-archive.com/
linux-rdma@vger.kernel.org/msg24680.html:
The existing code for this is broken, in ib_uverbs_get_context all
the error paths between ib_uverbs_alloc_event_file and the
kref_get(file-ref) are wrong - this will result in fput() which will
call ib_uverbs_event_close, which will try to do kref_put and
ib_unregister_event_handler - which are no longer paired.

Signed-off-by: Yishai Hadas yish...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
 drivers/infiniband/core/uverbs.h  |1 +
 drivers/infiniband/core/uverbs_cmd.c  |   11 +---
 drivers/infiniband/core/uverbs_main.c |   44 
 3 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index ba365b6..60e6e3d 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -178,6 +178,7 @@ void idr_remove_uobj(struct idr *idp, struct ib_uobject 
*uobj);
 
 struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
int is_async);
+void ib_uverbs_free_async_event_file(struct ib_uverbs_file *uverbs_file);
 struct ib_uverbs_event_file *ib_uverbs_lookup_comp_file(int fd);
 
 void ib_uverbs_release_ucq(struct ib_uverbs_file *file,
diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index bbb02ff..5720a92 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -367,16 +367,6 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
goto err_file;
}
 
-   file-async_file = filp-private_data;
-
-   INIT_IB_EVENT_HANDLER(file-event_handler, file-device-ib_dev,
- ib_uverbs_event_handler);
-   ret = ib_register_event_handler(file-event_handler);
-   if (ret)
-   goto err_file;
-
-   kref_get(file-async_file-ref);
-   kref_get(file-ref);
file-ucontext = ucontext;
 
fd_install(resp.async_fd, filp);
@@ -386,6 +376,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
return in_len;
 
 err_file:
+   ib_uverbs_free_async_event_file(file);
fput(filp);
 
 err_fd:
diff --git a/drivers/infiniband/core/uverbs_main.c 
b/drivers/infiniband/core/uverbs_main.c
index f6eef2d..c238eba 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -406,10 +406,9 @@ static int ib_uverbs_event_close(struct inode *inode, 
struct file *filp)
}
spin_unlock_irq(file-lock);
 
-   if (file-is_async) {
+   if (file-is_async)
ib_unregister_event_handler(file-uverbs_file-event_handler);
-   kref_put(file-uverbs_file-ref, ib_uverbs_release_file);
-   }
+   kref_put(file-uverbs_file-ref, ib_uverbs_release_file);
kref_put(file-ref, ib_uverbs_release_event_file);
 
return 0;
@@ -541,13 +540,20 @@ void ib_uverbs_event_handler(struct ib_event_handler 
*handler,
NULL, NULL);
 }
 
+void ib_uverbs_free_async_event_file(struct ib_uverbs_file *file)
+{
+   kref_put(file-async_file-ref, ib_uverbs_release_event_file);
+   file-async_file = NULL;
+}
+
 struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
int is_async)
 {
struct ib_uverbs_event_file *ev_file;
struct file *filp;
+   int ret;
 
-   ev_file = kmalloc(sizeof *ev_file, GFP_KERNEL);
+   ev_file = kzalloc(sizeof(*ev_file), GFP_KERNEL);
if (!ev_file)
return ERR_PTR(-ENOMEM);
 
@@ -556,15 +562,41 @@ struct file *ib_uverbs_alloc_event_file(struct 
ib_uverbs_file *uverbs_file,
INIT_LIST_HEAD(ev_file-event_list);
init_waitqueue_head(ev_file-poll_wait);
ev_file-uverbs_file = uverbs_file;
+   kref_get(ev_file-uverbs_file-ref);
ev_file-async_queue = NULL;
-   ev_file-is_async= is_async;
ev_file-is_closed   = 0;
 
filp = anon_inode_getfile([infinibandevent], uverbs_event_fops,
  ev_file, O_RDONLY);
if (IS_ERR(filp))
-   kfree(ev_file);
+   goto err_put_refs;
+
+   if (is_async) {
+   WARN_ON(uverbs_file-async_file);
+   uverbs_file-async_file = ev_file;
+   kref_get(uverbs_file-async_file-ref);
+   INIT_IB_EVENT_HANDLER(uverbs_file-event_handler,
+ uverbs_file-device-ib_dev,
+ ib_uverbs_event_handler);
+   ret = 

[PATCH for-next V8 2/6] IB/uverbs: Fix race between ib_uverbs_open and remove_one

2015-08-13 Thread Yishai Hadas
Fixes: 2a72f212263701b927559f6850446421d5906c41 (IB/uverbs: Remove dev_table)

Before this commit there was a device look-up table that was protected
by a spin_lock used by ib_uverbs_open and by ib_uverbs_remove_one. When
it was dropped and container_of was used instead, it enabled the race
with remove_one as dev might be freed just after:
dev = container_of(inode-i_cdev, struct ib_uverbs_device, cdev) but
before the kref_get.

In addition, this buggy patch added some dead code as
container_of(x,y,z) can never be NULL and so dev can never be NULL.
As a result the comment above ib_uverbs_open saying the open method
will either immediately run -ENXIO is wrong as it can never happen.

The solution follows Jason Gunthorpe suggestion from below URL:
https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg25692.html

cdev will hold a kref on the parent (the containing structure,
ib_uverbs_device) and only when that kref is released it is
guaranteed that open will never be called again.

In addition, fixes the active count scheme to use an atomic
not a kref to prevent WARN_ON as pointed by above comment
from Jason.

Signed-off-by: Yishai Hadas yish...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
 drivers/infiniband/core/uverbs.h  |3 +-
 drivers/infiniband/core/uverbs_main.c |   43 +++--
 2 files changed, 32 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index 60e6e3d..92ec765 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -85,7 +85,7 @@
  */
 
 struct ib_uverbs_device {
-   struct kref ref;
+   atomic_trefcount;
int num_comp_vectors;
struct completion   comp;
struct device  *dev;
@@ -94,6 +94,7 @@ struct ib_uverbs_device {
struct cdev cdev;
struct rb_root  xrcd_tree;
struct mutexxrcd_tree_mutex;
+   struct kobject  kobj;
 };
 
 struct ib_uverbs_event_file {
diff --git a/drivers/infiniband/core/uverbs_main.c 
b/drivers/infiniband/core/uverbs_main.c
index c238eba..9f39978 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -130,14 +130,18 @@ static int (*uverbs_ex_cmd_table[])(struct ib_uverbs_file 
*file,
 static void ib_uverbs_add_one(struct ib_device *device);
 static void ib_uverbs_remove_one(struct ib_device *device);
 
-static void ib_uverbs_release_dev(struct kref *ref)
+static void ib_uverbs_release_dev(struct kobject *kobj)
 {
struct ib_uverbs_device *dev =
-   container_of(ref, struct ib_uverbs_device, ref);
+   container_of(kobj, struct ib_uverbs_device, kobj);
 
-   complete(dev-comp);
+   kfree(dev);
 }
 
+static struct kobj_type ib_uverbs_dev_ktype = {
+   .release = ib_uverbs_release_dev,
+};
+
 static void ib_uverbs_release_event_file(struct kref *ref)
 {
struct ib_uverbs_event_file *file =
@@ -303,13 +307,19 @@ static int ib_uverbs_cleanup_ucontext(struct 
ib_uverbs_file *file,
return context-device-dealloc_ucontext(context);
 }
 
+static void ib_uverbs_comp_dev(struct ib_uverbs_device *dev)
+{
+   complete(dev-comp);
+}
+
 static void ib_uverbs_release_file(struct kref *ref)
 {
struct ib_uverbs_file *file =
container_of(ref, struct ib_uverbs_file, ref);
 
module_put(file-device-ib_dev-owner);
-   kref_put(file-device-ref, ib_uverbs_release_dev);
+   if (atomic_dec_and_test(file-device-refcount))
+   ib_uverbs_comp_dev(file-device);
 
kfree(file);
 }
@@ -775,9 +785,7 @@ static int ib_uverbs_open(struct inode *inode, struct file 
*filp)
int ret;
 
dev = container_of(inode-i_cdev, struct ib_uverbs_device, cdev);
-   if (dev)
-   kref_get(dev-ref);
-   else
+   if (!atomic_inc_not_zero(dev-refcount))
return -ENXIO;
 
if (!try_module_get(dev-ib_dev-owner)) {
@@ -798,6 +806,7 @@ static int ib_uverbs_open(struct inode *inode, struct file 
*filp)
mutex_init(file-mutex);
 
filp-private_data = file;
+   kobject_get(dev-kobj);
 
return nonseekable_open(inode, filp);
 
@@ -805,13 +814,16 @@ err_module:
module_put(dev-ib_dev-owner);
 
 err:
-   kref_put(dev-ref, ib_uverbs_release_dev);
+   if (atomic_dec_and_test(dev-refcount))
+   ib_uverbs_comp_dev(dev);
+
return ret;
 }
 
 static int ib_uverbs_close(struct inode *inode, struct file *filp)
 {
struct ib_uverbs_file *file = filp-private_data;
+   struct ib_uverbs_device *dev = file-device;
 

[PATCH for-next V8 4/6] IB/uverbs: Enable device removal when there are active user space applications

2015-08-13 Thread Yishai Hadas
Enables the uverbs_remove_one to succeed despite the fact that there are
running IB applications working with the given ib device.  This
functionality enables a HW device to be unbind/reset despite the fact that
there are running user space applications using it.

It exposes a new IB kernel API named 'disassociate_ucontext' which lets
a driver detaching its HW resources from a given user context without
crashing/terminating the application. In case a driver implemented the
above API and registered with ib_uverb there will be no dependency between its
device to its uverbs_device. Upon calling remove_one of ib_uverbs the call
should return after disassociating the open HW resources without waiting to
clients disconnecting. In case driver didn't implement this API there will be no
change to current behaviour and uverbs_remove_one will return only when last
client has disconnected and reference count on uverbs device became 0.

In case the lower driver device was removed any application will
continue working over some zombie HCA, further calls will ended with an
immediate error.

Signed-off-by: Yishai Hadas yish...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
 drivers/infiniband/core/uverbs.h  |9 +-
 drivers/infiniband/core/uverbs_main.c |  360 +++--
 include/rdma/ib_verbs.h   |1 +
 3 files changed, 302 insertions(+), 68 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index ea52db1..3863d33 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -89,12 +89,16 @@ struct ib_uverbs_device {
int num_comp_vectors;
struct completion   comp;
struct device  *dev;
-   struct ib_device   *ib_dev;
+   struct ib_device__rcu  *ib_dev;
int devnum;
struct cdev cdev;
struct rb_root  xrcd_tree;
struct mutexxrcd_tree_mutex;
struct kobject  kobj;
+   struct srcu_struct  disassociate_srcu;
+   struct mutexlists_mutex; /* protect lists */
+   struct list_headuverbs_file_list;
+   struct list_headuverbs_events_file_list;
 };
 
 struct ib_uverbs_event_file {
@@ -106,6 +110,7 @@ struct ib_uverbs_event_file {
wait_queue_head_t   poll_wait;
struct fasync_struct   *async_queue;
struct list_headevent_list;
+   struct list_headlist;
 };
 
 struct ib_uverbs_file {
@@ -115,6 +120,8 @@ struct ib_uverbs_file {
struct ib_ucontext *ucontext;
struct ib_event_handler event_handler;
struct ib_uverbs_event_file*async_file;
+   struct list_headlist;
+   int is_closed;
 };
 
 struct ib_uverbs_event {
diff --git a/drivers/infiniband/core/uverbs_main.c 
b/drivers/infiniband/core/uverbs_main.c
index dc968df..59b28a6 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -137,6 +137,7 @@ static void ib_uverbs_release_dev(struct kobject *kobj)
struct ib_uverbs_device *dev =
container_of(kobj, struct ib_uverbs_device, kobj);
 
+   cleanup_srcu_struct(dev-disassociate_srcu);
kfree(dev);
 }
 
@@ -207,9 +208,6 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file 
*file,
 {
struct ib_uobject *uobj, *tmp;
 
-   if (!context)
-   return 0;
-
context-closing = 1;
 
list_for_each_entry_safe(uobj, tmp, context-ah_list, list) {
@@ -318,8 +316,16 @@ static void ib_uverbs_release_file(struct kref *ref)
 {
struct ib_uverbs_file *file =
container_of(ref, struct ib_uverbs_file, ref);
+   struct ib_device *ib_dev;
+   int srcu_key;
+
+   srcu_key = srcu_read_lock(file-device-disassociate_srcu);
+   ib_dev = srcu_dereference(file-device-ib_dev,
+ file-device-disassociate_srcu);
+   if (ib_dev  !ib_dev-disassociate_ucontext)
+   module_put(ib_dev-owner);
+   srcu_read_unlock(file-device-disassociate_srcu, srcu_key);
 
-   module_put(file-device-ib_dev-owner);
if (atomic_dec_and_test(file-device-refcount))
ib_uverbs_comp_dev(file-device);
 
@@ -343,9 +349,19 @@ static ssize_t ib_uverbs_event_read(struct file *filp, 
char __user *buf,
return -EAGAIN;
 
if (wait_event_interruptible(file-poll_wait,
-

[PATCH for-next V8 5/6] IB/mlx4_ib: Disassociate support

2015-08-13 Thread Yishai Hadas
Implements the IB core disassociate_ucontext API. The driver detaches the HW
resources for a given user context to prevent a dependency between application
termination and device disconnecting. This is done by managing the VMAs that
were mapped to the HW bars such as door bell and blueflame. When need to detach
remap them to an arbitrary kernel page returned by the zap API.

Signed-off-by: Yishai Hadas yish...@mellanox.com
Signed-off-by: Jack Morgenstein ja...@mellanox.com
---
 drivers/infiniband/hw/mlx4/main.c|  139 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   13 +++
 2 files changed, 150 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 8be6db8..3097a27 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -692,7 +692,7 @@ static struct ib_ucontext *mlx4_ib_alloc_ucontext(struct 
ib_device *ibdev,
resp.cqe_size = dev-dev-caps.cqe_size;
}
 
-   context = kmalloc(sizeof *context, GFP_KERNEL);
+   context = kzalloc(sizeof(*context), GFP_KERNEL);
if (!context)
return ERR_PTR(-ENOMEM);
 
@@ -729,21 +729,143 @@ static int mlx4_ib_dealloc_ucontext(struct ib_ucontext 
*ibcontext)
return 0;
 }
 
+static void  mlx4_ib_vma_open(struct vm_area_struct *area)
+{
+   /* vma_open is called when a new VMA is created on top of our VMA.
+* This is done through either mremap flow or split_vma (usually due
+* to mlock, madvise, munmap, etc.). We do not support a clone of the
+* vma, as this VMA is strongly hardware related. Therefore we set the
+* vm_ops of the newly created/cloned VMA to NULL, to prevent it from
+* calling us again and trying to do incorrect actions. We assume that
+* the original vma size is exactly a single page that there will be no
+* splitting operations on.
+*/
+   area-vm_ops = NULL;
+}
+
+static void  mlx4_ib_vma_close(struct vm_area_struct *area)
+{
+   struct mlx4_ib_vma_private_data *mlx4_ib_vma_priv_data;
+
+   /* It's guaranteed that all VMAs opened on a FD are closed before the
+* file itself is closed, therefore no sync is needed with the regular
+* closing flow. (e.g. mlx4_ib_dealloc_ucontext) However need a sync
+* with accessing the vma as part of mlx4_ib_disassociate_ucontext.
+* The close operation is usually called under mm-mmap_sem except when
+* process is exiting.  The exiting case is handled explicitly as part
+* of mlx4_ib_disassociate_ucontext.
+*/
+   mlx4_ib_vma_priv_data = (struct mlx4_ib_vma_private_data *)
+   area-vm_private_data;
+
+   /* set the vma context pointer to null in the mlx4_ib driver's private
+* data to protect against a race condition in 
mlx4_ib_dissassociate_ucontext().
+*/
+   mlx4_ib_vma_priv_data-vma = NULL;
+}
+
+static const struct vm_operations_struct mlx4_ib_vm_ops = {
+   .open = mlx4_ib_vma_open,
+   .close = mlx4_ib_vma_close
+};
+
+static void mlx4_ib_disassociate_ucontext(struct ib_ucontext *ibcontext)
+{
+   int i;
+   int ret = 0;
+   struct vm_area_struct *vma;
+   struct mlx4_ib_ucontext *context = to_mucontext(ibcontext);
+   struct task_struct *owning_process  = NULL;
+   struct mm_struct   *owning_mm   = NULL;
+
+   owning_process = get_pid_task(ibcontext-tgid, PIDTYPE_PID);
+   if (!owning_process)
+   return;
+
+   owning_mm = get_task_mm(owning_process);
+   if (!owning_mm) {
+   pr_info(no mm, disassociate ucontext is pending task 
termination\n);
+   while (1) {
+   /* make sure that task is dead before returning, it may
+* prevent a rare case of module down in parallel to a
+* call to mlx4_ib_vma_close.
+*/
+   put_task_struct(owning_process);
+   msleep(1);
+   owning_process = get_pid_task(ibcontext-tgid,
+ PIDTYPE_PID);
+   if (!owning_process ||
+   owning_process-state == TASK_DEAD) {
+   pr_info(disassociate ucontext done, task was 
terminated\n);
+   /* in case task was dead need to release the 
task struct */
+   if (owning_process)
+   put_task_struct(owning_process);
+   return;
+   }
+   }
+   }
+
+   /* need to protect from a race on closing the vma as part of
+* mlx4_ib_vma_close().
+*/
+   down_read(owning_mm-mmap_sem);
+   for (i = 0; i  HW_BAR_COUNT; i++) {
+   vma = 

Re: [RFC] split struct ib_send_wr

2015-08-13 Thread Christoph Hellwig
On Wed, Aug 12, 2015 at 08:24:49PM +0300, Sagi Grimberg wrote:
 Just a nit that I've noticed, in mlx4 set_fmr_seg params are not
 aligned to the parenthesis (maybe in other locations too but I haven't
 noticed such...)

This is just using a normal two tab indent for continued function
parameters..
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html