Re: [RFC] split struct ib_send_wr
On 08/13/2015 01:54 AM, Christoph Hellwig wrote: On Wed, Aug 12, 2015 at 07:24:44PM -0700, Chuck Lever wrote: That makes sense, but you already Acked the change that breaks Lustre, and it's going in through the NFS tree. Are you changing that to a NAK? No. Lustre fits in my languishing in the staging tree category. It seems like Doug was mostly concened about to be removed drivers. I defintively refuse to fix Lustre for anything I tough because it's such a giant mess with uses just about every major subsystem in an incorrect way. Doug: was your mail a request to fix up the two de-staged drivers? I'm happy to do that if you're fine with the patch in general. amso1100 should be trivial anyway, while ipath is a mess, just like the new intel driver with the third copy of the soft ib stack. Correct. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH V6 6/9] isert: Rename IO functions to more descriptive names
Nic is silent... Sagi, do you have an ETA on when you can have the recode ready for detailed review and test? If we can't make linux-4.3, can we be early in staging it for linux-4.4? Hi Steve, I have something, but its not remotely close to be submission ready. This ended up being a rewrite of the registration path which is pretty convoluted at the moment. My aim is mostly simplifying it in a way that iWARP support would be (almost) straight-forward... I can send you my WIP to test. Sagi. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IB/hfi1: Remove inline from trace functions
From: Dennis Dalessandro dennis.dalessan...@intel.com inline in trace functions causes the following build error when CONFIG_OPTIMIZE_INLINING is not defined in the kernel config: error: function can never be inlined because it uses variable argument lists Reported by 0-day build: https://lists.01.org/pipermail/kbuild-all/2015-August/011215.html This patch converts to a non-inline version of the hfi1 trace functions Reviewed-by: Jubin John jubin.j...@intel.com Reviewed-by: Mike Marciniszyn mike.marcinis...@intel.com Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com --- drivers/staging/hfi1/trace.c | 15 +++- drivers/staging/hfi1/trace.h | 51 -- 2 files changed, 32 insertions(+), 34 deletions(-) diff --git a/drivers/staging/hfi1/trace.c b/drivers/staging/hfi1/trace.c index afbb212..ea95591 100644 --- a/drivers/staging/hfi1/trace.c +++ b/drivers/staging/hfi1/trace.c @@ -48,7 +48,6 @@ * */ #define CREATE_TRACE_POINTS -#define HFI1_TRACE_DO_NOT_CREATE_INLINES #include trace.h u8 ibhdr_exhdr_len(struct hfi1_ib_header *hdr) @@ -208,4 +207,16 @@ const char *print_u64_array( return ret; } -#undef HFI1_TRACE_DO_NOT_CREATE_INLINES +__hfi1_trace_fn(PKT); +__hfi1_trace_fn(PROC); +__hfi1_trace_fn(SDMA); +__hfi1_trace_fn(LINKVERB); +__hfi1_trace_fn(DEBUG); +__hfi1_trace_fn(SNOOP); +__hfi1_trace_fn(CNTR); +__hfi1_trace_fn(PIO); +__hfi1_trace_fn(DC8051); +__hfi1_trace_fn(FIRMWARE); +__hfi1_trace_fn(RCVCTRL); +__hfi1_trace_fn(TID); + diff --git a/drivers/staging/hfi1/trace.h b/drivers/staging/hfi1/trace.h index 5c34606..05c7ce8 100644 --- a/drivers/staging/hfi1/trace.h +++ b/drivers/staging/hfi1/trace.h @@ -1339,22 +1339,17 @@ DECLARE_EVENT_CLASS(hfi1_trace_template, /* * It may be nice to macroize the __hfi1_trace but the va_* stuff requires an - * actual function to work and can not be in a macro. Also the fmt can not be a - * constant char * because we need to be able to manipulate the \n if it is - * present. + * actual function to work and can not be in a macro. */ -#define __hfi1_trace_event(lvl) \ +#define __hfi1_trace_def(lvl) \ +void __hfi1_trace_##lvl(const char *funct, char *fmt, ...);\ + \ DEFINE_EVENT(hfi1_trace_template, hfi1_ ##lvl, \ TP_PROTO(const char *function, struct va_format *vaf), \ TP_ARGS(function, vaf)) -#ifdef HFI1_TRACE_DO_NOT_CREATE_INLINES -#define __hfi1_trace_fn(fn) __hfi1_trace_event(fn) -#else -#define __hfi1_trace_fn(fn) \ -__hfi1_trace_event(fn); \ -__printf(2, 3) \ -static inline void __hfi1_trace_##fn(const char *func, char *fmt, ...) \ +#define __hfi1_trace_fn(lvl) \ +void __hfi1_trace_##lvl(const char *func, char *fmt, ...) \ { \ struct va_format vaf = {\ .fmt = fmt, \ @@ -1363,36 +1358,28 @@ static inline void __hfi1_trace_##fn(const char *func, char *fmt, ...) \ \ va_start(args, fmt);\ vaf.va = args; \ - trace_hfi1_ ##fn(func, vaf); \ + trace_hfi1_ ##lvl(func, vaf); \ va_end(args); \ return; \ } -#endif /* * To create a new trace level simply define it as below. This will create all * the hooks for calling hfi1_cdbg(LVL, fmt, ...); as well as take care of all * the debugfs stuff. */ -__hfi1_trace_fn(RVPKT); -__hfi1_trace_fn(INIT); -__hfi1_trace_fn(VERB); -__hfi1_trace_fn(PKT); -__hfi1_trace_fn(PROC); -__hfi1_trace_fn(MM); -__hfi1_trace_fn(ERRPKT); -__hfi1_trace_fn(SDMA); -__hfi1_trace_fn(VPKT); -__hfi1_trace_fn(LINKVERB); -__hfi1_trace_fn(VERBOSE); -__hfi1_trace_fn(DEBUG); -__hfi1_trace_fn(SNOOP); -__hfi1_trace_fn(CNTR); -__hfi1_trace_fn(PIO); -__hfi1_trace_fn(DC8051); -__hfi1_trace_fn(FIRMWARE); -__hfi1_trace_fn(RCVCTRL); -__hfi1_trace_fn(TID); +__hfi1_trace_def(PKT); +__hfi1_trace_def(PROC); +__hfi1_trace_def(SDMA); +__hfi1_trace_def(LINKVERB); +__hfi1_trace_def(DEBUG); +__hfi1_trace_def(SNOOP); +__hfi1_trace_def(CNTR); +__hfi1_trace_def(PIO); +__hfi1_trace_def(DC8051); +__hfi1_trace_def(FIRMWARE); +__hfi1_trace_def(RCVCTRL); +__hfi1_trace_def(TID); #define hfi1_cdbg(which, fmt, ...) \ __hfi1_trace_##which(__func__, fmt, ##__VA_ARGS__) -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] split struct ib_send_wr
On Thu, Aug 13, 2015 at 09:04:39AM -0700, Christoph Hellwig wrote: On Thu, Aug 13, 2015 at 09:07:14AM -0400, Doug Ledford wrote: Doug: was your mail a request to fix up the two de-staged drivers? I'm happy to do that if you're fine with the patch in general. amso1100 should be trivial anyway, while ipath is a mess, just like the new intel driver with the third copy of the soft ib stack. Correct. http://git.infradead.org/users/hch/rdma.git/commitdiff/efb2b0f21645b9caabcce955481ab6966e52ad90 contains the updates for ipath and amso1100, as well as the reviewed-by and tested-by tags. The uverbs change needs to drop/move the original kmalloc: next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + user_wr-num_sge * sizeof (struct ib_sge), GFP_KERNEL); It looks like it is leaking that allocation right now. Every path replaces next with the result of alloc_mr.. Noticed a couple of trailing whitespaces too.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/hfi1: Remove inline from trace functions
On Thu, Aug 13, 2015 at 10:06:14AM -0400, Mike Marciniszyn wrote: From: Dennis Dalessandro dennis.dalessan...@intel.com inline in trace functions causes the following build error when CONFIG_OPTIMIZE_INLINING is not defined in the kernel config: error: function can never be inlined because it uses variable argument lists There are all manner of tracing things in the kernel. Does this driver really need a custom designed one? Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] split struct ib_send_wr
On Thu, Aug 13, 2015 at 10:53:54AM -0700, Christoph Hellwig wrote: http://git.infradead.org/users/hch/rdma.git/commitdiff/5d7e6fa563dae32d4b6f63e29e3795717a545f11 For the core bits: Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] dapl-2.1.6-1 release
New release for uDAPL (2.1.6) is available at http://downloads.openfabrics.org/dapl/ Vlad, please pull into OFED 3.18-1 md5sum: dce3ef7c943807d35bcb26dae72b1d88 dapl-2.1.6.tar.gz For v2.1 package install RPM packages as follow: dapl-2.1.6-1 dapl-utils-2.1.6-1 dapl-devel-2.1.6-1 dapl-debuginfo-2.1.6-1 Release notes: http://downloads.openfabrics.org/dapl/documentation/uDAPL_release_notes.txt Summary: Release 2.1.6 ucm: add cluster size environments to adjust CM timers mpxyd: proxy_in data transfers can improperly start before RTU received mcm: forward open/query for MFO devices in query only mode mpxyd: byte swap incorrect on WRC wr_len dtest: remove ERR message from flush QP function dapltest: Quit command with -n port number will core dump config: update dat.conf for MFO qib devices, 2 adapters/ports mpxyd: add MFO support on proxy side mcm: add MFO proxy commands, device, and CM support mcm: add MFO support to openib_common code base mcm: add full offload (MFO) mode to provider to support qib on MIC dtest: pre-allocated buffer too small for RMR, DTO ops timeout mpxyd: fix buffer initialization when no-inline support is active mpxyd: reduce log level on qp_flush to CM level mcm: intra-node proxy missing LID setup on rejects mcm: add intra-node support via ibscif device and mcm provider mcm: provide MIC address info with proxy device open mcm: add device info to non-debug log common: add DAPL_DTO_TYPE_EXTENSION_IMM for rdma_write_imm DTO type checking mpxyd: fix up some of the PI logging dtest: modify rdma_write_with_msg to support uni-direction streaming mcm,mpxyd: fix dreq processing to defer QP flush when proxy WRs still pending mpxyd: update byte_len and comp_cnt for PO to remote HST communications mcm: bug fixes for non-inline devices mcm: return CM_rej with CM_req_in errors mpxyd,mcm: RDMA write with immed data not signaled on request side mcm: add WC opcode and wc_flags in debug log message mpxyd: set options bug fix for mcm_ib_inline Release commit: http://git.openfabrics.org/?p=~ardavis/dapl.git;a=commit;h=91febc42f0070b2b9eaa81c0c113c6ff7ab8ea60 Regards, Arlin -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] split struct ib_send_wr
On Aug 13, 2015, at 9:04 AM, Christoph Hellwig h...@infradead.org wrote: On Thu, Aug 13, 2015 at 09:07:14AM -0400, Doug Ledford wrote: Doug: was your mail a request to fix up the two de-staged drivers? I'm happy to do that if you're fine with the patch in general. amso1100 should be trivial anyway, while ipath is a mess, just like the new intel driver with the third copy of the soft ib stack. Correct. http://git.infradead.org/users/hch/rdma.git/commitdiff/efb2b0f21645b9caabcce955481ab6966e52ad90 contains the updates for ipath and amso1100, as well as the reviewed-by and tested-by tags. Note that for now I've skipped the new intel hfi1 driver as updating two of the soft ib codebases already was tiresome enough. This looks like a straightforward mechanical change. For the hunks under net/sunrpc/xprtrdma/ : Reviewed-by: Chuck Lever chuck.le...@oracle.com -- Chuck Lever -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] split struct ib_send_wr
On Thu, Aug 13, 2015 at 11:22:34AM -0600, Jason Gunthorpe wrote: The uverbs change needs to drop/move the original kmalloc: next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + user_wr-num_sge * sizeof (struct ib_sge), GFP_KERNEL); It looks like it is leaking that allocation right now. Every path replaces next with the result of alloc_mr.. Thanks. It should be come and indeed was in my first version. Not sure how it sneaked in during a rebase. Noticed a couple of trailing whitespaces too.. checkpatch found two of them, which I've fixed now. New version at: http://git.infradead.org/users/hch/rdma.git/commitdiff/5d7e6fa563dae32d4b6f63e29e3795717a545f11 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 5/8 v2] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM v3.
This add new core infiniband structure and helper to implement ODP (on demand paging) on top of HMM. We need to retain the tree of ib_umem as some hardware associate unique identifiant with each umem (or mr) and only allow hardware page table to be updated using this unique id. Changed since v1: - Adapt to new hmm_mirror lifetime rules. - Fix scan of existing mirror in ib_umem_odp_get(). Changed since v2: - Remove FIXME for empty umem as it is an invalid case. - Fix HMM version of ib_umem_odp_release() Signed-off-by: Jérôme Glisse jgli...@redhat.com Signed-off-by: John Hubbard jhubb...@nvidia.com Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/umem_odp.c| 145 ++ drivers/infiniband/core/uverbs_cmd.c | 1 + drivers/infiniband/core/uverbs_main.c | 6 ++ include/rdma/ib_umem_odp.h| 27 +++ include/rdma/ib_verbs.h | 12 +++ 5 files changed, 191 insertions(+) diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index d3b65d4..bcbc2c2 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -42,7 +42,152 @@ #include rdma/ib_umem_odp.h #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) + +static void ib_mirror_destroy(struct kref *kref) +{ + struct ib_mirror *ib_mirror; + struct ib_device *ib_device; + + ib_mirror = container_of(kref, struct ib_mirror, kref); + + ib_device = ib_mirror-ib_device; + mutex_lock(ib_device-hmm_mutex); + list_del_init(ib_mirror-list); + mutex_unlock(ib_device-hmm_mutex); + + /* hmm_mirror_unregister() will free the structure. */ + hmm_mirror_unregister(ib_mirror-base); +} + +void ib_mirror_unref(struct ib_mirror *ib_mirror) +{ + if (ib_mirror == NULL) + return; + + kref_put(ib_mirror-kref, ib_mirror_destroy); +} +EXPORT_SYMBOL(ib_mirror_unref); + +static inline struct ib_mirror *ib_mirror_ref(struct ib_mirror *ib_mirror) +{ + if (!ib_mirror || !kref_get_unless_zero(ib_mirror-kref)) + return NULL; + return ib_mirror; +} + +int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem) +{ + struct mm_struct *mm = get_task_mm(current); + struct ib_device *ib_device = context-device; + struct ib_mirror *ib_mirror; + struct pid *our_pid; + int ret; + + if (!mm || !ib_device-hmm_ready) + return -EINVAL; + + /* This can not happen ! */ + if (unlikely(ib_umem_start(umem) == ib_umem_end(umem))) + return -EINVAL; + + /* Prevent creating ODP MRs in child processes */ + rcu_read_lock(); + our_pid = get_task_pid(current-group_leader, PIDTYPE_PID); + rcu_read_unlock(); + put_pid(our_pid); + if (context-tgid != our_pid) { + mmput(mm); + return -EINVAL; + } + + umem-hugetlb = 0; + umem-odp_data = kmalloc(sizeof(*umem-odp_data), GFP_KERNEL); + if (umem-odp_data == NULL) { + mmput(mm); + return -ENOMEM; + } + umem-odp_data-private = NULL; + umem-odp_data-umem = umem; + + mutex_lock(ib_device-hmm_mutex); + /* Is there an existing mirror for this process mm ? */ + ib_mirror = ib_mirror_ref(context-ib_mirror); + if (!ib_mirror) { + struct ib_mirror *tmp; + + list_for_each_entry(tmp, ib_device-ib_mirrors, list) { + if (tmp-base.hmm-mm != mm) + continue; + ib_mirror = ib_mirror_ref(tmp); + break; + } + } + + if (!ib_mirror) { + /* We need to create a new mirror. */ + ib_mirror = kmalloc(sizeof(*ib_mirror), GFP_KERNEL); + if (!ib_mirror) { + mutex_unlock(ib_device-hmm_mutex); + mmput(mm); + return -ENOMEM; + } + kref_init(ib_mirror-kref); + init_rwsem(ib_mirror-hmm_mr_rwsem); + ib_mirror-umem_tree = RB_ROOT; + ib_mirror-ib_device = ib_device; + + ib_mirror-base.device = ib_device-hmm_dev; + ret = hmm_mirror_register(ib_mirror-base); + if (ret) { + mutex_unlock(ib_device-hmm_mutex); + kfree(ib_mirror); + mmput(mm); + return ret; + } + + list_add(ib_mirror-list, ib_device-ib_mirrors); + context-ib_mirror = ib_mirror_ref(ib_mirror); + } + mutex_unlock(ib_device-hmm_mutex); + umem-odp_data.ib_mirror = ib_mirror; + + down_write(ib_mirror-umem_rwsem); + rbt_ib_umem_insert(umem-odp_data-interval_tree, mirror-umem_tree); + up_write(ib_mirror-umem_rwsem); + +
[RFC PATCH 4/8 v2] IB/odp/hmm: prepare for HMM code path.
This is a preparatory patch for HMM implementation of ODP (on demand paging). It shuffle codes around that will be share between current ODP implementation and HMM code path. It also convert many #ifdef CONFIG to #if IS_ENABLED(). Signed-off-by: Jérôme Glisse jgli...@redhat.com --- drivers/infiniband/core/umem_odp.c | 3 + drivers/infiniband/core/uverbs_cmd.c | 24 -- drivers/infiniband/hw/mlx5/main.c| 13 ++- drivers/infiniband/hw/mlx5/mem.c | 11 ++- drivers/infiniband/hw/mlx5/mlx5_ib.h | 14 ++-- drivers/infiniband/hw/mlx5/mr.c | 19 +++-- drivers/infiniband/hw/mlx5/odp.c | 118 ++- drivers/infiniband/hw/mlx5/qp.c | 4 +- drivers/net/ethernet/mellanox/mlx5/core/eq.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/qp.c | 8 +- include/rdma/ib_umem_odp.h | 51 +++- include/rdma/ib_verbs.h | 7 +- 12 files changed, 159 insertions(+), 115 deletions(-) diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index 0541761..d3b65d4 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -41,6 +41,8 @@ #include rdma/ib_umem.h #include rdma/ib_umem_odp.h +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ static void ib_umem_notifier_start_account(struct ib_umem *item) { mutex_lock(item-odp_data-umem_mutex); @@ -667,3 +669,4 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 virt, mutex_unlock(umem-odp_data-umem_mutex); } EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages); +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index bbb02ff..53163aa 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -289,9 +289,12 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, struct ib_uverbs_get_context_resp resp; struct ib_udata udata; struct ib_device *ibdev = file-device-ib_dev; -#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING) +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ struct ib_device_attr dev_attr; -#endif +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ struct ib_ucontext *ucontext; struct file *filp; int ret; @@ -334,7 +337,9 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, rcu_read_unlock(); ucontext-closing = 0; -#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING) +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ ucontext-umem_tree = RB_ROOT; init_rwsem(ucontext-umem_rwsem); ucontext-odp_mrs_count = 0; @@ -345,8 +350,8 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, goto err_free; if (!(dev_attr.device_cap_flags IB_DEVICE_ON_DEMAND_PAGING)) ucontext-invalidate_range = NULL; - -#endif +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ resp.num_comp_vectors = file-device-num_comp_vectors; @@ -3438,7 +3443,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, if (ucore-outlen resp.response_length + sizeof(resp.odp_caps)) goto end; -#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING) +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ resp.odp_caps.general_caps = attr.odp_caps.general_caps; resp.odp_caps.per_transport_caps.rc_odp_caps = attr.odp_caps.per_transport_caps.rc_odp_caps; @@ -3447,9 +3454,10 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, resp.odp_caps.per_transport_caps.ud_odp_caps = attr.odp_caps.per_transport_caps.ud_odp_caps; resp.odp_caps.reserved = 0; -#else +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ memset(resp.odp_caps, 0, sizeof(resp.odp_caps)); -#endif +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ resp.response_length += sizeof(resp.odp_caps); if (ucore-outlen resp.response_length + sizeof(resp.timestamp_mask)) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index 085c24b..da31c70 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -293,11 +293,14 @@ static int mlx5_ib_query_device(struct ib_device
[RFC PATCH 6/8 v2] IB/mlx5/hmm: add mlx5 HMM device initialization and callback v3.
This add the core HMM callback for mlx5 device driver and initialize the HMM device for the mlx5 infiniband device driver. Changed since v1: - Adapt to new hmm_mirror lifetime rules. - HMM_ISDIRTY no longer exist. Changed since v2: - Adapt to HMM page table changes. Signed-off-by: Jérôme Glisse jgli...@redhat.com Signed-off-by: John Hubbard jhubb...@nvidia.com --- drivers/infiniband/core/umem_odp.c | 10 +- drivers/infiniband/hw/mlx5/main.c| 5 + drivers/infiniband/hw/mlx5/mem.c | 38 drivers/infiniband/hw/mlx5/mlx5_ib.h | 17 drivers/infiniband/hw/mlx5/mr.c | 7 ++ drivers/infiniband/hw/mlx5/odp.c | 174 +++ include/rdma/ib_umem_odp.h | 17 7 files changed, 264 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index bcbc2c2..b7dd8228 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -132,7 +132,7 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem) return -ENOMEM; } kref_init(ib_mirror-kref); - init_rwsem(ib_mirror-hmm_mr_rwsem); + init_rwsem(ib_mirror-umem_rwsem); ib_mirror-umem_tree = RB_ROOT; ib_mirror-ib_device = ib_device; @@ -149,10 +149,11 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem) context-ib_mirror = ib_mirror_ref(ib_mirror); } mutex_unlock(ib_device-hmm_mutex); - umem-odp_data.ib_mirror = ib_mirror; + umem-odp_data-ib_mirror = ib_mirror; down_write(ib_mirror-umem_rwsem); - rbt_ib_umem_insert(umem-odp_data-interval_tree, mirror-umem_tree); + rbt_ib_umem_insert(umem-odp_data-interval_tree, + ib_mirror-umem_tree); up_write(ib_mirror-umem_rwsem); mmput(mm); @@ -178,7 +179,8 @@ void ib_umem_odp_release(struct ib_umem *umem) * range covered by one and only one umem while holding the umem rwsem. */ down_write(ib_mirror-umem_rwsem); - rbt_ib_umem_remove(umem-odp_data-interval_tree, mirror-umem_tree); + rbt_ib_umem_remove(umem-odp_data-interval_tree, + ib_mirror-umem_tree); up_write(ib_mirror-umem_rwsem); ib_mirror_unref(ib_mirror); diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index da31c70..32ed2f1 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -1530,6 +1530,9 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev) if (err) goto err_rsrc; + /* If HMM initialization fails we just do not enable odp. */ + mlx5_dev_init_odp_hmm(dev-ib_dev, mdev-pdev-dev); + err = ib_register_device(dev-ib_dev, NULL); if (err) goto err_odp; @@ -1554,6 +1557,7 @@ err_umrc: err_dev: ib_unregister_device(dev-ib_dev); + mlx5_dev_fini_odp_hmm(dev-ib_dev); err_odp: mlx5_ib_odp_remove_one(dev); @@ -1573,6 +1577,7 @@ static void mlx5_ib_remove(struct mlx5_core_dev *mdev, void *context) ib_unregister_device(dev-ib_dev); destroy_umrc_res(dev); + mlx5_dev_fini_odp_hmm(dev-ib_dev); mlx5_ib_odp_remove_one(dev); destroy_dev_resources(dev-devr); ib_dealloc_device(dev-ib_dev); diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c index 19354b6..0d74eac 100644 --- a/drivers/infiniband/hw/mlx5/mem.c +++ b/drivers/infiniband/hw/mlx5/mem.c @@ -154,6 +154,8 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, __be64 *pas, int access_flags, void *data) { unsigned long umem_page_shift = ilog2(umem-page_size); + unsigned long start = ib_umem_start(umem) + (offset PAGE_SHIFT); + unsigned long end = start + (num_pages PAGE_SHIFT); int shift = page_shift - umem_page_shift; int mask = (1 shift) - 1; int i, k; @@ -164,6 +166,42 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, int entry; #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING) #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) + if (umem-odp_data) { + struct ib_mirror *ib_mirror = umem-odp_data-ib_mirror; + struct hmm_mirror *mirror = ib_mirror-base; + struct hmm_pt_iter *iter = data, local_iter; + unsigned long addr; + + if (iter == NULL) { + iter = local_iter; + hmm_pt_iter_init(iter, mirror-pt); + } + + for (i=0, addr=start; i num_pages; ++i, addr+=PAGE_SIZE) { + unsigned long next = end; + dma_addr_t *ptep, pte; + + /* Get and
Re: [GIT PULL] pull request: linux-firmware: Add Intel OPA hfi1 firmware
On Tue, Aug 11, 2015 at 05:35:53PM -0600, Jason Gunthorpe wrote: On Tue, Aug 11, 2015 at 10:47:03PM +, Vogel, Steve wrote: The license terms allow anyone to distribute (but not sell) the firmware but only for use on Intel products. Redistribution alone may be enough to be included in linux-firmware However, most of the additional terms (and there are lots of them) this imposes beyond the usual likely make it impossible to include in a distro, so pragmatically, there is no reason to push for inclusion in linux-firmware. This is going to be a hard road for you guys. Falling in line with every other Intel firmware blob's (i915, ibt, iwlwifi, SST2) license would be much easier on you and the distros. Frankly, I think the onus is on you to get statements from the licensing teams at Fedora, Debian, RH and SuSE on if they can include work under this license or not. I suspect Fedora and Debian will both say they can't, just based on their public policies and the additional restrictions in this license.. But hey, I'm not a licensing lawyer.. I just noticed that the email from Steve that Jason Replied to did not make it to the lists. Here is the text from Steve for reference. quote Here is an interpretation of the grant language: 2.11Grant. Subject to Your compliance with the terms of this Agreement, and the limitations set forth in Section 2.2, Intel hereby grants You, during the term of this Agreement, a non-transferable, non-exclusive, non-sublicenseable (except as expressly set forth below), limited right and license: (A)Onunder Intel’s copyrights, to: (1)Onreproduce and execute the Software only for internal use with Intel Products, including designing products for Intel Products,; this license does not include the right to sublicense, and may be exercised only within Your facilities by Your employees; [This allows anyone obtaining the software to make copies and use the software, but not to re-license it.] (2)Ondistribute the unmodified Software only in Object Code, only for use with Intel Products; this license includes the right to sublicense, but only the rights to execute the Software and only under Intel’s End User Software License Agreement attached as Attachment B, without the right to further sublicense; [This allows anyone to re-distribute the software for use on Intel products and requires the them to re-distribute with the license in Attachment B] /quote Ira -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2] IB/hfi1: Remove inline from trace functions
Subject: [PATCH v2] IB/hfi1: Remove inline from trace functions v2 adjusts some of the comment text to clarify adding new traces.
[RFC PATCH 0/8 v2] Implement ODP using HMM v2
Posting just for comment, still waiting on HMM to be accepted before this patchset can be considered for inclusion. This patchset implement the on demand paging feature using HMM. It depends on the HMM patchset v10 (previous post (1)). Long term plan is to replace ODP with HMM allowing to share same code infrastructure accross different class of devices. HMM (Heterogeneous Memory Management) is an helper layer for device that want to mirror a process address space into their own mmu. Main target is GPU but other hardware, like network device can take also use HMM. Tree with the patchset: git://people.freedesktop.org/~glisse/linux hmm-v10 branch (1) Previous patchset posting : v1 http://lwn.net/Articles/597289/ v2 https://lkml.org/lkml/2014/6/12/559 v3 https://lkml.org/lkml/2014/6/13/633 v4 https://lkml.org/lkml/2014/8/29/423 v5 https://lkml.org/lkml/2014/11/3/759 v6 http://lwn.net/Articles/619737/ v7 http://lwn.net/Articles/627316/ v8 https://lwn.net/Articles/645515/ v9 https://lwn.net/Articles/651553/ Cheers, Jérôme To: linux-rdma@vger.kernel.org, To: linux-ker...@vger.kernel.org, Cc: Kevin E Martin k...@redhat.com, Cc: Christophe Harle cha...@nvidia.com, Cc: Duncan Poole dpo...@nvidia.com, Cc: Sherry Cheung sche...@nvidia.com, Cc: Subhash Gutti sgu...@nvidia.com, Cc: John Hubbard jhubb...@nvidia.com, Cc: Mark Hairgrove mhairgr...@nvidia.com, Cc: Lucien Dunning ldunn...@nvidia.com, Cc: Cameron Buschardt cabuscha...@nvidia.com, Cc: Arvind Gopalakrishnan arvi...@nvidia.com, Cc: Haggai Eran hagg...@mellanox.com, Cc: Or Gerlitz ogerl...@mellanox.com, Cc: Sagi Grimberg sa...@mellanox.com Cc: Shachar Raindel rain...@mellanox.com, Cc: Liran Liss lir...@mellanox.com, Cc: Roland Dreier rol...@purestorage.com, Cc: Sander, Ben ben.san...@amd.com, Cc: Stoner, Greg greg.sto...@amd.com, Cc: Bridgman, John john.bridg...@amd.com, Cc: Mantor, Michael michael.man...@amd.com, Cc: Blinzer, Paul paul.blin...@amd.com, Cc: Morichetti, Laurent laurent.moriche...@amd.com, Cc: Deucher, Alexander alexander.deuc...@amd.com, Cc: Leonid Shamis leonid.sha...@amd.com, -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 7/8 v2] IB/mlx5/hmm: add page fault support for ODP on HMM v2.
This patch add HMM specific support for hardware page faulting of user memory region. Changed since v1: - Adapt to HMM page table changes. - Turn some sanity test to BUG_ON(). Signed-off-by: Jérôme Glisse jgli...@redhat.com --- drivers/infiniband/hw/mlx5/odp.c | 144 ++- 1 file changed, 143 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c index 5ef31da..658bfca 100644 --- a/drivers/infiniband/hw/mlx5/odp.c +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -54,6 +54,52 @@ static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev, #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) +struct mlx5_hmm_pfault { + struct mlx5_ib_mr *mlx5_ib_mr; + u64 start_idx; + dma_addr_t access_mask; + unsignednpages; + struct hmm_eventevent; +}; + +static int mlx5_hmm_pfault(struct mlx5_ib_dev *mlx5_ib_dev, + struct hmm_mirror *mirror, + const struct hmm_event *event) +{ + struct mlx5_hmm_pfault *pfault; + struct hmm_pt_iter iter; + unsigned long addr, cnt; + int ret; + + pfault = container_of(event, struct mlx5_hmm_pfault, event); + hmm_pt_iter_init(iter, mirror-pt); + + for (addr = event-start, cnt = 0; addr event-end; +addr += PAGE_SIZE, ++cnt) { + unsigned long next = event-end; + dma_addr_t *ptep; + + /* Get and lock pointer to mirror page table. */ + ptep = hmm_pt_iter_lookup(iter, addr, next); + BUG_ON(!ptep); + for (; ptep addr next; addr += PAGE_SIZE, ptep++) { + /* This could be BUG_ON() as it can not happen. */ + BUG_ON(!hmm_pte_test_valid_dma(ptep)); + BUG_ON((pfault-access_mask ODP_WRITE_ALLOWED_BIT) + !hmm_pte_test_write(ptep)); + if (hmm_pte_test_write(ptep)) + hmm_pte_set_bit(ptep, ODP_WRITE_ALLOWED_SHIFT); + hmm_pte_set_bit(ptep, ODP_READ_ALLOWED_SHIFT); + pfault-npages++; + } + } + ret = mlx5_ib_update_mtt(pfault-mlx5_ib_mr, +pfault-start_idx, +cnt, 0, iter); + hmm_pt_iter_fini(iter); + return ret; +} + int mlx5_ib_umem_invalidate(struct ib_umem *umem, u64 start, u64 end, void *cookie) { @@ -179,12 +225,19 @@ static int mlx5_hmm_update(struct hmm_mirror *mirror, struct hmm_event *event) { struct device *device = mirror-device-dev; + struct mlx5_ib_dev *mlx5_ib_dev; + struct ib_device *ib_device; int ret = 0; + ib_device = container_of(mirror-device, struct ib_device, hmm_dev); + mlx5_ib_dev = to_mdev(ib_device); + switch (event-etype) { case HMM_DEVICE_RFAULT: case HMM_DEVICE_WFAULT: - /* FIXME implement. */ + ret = mlx5_hmm_pfault(mlx5_ib_dev, mirror, event); + if (ret) + return ret; break; case HMM_NONE: default: @@ -227,6 +280,95 @@ void mlx5_dev_fini_odp_hmm(struct ib_device *ib_device) hmm_device_unregister(ib_device-hmm_dev); } +/* + * Handle a single data segment in a page-fault WQE. + * + * Returns number of pages retrieved on success. The caller will continue to + * the next data segment. + * Can return the following error codes: + * -EAGAIN to designate a temporary error. The caller will abort handling the + * page fault and resolve it. + * -EFAULT when there's an error mapping the requested pages. The caller will + * abort the page fault handling and possibly move the QP to an error state. + * On other errors the QP should also be closed with an error. + */ +static int pagefault_single_data_segment(struct mlx5_ib_qp *qp, +struct mlx5_ib_pfault *pfault, +u32 key, u64 io_virt, size_t bcnt, +u32 *bytes_mapped) +{ + struct mlx5_ib_dev *mlx5_ib_dev = to_mdev(qp-ibqp.pd-device); + struct ib_mirror *ib_mirror; + struct mlx5_hmm_pfault hmm_pfault; + int srcu_key; + int ret = 0; + + srcu_key = srcu_read_lock(mlx5_ib_dev-mr_srcu); + hmm_pfault.mlx5_ib_mr = mlx5_ib_odp_find_mr_lkey(mlx5_ib_dev, key); + /* +* If we didn't find the MR, it means the MR was closed while we were +* handling the ODP event. In this case we return -EFAULT so that the +* QP will be closed. +*/ + if (!hmm_pfault.mlx5_ib_mr || !hmm_pfault.mlx5_ib_mr-ibmr.pd) { + pr_err(Failed to find relevant mr for
[RFC PATCH 8/8 v2] IB/mlx5/hmm: enable ODP using HMM v2.
All pieces are in place for ODP (on demand paging) to work using HMM. Add kernel option and final code to enable it. Changed since v1: - Added kernel option in this last patch of the serie. Signed-off-by: Jérôme Glisse jgli...@redhat.com --- drivers/infiniband/Kconfig | 10 ++ drivers/infiniband/core/uverbs_cmd.c | 3 --- drivers/infiniband/hw/mlx5/main.c| 4 3 files changed, 14 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index b899531..764f524 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -49,6 +49,16 @@ config INFINIBAND_ON_DEMAND_PAGING memory regions without pinning their pages, fetching the pages on demand instead. +config INFINIBAND_ON_DEMAND_PAGING_HMM + bool InfiniBand on-demand paging support using HMM. + depends on HMM + depends on INFINIBAND_ON_DEMAND_PAGING + default n + ---help--- + Use HMM (heterogeneous memory management) kernel API for + on demand paging. No userspace difference, this is just + an alternative implementation of the feature. + config INFINIBAND_ADDR_TRANS bool depends on INFINIBAND diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 1db6a17..c3e14a8 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3445,8 +3445,6 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, goto end; #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING) -#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) -#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ resp.odp_caps.general_caps = attr.odp_caps.general_caps; resp.odp_caps.per_transport_caps.rc_odp_caps = attr.odp_caps.per_transport_caps.rc_odp_caps; @@ -3455,7 +3453,6 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, resp.odp_caps.per_transport_caps.ud_odp_caps = attr.odp_caps.per_transport_caps.ud_odp_caps; resp.odp_caps.reserved = 0; -#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ memset(resp.odp_caps, 0, sizeof(resp.odp_caps)); #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index 32ed2f1..c340c3a 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -295,6 +295,10 @@ static int mlx5_ib_query_device(struct ib_device *ibdev, #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING) #if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) + if (MLX5_CAP_GEN(mdev, pg) ibdev-hmm_ready) { + props-device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING; + props-odp_caps = dev-odp_caps; + } #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ if (MLX5_CAP_GEN(mdev, pg)) props-device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] IB/hfi1: Remove inline from trace functions
From: Dennis Dalessandro dennis.dalessan...@intel.com inline in trace functions causes the following build error when CONFIG_OPTIMIZE_INLINING is not defined in the kernel config: error: function can never be inlined because it uses variable argument lists There are all manner of tracing things in the kernel. Does this driver really need a custom designed one? All of our trace infrastructure is built out of events/tracepoints, so we are not inventing anything new here. The fast patch traces are built out of tracepoints. The *_cdbg() ones are intended for slow path code and compared to the native trace points are easier to new trace capabilities. Mike -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] RDMA/cma: fix IPv6 address resolution
Resolving a link-local IPv6 address with an unspecified source address was broken by commit 5462eddd7a, which prevented the IPv6 stack from learning the scope id of the link-local IPv6 address, causing random failures as the IP stack chose a random link to resolve the address on. This commit 5462eddd7a made us bail out of cma_check_linklocal early if the address passed in was not an IPv6 link-local address. On the address resolution path, the address passed in is the source address; if the source address is the unspecified address, which is not link-local, we will bail out early. This is mostly correct, but if the destination address is a link-local address, then we will be following a link-local route, and we'll need to tell the IPv6 stack what the scope id of the destination address is. This used to be done by last line of cma_check_linklocal, which is skipped when bailing out early: dev_addr-bound_dev_if = sin6-sin6_scope_id; (In cma_bind_addr, the sin6_scope_id of the source address is set to the sin6_scope_id of the destination address, so this is correct) This line is required in turn for the following line, L279 of addr6_resolve, to actually inform the IPv6 stack of the scope id: fl6.flowi6_oif = addr-bound_dev_if; Since we can only know we are in this failure case when we have access to both the source IPv6 address and destination IPv6 address, we have to deal with this further up the stack. So detect this failure case in cma_bind_addr, and set bound_dev_if to the destination address scope id to correct it. Signed-off-by: Spencer Baugh sba...@catern.com --- drivers/infiniband/core/cma.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 6a6b60a..3b71154 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -2188,8 +2188,11 @@ static int cma_bind_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, src_addr = (struct sockaddr *) id-route.addr.src_addr; src_addr-sa_family = dst_addr-sa_family; if (dst_addr-sa_family == AF_INET6) { - ((struct sockaddr_in6 *) src_addr)-sin6_scope_id = - ((struct sockaddr_in6 *) dst_addr)-sin6_scope_id; + struct sockaddr_in6 *src_addr6 = (struct sockaddr_in6 *) src_addr; + struct sockaddr_in6 *dst_addr6 = (struct sockaddr_in6 *) dst_addr; + src_addr6-sin6_scope_id = dst_addr6-sin6_scope_id; + if (ipv6_addr_type(dst_addr6-sin6_addr) IPV6_ADDR_LINKLOCAL) + id-route.addr.dev_addr.bound_dev_if = dst_addr6-sin6_scope_id; } else if (dst_addr-sa_family == AF_IB) { ((struct sockaddr_ib *) src_addr)-sib_pkey = ((struct sockaddr_ib *) dst_addr)-sib_pkey; -- 2.5.0.rc3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 4/8 v2] IB/odp/hmm: prepare for HMM code path.
On Thu, Aug 13, 2015 at 02:13:35PM -0600, Jason Gunthorpe wrote: On Thu, Aug 13, 2015 at 03:20:49PM -0400, Jérôme Glisse wrote: +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ Yuk, what is wrong with #if !IS_ENABLED(...) ? Just that latter patches add code btw #if and #else, and that originaly it was a bigger patch that added the #if code #else at the same time. Hence why this patch looks like this. -#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING) +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ Double yuk #if !(IS_ENABLED(..) IS_ENABLED(..)) ? Same reason as above. And the #ifdefs suck, as many as possible should be normal if statements, and one should think carefully if we really need to remove fields from structures.. My patch only add #if, i am not responsible for previous code that used #ifdef, i was told to convert to #if and that's what i am doing. Regarding fields, yes this is intentional, ODP is an infrastructure that is private to infiniband and thus needs more fields inside ib struct. While HMM is intended to be a common infrastructure not only for ib device but for other kind of devices too. Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/8 v2] IB/odp: export rbt_ib_umem_for_each_in_range()
The mlx5 driver will need this function for its driver specific bit of ODP (on demand paging) on HMM (Heterogeneous Memory Management). Signed-off-by: Jérôme Glisse jgli...@redhat.com --- drivers/infiniband/core/umem_rbtree.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/infiniband/core/umem_rbtree.c b/drivers/infiniband/core/umem_rbtree.c index 727d788..f030ec0 100644 --- a/drivers/infiniband/core/umem_rbtree.c +++ b/drivers/infiniband/core/umem_rbtree.c @@ -92,3 +92,4 @@ int rbt_ib_umem_for_each_in_range(struct rb_root *root, return ret_val; } +EXPORT_SYMBOL(rbt_ib_umem_for_each_in_range); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/8 v2] IB/mlx5: add a new parameter to __mlx_ib_populated_pas for ODP with HMM.
When using HMM for ODP it will be useful to pass the current mirror page table iterator for __mlx_ib_populated_pas() function benefit. Add void parameter for this. Signed-off-by: Jérôme Glisse jgli...@redhat.com --- drivers/infiniband/hw/mlx5/mem.c | 8 +--- drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +- drivers/infiniband/hw/mlx5/mr.c | 2 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c index 40df2cc..df56b7d 100644 --- a/drivers/infiniband/hw/mlx5/mem.c +++ b/drivers/infiniband/hw/mlx5/mem.c @@ -145,11 +145,13 @@ static u64 umem_dma_to_mtt(dma_addr_t umem_dma) * num_pages - total number of pages to fill * pas - bus addresses array to fill * access_flags - access flags to set on all present pages. - use enum mlx5_ib_mtt_access_flags for this. + *use enum mlx5_ib_mtt_access_flags for this. + * data - intended for odp with hmm, it should point to current mirror page + *table iterator. */ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, int page_shift, size_t offset, size_t num_pages, - __be64 *pas, int access_flags) + __be64 *pas, int access_flags, void *data) { unsigned long umem_page_shift = ilog2(umem-page_size); int shift = page_shift - umem_page_shift; @@ -201,7 +203,7 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, { return __mlx5_ib_populate_pas(dev, umem, page_shift, 0, ib_umem_num_pages(umem), pas, - access_flags); + access_flags, NULL); } int mlx5_ib_get_buf_offset(u64 addr, int page_shift, u32 *offset) { diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index 7cae098..d4dbd8e 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -622,7 +622,7 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift, int *ncont, int *order); void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, int page_shift, size_t offset, size_t num_pages, - __be64 *pas, int access_flags); + __be64 *pas, int access_flags, void *data); void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, int page_shift, __be64 *pas, int access_flags); void mlx5_ib_copy_pas(u64 *old, u64 *new, int step, int num); diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index bc9a0de..ef63e5f 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -912,7 +912,7 @@ int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages, if (!zap) { __mlx5_ib_populate_pas(dev, umem, PAGE_SHIFT, start_page_index, npages, pas, - MLX5_IB_MTT_PRESENT); + MLX5_IB_MTT_PRESENT, NULL); /* Clear padding after the pages brought from the * umem. */ memset(pas + npages, 0, size - npages * sizeof(u64)); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/8 v2] IB/mlx5: add a new parameter to mlx5_ib_update_mtt() for ODP with HMM.
When using HMM for ODP it will be useful to pass the current mirror page table iterator for mlx5_ib_update_mtt() function benefit. Add void parameter for this. Signed-off-by: Jérôme Glisse jgli...@redhat.com --- drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +- drivers/infiniband/hw/mlx5/mr.c | 4 ++-- drivers/infiniband/hw/mlx5/odp.c | 8 +--- 3 files changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index d4dbd8e..79d1e7c 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -571,7 +571,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt_addr, int access_flags, struct ib_udata *udata); int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, - int npages, int zap); + int npages, int zap, void *data); int mlx5_ib_dereg_mr(struct ib_mr *ibmr); int mlx5_ib_destroy_mr(struct ib_mr *ibmr); struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd, diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index ef63e5f..3ad371d 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -845,7 +845,7 @@ free_mr: #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages, - int zap) + int zap, void *data) { struct mlx5_ib_dev *dev = mr-dev; struct device *ddev = dev-ib_dev.dma_device; @@ -912,7 +912,7 @@ int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages, if (!zap) { __mlx5_ib_populate_pas(dev, umem, PAGE_SHIFT, start_page_index, npages, pas, - MLX5_IB_MTT_PRESENT, NULL); + MLX5_IB_MTT_PRESENT, data); /* Clear padding after the pages brought from the * umem. */ memset(pas + npages, 0, size - npages * sizeof(u64)); diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c index aa8391e..df86d05 100644 --- a/drivers/infiniband/hw/mlx5/odp.c +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -91,14 +91,15 @@ void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start, if (in_block umr_offset == 0) { mlx5_ib_update_mtt(mr, blk_start_idx, - idx - blk_start_idx, 1); + idx - blk_start_idx, 1, + NULL); in_block = 0; } } } if (in_block) mlx5_ib_update_mtt(mr, blk_start_idx, idx - blk_start_idx + 1, - 1); + 1, NULL); /* * We are now sure that the device will not access the @@ -249,7 +250,8 @@ static int pagefault_single_data_segment(struct mlx5_ib_qp *qp, * this MR, since ib_umem_odp_map_dma_pages already * checks this. */ - ret = mlx5_ib_update_mtt(mr, start_idx, npages, 0); + ret = mlx5_ib_update_mtt(mr, start_idx, +npages, 0, NULL); } else { ret = -EAGAIN; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] IB/hfi1: Remove inline from trace functions
From: Dennis Dalessandro dennis.dalessan...@intel.com Inline in trace functions causes the following build error when CONFIG_OPTIMIZE_INLINING is not defined in the kernel config: error: function can never be inlined because it uses variable argument lists Reported by 0-day build: https://lists.01.org/pipermail/kbuild-all/2015-August/011215.html This patch converts to a non-inline version of the hfi1 trace functions Reviewed-by: Jubin John jubin.j...@intel.com Reviewed-by: Mike Marciniszyn mike.marcinis...@intel.com Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com --- drivers/staging/hfi1/trace.c | 15 ++- drivers/staging/hfi1/trace.h | 56 +- 2 files changed, 35 insertions(+), 36 deletions(-) diff --git a/drivers/staging/hfi1/trace.c b/drivers/staging/hfi1/trace.c index afbb212..ea95591 100644 --- a/drivers/staging/hfi1/trace.c +++ b/drivers/staging/hfi1/trace.c @@ -48,7 +48,6 @@ * */ #define CREATE_TRACE_POINTS -#define HFI1_TRACE_DO_NOT_CREATE_INLINES #include trace.h u8 ibhdr_exhdr_len(struct hfi1_ib_header *hdr) @@ -208,4 +207,16 @@ const char *print_u64_array( return ret; } -#undef HFI1_TRACE_DO_NOT_CREATE_INLINES +__hfi1_trace_fn(PKT); +__hfi1_trace_fn(PROC); +__hfi1_trace_fn(SDMA); +__hfi1_trace_fn(LINKVERB); +__hfi1_trace_fn(DEBUG); +__hfi1_trace_fn(SNOOP); +__hfi1_trace_fn(CNTR); +__hfi1_trace_fn(PIO); +__hfi1_trace_fn(DC8051); +__hfi1_trace_fn(FIRMWARE); +__hfi1_trace_fn(RCVCTRL); +__hfi1_trace_fn(TID); + diff --git a/drivers/staging/hfi1/trace.h b/drivers/staging/hfi1/trace.h index 5c34606..d7851c0 100644 --- a/drivers/staging/hfi1/trace.h +++ b/drivers/staging/hfi1/trace.h @@ -1339,22 +1339,17 @@ DECLARE_EVENT_CLASS(hfi1_trace_template, /* * It may be nice to macroize the __hfi1_trace but the va_* stuff requires an - * actual function to work and can not be in a macro. Also the fmt can not be a - * constant char * because we need to be able to manipulate the \n if it is - * present. + * actual function to work and can not be in a macro. */ -#define __hfi1_trace_event(lvl) \ +#define __hfi1_trace_def(lvl) \ +void __hfi1_trace_##lvl(const char *funct, char *fmt, ...);\ + \ DEFINE_EVENT(hfi1_trace_template, hfi1_ ##lvl, \ TP_PROTO(const char *function, struct va_format *vaf), \ TP_ARGS(function, vaf)) -#ifdef HFI1_TRACE_DO_NOT_CREATE_INLINES -#define __hfi1_trace_fn(fn) __hfi1_trace_event(fn) -#else -#define __hfi1_trace_fn(fn) \ -__hfi1_trace_event(fn); \ -__printf(2, 3) \ -static inline void __hfi1_trace_##fn(const char *func, char *fmt, ...) \ +#define __hfi1_trace_fn(lvl) \ +void __hfi1_trace_##lvl(const char *func, char *fmt, ...) \ { \ struct va_format vaf = {\ .fmt = fmt, \ @@ -1363,36 +1358,29 @@ static inline void __hfi1_trace_##fn(const char *func, char *fmt, ...) \ \ va_start(args, fmt);\ vaf.va = args; \ - trace_hfi1_ ##fn(func, vaf); \ + trace_hfi1_ ##lvl(func, vaf); \ va_end(args); \ return; \ } -#endif /* - * To create a new trace level simply define it as below. This will create all - * the hooks for calling hfi1_cdbg(LVL, fmt, ...); as well as take care of all + * To create a new trace level simply define it below and as a __hfi1_trace_fn + * in trace.c. This will create all the hooks for calling + * hfi1_cdbg(LVL, fmt, ...); as well as take care of all * the debugfs stuff. */ -__hfi1_trace_fn(RVPKT); -__hfi1_trace_fn(INIT); -__hfi1_trace_fn(VERB); -__hfi1_trace_fn(PKT); -__hfi1_trace_fn(PROC); -__hfi1_trace_fn(MM); -__hfi1_trace_fn(ERRPKT); -__hfi1_trace_fn(SDMA); -__hfi1_trace_fn(VPKT); -__hfi1_trace_fn(LINKVERB); -__hfi1_trace_fn(VERBOSE); -__hfi1_trace_fn(DEBUG); -__hfi1_trace_fn(SNOOP); -__hfi1_trace_fn(CNTR); -__hfi1_trace_fn(PIO); -__hfi1_trace_fn(DC8051); -__hfi1_trace_fn(FIRMWARE); -__hfi1_trace_fn(RCVCTRL); -__hfi1_trace_fn(TID); +__hfi1_trace_def(PKT); +__hfi1_trace_def(PROC); +__hfi1_trace_def(SDMA); +__hfi1_trace_def(LINKVERB); +__hfi1_trace_def(DEBUG); +__hfi1_trace_def(SNOOP); +__hfi1_trace_def(CNTR); +__hfi1_trace_def(PIO); +__hfi1_trace_def(DC8051); +__hfi1_trace_def(FIRMWARE); +__hfi1_trace_def(RCVCTRL); +__hfi1_trace_def(TID); #define hfi1_cdbg(which, fmt, ...) \ __hfi1_trace_##which(__func__, fmt, ##__VA_ARGS__) -- To unsubscribe from this list:
Re: [RFC PATCH 4/8 v2] IB/odp/hmm: prepare for HMM code path.
On Thu, Aug 13, 2015 at 03:20:49PM -0400, Jérôme Glisse wrote: +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ Yuk, what is wrong with #if !IS_ENABLED(...) ? -#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING) +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM) +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */ Double yuk #if !(IS_ENABLED(..) IS_ENABLED(..)) ? And the #ifdefs suck, as many as possible should be normal if statements, and one should think carefully if we really need to remove fields from structures.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 12/12] rds/ib: Remove ib_get_dma_mr calls
On 7/30/2015 4:22 PM, Jason Gunthorpe wrote: The pd now has a local_dma_lkey member which completely replaces ib_get_dma_mr, use it instead. Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com --- net/rds/ib.c | 8 net/rds/ib.h | 2 -- net/rds/ib_cm.c | 4 +--- net/rds/ib_recv.c | 6 +++--- net/rds/ib_send.c | 8 5 files changed, 8 insertions(+), 20 deletions(-) I wanted to try this series earlier but couldn't do it because of broken RDS RDMA. Now I have that fixed with bunch of patches soon to be posted, tried the series. It works as expected. The rds change looks also straight forward since ib_get_dma_mr() is being used for local write. So feel free to add below tag if you need one. Tested-Acked-by: Santosh Shilimkar santosh.shilim...@oracle.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 1/9] IB/core: Add gid_type to gid attribute
In order to support multiple GID types, we need to store the gid_type with each GID. This is also aligned with the RoCE v2 annex RoCEv2 PORT GID table entries shall have a GID type attribute that denotes the L3 Address type. The currently supported GID is IB_GID_TYPE_IB which is also RoCE v1 GID type. This implies that gid_type should be added to roce_gid_table meta-data. Signed-off-by: Matan Barak mat...@mellanox.com --- drivers/infiniband/core/cache.c | 127 +- drivers/infiniband/core/cm.c | 2 +- drivers/infiniband/core/cma.c | 3 +- drivers/infiniband/core/core_priv.h | 4 + drivers/infiniband/core/device.c | 9 ++- drivers/infiniband/core/multicast.c | 2 +- drivers/infiniband/core/roce_gid_mgmt.c | 60 -- drivers/infiniband/core/sa_query.c| 5 +- drivers/infiniband/core/uverbs_marshall.c | 1 + drivers/infiniband/core/verbs.c | 1 + include/rdma/ib_cache.h | 4 + include/rdma/ib_sa.h | 1 + include/rdma/ib_verbs.h | 11 ++- 13 files changed, 177 insertions(+), 53 deletions(-) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index e62b63c..513a1ef 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -64,6 +64,7 @@ enum gid_attr_find_mask { GID_ATTR_FIND_MASK_GID = 1UL 0, GID_ATTR_FIND_MASK_NETDEV = 1UL 1, GID_ATTR_FIND_MASK_DEFAULT = 1UL 2, + GID_ATTR_FIND_MASK_GID_TYPE = 1UL 3, }; enum gid_table_entry_props { @@ -112,6 +113,19 @@ struct ib_gid_table { struct ib_gid_table_entry *data_vec; }; +static const char * const gid_type_str[] = { + [IB_GID_TYPE_IB]= IB/RoCE v1, +}; + +const char *ib_cache_gid_type_str(enum ib_gid_type gid_type) +{ + if (gid_type ARRAY_SIZE(gid_type_str) gid_type_str[gid_type]) + return gid_type_str[gid_type]; + + return Invalid GID type; +} +EXPORT_SYMBOL(ib_cache_gid_type_str); + static int write_gid(struct ib_device *ib_dev, u8 port, struct ib_gid_table *table, int ix, const union ib_gid *gid, @@ -216,6 +230,10 @@ static int find_gid(struct ib_gid_table *table, const union ib_gid *gid, if (table-data_vec[i].props GID_TABLE_ENTRY_INVALID) goto next; + if (mask GID_ATTR_FIND_MASK_GID_TYPE + attr-gid_type != val-gid_type) + goto next; + if (mask GID_ATTR_FIND_MASK_GID memcmp(gid, table-data_vec[i].gid, sizeof(*gid))) goto next; @@ -277,6 +295,7 @@ int ib_cache_gid_add(struct ib_device *ib_dev, u8 port, mutex_lock(table-lock); ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID | + GID_ATTR_FIND_MASK_GID_TYPE | GID_ATTR_FIND_MASK_NETDEV); if (ix = 0) goto out_unlock; @@ -308,6 +327,7 @@ int ib_cache_gid_del(struct ib_device *ib_dev, u8 port, ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID | + GID_ATTR_FIND_MASK_GID_TYPE | GID_ATTR_FIND_MASK_NETDEV | GID_ATTR_FIND_MASK_DEFAULT); if (ix 0) @@ -396,11 +416,13 @@ static int _ib_cache_gid_table_find(struct ib_device *ib_dev, static int ib_cache_gid_find(struct ib_device *ib_dev, const union ib_gid *gid, +enum ib_gid_type gid_type, struct net_device *ndev, u8 *port, u16 *index) { - unsigned long mask = GID_ATTR_FIND_MASK_GID; - struct ib_gid_attr gid_attr_val = {.ndev = ndev}; + unsigned long mask = GID_ATTR_FIND_MASK_GID | +GID_ATTR_FIND_MASK_GID_TYPE; + struct ib_gid_attr gid_attr_val = {.ndev = ndev, .gid_type = gid_type}; if (ndev) mask |= GID_ATTR_FIND_MASK_NETDEV; @@ -411,14 +433,16 @@ static int ib_cache_gid_find(struct ib_device *ib_dev, int ib_find_cached_gid_by_port(struct ib_device *ib_dev, const union ib_gid *gid, + enum ib_gid_type gid_type, u8 port, struct net_device *ndev, u16 *index) { int local_index; struct ib_gid_table **ports_table = ib_dev-cache.gid_cache; struct ib_gid_table *table; - unsigned long mask = GID_ATTR_FIND_MASK_GID; - struct ib_gid_attr val = {.ndev = ndev}; + unsigned long mask = GID_ATTR_FIND_MASK_GID | +GID_ATTR_FIND_MASK_GID_TYPE; + struct ib_gid_attr val = {.ndev = ndev, .gid_type = gid_type}; if (port
[PATCH for-next 3/9] IB/core: Add gid attributes to sysfs
This patch set adds attributes of net device and gid type to each GID in the GID table. Users that use verbs directly need to specify the GID index. Since the same GID could have different types or associated net devices, users should have the ability to query the associated GID attributes. Adding these attributes to sysfs. Signed-off-by: Matan Barak mat...@mellanox.com --- drivers/infiniband/core/sysfs.c | 184 +++- 1 file changed, 182 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index b1f37d4..4d5d87a 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -37,12 +37,22 @@ #include linux/slab.h #include linux/stat.h #include linux/string.h +#include linux/netdevice.h #include rdma/ib_mad.h +struct ib_port; + +struct gid_attr_group { + struct ib_port *port; + struct kobject kobj; + struct attribute_group ndev; + struct attribute_group type; +}; struct ib_port { struct kobject kobj; struct ib_device *ibdev; + struct gid_attr_group *gid_attr_group; struct attribute_group gid_group; struct attribute_group pkey_group; u8 port_num; @@ -84,6 +94,24 @@ static const struct sysfs_ops port_sysfs_ops = { .show = port_attr_show }; +static ssize_t gid_attr_show(struct kobject *kobj, +struct attribute *attr, char *buf) +{ + struct port_attribute *port_attr = + container_of(attr, struct port_attribute, attr); + struct ib_port *p = container_of(kobj, struct gid_attr_group, +kobj)-port; + + if (!port_attr-show) + return -EIO; + + return port_attr-show(p, port_attr, buf); +} + +static const struct sysfs_ops gid_attr_sysfs_ops = { + .show = gid_attr_show +}; + static ssize_t state_show(struct ib_port *p, struct port_attribute *unused, char *buf) { @@ -281,6 +309,46 @@ static struct attribute *port_default_attrs[] = { NULL }; +static size_t print_ndev(struct ib_gid_attr *gid_attr, char *buf) +{ + if (!gid_attr-ndev) + return -EINVAL; + + return sprintf(buf, %s\n, gid_attr-ndev-name); +} + +static size_t print_gid_type(struct ib_gid_attr *gid_attr, char *buf) +{ + return sprintf(buf, %s\n, ib_cache_gid_type_str(gid_attr-gid_type)); +} + +static ssize_t _show_port_gid_attr(struct ib_port *p, + struct port_attribute *attr, + char *buf, + size_t (*print)(struct ib_gid_attr *gid_attr, + char *buf)) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + union ib_gid gid; + struct ib_gid_attr gid_attr = {}; + ssize_t ret; + va_list args; + + ret = ib_query_gid(p-ibdev, p-port_num, tab_attr-index, gid, + gid_attr); + if (ret) + goto err; + + ret = print(gid_attr, buf); + +err: + if (gid_attr.ndev) + dev_put(gid_attr.ndev); + va_end(args); + return ret; +} + static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, char *buf) { @@ -296,6 +364,19 @@ static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, return sprintf(buf, %pI6\n, gid.raw); } +static ssize_t show_port_gid_attr_ndev(struct ib_port *p, + struct port_attribute *attr, char *buf) +{ + return _show_port_gid_attr(p, attr, buf, print_ndev); +} + +static ssize_t show_port_gid_attr_gid_type(struct ib_port *p, + struct port_attribute *attr, + char *buf) +{ + return _show_port_gid_attr(p, attr, buf, print_gid_type); +} + static ssize_t show_port_pkey(struct ib_port *p, struct port_attribute *attr, char *buf) { @@ -451,12 +532,41 @@ static void ib_port_release(struct kobject *kobj) kfree(p); } +static void ib_port_gid_attr_release(struct kobject *kobj) +{ + struct gid_attr_group *g = container_of(kobj, struct gid_attr_group, + kobj); + struct attribute *a; + int i; + + if (g-ndev.attrs) { + for (i = 0; (a = g-ndev.attrs[i]); ++i) + kfree(a); + + kfree(g-ndev.attrs); + } + + if (g-type.attrs) { + for (i = 0; (a = g-type.attrs[i]); ++i) + kfree(a); + + kfree(g-type.attrs); + } + + kfree(g); +} + static struct kobj_type port_type = { .release
[PATCH for-next 4/9] IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type
Adding RoCE v2 GID type and port type. Vendors which support this type will get their GID table populated with RoCE v2 GIDs automatically. Signed-off-by: Matan Barak mat...@mellanox.com --- drivers/infiniband/core/cache.c | 1 + drivers/infiniband/core/roce_gid_mgmt.c | 3 ++- include/rdma/ib_verbs.h | 23 +-- 3 files changed, 24 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index 513a1ef..ddd0406 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -115,6 +115,7 @@ struct ib_gid_table { static const char * const gid_type_str[] = { [IB_GID_TYPE_IB]= IB/RoCE v1, + [IB_GID_TYPE_ROCE_UDP_ENCAP]= RoCE v2, }; const char *ib_cache_gid_type_str(enum ib_gid_type gid_type) diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c index 7dec6f2..46b52b9 100644 --- a/drivers/infiniband/core/roce_gid_mgmt.c +++ b/drivers/infiniband/core/roce_gid_mgmt.c @@ -71,7 +71,8 @@ static const struct { bool (*is_supported)(const struct ib_device *device, u8 port_num); enum ib_gid_type gid_type; } PORT_CAP_TO_GID_TYPE[] = { - {rdma_protocol_roce, IB_GID_TYPE_ROCE}, + {rdma_protocol_roce_eth_encap, IB_GID_TYPE_ROCE}, + {rdma_protocol_roce_udp_encap, IB_GID_TYPE_ROCE_UDP_ENCAP}, }; #define CAP_TO_GID_TABLE_SIZE ARRAY_SIZE(PORT_CAP_TO_GID_TYPE) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index a85926d..dd06be8 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -70,6 +70,7 @@ enum ib_gid_type { /* If link layer is Ethernet, this is RoCE V1 */ IB_GID_TYPE_IB= 0, IB_GID_TYPE_ROCE = 0, + IB_GID_TYPE_ROCE_UDP_ENCAP = 1, IB_GID_TYPE_SIZE }; @@ -398,6 +399,7 @@ union rdma_protocol_stats { #define RDMA_CORE_CAP_PROT_IB 0x0010 #define RDMA_CORE_CAP_PROT_ROCE 0x0020 #define RDMA_CORE_CAP_PROT_IWARP0x0040 +#define RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP 0x0080 #define RDMA_CORE_PORT_IBA_IB (RDMA_CORE_CAP_PROT_IB \ | RDMA_CORE_CAP_IB_MAD \ @@ -410,6 +412,12 @@ union rdma_protocol_stats { | RDMA_CORE_CAP_IB_CM \ | RDMA_CORE_CAP_AF_IB \ | RDMA_CORE_CAP_ETH_AH) +#define RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP \ + (RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP \ + | RDMA_CORE_CAP_IB_MAD \ + | RDMA_CORE_CAP_IB_CM \ + | RDMA_CORE_CAP_AF_IB \ + | RDMA_CORE_CAP_ETH_AH) #define RDMA_CORE_PORT_IWARP (RDMA_CORE_CAP_PROT_IWARP \ | RDMA_CORE_CAP_IW_CM) #define RDMA_CORE_PORT_INTEL_OPA (RDMA_CORE_PORT_IBA_IB \ @@ -1919,6 +1927,17 @@ static inline bool rdma_protocol_ib(const struct ib_device *device, u8 port_num) static inline bool rdma_protocol_roce(const struct ib_device *device, u8 port_num) { + return device-port_immutable[port_num].core_cap_flags + (RDMA_CORE_CAP_PROT_ROCE | RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP); +} + +static inline bool rdma_protocol_roce_udp_encap(const struct ib_device *device, u8 port_num) +{ + return device-port_immutable[port_num].core_cap_flags RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP; +} + +static inline bool rdma_protocol_roce_eth_encap(const struct ib_device *device, u8 port_num) +{ return device-port_immutable[port_num].core_cap_flags RDMA_CORE_CAP_PROT_ROCE; } @@ -1929,8 +1948,8 @@ static inline bool rdma_protocol_iwarp(const struct ib_device *device, u8 port_n static inline bool rdma_ib_or_roce(const struct ib_device *device, u8 port_num) { - return device-port_immutable[port_num].core_cap_flags - (RDMA_CORE_CAP_PROT_IB | RDMA_CORE_CAP_PROT_ROCE); + return rdma_protocol_ib(device, port_num) || + rdma_protocol_roce(device, port_num); } /** -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 9/9] IB/cma: Join and leave multicast groups with IGMP
From: Moni Shoua mo...@mellanox.com Since RoCEv2 is a protocol over IP header it is required to send IGMP join and leave requests to the network when joining and leaving multicast groups. Signed-off-by: Moni Shoua mo...@mellanox.com --- drivers/infiniband/core/cma.c | 96 + drivers/infiniband/core/multicast.c | 20 +++- include/rdma/ib_sa.h| 3 ++ 3 files changed, 107 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index e4f4d23..35976d5 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -38,6 +38,7 @@ #include linux/in6.h #include linux/mutex.h #include linux/random.h +#include linux/igmp.h #include linux/idr.h #include linux/inetdevice.h #include linux/slab.h @@ -250,6 +251,7 @@ struct cma_multicast { void*context; struct sockaddr_storage addr; struct kref mcref; + booligmp_joined; }; struct cma_work { @@ -337,6 +339,26 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 ip_ver) hdr-ip_version = (ip_ver 4) | (hdr-ip_version 0xF); } +static int cma_igmp_send(struct net_device *ndev, union ib_gid *mgid, bool join) +{ + struct in_device *in_dev = NULL; + + if (ndev) { + rtnl_lock(); + in_dev = __in_dev_get_rtnl(ndev); + if (in_dev) { + if (join) + ip_mc_inc_group(in_dev, + *(__be32 *)(mgid-raw + 12)); + else + ip_mc_dec_group(in_dev, + *(__be32 *)(mgid-raw + 12)); + } + rtnl_unlock(); + } + return (in_dev) ? 0 : -ENODEV; +} + static void cma_attach_to_dev(struct rdma_id_private *id_priv, struct cma_device *cma_dev) { @@ -1137,8 +1159,24 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv) id_priv-id.port_num)) { ib_sa_free_multicast(mc-multicast.ib); kfree(mc); - } else + } else { + if (mc-igmp_joined) { + struct rdma_dev_addr *dev_addr = + id_priv-id.route.addr.dev_addr; + struct net_device *ndev = NULL; + + if (dev_addr-bound_dev_if) + ndev = dev_get_by_index(init_net, + dev_addr-bound_dev_if); + if (ndev) { + cma_igmp_send(ndev, + mc-multicast.ib-rec.mgid, + false); + dev_put(ndev); + } + } kref_put(mc-mcref, release_mc); + } } } @@ -3379,7 +3417,7 @@ static int cma_iboe_join_multicast(struct rdma_id_private *id_priv, { struct iboe_mcast_work *work; struct rdma_dev_addr *dev_addr = id_priv-id.route.addr.dev_addr; - int err; + int err = 0; struct sockaddr *addr = (struct sockaddr *)mc-addr; struct net_device *ndev = NULL; @@ -3411,13 +3449,35 @@ static int cma_iboe_join_multicast(struct rdma_id_private *id_priv, mc-multicast.ib-rec.rate = iboe_get_rate(ndev); mc-multicast.ib-rec.hop_limit = 1; mc-multicast.ib-rec.mtu = iboe_get_mtu(ndev-mtu); + mc-multicast.ib-rec.ifindex = dev_addr-bound_dev_if; + mc-multicast.ib-rec.net = init_net; + rdma_ip2gid((struct sockaddr *)id_priv-id.route.addr.src_addr, + mc-multicast.ib-rec.port_gid); + + mc-multicast.ib-rec.gid_type = + id_priv-cma_dev-default_gid_type[id_priv-id.port_num - + rdma_start_port(id_priv-cma_dev-device)]; + if (addr-sa_family == AF_INET) { + if (mc-multicast.ib-rec.gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP) + err = cma_igmp_send(ndev, mc-multicast.ib-rec.mgid, + true); + if (!err) { + mc-igmp_joined = true; + mc-multicast.ib-rec.hop_limit = IPV6_DEFAULT_HOPLIMIT; + } + } else { + if (mc-multicast.ib-rec.gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP) + err = -ENOTSUPP; + else + mc-multicast.ib-rec.gid_type = IB_GID_TYPE_IB; + } dev_put(ndev); - if (!mc-multicast.ib-rec.mtu) { - err =
[PATCH for-next 8/9] IB/core: Initialize UD header structure with IP and UDP headers
From: Moni Shoua mo...@mellanox.com ib_ud_header_init() is used to format InfiniBand headers in a buffer up to (but not with) BTH. For RoCE UDP ENCAP it is required that this function would be able to build also IP and UDP headers. Signed-off-by: Moni Shoua mo...@mellanox.com Signed-off-by: Matan Barak mat...@mellanox.com --- drivers/infiniband/core/ud_header.c| 155 ++--- drivers/infiniband/hw/mlx4/qp.c| 7 +- drivers/infiniband/hw/mthca/mthca_qp.c | 2 +- include/rdma/ib_pack.h | 45 -- 4 files changed, 188 insertions(+), 21 deletions(-) diff --git a/drivers/infiniband/core/ud_header.c b/drivers/infiniband/core/ud_header.c index 72feee6..96697e7 100644 --- a/drivers/infiniband/core/ud_header.c +++ b/drivers/infiniband/core/ud_header.c @@ -35,6 +35,7 @@ #include linux/string.h #include linux/export.h #include linux/if_ether.h +#include linux/ip.h #include rdma/ib_pack.h @@ -116,6 +117,72 @@ static const struct ib_field vlan_table[] = { .size_bits= 16 } }; +static const struct ib_field ip4_table[] = { + { STRUCT_FIELD(ip4, ver), + .offset_words = 0, + .offset_bits = 0, + .size_bits= 4 }, + { STRUCT_FIELD(ip4, hdr_len), + .offset_words = 0, + .offset_bits = 4, + .size_bits= 4 }, + { STRUCT_FIELD(ip4, tos), + .offset_words = 0, + .offset_bits = 8, + .size_bits= 8 }, + { STRUCT_FIELD(ip4, tot_len), + .offset_words = 0, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, id), + .offset_words = 1, + .offset_bits = 0, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, frag_off), + .offset_words = 1, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, ttl), + .offset_words = 2, + .offset_bits = 0, + .size_bits= 8 }, + { STRUCT_FIELD(ip4, protocol), + .offset_words = 2, + .offset_bits = 8, + .size_bits= 8 }, + { STRUCT_FIELD(ip4, check), + .offset_words = 2, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, saddr), + .offset_words = 3, + .offset_bits = 0, + .size_bits= 32 }, + { STRUCT_FIELD(ip4, daddr), + .offset_words = 4, + .offset_bits = 0, + .size_bits= 32 } +}; + +static const struct ib_field udp_table[] = { + { STRUCT_FIELD(udp, sport), + .offset_words = 0, + .offset_bits = 0, + .size_bits= 16 }, + { STRUCT_FIELD(udp, dport), + .offset_words = 0, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(udp, length), + .offset_words = 1, + .offset_bits = 0, + .size_bits= 16 }, + { STRUCT_FIELD(udp, csum), + .offset_words = 1, + .offset_bits = 16, + .size_bits= 16 } +}; + static const struct ib_field grh_table[] = { { STRUCT_FIELD(grh, ip_version), .offset_words = 0, @@ -213,26 +280,57 @@ static const struct ib_field deth_table[] = { .size_bits= 24 } }; +__be16 ib_ud_ip4_csum(struct ib_ud_header *header) +{ + struct iphdr iph; + + iph.ihl = 5; + iph.version = 4; + iph.tos = header-ip4.tos; + iph.tot_len = header-ip4.tot_len; + iph.id = header-ip4.id; + iph.frag_off= header-ip4.frag_off; + iph.ttl = header-ip4.ttl; + iph.protocol= header-ip4.protocol; + iph.check = 0; + iph.saddr = header-ip4.saddr; + iph.daddr = header-ip4.daddr; + + return ip_fast_csum((u8 *)iph, iph.ihl); +} +EXPORT_SYMBOL(ib_ud_ip4_csum); + /** * ib_ud_header_init - Initialize UD header structure * @payload_bytes:Length of packet payload * @lrh_present: specify if LRH is present * @eth_present: specify if Eth header is present * @vlan_present: packet is tagged vlan - * @grh_present:GRH flag (if non-zero, GRH will be included) + * @grh_present: GRH flag (if non-zero, GRH will be included) + * @ip_version: if non-zero, IP header, V4 or V6, will be included + * @udp_present :if non-zero, UDP header will be included * @immediate_present: specify if immediate data is present * @header:Structure to initialize */ -void ib_ud_header_init(int payload_bytes, - int lrh_present, - int eth_present, - int vlan_present, - int grh_present, - int immediate_present, - struct ib_ud_header *header) +int ib_ud_header_init(int payload_bytes, + intlrh_present, +
[PATCH for-next 0/9] Add RoCE v2 support
Hi Doug, This series adds the support for RoCE v2. In order to support RoCE v2, we add gid_type attribute to every GID. When the RoCE GID management populates the GID table, it duplicates each GID with all supported types. This gives the user the ability to communicate over each supported type. Patch 0001, 0002 and 0003 add support for multiple GID types to the cache and related APIs. The third patch exposes the GID attributes information is sysfs. Patch 0004 adds the RoCE v2 GID type and the capabilities required from the vendor in order to implement RoCE v2. These capabilities are grouped together as RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP. RoCE v2 could work at IPv4 and IPv6 networks. When receiving ib_wc, this information should come from the vendor's driver. In case the vendor doesn't supply this information, we parse the packet headers and resolve its network type. Patch 0005 adds this information and required utilities. Patches 0006 and 0007 add configfs support (and the required infrastructure) for CMA. The administrator should be able to set the default RoCE type. This is done through a new per-port default_roce_mode configfs file. Patch 0008 formats a QP1 packet in order to support RoCE v2 CM packets. This is required for vendors which implement their QP1 as a Raw QP. Patch 0009 adds support for IPv4 multicast as an IPv4 network requires IGMP to be sent in order to join multicast groups. Vendors code aren't part of this patch-set. Soft-Roce will be sent soon and depends on these patches. Other vendors, like mlx4, ocrdma and mlx5 will follow. This patch is applied on Add RoCE GID cache usage in verbs/cma which was sent to the mailing list. Thanks, Matan Matan Barak (6): IB/core: Add gid_type to gid attribute IB/cm: Use the source GID index type IB/core: Add gid attributes to sysfs IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type IB/rdma_cm: Add wrapper for cma reference count IB/cma: Add configfs for rdma_cm Moni Shoua (2): IB/core: Initialize UD header structure with IP and UDP headers IB/cma: Join and leave multicast groups with IGMP Somnath Kotur (1): IB/core: Add rdma_network_type to wc drivers/infiniband/Kconfig| 9 + drivers/infiniband/core/Makefile | 2 + drivers/infiniband/core/addr.c| 14 ++ drivers/infiniband/core/cache.c | 152 + drivers/infiniband/core/cm.c | 25 ++- drivers/infiniband/core/cma.c | 216 -- drivers/infiniband/core/cma_configfs.c| 353 ++ drivers/infiniband/core/core_priv.h | 32 +++ drivers/infiniband/core/device.c | 9 +- drivers/infiniband/core/multicast.c | 20 +- drivers/infiniband/core/roce_gid_mgmt.c | 61 +- drivers/infiniband/core/sa_query.c| 5 +- drivers/infiniband/core/sysfs.c | 184 +++- drivers/infiniband/core/ud_header.c | 155 - drivers/infiniband/core/uverbs_marshall.c | 1 + drivers/infiniband/core/verbs.c | 124 ++- drivers/infiniband/hw/mlx4/qp.c | 7 +- drivers/infiniband/hw/mthca/mthca_qp.c| 2 +- include/rdma/ib_addr.h| 1 + include/rdma/ib_cache.h | 4 + include/rdma/ib_pack.h| 45 +++- include/rdma/ib_sa.h | 4 + include/rdma/ib_verbs.h | 78 ++- 23 files changed, 1399 insertions(+), 104 deletions(-) create mode 100644 drivers/infiniband/core/cma_configfs.c -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] split struct ib_send_wr
On Thu, Aug 13, 2015 at 09:07:14AM -0400, Doug Ledford wrote: Doug: was your mail a request to fix up the two de-staged drivers? I'm happy to do that if you're fine with the patch in general. amso1100 should be trivial anyway, while ipath is a mess, just like the new intel driver with the third copy of the soft ib stack. Correct. http://git.infradead.org/users/hch/rdma.git/commitdiff/efb2b0f21645b9caabcce955481ab6966e52ad90 contains the updates for ipath and amso1100, as well as the reviewed-by and tested-by tags. Note that for now I've skipped the new intel hfi1 driver as updating two of the soft ib codebases already was tiresome enough. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 7/9] IB/cma: Add configfs for rdma_cm
Users would like to control the behaviour of rdma_cm. For example, old applications which don't set the required RoCE gid type could be executed on RoCE V2 network types. In order to support this configuration, we implement a configfs for rdma_cm. In order to use the configfs, one needs to mount it and mkdir IB device name inside rdma_cm directory. The patch adds support for a single configuration file, default_roce_mode. The mode can either be IB/RoCE v1 or RoCE v2. Signed-off-by: Matan Barak mat...@mellanox.com --- drivers/infiniband/Kconfig | 9 + drivers/infiniband/core/Makefile | 2 + drivers/infiniband/core/cache.c| 24 +++ drivers/infiniband/core/cma.c | 95 - drivers/infiniband/core/cma_configfs.c | 353 + drivers/infiniband/core/core_priv.h| 24 +++ 6 files changed, 503 insertions(+), 4 deletions(-) create mode 100644 drivers/infiniband/core/cma_configfs.c diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index da4c697..9ee82a2 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -54,6 +54,15 @@ config INFINIBAND_ADDR_TRANS depends on INFINIBAND default y +config INFINIBAND_ADDR_TRANS_CONFIGFS + bool + depends on INFINIBAND_ADDR_TRANS CONFIGFS_FS + default y + ---help--- + ConfigFS support for RDMA communication manager (CM). + This allows the user to config the default GID type that the CM + uses for each device, when initiaing new connections. + source drivers/infiniband/hw/mthca/Kconfig source drivers/infiniband/hw/qib/Kconfig source drivers/infiniband/hw/ehca/Kconfig diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index d43a899..7922fa7 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -24,6 +24,8 @@ iw_cm-y :=iwcm.o iwpm_util.o iwpm_msg.o rdma_cm-y := cma.o +rdma_cm-$(CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS) += cma_configfs.o + rdma_ucm-y := ucma.o ib_addr-y := addr.o diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index ddd0406..66090ce 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -127,6 +127,30 @@ const char *ib_cache_gid_type_str(enum ib_gid_type gid_type) } EXPORT_SYMBOL(ib_cache_gid_type_str); +int ib_cache_gid_parse_type_str(const char *buf) +{ + unsigned int i; + size_t len; + int err = -EINVAL; + + len = strlen(buf); + if (len == 0) + return -EINVAL; + + if (buf[len - 1] == '\n') + len--; + + for (i = 0; i ARRAY_SIZE(gid_type_str); ++i) + if (gid_type_str[i] !strncmp(buf, gid_type_str[i], len) + len == strlen(gid_type_str[i])) { + err = i; + break; + } + + return err; +} +EXPORT_SYMBOL(ib_cache_gid_parse_type_str); + static int write_gid(struct ib_device *ib_dev, u8 port, struct ib_gid_table *table, int ix, const union ib_gid *gid, diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 22003dd..e4f4d23 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -121,6 +121,7 @@ struct cma_device { struct completion comp; atomic_trefcount; struct list_headid_list; + enum ib_gid_type*default_gid_type; }; struct rdma_bind_list { @@ -138,6 +139,62 @@ void cma_ref_dev(struct cma_device *cma_dev) atomic_inc(cma_dev-refcount); } +struct cma_device *cma_enum_devices_by_ibdev(cma_device_filter filter, +void *cookie) +{ + struct cma_device *cma_dev; + struct cma_device *found_cma_dev = NULL; + + mutex_lock(lock); + + list_for_each_entry(cma_dev, dev_list, list) + if (filter(cma_dev-device, cookie)) { + found_cma_dev = cma_dev; + break; + } + + if (found_cma_dev) + cma_ref_dev(found_cma_dev); + mutex_unlock(lock); + return found_cma_dev; +} + +int cma_get_default_gid_type(struct cma_device *cma_dev, +unsigned int port) +{ + if (port rdma_start_port(cma_dev-device) || + port rdma_end_port(cma_dev-device)) + return -EINVAL; + + return cma_dev-default_gid_type[port - rdma_start_port(cma_dev-device)]; +} + +int cma_set_default_gid_type(struct cma_device *cma_dev, +unsigned int port, +enum ib_gid_type default_gid_type) +{ + unsigned long supported_gids; + + if (port rdma_start_port(cma_dev-device) || + port
[PATCH for-next 6/9] IB/rdma_cm: Add wrapper for cma reference count
Currently, cma users can't increase or decrease the cma reference count. This is necassary when setting cma attributes (like the default GID type) in order to avoid use-after-free errors. Adding cma_ref_dev and cma_deref_dev APIs. Signed-off-by: Matan Barak mat...@mellanox.com --- drivers/infiniband/core/cma.c | 11 +-- drivers/infiniband/core/core_priv.h | 4 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 99e9e3e..22003dd 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -56,6 +56,8 @@ #include rdma/ib_sa.h #include rdma/iw_cm.h +#include core_priv.h + MODULE_AUTHOR(Sean Hefty); MODULE_DESCRIPTION(Generic RDMA CM Agent); MODULE_LICENSE(Dual BSD/GPL); @@ -131,6 +133,11 @@ enum { CMA_OPTION_AFONLY, }; +void cma_ref_dev(struct cma_device *cma_dev) +{ + atomic_inc(cma_dev-refcount); +} + /* * Device removal can occur at anytime, so we need extra handling to * serialize notifying the user of device removal with other callbacks. @@ -276,7 +283,7 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 ip_ver) static void cma_attach_to_dev(struct rdma_id_private *id_priv, struct cma_device *cma_dev) { - atomic_inc(cma_dev-refcount); + cma_ref_dev(cma_dev); id_priv-cma_dev = cma_dev; id_priv-id.device = cma_dev-device; id_priv-id.route.addr.dev_addr.transport = @@ -284,7 +291,7 @@ static void cma_attach_to_dev(struct rdma_id_private *id_priv, list_add_tail(id_priv-list, cma_dev-id_list); } -static inline void cma_deref_dev(struct cma_device *cma_dev) +void cma_deref_dev(struct cma_device *cma_dev) { if (atomic_dec_and_test(cma_dev-refcount)) complete(cma_dev-comp); diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h index f2c5cc9..7fbfea9 100644 --- a/drivers/infiniband/core/core_priv.h +++ b/drivers/infiniband/core/core_priv.h @@ -38,6 +38,10 @@ #include rdma/ib_verbs.h +struct cma_device; +void cma_ref_dev(struct cma_device *cma_dev); +void cma_deref_dev(struct cma_device *cma_dev); + int ib_device_register_sysfs(struct ib_device *device, int (*port_callback)(struct ib_device *, u8, struct kobject *)); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 2/9] IB/cm: Use the source GID index type
Previosuly, cm and cma modules supported only IB and RoCE v1 GID type. In order to support multiple GID types, the gid_type is passed to cm_init_av_by_path and stored in the path record. The rdma cm client would use a default GID type that will be saved in rdma_id_private. Signed-off-by: Matan Barak mat...@mellanox.com --- drivers/infiniband/core/cm.c | 25 - drivers/infiniband/core/cma.c | 2 ++ 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 0c15488..ba81025 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -362,7 +362,7 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) read_lock_irqsave(cm.device_lock, flags); list_for_each_entry(cm_dev, cm.device_list, list) { if (!ib_find_cached_gid(cm_dev-ib_device, path-sgid, - IB_GID_TYPE_IB, ndev, p, NULL)) { + path-gid_type, ndev, p, NULL)) { port = cm_dev-port[p-1]; break; } @@ -1536,6 +1536,8 @@ static int cm_req_handler(struct cm_work *work) struct ib_cm_id *cm_id; struct cm_id_private *cm_id_priv, *listen_cm_id_priv; struct cm_req_msg *req_msg; + union ib_gid gid; + struct ib_gid_attr gid_attr; int ret; req_msg = (struct cm_req_msg *)work-mad_recv_wc-recv_buf.mad; @@ -1575,11 +1577,24 @@ static int cm_req_handler(struct cm_work *work) cm_format_paths_from_req(req_msg, work-path[0], work-path[1]); memcpy(work-path[0].dmac, cm_id_priv-av.ah_attr.dmac, ETH_ALEN); - ret = cm_init_av_by_path(work-path[0], cm_id_priv-av); + ret = ib_get_cached_gid(work-port-cm_dev-ib_device, + work-port-port_num, + cm_id_priv-av.ah_attr.grh.sgid_index, + gid, gid_attr); + if (!ret) { + if (gid_attr.ndev) + dev_put(gid_attr.ndev); + work-path[0].gid_type = gid_attr.gid_type; + ret = cm_init_av_by_path(work-path[0], cm_id_priv-av); + } if (ret) { - ib_get_cached_gid(work-port-cm_dev-ib_device, - work-port-port_num, 0, work-path[0].sgid, - NULL); + int err = ib_get_cached_gid(work-port-cm_dev-ib_device, + work-port-port_num, 0, + work-path[0].sgid, + gid_attr); + if (!err gid_attr.ndev) + dev_put(gid_attr.ndev); + work-path[0].gid_type = gid_attr.gid_type; ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID, work-path[0].sgid, sizeof work-path[0].sgid, NULL, 0); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index dfb92bf..f78b8dd 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -174,6 +174,7 @@ struct rdma_id_private { u8 tos; u8 reuseaddr; u8 afonly; + enum ib_gid_typegid_type; }; struct cma_multicast { @@ -1952,6 +1953,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv) ndev = dev_get_by_index(init_net, addr-dev_addr.bound_dev_if); route-path_rec-net = init_net; route-path_rec-ifindex = addr-dev_addr.bound_dev_if; + route-path_rec-gid_type = id_priv-gid_type; } if (!ndev) { ret = -ENODEV; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 5/9] IB/core: Add rdma_network_type to wc
From: Somnath Kotur somnath.ko...@avagotech.com Providers should tell IB core the wc's network type. This is used in order to search for the proper GID in the GID table. When using HCAs that can't provide this info, IB core tries to deep examine the packet and extract the GID type by itself. We choose sgid_index and type from all the matching entries in RDMA-CM based on hint from the IP stack and we set hop_limit for the IP packet based on above hint from IP stack. Signed-off-by: Matan Barak mat...@mellanox.com Signed-off-by: Somnath Kotur somnath.ko...@avagotech.com --- drivers/infiniband/core/addr.c | 14 + drivers/infiniband/core/cma.c | 11 +++- drivers/infiniband/core/verbs.c | 123 ++-- include/rdma/ib_addr.h | 1 + include/rdma/ib_verbs.h | 44 ++ 5 files changed, 187 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index d3c42b3..3e1f93c 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -257,6 +257,12 @@ static int addr4_resolve(struct sockaddr_in *src_in, goto put; } + /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't +* routable) and we could set the network type accordingly. +*/ + if (rt-rt_uses_gateway) + addr-network = RDMA_NETWORK_IPV4; + ret = dst_fetch_ha(rt-dst, addr, fl4.daddr); put: ip_rt_put(rt); @@ -271,6 +277,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, { struct flowi6 fl6; struct dst_entry *dst; + struct rt6_info *rt; int ret; memset(fl6, 0, sizeof fl6); @@ -282,6 +289,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, if ((ret = dst-error)) goto put; + rt = (struct rt6_info *)dst; if (ipv6_addr_any(fl6.saddr)) { ret = ipv6_dev_get_saddr(init_net, ip6_dst_idev(dst)-dev, fl6.daddr, 0, fl6.saddr); @@ -305,6 +313,12 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, goto put; } + /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't +* routable) and we could set the network type accordingly. +*/ + if (rt-rt6i_flags RTF_GATEWAY) + addr-network = RDMA_NETWORK_IPV6; + ret = dst_fetch_ha(dst, addr, fl6.daddr); put: dst_release(dst); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index f78b8dd..99e9e3e 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -1929,6 +1929,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv) { struct rdma_route *route = id_priv-id.route; struct rdma_addr *addr = route-addr; + enum ib_gid_type network_gid_type; struct cma_work *work; int ret; struct net_device *ndev = NULL; @@ -1967,7 +1968,15 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv) rdma_ip2gid((struct sockaddr *)id_priv-id.route.addr.dst_addr, route-path_rec-dgid); - route-path_rec-hop_limit = 1; + /* Use the hint from IP Stack to select GID Type */ + network_gid_type = ib_network_to_gid_type(addr-dev_addr.network); + if (addr-dev_addr.network != RDMA_NETWORK_IB) { + route-path_rec-gid_type = network_gid_type; + /* TODO: get the hoplimit from the inet/inet6 device */ + route-path_rec-hop_limit = IPV6_DEFAULT_HOPLIMIT; + } else { + route-path_rec-hop_limit = 1; + } route-path_rec-reversible = 1; route-path_rec-pkey = cpu_to_be16(0x); route-path_rec-mtu_selector = IB_SA_EQ; diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 62a3d01..1d1cab3 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -260,8 +260,61 @@ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) } EXPORT_SYMBOL(ib_create_ah); +static int ib_get_header_version(const union rdma_network_hdr *hdr) +{ + const struct iphdr *ip4h = (struct iphdr *)hdr-roce4grh; + struct iphdr ip4h_checked; + const struct ipv6hdr *ip6h = (struct ipv6hdr *)hdr-ibgrh; + + /* If it's IPv6, the version must be 6, otherwise, the first +* 20 bytes (before the IPv4 header) are garbled. +*/ + if (ip6h-version != 6) + return (ip4h-version == 4) ? 4 : 0; + /* version may be 6 or 4 because the first 20 bytes could be garbled */ + + /* RoCE v2 requires no options, thus header length + must be 5 words + */ + if (ip4h-ihl != 5) + return 6; + + /* Verify checksum. + We can't write on scattered buffers so we need to copy to + temp
[PATCH for-next V8 3/6] IB/uverbs: Explicitly pass ib_dev to uverbs commands
Done in preparation for deploying RCU for the device removal flow. Allows isolating the RCU handling to the uverb_main layer and keeping the uverbs_cmd code as is. Signed-off-by: Yishai Hadas yish...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com --- drivers/infiniband/core/uverbs.h |3 + drivers/infiniband/core/uverbs_cmd.c | 103 ++--- drivers/infiniband/core/uverbs_main.c | 21 +-- 3 files changed, 88 insertions(+), 39 deletions(-) diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 92ec765..ea52db1 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -178,6 +178,7 @@ extern struct idr ib_uverbs_rule_idr; void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj); struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file, + struct ib_device *ib_dev, int is_async); void ib_uverbs_free_async_event_file(struct ib_uverbs_file *uverbs_file); struct ib_uverbs_event_file *ib_uverbs_lookup_comp_file(int fd); @@ -214,6 +215,7 @@ struct ib_uverbs_flow_spec { #define IB_UVERBS_DECLARE_CMD(name)\ ssize_t ib_uverbs_##name(struct ib_uverbs_file *file, \ +struct ib_device *ib_dev, \ const char __user *buf, int in_len,\ int out_len) @@ -255,6 +257,7 @@ IB_UVERBS_DECLARE_CMD(close_xrcd); #define IB_UVERBS_DECLARE_EX_CMD(name) \ int ib_uverbs_ex_##name(struct ib_uverbs_file *file,\ + struct ib_device *ib_dev, \ struct ib_udata *ucore, \ struct ib_udata *uhw) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 5720a92..29443c0 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -282,13 +282,13 @@ static void put_xrcd_read(struct ib_uobject *uobj) } ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, + struct ib_device *ib_dev, const char __user *buf, int in_len, int out_len) { struct ib_uverbs_get_context cmd; struct ib_uverbs_get_context_resp resp; struct ib_udata udata; - struct ib_device *ibdev = file-device-ib_dev; #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING struct ib_device_attr dev_attr; #endif @@ -313,13 +313,13 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, (unsigned long) cmd.response + sizeof resp, in_len - sizeof cmd, out_len - sizeof resp); - ucontext = ibdev-alloc_ucontext(ibdev, udata); + ucontext = ib_dev-alloc_ucontext(ib_dev, udata); if (IS_ERR(ucontext)) { ret = PTR_ERR(ucontext); goto err; } - ucontext-device = ibdev; + ucontext-device = ib_dev; INIT_LIST_HEAD(ucontext-pd_list); INIT_LIST_HEAD(ucontext-mr_list); INIT_LIST_HEAD(ucontext-mw_list); @@ -340,7 +340,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, ucontext-odp_mrs_count = 0; INIT_LIST_HEAD(ucontext-no_private_counters); - ret = ib_query_device(ibdev, dev_attr); + ret = ib_query_device(ib_dev, dev_attr); if (ret) goto err_free; if (!(dev_attr.device_cap_flags IB_DEVICE_ON_DEMAND_PAGING)) @@ -355,7 +355,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, goto err_free; resp.async_fd = ret; - filp = ib_uverbs_alloc_event_file(file, 1); + filp = ib_uverbs_alloc_event_file(file, ib_dev, 1); if (IS_ERR(filp)) { ret = PTR_ERR(filp); goto err_fd; @@ -384,7 +384,7 @@ err_fd: err_free: put_pid(ucontext-tgid); - ibdev-dealloc_ucontext(ucontext); + ib_dev-dealloc_ucontext(ucontext); err: mutex_unlock(file-mutex); @@ -392,11 +392,12 @@ err: } static void copy_query_dev_fields(struct ib_uverbs_file *file, + struct ib_device *ib_dev, struct ib_uverbs_query_device_resp *resp, struct ib_device_attr *attr) { resp-fw_ver= attr-fw_ver; - resp-node_guid = file-device-ib_dev-node_guid; + resp-node_guid = ib_dev-node_guid; resp-sys_image_guid= attr-sys_image_guid; resp-max_mr_size = attr-max_mr_size; resp-page_size_cap =
[PATCH for-next V8 6/6] IB/ucma: HW Device hot-removal support
Currently, IB/cma remove_one flow blocks until all user descriptor managed by IB/ucma are released. This prevents hot-removal of IB devices. This patch allows IB/cma to remove devices regardless of user space activity. Upon getting the RDMA_CM_EVENT_DEVICE_REMOVAL event we close all the underlying HW resources for the given ucontext. The ucontext itself is still alive till its explicit destroying by its creator. Running applications at that time will have some zombie device, further operations may fail. Signed-off-by: Yishai Hadas yish...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Reviewed-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/ucma.c | 140 --- 1 files changed, 129 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 29b2121..c41aef4 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -74,6 +74,7 @@ struct ucma_file { struct list_headctx_list; struct list_headevent_list; wait_queue_head_t poll_wait; + struct workqueue_struct *close_wq; }; struct ucma_context { @@ -89,6 +90,13 @@ struct ucma_context { struct list_headlist; struct list_headmc_list; + /* mark that device is in process of destroying the internal HW +* resources, protected by the global mut +*/ + int closing; + /* sync between removal event and id destroy, protected by file mut */ + int destroying; + struct work_struct close_work; }; struct ucma_multicast { @@ -107,6 +115,7 @@ struct ucma_event { struct list_headlist; struct rdma_cm_id *cm_id; struct rdma_ucm_event_resp resp; + struct work_struct close_work; }; static DEFINE_MUTEX(mut); @@ -132,8 +141,12 @@ static struct ucma_context *ucma_get_ctx(struct ucma_file *file, int id) mutex_lock(mut); ctx = _ucma_find_context(id, file); - if (!IS_ERR(ctx)) - atomic_inc(ctx-ref); + if (!IS_ERR(ctx)) { + if (ctx-closing) + ctx = ERR_PTR(-EIO); + else + atomic_inc(ctx-ref); + } mutex_unlock(mut); return ctx; } @@ -144,6 +157,28 @@ static void ucma_put_ctx(struct ucma_context *ctx) complete(ctx-comp); } +static void ucma_close_event_id(struct work_struct *work) +{ + struct ucma_event *uevent_close = container_of(work, struct ucma_event, close_work); + + rdma_destroy_id(uevent_close-cm_id); + kfree(uevent_close); +} + +static void ucma_close_id(struct work_struct *work) +{ + struct ucma_context *ctx = container_of(work, struct ucma_context, close_work); + + /* once all inflight tasks are finished, we close all underlying +* resources. The context is still alive till its explicit destryoing +* by its creator. +*/ + ucma_put_ctx(ctx); + wait_for_completion(ctx-comp); + /* No new events will be generated after destroying the id. */ + rdma_destroy_id(ctx-cm_id); +} + static struct ucma_context *ucma_alloc_ctx(struct ucma_file *file) { struct ucma_context *ctx; @@ -152,6 +187,7 @@ static struct ucma_context *ucma_alloc_ctx(struct ucma_file *file) if (!ctx) return NULL; + INIT_WORK(ctx-close_work, ucma_close_id); atomic_set(ctx-ref, 1); init_completion(ctx-comp); INIT_LIST_HEAD(ctx-mc_list); @@ -242,6 +278,44 @@ static void ucma_set_event_context(struct ucma_context *ctx, } } +/* Called with file-mut locked for the relevant context. */ +static void ucma_removal_event_handler(struct rdma_cm_id *cm_id) +{ + struct ucma_context *ctx = cm_id-context; + struct ucma_event *con_req_eve; + int event_found = 0; + + if (ctx-destroying) + return; + + /* only if context is pointing to cm_id that it owns it and can be +* queued to be closed, otherwise that cm_id is an inflight one that +* is part of that context event list pending to be detached and +* reattached to its new context as part of ucma_get_event, +* handled separately below. +*/ + if (ctx-cm_id == cm_id) { + mutex_lock(mut); + ctx-closing = 1; + mutex_unlock(mut); + queue_work(ctx-file-close_wq, ctx-close_work); + return; + } + + list_for_each_entry(con_req_eve, ctx-file-event_list, list) { + if (con_req_eve-cm_id == cm_id + con_req_eve-resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) { + list_del(con_req_eve-list); + INIT_WORK(con_req_eve-close_work, ucma_close_event_id); +
[PATCH for-next V8 1/6] IB/uverbs: Fix reference counting usage of event files
Fix the reference counting usage to be handled in the event file creation/destruction function, instead of being done by the caller. This is done for both async/non-async event files. Based on Jason Gunthorpe report at https://www.mail-archive.com/ linux-rdma@vger.kernel.org/msg24680.html: The existing code for this is broken, in ib_uverbs_get_context all the error paths between ib_uverbs_alloc_event_file and the kref_get(file-ref) are wrong - this will result in fput() which will call ib_uverbs_event_close, which will try to do kref_put and ib_unregister_event_handler - which are no longer paired. Signed-off-by: Yishai Hadas yish...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com --- drivers/infiniband/core/uverbs.h |1 + drivers/infiniband/core/uverbs_cmd.c | 11 +--- drivers/infiniband/core/uverbs_main.c | 44 3 files changed, 40 insertions(+), 16 deletions(-) diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index ba365b6..60e6e3d 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -178,6 +178,7 @@ void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj); struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file, int is_async); +void ib_uverbs_free_async_event_file(struct ib_uverbs_file *uverbs_file); struct ib_uverbs_event_file *ib_uverbs_lookup_comp_file(int fd); void ib_uverbs_release_ucq(struct ib_uverbs_file *file, diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index bbb02ff..5720a92 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -367,16 +367,6 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, goto err_file; } - file-async_file = filp-private_data; - - INIT_IB_EVENT_HANDLER(file-event_handler, file-device-ib_dev, - ib_uverbs_event_handler); - ret = ib_register_event_handler(file-event_handler); - if (ret) - goto err_file; - - kref_get(file-async_file-ref); - kref_get(file-ref); file-ucontext = ucontext; fd_install(resp.async_fd, filp); @@ -386,6 +376,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, return in_len; err_file: + ib_uverbs_free_async_event_file(file); fput(filp); err_fd: diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index f6eef2d..c238eba 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -406,10 +406,9 @@ static int ib_uverbs_event_close(struct inode *inode, struct file *filp) } spin_unlock_irq(file-lock); - if (file-is_async) { + if (file-is_async) ib_unregister_event_handler(file-uverbs_file-event_handler); - kref_put(file-uverbs_file-ref, ib_uverbs_release_file); - } + kref_put(file-uverbs_file-ref, ib_uverbs_release_file); kref_put(file-ref, ib_uverbs_release_event_file); return 0; @@ -541,13 +540,20 @@ void ib_uverbs_event_handler(struct ib_event_handler *handler, NULL, NULL); } +void ib_uverbs_free_async_event_file(struct ib_uverbs_file *file) +{ + kref_put(file-async_file-ref, ib_uverbs_release_event_file); + file-async_file = NULL; +} + struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file, int is_async) { struct ib_uverbs_event_file *ev_file; struct file *filp; + int ret; - ev_file = kmalloc(sizeof *ev_file, GFP_KERNEL); + ev_file = kzalloc(sizeof(*ev_file), GFP_KERNEL); if (!ev_file) return ERR_PTR(-ENOMEM); @@ -556,15 +562,41 @@ struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file, INIT_LIST_HEAD(ev_file-event_list); init_waitqueue_head(ev_file-poll_wait); ev_file-uverbs_file = uverbs_file; + kref_get(ev_file-uverbs_file-ref); ev_file-async_queue = NULL; - ev_file-is_async= is_async; ev_file-is_closed = 0; filp = anon_inode_getfile([infinibandevent], uverbs_event_fops, ev_file, O_RDONLY); if (IS_ERR(filp)) - kfree(ev_file); + goto err_put_refs; + + if (is_async) { + WARN_ON(uverbs_file-async_file); + uverbs_file-async_file = ev_file; + kref_get(uverbs_file-async_file-ref); + INIT_IB_EVENT_HANDLER(uverbs_file-event_handler, + uverbs_file-device-ib_dev, + ib_uverbs_event_handler); + ret =
[PATCH for-next V8 2/6] IB/uverbs: Fix race between ib_uverbs_open and remove_one
Fixes: 2a72f212263701b927559f6850446421d5906c41 (IB/uverbs: Remove dev_table) Before this commit there was a device look-up table that was protected by a spin_lock used by ib_uverbs_open and by ib_uverbs_remove_one. When it was dropped and container_of was used instead, it enabled the race with remove_one as dev might be freed just after: dev = container_of(inode-i_cdev, struct ib_uverbs_device, cdev) but before the kref_get. In addition, this buggy patch added some dead code as container_of(x,y,z) can never be NULL and so dev can never be NULL. As a result the comment above ib_uverbs_open saying the open method will either immediately run -ENXIO is wrong as it can never happen. The solution follows Jason Gunthorpe suggestion from below URL: https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg25692.html cdev will hold a kref on the parent (the containing structure, ib_uverbs_device) and only when that kref is released it is guaranteed that open will never be called again. In addition, fixes the active count scheme to use an atomic not a kref to prevent WARN_ON as pointed by above comment from Jason. Signed-off-by: Yishai Hadas yish...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com --- drivers/infiniband/core/uverbs.h |3 +- drivers/infiniband/core/uverbs_main.c | 43 +++-- 2 files changed, 32 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 60e6e3d..92ec765 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -85,7 +85,7 @@ */ struct ib_uverbs_device { - struct kref ref; + atomic_trefcount; int num_comp_vectors; struct completion comp; struct device *dev; @@ -94,6 +94,7 @@ struct ib_uverbs_device { struct cdev cdev; struct rb_root xrcd_tree; struct mutexxrcd_tree_mutex; + struct kobject kobj; }; struct ib_uverbs_event_file { diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index c238eba..9f39978 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -130,14 +130,18 @@ static int (*uverbs_ex_cmd_table[])(struct ib_uverbs_file *file, static void ib_uverbs_add_one(struct ib_device *device); static void ib_uverbs_remove_one(struct ib_device *device); -static void ib_uverbs_release_dev(struct kref *ref) +static void ib_uverbs_release_dev(struct kobject *kobj) { struct ib_uverbs_device *dev = - container_of(ref, struct ib_uverbs_device, ref); + container_of(kobj, struct ib_uverbs_device, kobj); - complete(dev-comp); + kfree(dev); } +static struct kobj_type ib_uverbs_dev_ktype = { + .release = ib_uverbs_release_dev, +}; + static void ib_uverbs_release_event_file(struct kref *ref) { struct ib_uverbs_event_file *file = @@ -303,13 +307,19 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, return context-device-dealloc_ucontext(context); } +static void ib_uverbs_comp_dev(struct ib_uverbs_device *dev) +{ + complete(dev-comp); +} + static void ib_uverbs_release_file(struct kref *ref) { struct ib_uverbs_file *file = container_of(ref, struct ib_uverbs_file, ref); module_put(file-device-ib_dev-owner); - kref_put(file-device-ref, ib_uverbs_release_dev); + if (atomic_dec_and_test(file-device-refcount)) + ib_uverbs_comp_dev(file-device); kfree(file); } @@ -775,9 +785,7 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp) int ret; dev = container_of(inode-i_cdev, struct ib_uverbs_device, cdev); - if (dev) - kref_get(dev-ref); - else + if (!atomic_inc_not_zero(dev-refcount)) return -ENXIO; if (!try_module_get(dev-ib_dev-owner)) { @@ -798,6 +806,7 @@ static int ib_uverbs_open(struct inode *inode, struct file *filp) mutex_init(file-mutex); filp-private_data = file; + kobject_get(dev-kobj); return nonseekable_open(inode, filp); @@ -805,13 +814,16 @@ err_module: module_put(dev-ib_dev-owner); err: - kref_put(dev-ref, ib_uverbs_release_dev); + if (atomic_dec_and_test(dev-refcount)) + ib_uverbs_comp_dev(dev); + return ret; } static int ib_uverbs_close(struct inode *inode, struct file *filp) { struct ib_uverbs_file *file = filp-private_data; + struct ib_uverbs_device *dev = file-device;
[PATCH for-next V8 4/6] IB/uverbs: Enable device removal when there are active user space applications
Enables the uverbs_remove_one to succeed despite the fact that there are running IB applications working with the given ib device. This functionality enables a HW device to be unbind/reset despite the fact that there are running user space applications using it. It exposes a new IB kernel API named 'disassociate_ucontext' which lets a driver detaching its HW resources from a given user context without crashing/terminating the application. In case a driver implemented the above API and registered with ib_uverb there will be no dependency between its device to its uverbs_device. Upon calling remove_one of ib_uverbs the call should return after disassociating the open HW resources without waiting to clients disconnecting. In case driver didn't implement this API there will be no change to current behaviour and uverbs_remove_one will return only when last client has disconnected and reference count on uverbs device became 0. In case the lower driver device was removed any application will continue working over some zombie HCA, further calls will ended with an immediate error. Signed-off-by: Yishai Hadas yish...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com --- drivers/infiniband/core/uverbs.h |9 +- drivers/infiniband/core/uverbs_main.c | 360 +++-- include/rdma/ib_verbs.h |1 + 3 files changed, 302 insertions(+), 68 deletions(-) diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index ea52db1..3863d33 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -89,12 +89,16 @@ struct ib_uverbs_device { int num_comp_vectors; struct completion comp; struct device *dev; - struct ib_device *ib_dev; + struct ib_device__rcu *ib_dev; int devnum; struct cdev cdev; struct rb_root xrcd_tree; struct mutexxrcd_tree_mutex; struct kobject kobj; + struct srcu_struct disassociate_srcu; + struct mutexlists_mutex; /* protect lists */ + struct list_headuverbs_file_list; + struct list_headuverbs_events_file_list; }; struct ib_uverbs_event_file { @@ -106,6 +110,7 @@ struct ib_uverbs_event_file { wait_queue_head_t poll_wait; struct fasync_struct *async_queue; struct list_headevent_list; + struct list_headlist; }; struct ib_uverbs_file { @@ -115,6 +120,8 @@ struct ib_uverbs_file { struct ib_ucontext *ucontext; struct ib_event_handler event_handler; struct ib_uverbs_event_file*async_file; + struct list_headlist; + int is_closed; }; struct ib_uverbs_event { diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index dc968df..59b28a6 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -137,6 +137,7 @@ static void ib_uverbs_release_dev(struct kobject *kobj) struct ib_uverbs_device *dev = container_of(kobj, struct ib_uverbs_device, kobj); + cleanup_srcu_struct(dev-disassociate_srcu); kfree(dev); } @@ -207,9 +208,6 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, { struct ib_uobject *uobj, *tmp; - if (!context) - return 0; - context-closing = 1; list_for_each_entry_safe(uobj, tmp, context-ah_list, list) { @@ -318,8 +316,16 @@ static void ib_uverbs_release_file(struct kref *ref) { struct ib_uverbs_file *file = container_of(ref, struct ib_uverbs_file, ref); + struct ib_device *ib_dev; + int srcu_key; + + srcu_key = srcu_read_lock(file-device-disassociate_srcu); + ib_dev = srcu_dereference(file-device-ib_dev, + file-device-disassociate_srcu); + if (ib_dev !ib_dev-disassociate_ucontext) + module_put(ib_dev-owner); + srcu_read_unlock(file-device-disassociate_srcu, srcu_key); - module_put(file-device-ib_dev-owner); if (atomic_dec_and_test(file-device-refcount)) ib_uverbs_comp_dev(file-device); @@ -343,9 +349,19 @@ static ssize_t ib_uverbs_event_read(struct file *filp, char __user *buf, return -EAGAIN; if (wait_event_interruptible(file-poll_wait, -
[PATCH for-next V8 5/6] IB/mlx4_ib: Disassociate support
Implements the IB core disassociate_ucontext API. The driver detaches the HW resources for a given user context to prevent a dependency between application termination and device disconnecting. This is done by managing the VMAs that were mapped to the HW bars such as door bell and blueflame. When need to detach remap them to an arbitrary kernel page returned by the zap API. Signed-off-by: Yishai Hadas yish...@mellanox.com Signed-off-by: Jack Morgenstein ja...@mellanox.com --- drivers/infiniband/hw/mlx4/main.c| 139 +- drivers/infiniband/hw/mlx4/mlx4_ib.h | 13 +++ 2 files changed, 150 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 8be6db8..3097a27 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -692,7 +692,7 @@ static struct ib_ucontext *mlx4_ib_alloc_ucontext(struct ib_device *ibdev, resp.cqe_size = dev-dev-caps.cqe_size; } - context = kmalloc(sizeof *context, GFP_KERNEL); + context = kzalloc(sizeof(*context), GFP_KERNEL); if (!context) return ERR_PTR(-ENOMEM); @@ -729,21 +729,143 @@ static int mlx4_ib_dealloc_ucontext(struct ib_ucontext *ibcontext) return 0; } +static void mlx4_ib_vma_open(struct vm_area_struct *area) +{ + /* vma_open is called when a new VMA is created on top of our VMA. +* This is done through either mremap flow or split_vma (usually due +* to mlock, madvise, munmap, etc.). We do not support a clone of the +* vma, as this VMA is strongly hardware related. Therefore we set the +* vm_ops of the newly created/cloned VMA to NULL, to prevent it from +* calling us again and trying to do incorrect actions. We assume that +* the original vma size is exactly a single page that there will be no +* splitting operations on. +*/ + area-vm_ops = NULL; +} + +static void mlx4_ib_vma_close(struct vm_area_struct *area) +{ + struct mlx4_ib_vma_private_data *mlx4_ib_vma_priv_data; + + /* It's guaranteed that all VMAs opened on a FD are closed before the +* file itself is closed, therefore no sync is needed with the regular +* closing flow. (e.g. mlx4_ib_dealloc_ucontext) However need a sync +* with accessing the vma as part of mlx4_ib_disassociate_ucontext. +* The close operation is usually called under mm-mmap_sem except when +* process is exiting. The exiting case is handled explicitly as part +* of mlx4_ib_disassociate_ucontext. +*/ + mlx4_ib_vma_priv_data = (struct mlx4_ib_vma_private_data *) + area-vm_private_data; + + /* set the vma context pointer to null in the mlx4_ib driver's private +* data to protect against a race condition in mlx4_ib_dissassociate_ucontext(). +*/ + mlx4_ib_vma_priv_data-vma = NULL; +} + +static const struct vm_operations_struct mlx4_ib_vm_ops = { + .open = mlx4_ib_vma_open, + .close = mlx4_ib_vma_close +}; + +static void mlx4_ib_disassociate_ucontext(struct ib_ucontext *ibcontext) +{ + int i; + int ret = 0; + struct vm_area_struct *vma; + struct mlx4_ib_ucontext *context = to_mucontext(ibcontext); + struct task_struct *owning_process = NULL; + struct mm_struct *owning_mm = NULL; + + owning_process = get_pid_task(ibcontext-tgid, PIDTYPE_PID); + if (!owning_process) + return; + + owning_mm = get_task_mm(owning_process); + if (!owning_mm) { + pr_info(no mm, disassociate ucontext is pending task termination\n); + while (1) { + /* make sure that task is dead before returning, it may +* prevent a rare case of module down in parallel to a +* call to mlx4_ib_vma_close. +*/ + put_task_struct(owning_process); + msleep(1); + owning_process = get_pid_task(ibcontext-tgid, + PIDTYPE_PID); + if (!owning_process || + owning_process-state == TASK_DEAD) { + pr_info(disassociate ucontext done, task was terminated\n); + /* in case task was dead need to release the task struct */ + if (owning_process) + put_task_struct(owning_process); + return; + } + } + } + + /* need to protect from a race on closing the vma as part of +* mlx4_ib_vma_close(). +*/ + down_read(owning_mm-mmap_sem); + for (i = 0; i HW_BAR_COUNT; i++) { + vma =
Re: [RFC] split struct ib_send_wr
On Wed, Aug 12, 2015 at 08:24:49PM +0300, Sagi Grimberg wrote: Just a nit that I've noticed, in mlx4 set_fmr_seg params are not aligned to the parenthesis (maybe in other locations too but I haven't noticed such...) This is just using a normal two tab indent for continued function parameters.. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html