[PATCH v1] IB/cma: Use inner P_Key to determine netdev
When discussing the patches to demux ids in rdma_cm instead of ib_cm, it was decided that it is best to use the P_Key value in the packet headers. However, the mlx5 and ipath drivers are currently unable to send correct P_Key values in GMP headers. They always send using a single P_Key that is set during the GSI QP initialization. Change the rdma_cm code to look at the P_Key value that is part of the packet payload as a workaround. Once the drivers are fixed this patch can be reverted. Fixes: 4c21b5bcef73 ("IB/cma: Add net_dev and private data checks to RDMA CM") Signed-off-by: Haggai Eran--- Changes from v0: - improve commit message drivers/infiniband/core/cma.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 59a2dafc8c57..e8324543e085 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -1067,14 +1067,14 @@ static int cma_save_req_info(const struct ib_cm_event *ib_event, sizeof(req->local_gid)); req->has_gid= true; req->service_id = req_param->primary_path->service_id; - req->pkey = req_param->bth_pkey; + req->pkey = be16_to_cpu(req_param->primary_path->pkey); break; case IB_CM_SIDR_REQ_RECEIVED: req->device = sidr_param->listen_id->device; req->port = sidr_param->port; req->has_gid= false; req->service_id = sidr_param->service_id; - req->pkey = sidr_param->bth_pkey; + req->pkey = sidr_param->pkey; break; default: return -EINVAL; -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1] IB/cma: Use inner P_Key to determine netdev
On 10/20/2015 02:53 AM, Haggai Eran wrote: > When discussing the patches to demux ids in rdma_cm instead of ib_cm, it > was decided that it is best to use the P_Key value in the packet headers. > However, the mlx5 and ipath drivers are currently unable to send correct > P_Key values in GMP headers. They always send using a single P_Key that is > set during the GSI QP initialization. > > Change the rdma_cm code to look at the P_Key value that is part of the > packet payload as a workaround. Once the drivers are fixed this patch can > be reverted. > > Fixes: 4c21b5bcef73 ("IB/cma: Add net_dev and private data checks to > RDMA CM") > Signed-off-by: Haggai Eran> --- > Changes from v0: > - improve commit message > > drivers/infiniband/core/cma.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > index 59a2dafc8c57..e8324543e085 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -1067,14 +1067,14 @@ static int cma_save_req_info(const struct ib_cm_event > *ib_event, > sizeof(req->local_gid)); > req->has_gid= true; > req->service_id = req_param->primary_path->service_id; > - req->pkey = req_param->bth_pkey; > + req->pkey = be16_to_cpu(req_param->primary_path->pkey); > break; > case IB_CM_SIDR_REQ_RECEIVED: > req->device = sidr_param->listen_id->device; > req->port = sidr_param->port; > req->has_gid= false; > req->service_id = sidr_param->service_id; > - req->pkey = sidr_param->bth_pkey; > + req->pkey = sidr_param->pkey; > break; > default: > return -EINVAL; > And, to be clear, you are looking for this to be part of 4.3-rc updates, yes? -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
[PATCH for-next 0/5] RDMA/ocrdma: bug fixes
This patch series includs fix in CQ door bell logic, check to prevent driver crash and code cleanup. Please apply. Devesh Sharma (1): RDMA/ocrdma: Prevent CQ-Doorbell floods Naga Irrinki (1): RDMA/ocrdma: Check resource ids received in Async CQE Selvin Xavier (3): RDMA/ocrdma: Cleanup unused device list and rcu variables RDMA/ocrdma: Avoid a possible crash in ocrdma_rem_port_stats RDMA/ocrdma: Bump up ocrdma version number to 11.0.0.0 drivers/infiniband/hw/ocrdma/ocrdma.h | 3 +-- drivers/infiniband/hw/ocrdma/ocrdma_hw.c| 30 + drivers/infiniband/hw/ocrdma/ocrdma_main.c | 16 ++- drivers/infiniband/hw/ocrdma/ocrdma_stats.c | 2 +- drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 11 +++ 5 files changed, 33 insertions(+), 29 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 1/5] RDMA/ocrdma: Cleanup unused device list and rcu variables
ocrdma_dev_list is not used by the driver. So removing the references of this variable. dev->rcu was introduced for the ipv6 notifier for GID management. This is no longer required as the GID management is outside the HW driver. Signed-off-by: Padmanabh RatnakarSigned-off-by: Selvin Xavier --- drivers/infiniband/hw/ocrdma/ocrdma.h | 1 - drivers/infiniband/hw/ocrdma/ocrdma_main.c | 16 ++-- 2 files changed, 2 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h b/drivers/infiniband/hw/ocrdma/ocrdma.h index b4091ab..8cd57c7 100644 --- a/drivers/infiniband/hw/ocrdma/ocrdma.h +++ b/drivers/infiniband/hw/ocrdma/ocrdma.h @@ -278,7 +278,6 @@ struct ocrdma_dev { u32 hba_port_num; struct list_head entry; - struct rcu_head rcu; int id; u64 *stag_arr; u8 sl; /* service level */ diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c b/drivers/infiniband/hw/ocrdma/ocrdma_main.c index 87aa55d..cb1af0f 100644 --- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c +++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c @@ -63,8 +63,6 @@ MODULE_DESCRIPTION(OCRDMA_ROCE_DRV_DESC " " OCRDMA_ROCE_DRV_VERSION); MODULE_AUTHOR("Emulex Corporation"); MODULE_LICENSE("Dual BSD/GPL"); -static LIST_HEAD(ocrdma_dev_list); -static DEFINE_SPINLOCK(ocrdma_devlist_lock); static DEFINE_IDR(ocrdma_dev_id); void ocrdma_get_guid(struct ocrdma_dev *dev, u8 *guid) @@ -325,9 +323,6 @@ static struct ocrdma_dev *ocrdma_add(struct be_dev_info *dev_info) for (i = 0; i < ARRAY_SIZE(ocrdma_attributes); i++) if (device_create_file(>ibdev.dev, ocrdma_attributes[i])) goto sysfs_err; - spin_lock(_devlist_lock); - list_add_tail_rcu(>entry, _dev_list); - spin_unlock(_devlist_lock); /* Init stats */ ocrdma_add_port_stats(dev); /* Interrupt Moderation */ @@ -356,9 +351,8 @@ idr_err: return NULL; } -static void ocrdma_remove_free(struct rcu_head *rcu) +static void ocrdma_remove_free(struct ocrdma_dev *dev) { - struct ocrdma_dev *dev = container_of(rcu, struct ocrdma_dev, rcu); idr_remove(_dev_id, dev->id); kfree(dev->mbx_cmd); @@ -375,15 +369,9 @@ static void ocrdma_remove(struct ocrdma_dev *dev) ib_unregister_device(>ibdev); ocrdma_rem_port_stats(dev); - - spin_lock(_devlist_lock); - list_del_rcu(>entry); - spin_unlock(_devlist_lock); - ocrdma_free_resources(dev); ocrdma_cleanup_hw(dev); - - call_rcu(>rcu, ocrdma_remove_free); + ocrdma_remove_free(dev); } static int ocrdma_open(struct ocrdma_dev *dev) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 5/5] RDMA/ocrdma: Bump up ocrdma version number to 11.0.0.0
Updating the version number to 11.0.0.0 Signed-off-by: Selvin Xavier--- drivers/infiniband/hw/ocrdma/ocrdma.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h b/drivers/infiniband/hw/ocrdma/ocrdma.h index 8cd57c7..8f047f9 100644 --- a/drivers/infiniband/hw/ocrdma/ocrdma.h +++ b/drivers/infiniband/hw/ocrdma/ocrdma.h @@ -55,7 +55,7 @@ #include #include "ocrdma_sli.h" -#define OCRDMA_ROCE_DRV_VERSION "10.6.0.0" +#define OCRDMA_ROCE_DRV_VERSION "11.0.0.0" #define OCRDMA_ROCE_DRV_DESC "Emulex OneConnect RoCE Driver" #define OCRDMA_NODE_DESC "Emulex OneConnect RoCE HCA" -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 4/5] RDMA/ocrdma: Prevent CQ-Doorbell floods
From: Devesh SharmaChanging CQ-Doorbell(DB) logic to prevent DB floods, it is supposed to be pressed only if any hw CQE is polled. If cq-arm was requested previously then don't bother about number of hw CQEs polled and arm the CQ. Signed-off-by: Devesh Sharma Signed-off-by: Selvin Xavier --- drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 11 +++ 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c index 1f3affb..1c4e83d 100644 --- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c +++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c @@ -2933,16 +2933,11 @@ expand_cqe: } stop_cqe: cq->getp = cur_getp; - if (cq->deferred_arm) { - ocrdma_ring_cq_db(dev, cq->id, true, cq->deferred_sol, - polled_hw_cqes); + if (cq->deferred_arm || polled_hw_cqes) { + ocrdma_ring_cq_db(dev, cq->id, cq->deferred_arm, + cq->deferred_sol, polled_hw_cqes); cq->deferred_arm = false; cq->deferred_sol = false; - } else { - /* We need to pop the CQE. No need to arm */ - ocrdma_ring_cq_db(dev, cq->id, false, cq->deferred_sol, - polled_hw_cqes); - cq->deferred_sol = false; } return i; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 3/5] RDMA/ocrdma: Check resource ids received in Async CQE
From: Naga IrrinkiSome versions of the FW sends wrong QP or CQ IDs in the Async CQE. Adding a check to see whether qp or cq structures associated with the CQE is valid. Signed-off-by: Devesh Sharma Signed-off-by: Selvin Xavier --- drivers/infiniband/hw/ocrdma/ocrdma_hw.c | 30 ++ 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c index aab391a..9d99142 100644 --- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c +++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c @@ -678,11 +678,33 @@ static void ocrdma_dispatch_ibevent(struct ocrdma_dev *dev, int dev_event = 0; int type = (cqe->valid_ae_event & OCRDMA_AE_MCQE_EVENT_TYPE_MASK) >> OCRDMA_AE_MCQE_EVENT_TYPE_SHIFT; + u16 qpid = cqe->qpvalid_qpid & OCRDMA_AE_MCQE_QPID_MASK; + u16 cqid = cqe->cqvalid_cqid & OCRDMA_AE_MCQE_CQID_MASK; - if (cqe->qpvalid_qpid & OCRDMA_AE_MCQE_QPVALID) - qp = dev->qp_tbl[cqe->qpvalid_qpid & OCRDMA_AE_MCQE_QPID_MASK]; - if (cqe->cqvalid_cqid & OCRDMA_AE_MCQE_CQVALID) - cq = dev->cq_tbl[cqe->cqvalid_cqid & OCRDMA_AE_MCQE_CQID_MASK]; + /* +* Some FW version returns wrong qp or cq ids in CQEs. +* Checking whether the IDs are valid +*/ + + if (cqe->qpvalid_qpid & OCRDMA_AE_MCQE_QPVALID) { + if (qpid < dev->attr.max_qp) + qp = dev->qp_tbl[qpid]; + if (qp == NULL) { + pr_err("ocrdma%d:Async event - qpid %u is not valid\n", + dev->id, qpid); + return; + } + } + + if (cqe->cqvalid_cqid & OCRDMA_AE_MCQE_CQVALID) { + if (cqid < dev->attr.max_cq) + cq = dev->cq_tbl[cqid]; + if (cq == NULL) { + pr_err("ocrdma%d:Async event - cqid %u is not valid\n", + dev->id, cqid); + return; + } + } memset(_evt, 0, sizeof(ib_evt)); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 2/5] RDMA/ocrdma: Avoid a possible crash in ocrdma_rem_port_stats
debugfs_remove should be called before freeing the driver stats resources to avoid any crash during ocrdma_remove. Signed-off-by: Devesh SharmaSigned-off-by: Selvin Xavier --- drivers/infiniband/hw/ocrdma/ocrdma_stats.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_stats.c b/drivers/infiniband/hw/ocrdma/ocrdma_stats.c index 69334e2..86c303a 100644 --- a/drivers/infiniband/hw/ocrdma/ocrdma_stats.c +++ b/drivers/infiniband/hw/ocrdma/ocrdma_stats.c @@ -855,9 +855,9 @@ void ocrdma_rem_port_stats(struct ocrdma_dev *dev) { if (!dev->dir) return; + debugfs_remove(dev->dir); mutex_destroy(>stats_lock); ocrdma_release_stats_mem(dev); - debugfs_remove(dev->dir); } void ocrdma_init_debugfs(void) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1] IB/cma: Use inner P_Key to determine netdev
On 20/10/2015 10:20, Doug Ledford wrote: > On 10/20/2015 02:53 AM, Haggai Eran wrote: >> When discussing the patches to demux ids in rdma_cm instead of ib_cm, it >> was decided that it is best to use the P_Key value in the packet headers. >> However, the mlx5 and ipath drivers are currently unable to send correct >> P_Key values in GMP headers. They always send using a single P_Key that is >> set during the GSI QP initialization. >> >> Change the rdma_cm code to look at the P_Key value that is part of the >> packet payload as a workaround. Once the drivers are fixed this patch can >> be reverted. >> >> Fixes: 4c21b5bcef73 ("IB/cma: Add net_dev and private data checks to >> RDMA CM") >> Signed-off-by: Haggai Eran>> --- >> Changes from v0: >> - improve commit message >> >> drivers/infiniband/core/cma.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c >> index 59a2dafc8c57..e8324543e085 100644 >> --- a/drivers/infiniband/core/cma.c >> +++ b/drivers/infiniband/core/cma.c >> @@ -1067,14 +1067,14 @@ static int cma_save_req_info(const struct >> ib_cm_event *ib_event, >> sizeof(req->local_gid)); >> req->has_gid= true; >> req->service_id = req_param->primary_path->service_id; >> -req->pkey = req_param->bth_pkey; >> +req->pkey = be16_to_cpu(req_param->primary_path->pkey); >> break; >> case IB_CM_SIDR_REQ_RECEIVED: >> req->device = sidr_param->listen_id->device; >> req->port = sidr_param->port; >> req->has_gid= false; >> req->service_id = sidr_param->service_id; >> -req->pkey = sidr_param->bth_pkey; >> +req->pkey = sidr_param->pkey; >> break; >> default: >> return -EINVAL; >> > > And, to be clear, you are looking for this to be part of 4.3-rc updates, > yes? Yes, the issue was introduced in 4.3 in my cma demux patch. Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 00/26] New fast registration API
Doug, are you planning on taking this for 4.4? I think this set has converged towards inclusion. Reminder, this series goes on top of Christoph's wr_cleanup patches and iser bounce buffering cleanup. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merge struct ib_device_attr into struct ib_device V2
On Tue, Oct 20, 2015 at 4:08 PM, Sagi Grimbergwrote: >>> I think this is very useful to have and I'd love having it in 4.4, >>> does anyone have any other comments on this patch? >> Were we ever presented with performance gains achieved using the patch? > Can you explain what you mean by performance gains? My understanding is > that this patch is not performance critical. It just replaces each and > every ULP device attributes caching. oops, sorry, I was referring to the patch that deals with re-structuring of struct ib_wc, so... (1) did we even got performance gains achieved using **that** patch? (2) re this one, as I wrote in the past, I am in favor of simple caching of struct ib_device_attr on struct ib_device (best with pointer) and not adding 333 fields to struct ib_device, I don't see the benefit from this patch. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merge struct ib_device_attr into struct ib_device V2
So this patch seems to work well for me. I've ran some kernel applications (ipoib, iser, srp, nfs) and a user-space application (ibsrpdm) over mlx4/mlx5 and didn't spot any issues. I think this is very useful to have and I'd love having it in 4.4, does anyone have any other comments on this patch? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/23] Update driver to 0.9-294
> > Perhaps I did not chose my words carefully enough. > > The largest issue on the TODO list is the refactoring of the code to be > shared between the hfi1 and qib driver. While the hardware between hfi1 and > qib is similar and thus the initial code looked similar, our performance > tuning on hfi1 has revealed that some changes will be required to the hfi1 > code to fully utilize the hardware. We would like to get these updates in to > make sure that the refactoring work is done properly for the 2 hardware types. Please keep in mind that there are 3 HW types (our SoftRoCE driver) that needs to be part of the framework. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merge struct ib_device_attr into struct ib_device V2
On Tue, Oct 20, 2015 at 3:00 PM, Sagi Grimbergwrote: > So this patch seems to work well for me. I've ran some kernel > applications (ipoib, iser, srp, nfs) and a user-space application > (ibsrpdm) over mlx4/mlx5 and didn't spot any issues. > > I think this is very useful to have and I'd love having it in 4.4, > does anyone have any other comments on this patch? Were we ever presented with performance gains achieved using the patch? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/cma: Use inner P_Key to determine netdev
On 19/10/2015 21:19, Jason Gunthorpe wrote: > On Mon, Oct 19, 2015 at 09:09:25PM +0300, Haggai Eran wrote: >> When discussing the patches to demux ids in rdma_cm instead of ib_cm, it >> was decided that it is best to use the P_Key value in the packet headers >> [1]. However, some drivers are currently unable to send correct P_Key in >> GMP headers. > > You should explicitly describe the broken drivers in the commit text. These are mlx5 and ipath. I'll add them to the commit message. > I thought mlx5 was fixed for receive already? I'm confused why we need > this. mlx5 had two issues related to GSI pkeys. The issue that was fixed was that it treated the pkey value returned by the hardware in receive completions as a pkey_index. The remaining issue is that it doesn't respect the ib_send_wr.ud.pkey_index field when sending. With the current state of things, cma will try to look for an ipoib net dev matching the BTH pkey of the request, but if the sender is mlx5 or ipath, the BTH pkey would be the default pkey. If the request was intended for a different pkey, cma won't find a matching netdev and will throw away the request. Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH rdma-cm] IB/core: Fix use after free of ifa
On 10/19/2015 6:27 PM, Doug Ledford wrote: On 10/19/2015 10:20 AM, Matan Barak wrote: On 10/19/2015 3:23 PM, Doug Ledford wrote: This is a major cause of the slowness. Unless you have a specific need of them, per-entry rwlocks are *NOT* a good idea. I was going to bring this up separately, so I'll just mention it here. Per entry locks help reduce contention when you have lots of multiple accessors. Using rwlocks help reduce contention when you have a read-mostly entry that is only occasionally changed. But every lock and every unlock (just like every atomic access) still requires a locked memory cycle. That means every lock acquire and every lock release requires a synchronization event between all CPUs. Using per-entry locks means that every entry triggers two of those synchronization events. On modern Intel CPUs, they cost about 32 cycles per event. If you are going to do something, and it can't be done without a lock, then grab a single lock and do it as fast as you can. Only in rare cases would you want per-entry locks. I agree that every rwlock costs us locked access. However, lets look at the common scenario here. No, let's not. Especially if your "common scenario" causes you to ignore pathological bad behavior. Optimizing for the common case is not an excuse to enable pathological behavior. I think that in production stable systems, the IPs our rdma-cm stack uses (and thus GIDs) should be pretty stable. Probably true. Thus, most accesses should be read calls. Probably true. That's why IMHO read-write access makes sense here. Probably true. Regarding single-lock vs per entry lock, it really depends on common an entry could be updated while another entry is being used by an application. No, it depends on more than that. In addition to the frequency of updates versus lookups, there is also the issue of lookup cost and update cost. Using per entry locks makes both the lookup cost and the update cost of anything other than the first gid in the table grow exponentially. A good way to demonstrate this would be to create 100 vlans on a RoCE device, then run the cmtime test between two hosts using the 100th vlan IP at each end. Using per-entry locks, the performance of this test relative to using the first IP on the device will be pathologically bad. That's correct, we"waste" #entries * . But we have to note this is mainly during the connecting part. In a (future) dynamic system, you might want to create containers dynamically, which will add a net device and change the hardware GID table while other application (maybe in another container) uses other GIDs, but this might be rare scenario. This is a non-issue. Creating a container and an application inside that container to use an RDMA interface is already a heavy weight operation (minimum fork/exec/program startup). Table lookups can be needed thousands of times per second under certain conditions and must be fast. Per-entry locks are only fast for the first item in the list. After that, they progressively get slower and slower than usage with a single lock. We first used RCU and seqcount in order to protect each entry. When we changed that to rwlock, we added some per-entry work. Ok, refactoring this code for a single lock shouldn't be problematic. Regarding performance, I think the results here are vastly impacted by the rate we'll add/remove IPs or upper net-devices in the background. Not really. You will never add/remove IPs fast enough to justify per-entry locks, especially when you consider that ib_cache_gid_add calls find_gid twice, once going over the entire table if the gid isn't found, before ever calling add_gid. The ocrdma table isn't bad because it's only 16 entries. But the mlx4 table is going to be 128 I suspect, so you can do the test I mentioned above and use 100 vlans. In that scenario, the 101st gid will require that you take and release 128 locks to scan for the entry in the list, it will come back empty, then you will take and release 101 locks until you find an empty entry, and then you will take a write lock as you write the gid. That's 230 total locks, for 460 total locked memory operations at 32 cycles each, so a total of 14,720 cycles spent doing nothing but locking the memory bus. And each of those operations slows down all of the CPUs in the system. Where I've seen this sort of behavior totally wreck a production system is when the GID you need to look up is on port 2. Even if it is the firstmost GID on port 2, if you have to search all of port 1's table first, then by the time you get to port 2's table, you've already lost. So, here's some suggestions: 1) Use a single lock on the table per port. Make sense as an enhancement. 2) Change find_gid so that it will take an additional element, int *empty_idx, and as it is scanning the gid table, when it runs across an empty entry, if empty_idx != NULL, *empty_idx = idx;. This prevents needing to scan
Re: [PATCH TRIVIAL] ib_pack.h: Fix commentary IBA reference for CNP in IB opcode enum
On 09/30/2015 03:04 PM, Hal Rosenstock wrote: > > IBA spec is now 1.3 not 3.1 and vol 1 should be mentioned > as there is also vol 2. > > Signed-off-by: Hal Rosenstock> --- > diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h > index 709a533..e99d8f9 100644 > --- a/include/rdma/ib_pack.h > +++ b/include/rdma/ib_pack.h > @@ -76,7 +76,7 @@ enum { > IB_OPCODE_UC= 0x20, > IB_OPCODE_RD= 0x40, > IB_OPCODE_UD= 0x60, > - /* per IBTA 3.1 Table 38, A10.3.2 */ > + /* per IBTA 1.3 vol 1 Table 38, A10.3.2 */ Thanks Hal, I've added this to the 4.4 list. -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH] IB/iser: fix a comment typo
On 10/04/2015 05:00 AM, Geliang Tang wrote: > Just fix a typo in the code comment. > > Signed-off-by: Geliang TangThis has been picked up for the 4.4 merge window. -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH] IB/ucma: check workqueue allocation before usage
On 10/01/2015 10:54 AM, Sasha Levin wrote: > On 09/17/2015 06:59 PM, Jason Gunthorpe wrote: >> On Thu, Sep 17, 2015 at 04:04:19PM -0400, Sasha Levin wrote: >>> Allocating a workqueue might fail, which wasn't checked so far and would >>> lead to NULL ptr derefs when an attempt to use it was made. >> >> Indeed. >> >> Yishai, please check this and check the other patches you've sent to >> see if they have a similar error. > > Ping? It doesn't seem that the patch got anywhere. Sorry, I took the "it applies to linux-next" to mean it wasn't against my tree yet and was against upcoming patches. That wasn't the case. It's been applied. -- Doug LedfordGPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: 4.4's rdma plate
On Wed, Oct 14, 2015 at 3:56 PM, Doug Ledfordwrote: > On 10/14/2015 12:51 AM, Or Gerlitz wrote: >> On Sun, Oct 11, 2015 at 11:58 PM, Or Gerlitz wrote: >>> Hi Doug, >>> As we're soon on 4.3-rc5, could you please update how things look for 4.4? >>> >>> The mutlicast loopback prevention patches from Eran were intially >>> posted ~two months ago and went through testing by Christoph and Co. >> [...] >> Hi Doug, >> So we're after rc5 and not a single word from you re the plans for the >> next kernel, and your branch for that purpose contains exactly two >> patches... I do >> see few chunks >> of patches marked as under review in your patchworks, so there might >> be sign for getting some hope, is that real? > Yes Or, I'm working on it. A new 4.3-rc pull request should come out > today and an update on the for next area probably as well. Doug, So we're after rc6 and still pretty much nothing in your for-4.4 branch, are we getting there (e.g picking something from the backlog such as the self multicast loopback prevention and the rdma-cm name space patches) or for 4.4 we'd be coming naked? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.4's rdma plate
On Tue, Oct 20, 2015 at 7:31 PM, Doug Ledfordwrote: > On 10/20/2015 12:25 PM, Or Gerlitz wrote: >> On Wed, Oct 14, 2015 at 3:56 PM, Doug Ledford wrote: >>> On 10/14/2015 12:51 AM, Or Gerlitz wrote: On Sun, Oct 11, 2015 at 11:58 PM, Or Gerlitz wrote: > Hi Doug, > As we're soon on 4.3-rc5, could you please update how things look for 4.4? > > The mutlicast loopback prevention patches from Eran were intially > posted ~two months ago and went through testing by Christoph and Co. [...] >> >> Hi Doug, >> So we're after rc5 and not a single word from you re the plans for the next kernel, and your branch for that purpose contains exactly two patches... I do see few chunks of patches marked as under review in your patchworks, so there might be sign for getting some hope, is that real? >> >>> Yes Or, I'm working on it. A new 4.3-rc pull request should come out >>> today and an update on the for next area probably as well. >> Doug, >> So we're after rc6 and still pretty much nothing in your for-4.4 >> branch, are we getting there >> (e.g picking something from the backlog such as the self multicast >> loopback prevention and the >> rdma-cm name space patches) or for 4.4 we'd be coming naked? > Another pull request for the final 4.3 fixups will go in today. I > prioritized those over the for-next work. > Next I'll be pulling simple for-next items, then I'll be reviewing the Sounds good as a starter > then I'll be reviewing the > RoCE GID cadche usage in verbs/cma v2 patchset, then the Add RoCE v2 > support v1 patchset, then the New fast registration API v5 set. Doug, the New fast registration API v5 set has E2E community consensus, I guess there's not much left for you to dig there, so these are good news. As for the RoCE patches, you mentioned reviewing them, this is going to take time, which makes sense. However, the self-multicast loopback prevention patches are (1) sitting in the mailing list for months (2) small in volume and very large in benefit (3) passed testing by Christoph and Co, any reason not to pick them prior to review of the RoCE code. We need to encourage developers to take part in the upstream process, For this series I am clueless towards the developers where they ask me what's missing? why this isn't moving for months? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC
On 10/20/2015 11:57 AM, Hefty, Sean wrote: >> Today, cm assumes paths are reversible primary_path->reversible = 1. > > I can't quickly find a link, but I believe all CM MADs are reversible, per > the spec. Spec citation for this is: C12-5.1.3: All responses generated by the CM protocol shall follow the rules for response generation that are enumerated in 13.5.4 Response Generation and Reversible Paths. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/23] Update driver to 0.9-294
On Tue, Oct 20, 2015 at 03:45:47PM +0300, Moni Shoua wrote: > > > > Perhaps I did not chose my words carefully enough. > > > > The largest issue on the TODO list is the refactoring of the code to be > > shared between the hfi1 and qib driver. While the hardware between hfi1 > > and qib is similar and thus the initial code looked similar, our > > performance tuning on hfi1 has revealed that some changes will be required > > to the hfi1 code to fully utilize the hardware. We would like to get these > > updates in to make sure that the refactoring work is done properly for the > > 2 hardware types. > > Please keep in mind that there are 3 HW types (our SoftRoCE driver) > that needs to be part of the framework. Understood, however, unlike SoftRoCE, qib and hfi currently share a lot of code to drive the hardware. The underlying reason for the TODO item "Remove software processing of IB protocol..." is because we have a large amount of duplicated code between these drivers. _Some_ of which, at a high level, is sharable with SoftRoCE. These patches (and more to follow), further differentiate the 2 drivers along hardware lines. Accepting these patches will help us make sure that we don't create some common code between qib and hfi which, because of our testing we now know, needs to be separated out. This is a separate issue from the higher level code sharing which will be done with SoftRoCE. Ira -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/cma: Potential NULL dereference in cma_id_from_event
On 09/21/2015 09:02 AM, Haggai Eran wrote: > If the lookup of a listening ID failed for an AF_IB request, the code > would try to call dev_put() on a NULL net_dev. > > Fixes: be688195bd08 ("IB/cma: Fix net_dev reference leak with failed > requests") > Reported-by: Dan Carpenter> Signed-off-by: Haggai Eran > --- > drivers/infiniband/core/cma.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c > index b1ab13f3e182..b92a3c2c060b 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -1302,7 +1302,7 @@ static struct rdma_id_private *cma_id_from_event(struct > ib_cm_id *cm_id, > bind_list = cma_ps_find(rdma_ps_from_service_id(req.service_id), > cma_port_from_service_id(req.service_id)); > id_priv = cma_find_listener(bind_list, cm_id, ib_event, , *net_dev); > - if (IS_ERR(id_priv)) { > + if (IS_ERR(id_priv) && *net_dev) { > dev_put(*net_dev); > *net_dev = NULL; > } > This one got lost back in the noise of all the for-next patches. Sorry to have missed it. I've picked it up now for -rc. -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: merge struct ib_device_attr into struct ib_device V2
But this makes struct ib_device much much bigger, and this structure is used in **fast** path, e.g the ULP traverses through a pointer from struct ib_device to post_send/recv, poll_cq and friends. Umm, all the caps are appended to the end of the ib_device structure so I don't see how it will affect the fast path - am I missing something? The networking community will let you work for 10y before they add a field to struct net_device exactly b/c of the reason I brought. Why here we can come out of the blue and add tens if not hundreds of fields to our device structure? Keeping struct ib_device_attr embedded at the end of struct ib_device is exactly the same isn't it? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merge struct ib_device_attr into struct ib_device V2
On Tue, Oct 20, 2015 at 6:00 PM, Sagi Grimbergwrote: >> (2) re this one, as I wrote in the past, I am in favor of simple >> caching of struct ib_device_attr on struct ib_device (best with pointer) and >> not adding >> 333 fields to struct ib_device, I don't see the benefit from this patch. > Christoph commented on that: > > "I'm strongly against this. As the reviews show the move is highly > confusing. The attributes don't change and there is no need to 'cache' > or 'query' them. Just merge them into the device, follow years of > experience with how every other Linux subsystem does it and be done > with it." > > If the attributes are automatically a part of the ib_device then it > does seem a bit redundant keeping the structure around... But this makes struct ib_device much much bigger, and this structure is used in **fast** path, e.g the ULP traverses through a pointer from struct ib_device to post_send/recv, poll_cq and friends. The networking community will let you work for 10y before they add a field to struct net_device exactly b/c of the reason I brought. Why here we can come out of the blue and add tens if not hundreds of fields to our device structure? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merge struct ib_device_attr into struct ib_device V2
On Tue, Oct 20, 2015 at 6:13 PM, Sagi Grimbergwrote: >> The networking community will let you work for 10y before they add a >> field to struct net_device exactly b/c >> of the reason I brought. Why here we can come out of the blue and add >> tens if not hundreds of fields to our device structure? > Keeping struct ib_device_attr embedded at the end of struct ib_device is > exactly the same isn't it? I said " (best with pointer)" -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC
> Today, cm assumes paths are reversible primary_path->reversible = 1. I can't quickly find a link, but I believe all CM MADs are reversible, per the spec.
Re: 4.4's rdma plate
On Tue, Oct 20, 2015 at 07:42:25PM +0300, Or Gerlitz wrote: > As for the RoCE patches, you mentioned reviewing them, this is going to > take time, which makes sense. However, the self-multicast loopback prevention > patches are My issue with the latest round of UAPI changes is I don't see any Reviewed-By tag from someone at Mellanox whom I think understands our UAPI. UAPI changes are *NOT* something first-time list contributors should be doing unassisted. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH rdma-cm] IB/core: Fix use after free of ifa
On 10/20/2015 10:50 AM, Matan Barak wrote: > > > On 10/19/2015 6:27 PM, Doug Ledford wrote: >> On 10/19/2015 10:20 AM, Matan Barak wrote: >>> On 10/19/2015 3:23 PM, Doug Ledford wrote: >> This is a major cause of the slowness. Unless you have a specific need of them, per-entry rwlocks are *NOT* a good idea. I was going to bring this up separately, so I'll just mention it here. Per entry locks help reduce contention when you have lots of multiple accessors. Using rwlocks help reduce contention when you have a read-mostly entry that is only occasionally changed. But every lock and every unlock (just like every atomic access) still requires a locked memory cycle. That means every lock acquire and every lock release requires a synchronization event between all CPUs. Using per-entry locks means that every entry triggers two of those synchronization events. On modern Intel CPUs, they cost about 32 cycles per event. If you are going to do something, and it can't be done without a lock, then grab a single lock and do it as fast as you can. Only in rare cases would you want per-entry locks. >>> >>> I agree that every rwlock costs us locked access. However, lets look at >>> the common scenario here. >> >> No, let's not. Especially if your "common scenario" causes you to >> ignore pathological bad behavior. Optimizing for the common case is not >> an excuse to enable pathological behavior. >> >>> I think that in production stable systems, the >>> IPs our rdma-cm stack uses (and thus GIDs) should be pretty stable. >> >> Probably true. >> >>> Thus, most accesses should be read calls. >> >> Probably true. >> >>> That's why IMHO read-write >>> access makes sense here. >> >> Probably true. >> >>> Regarding single-lock vs per entry lock, it really depends on common an >>> entry could be updated while another entry is being used by an >>> application. >> >> No, it depends on more than that. In addition to the frequency of >> updates versus lookups, there is also the issue of lookup cost and >> update cost. Using per entry locks makes both the lookup cost and the >> update cost of anything other than the first gid in the table grow >> exponentially. A good way to demonstrate this would be to create 100 >> vlans on a RoCE device, then run the cmtime test between two hosts using >> the 100th vlan IP at each end. Using per-entry locks, the performance >> of this test relative to using the first IP on the device will be >> pathologically bad. >> > > That's correct, we"waste" #entries * . But we > have to note this is mainly during the connecting part. Yes, and I spent a *month* of my life tracking down why a customer's real world application was failing in production (but not in their staging environment) and it all boiled down to the fact that when their app had to start up more than about 700-900 connections at app startup and their setup included dual port controllers with both ports in use, and some of those connections had to look up GIDs on the second port, the abysmal per-gid lookup rate meant that the kernel had a hard ceiling of how many GIDs it could look up per second. Once that was hit, the app went into a permanent cycle of dropping failed connections and re-opening new ones and it would simply never catch up. So it concerns me greatly to hear you say "But we have to note this is mainly during the connecting part", implying that some wasted cycles on connection startup are OK. They aren't. Not ever. Production systems fall over dead precisely because of things like this. >> 3) Consider optimizing the code by requiring that any time a GID is >> added, it must take the first empty spot, and any time one is deleted, >> any valid entries after it must be moved up to fill the empty spot, then >> optimize find_gid to quit once it finds an empty slot because it knows >> the rest of the table is empty. Given that mlx4 keeps a huge 128 table >> array, this can be very helpful. > > I don't think we want to change existing GID entries. Entities sometimes > want to refer to GIDs via GID indices. Changing the GIDs order under > their feet could be problematic. I had a feeling this one might be an issue. There is, however, a solution that would resolve the problem. The issue here is that you have an array of GIDs that you want to be able to both scan efficiently and access directly. The array is great for direct access, but the scan efficiently part is a total looser because you scan so many empty entries most of the time. You could change it so that a gid port table is actually a combination of 1) an array of gid_port_table_element pointers and 2) a linked list of gid_port_table_element structs. Then you only allocate a struct when needed, link it into the list, assign it an index, set the pointer in the array at point index to point to your struct, and that way you can still direct access the elements, but can also
Re: 4.4's rdma plate
On 10/20/2015 12:25 PM, Or Gerlitz wrote: > On Wed, Oct 14, 2015 at 3:56 PM, Doug Ledfordwrote: >> On 10/14/2015 12:51 AM, Or Gerlitz wrote: >>> On Sun, Oct 11, 2015 at 11:58 PM, Or Gerlitz wrote: Hi Doug, As we're soon on 4.3-rc5, could you please update how things look for 4.4? The mutlicast loopback prevention patches from Eran were intially posted ~two months ago and went through testing by Christoph and Co. >>> [...] > > >>> Hi Doug, > >>> So we're after rc5 and not a single word from you re the plans for the >>> next kernel, and your branch for that purpose contains exactly two >>> patches... I do >>> see few chunks >>> of patches marked as under review in your patchworks, so there might >>> be sign for getting some hope, is that real? > >> Yes Or, I'm working on it. A new 4.3-rc pull request should come out >> today and an update on the for next area probably as well. > > Doug, > > So we're after rc6 and still pretty much nothing in your for-4.4 > branch, are we getting there > (e.g picking something from the backlog such as the self multicast > loopback prevention and the > rdma-cm name space patches) or for 4.4 we'd be coming naked? > > Or. > Another pull request for the final 4.3 fixups will go in today. I prioritized those over the for-next work. Next I'll be pulling simple for-next items, then I'll be reviewing the RoCE GID cadche usage in verbs/cma v2 patchset, then the Add RoCE v2 support v1 patchset, then the New fast registration API v5 set. -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH] IB/cma: Use inner P_Key to determine netdev
On Tue, Oct 20, 2015 at 09:45:27AM +0300, Haggai Eran wrote: > On 19/10/2015 21:19, Jason Gunthorpe wrote: > > On Mon, Oct 19, 2015 at 09:09:25PM +0300, Haggai Eran wrote: > >> When discussing the patches to demux ids in rdma_cm instead of ib_cm, it > >> was decided that it is best to use the P_Key value in the packet headers > >> [1]. However, some drivers are currently unable to send correct P_Key in > >> GMP headers. > > > > You should explicitly describe the broken drivers in the commit text. > These are mlx5 and ipath. I'll add them to the commit message. And ipath is actually ipath, the obsolete driver being removed, not qib? (ie we don't care about it?) > The remaining issue is that it doesn't respect the > ib_send_wr.ud.pkey_index field when sending. But this is a very serious bug, to the point the mis-labeled packets may not even be delivered in some cases - you really care about the sub case where mis-labeled packets happen to be deliverable but don't parse right? Well, don't forget to undo this patch when the mlx5 driver is fixed.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH rdma-RC] IB/cm: Fix sleeping while atomic when creating AH from WC
On Tue, Oct 20, 2015 at 03:57:54PM +, Hefty, Sean wrote: > > Today, cm assumes paths are reversible primary_path->reversible = 1. > > I can't quickly find a link, but I believe all CM MADs are > reversible, per the spec. But the Linux CM code doesn't always create the reverse CM MAD from the GMP headers, it sometimes will do it by looking into the data path in the MAD, which means it still could need to sleep to do the MAC resolution. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.4's rdma plate
On Tue, Oct 20, 2015 at 7:52 PM, Jason Gunthorpe wrote: > On Tue, Oct 20, 2015 at 07:42:25PM +0300, Or Gerlitz wrote: >> As for the RoCE patches, you mentioned reviewing them, this is going to >> take time, which makes sense. However, the self-multicast loopback prevention >> patches are > > My issue with the latest round of UAPI changes is I don't see any > Reviewed-By tag from someone at Mellanox whom I think understands our > UAPI. > > UAPI changes are *NOT* something first-time list contributors should > be doing unassisted. Jason, your point makes sense, when we did the internal review, I have asked that someone with the relevant knowledge will review this part. As I wrote below, we should encourage people to take part in upstream development, and making such comment only after the patch sits on the list whole two months (V1 was very minor change, V2 was re-spin) isn't going in the encouragement direction. There's one patch in the series that adds UAPI -- it's 1/7 "IB/core: Extend ib_uverbs_create_qp" https://patchwork.kernel.org/patch/7405241/ I will try to get Reviewed-by: from Haggai or Matan tomorrow, can you also take a look today, so we don't miss 4.4 just for that? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH rdma-RC] IB/cm: Fix rb-tree duplicate free and use-after-free
On 10/15/2015 11:15 AM, Matan Barak wrote: > > > On 10/12/2015 7:37 PM, Hefty, Sean wrote: >>> ib_send_cm_sidr_rep could sometimes erase the node from the sidr >>> (depending on errors in the process). Since ib_send_cm_sidr_rep is >>> called both from cm_sidr_req_handler and cm_destroy_id, cm_id_priv >> >> This should clarify that it is the app calling from the callback, and >> not a direct call from the cm_sidr_req_handler. >> > > Consider the following error flows: > > Double free: > cm_sidr_req_handler:3156->cm_reject_sidr_req:663->ib_send_cm_sidr_rep:3233->erase_rb > > cm_sidr_req_handler:3173->ib_destroy_cm_id->cm_destroy_id:846->ib_send_cm_sidr_rep:3233->erase_rb > > > RB contains free node: > cm_sidr_req_handler:3156->cm_reject_sidr_req:663->ib_send_cm_sidr_rep->returns > error(for example cm_alloc_msg,3219) > > cm_sidr_req_handler:3173->ib_destroy_cm_id->cm_destroy_id:846->cm_reject_sidr_req->cm_reject_sidr_req:663->returns > > error(for example cm_alloc_msg,3219)->RB wasn't erased but memory is > freed :910 kfree(cm_id_priv) > > >>> could be either erased from the rb_tree twice or not erased at all. >> >> In an error case, I can see why it would be left in the rbtree, but I >> don't see how it can be removed twice. >> >> >>> Fixing that by making sure it's erased only once before freeing >>> cm_id_priv. >>> >>> Fixes: a977049dacde ('[PATCH] IB: Add the kernel CM implementation') >>> Signed-off-by: Doron Tsur>>> Signed-off-by: Matan Barak >>> --- >>> >>> Hi Doug, >>> This patch fixes a bug in the CM. In some flow, rb-tree could be >>> freed twice or used after it was freed. This bug was picked by >>> our regression tests and this fix was verified. >>> >>> Thanks, >>> Doron and Matan >>> >>> drivers/infiniband/core/cm.c | 10 +- >>> 1 file changed, 9 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c >>> index f5cf1c4..56ff0f3 100644 >>> --- a/drivers/infiniband/core/cm.c >>> +++ b/drivers/infiniband/core/cm.c >>> @@ -844,6 +844,11 @@ retest: >>> case IB_CM_SIDR_REQ_RCVD: >>> spin_unlock_irq(_id_priv->lock); >>> cm_reject_sidr_req(cm_id_priv, IB_SIDR_REJECT); >>> +spin_lock_irq(); >>> +if (!RB_EMPTY_NODE(_id_priv->sidr_id_node)) >>> +rb_erase(_id_priv->sidr_id_node, >>> + _sidr_table); >>> +spin_unlock_irq(); > > This change seeks to remove the about to be freed node from the rb tree, > while verifying it has not been freed already > >> >> We should be able to use a return value from cm_reject_sidr_req() -- >> passed through from ib_send_cm_sidr_rep() to determine if the id was >> removed from the tree. >> > > But this won't protect from double free in ib_send_cm_sidr_rep, unless > we pass this parameter to the cm destroy function, but this alternative > is cumbersome. > >>> break; >>> case IB_CM_REQ_SENT: >>> case IB_CM_MRA_REQ_RCVD: >>> @@ -3210,7 +3215,10 @@ int ib_send_cm_sidr_rep(struct ib_cm_id *cm_id, >>> spin_unlock_irqrestore(_id_priv->lock, flags); >>> >>> spin_lock_irqsave(, flags); >>> -rb_erase(_id_priv->sidr_id_node, _sidr_table); >>> +if (!RB_EMPTY_NODE(_id_priv->sidr_id_node)) { >>> +rb_erase(_id_priv->sidr_id_node, _sidr_table); >>> +RB_CLEAR_NODE(_id_priv->sidr_id_node); >>> +} > > This change protects against double free > >>> spin_unlock_irqrestore(, flags); >> >> Something is very wrong in this function if the id is not in the tree >> at this point. >> > > We agree, but there's an error flow that triggers this behavior. Sean, I need to close on this patch. What is your position after Matan's explanation? -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH v1] IB/cma: Use inner P_Key to determine netdev
On 10/20/2015 04:35 AM, Haggai Eran wrote: > On 20/10/2015 10:20, Doug Ledford wrote: >> On 10/20/2015 02:53 AM, Haggai Eran wrote: >>> When discussing the patches to demux ids in rdma_cm instead of ib_cm, it >>> was decided that it is best to use the P_Key value in the packet headers. >>> However, the mlx5 and ipath drivers are currently unable to send correct >>> P_Key values in GMP headers. They always send using a single P_Key that is >>> set during the GSI QP initialization. >>> >>> Change the rdma_cm code to look at the P_Key value that is part of the >>> packet payload as a workaround. Once the drivers are fixed this patch can >>> be reverted. >>> >>> Fixes: 4c21b5bcef73 ("IB/cma: Add net_dev and private data checks to >>> RDMA CM") >>> Signed-off-by: Haggai Eran>>> --- >>> Changes from v0: >>> - improve commit message >>> >>> drivers/infiniband/core/cma.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c >>> index 59a2dafc8c57..e8324543e085 100644 >>> --- a/drivers/infiniband/core/cma.c >>> +++ b/drivers/infiniband/core/cma.c >>> @@ -1067,14 +1067,14 @@ static int cma_save_req_info(const struct >>> ib_cm_event *ib_event, >>>sizeof(req->local_gid)); >>> req->has_gid= true; >>> req->service_id = req_param->primary_path->service_id; >>> - req->pkey = req_param->bth_pkey; >>> + req->pkey = be16_to_cpu(req_param->primary_path->pkey); >>> break; >>> case IB_CM_SIDR_REQ_RECEIVED: >>> req->device = sidr_param->listen_id->device; >>> req->port = sidr_param->port; >>> req->has_gid= false; >>> req->service_id = sidr_param->service_id; >>> - req->pkey = sidr_param->bth_pkey; >>> + req->pkey = sidr_param->pkey; >>> break; >>> default: >>> return -EINVAL; >>> >> >> And, to be clear, you are looking for this to be part of 4.3-rc updates, >> yes? > > Yes, the issue was introduced in 4.3 in my cma demux patch. Applied, thanks. -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH rdma-cm] IB/core: Fix use after free of ifa
On 10/15/2015 08:01 AM, Matan Barak wrote: > When using ifup/ifdown while executing enum_netdev_ipv4_ips, > ifa could become invalid and cause use after free error. > Fixing it by protecting with RCU lock. > > Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management') > Signed-off-by: Matan BarakThis is in my tree for -rc. Thanks. > --- > > Hi Doug, > > This patch fixes a bug in RoCE GID table implementation. Under stress > conditions > where ifup/ifdown are used, the ifa pointer could become invalid. Using a > RCU lock in order to avoid freeing the ifa node (as done in other inet > functions > (for example, inet_addr_onlink). > > Our QA team verified that this patch fixes this issue. > > Thanks, > Matan > > drivers/infiniband/core/roce_gid_mgmt.c | 35 > + > 1 file changed, 27 insertions(+), 8 deletions(-) > > diff --git a/drivers/infiniband/core/roce_gid_mgmt.c > b/drivers/infiniband/core/roce_gid_mgmt.c > index 6b24cba..178f984 100644 > --- a/drivers/infiniband/core/roce_gid_mgmt.c > +++ b/drivers/infiniband/core/roce_gid_mgmt.c > @@ -250,25 +250,44 @@ static void enum_netdev_ipv4_ips(struct ib_device > *ib_dev, >u8 port, struct net_device *ndev) > { > struct in_device *in_dev; > + struct sin_list { > + struct list_headlist; > + struct sockaddr_in ip; > + }; > + struct sin_list *sin_iter; > + struct sin_list *sin_temp; > > + LIST_HEAD(sin_list); > if (ndev->reg_state >= NETREG_UNREGISTERING) > return; > > - in_dev = in_dev_get(ndev); > - if (!in_dev) > + rcu_read_lock(); > + in_dev = __in_dev_get_rcu(ndev); > + if (!in_dev) { > + rcu_read_unlock(); > return; > + } > > for_ifa(in_dev) { > - struct sockaddr_in ip; > + struct sin_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC); > > - ip.sin_family = AF_INET; > - ip.sin_addr.s_addr = ifa->ifa_address; > - update_gid_ip(GID_ADD, ib_dev, port, ndev, > - (struct sockaddr *)); > + if (!entry) { > + pr_warn("roce_gid_mgmt: couldn't allocate entry for > IPv4 update\n"); > + continue; > + } > + entry->ip.sin_family = AF_INET; > + entry->ip.sin_addr.s_addr = ifa->ifa_address; > + list_add_tail(>list, _list); > } > endfor_ifa(in_dev); > + rcu_read_unlock(); > > - in_dev_put(in_dev); > + list_for_each_entry_safe(sin_iter, sin_temp, _list, list) { > + update_gid_ip(GID_ADD, ib_dev, port, ndev, > + (struct sockaddr *)_iter->ip); > + list_del(_iter->list); > + kfree(sin_iter); > + } > } > > static void enum_netdev_ipv6_ips(struct ib_device *ib_dev, > -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: merge struct ib_device_attr into struct ib_device V2
On 10/20/2015 5:08 PM, Or Gerlitz wrote: On Tue, Oct 20, 2015 at 4:08 PM, Sagi Grimbergwrote: I think this is very useful to have and I'd love having it in 4.4, does anyone have any other comments on this patch? Were we ever presented with performance gains achieved using the patch? Can you explain what you mean by performance gains? My understanding is that this patch is not performance critical. It just replaces each and every ULP device attributes caching. oops, sorry, I was referring to the patch that deals with re-structuring of struct ib_wc, so... (1) did we even got performance gains achieved using **that** patch? I don't know if we'd see noticeable performance gains from **that** patch either.. The benefit is mostly stack relief as usually ULPs keep the work request structure on the stack. (2) re this one, as I wrote in the past, I am in favor of simple caching of struct ib_device_attr on struct ib_device (best with pointer) and not adding 333 fields to struct ib_device, I don't see the benefit from this patch. Christoph commented on that: "I'm strongly against this. As the reviews show the move is highly confusing. The attributes don't change and there is no need to 'cache' or 'query' them. Just merge them into the device, follow years of experience with how every other Linux subsystem does it and be done with it." If the attributes are automatically a part of the ib_device then it does seem a bit redundant keeping the structure around... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.4's rdma plate
On Wed, Oct 21, 2015 at 12:00:54AM +0300, Or Gerlitz wrote: > As I wrote below, we should encourage people to take part in upstream > development, and making such comment only after the patch sits on the > list whole two months (V1 was very minor change, V2 was re-spin) isn't > going in the encouragement direction. I would say there are at least 6 people who could meaningfully put a reviewed-by tag on that UAPI patch, why do you think none of them have done so in two months? What is needed to get more engagement in upstream? > I will try to get Reviewed-by: from Haggai or Matan tomorrow, can you > also take a look today, so we don't miss 4.4 just for that? Unlikely, I am very busy with SC|15. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merge struct ib_device_attr into struct ib_device V2
I think this is very useful to have and I'd love having it in 4.4, does anyone have any other comments on this patch? Were we ever presented with performance gains achieved using the patch? Hi Or, Can you explain what you mean by performance gains? My understanding is that this patch is not performance critical. It just replaces each and every ULP device attributes caching. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merge struct ib_device_attr into struct ib_device V2
On Tue, Oct 20, 2015 at 04:08:54PM +0300, Sagi Grimberg wrote: > Can you explain what you mean by performance gains? My understanding is > that this patch is not performance critical. It just replaces each and > every ULP device attributes caching. Exactly. The only 'performance' we care about is that of ULD writers :) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html