Re: [ofa-general] Re: [PATCH] IB/ipoib: IPOIB CM rx use higher order fragments
On Tue, 2007-10-23 at 11:35 -0700, Roland Dreier wrote: In order to reduce the overhead of iterating the fragments of an SKB in the receive flow, we use fragments of higher order and thus reduce the number of iterations. This patch seams to improve receive throughput of small UDP messages. I don't think we want to do this -- it may be good for benchmarks but it will hurt reliability, since systems often have highly fragmented memory so higher-order atomic allocations will fail. - R. Other drivers do similar allocations. For example, e1000 when working with jumbo frames does such large allocations. Also I did not notice allocation failures though my system was pretty much active but I can monitor for such possible failures. e1000_main.c line 3549: else if (max_frame = E1000_RXBUFFER_16384) adapter-rx_buffer_len = E1000_RXBUFFER_16384; ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU
Sean Hefty wrote: That is what I've been trying to push. Both MVAPICH2 and OMPI have been open to adjusting their transports to adhere to this requirement. I wouldn't mind implementing something to enforce this in the IWCM or the iWARP drivers IF there was a clean way to do it. So far there hasn't been a clean way proposed. Why can't either uDAPL or iW CM always do a send from the active to passive side that gets stripped off? From the active side, the first send is always posted before any user sends, and if necessary, a user send can be queued by software to avoid a QP/CQ overrun. The completion can simply be eaten by software. On the passive side, you have a similar process for receiving the data. (Yes this adds wire protocol, which requires both sides to support it.) - Sean I said clean way to do it. ;-) Yes, this is the only under the covers solution I know of that will work with existing HW. However, I don't think it can be done totally within the rdmacm or iwcm. I think it involves providers poll function to deal with the send completion/error and the passive side recv completion/error. Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU
Felix Marti wrote: -Original Message- From: Tom Tucker [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 23, 2007 9:32 PM To: Felix Marti Cc: Kanevsky, Arkady; Glenn Grundstrom; Sean Hefty; Steve Wise; Roland Dreier; [EMAIL PROTECTED]; OpenFabrics General Subject: Re: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU Felix Marti wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:general- [EMAIL PROTECTED] On Behalf Of Kanevsky, Arkady Sent: Tuesday, October 23, 2007 6:26 PM To: Glenn Grundstrom; Sean Hefty; Steve Wise Cc: Roland Dreier; [EMAIL PROTECTED]; OpenFabrics General Subject: RE: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU This is still a protocol and should be defined by IETF not OFA. But if we get agreement from all iWARP vendors this will be a good step. [felix] This will not work with a Chelsio RNIC which follows the IETF specification by a) not issuing a 0B RDMA Write to the wire and b) silently consuming an incoming 0B write. Therefore 0B RDMA Writes cannot be 'abused' for such a synchronization mechanism. I believe that the mentioned apps adhering to the iWarp requirement do a 'send' from the active side and only have the passive side issue RDMA ops once the incoming send has been received. I would guess that following a similar model is the best way to go and supported by all iWarp vendors implementing the IETF spec. IMO, the iWARP vendors _must_ get together and work on MPA '2'. Standardizing FPDU 'abuse' might be a good place to start, but it needs to be fixed to support peer-to-peer going forward. In the mean-time, imperfectly hiding the issue in the Firmware, uDAPL, the iWARP CM or anywhere else except the application seems to me to be the only customer friendly solution. [felix] While I'm not against trying to hide the connection migration details somewhere below the ULP, I'm not convinced that the issue is as severe as you make it to be and I would not press to have the issue resolved in a matter that requires a new MPA version. In fact, the different rdma transports (and maybe even different versions of the same transport (in the case of IB)) provide different features and I would assume that ULPs will eventually code to these features and must thus be aware of the underlying transport protocol. In that bigger picture, the connection migration issue at hand seems fairly trivial to solve even if it requires an ULP change... I didn't make an argument about severity. Qualifying the severity is in the customer's purview. I'm simply pointing out the following: a) the perspective that the restriction is trivial is how we got here, b) making the app change is putting a decision in the customer's hands that IMO an iWARP vendor would rather they didn't have to make Do I or don't I support iWARP?, and c) you have the power to hide this behavior for most cases. Finally, I believe RFC means Request for Comment. Well here's one last comment -- Add an FPDU message at the end of MPA exchange and fix the problem in the protocol. If we can not get agreement on it on reflector lets do it at SC'07 OFA dev. conference. Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Glenn Grundstrom [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 23, 2007 9:02 PM To: Sean Hefty; Steve Wise Cc: Roland Dreier; [EMAIL PROTECTED]; OpenFabrics General Subject: RE: [ofa-general] [RFP] support for iWARP requirement - activeconnect side MUST send first FPDU That is what I've been trying to push. Both MVAPICH2 and OMPI have been open to adjusting their transports to adhere to this requirement. I wouldn't mind implementing something to enforce this in the IWCM or the iWARP drivers IF there was a clean way to do it. So far there hasn't been a clean way proposed. Why can't either uDAPL or iW CM always do a send from the active to passive side that gets stripped off? From the active side, the first send is always posted before any user sends, and if necessary, a user send can be queued by software to avoid a QP/CQ overrun. The completion can simply be eaten by software. On the passive side, you have a similar process for receiving the data. This is similar to an option in the NetEffect driver. A zero byte RDMA write is sent from the active side and accounted for on the passive side. This can
RE: [ofa-general] [RFP] support for iWARP requirement- activeconnectside MUST send first FPDU
The bottom line we need to single solution which works for all vendors. This issue cause interoperability problems. So Customers will stay on the sideline until these type of issues are resolved. Hiding behind protocol holes is not going to help adoption. Will sending 0-size send message from initiator side work? Can IWCM on responder side squeeze 0-size buffer to recv this message and swallow it. Hope that there is no check that need to be done on all comletions? Will work for both interrupt and polling mode? I still believe that it will be simplier to add it to MPA protocol. Thanks, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Tom Tucker [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 24, 2007 10:40 AM To: Felix Marti Cc: Kanevsky, Arkady; Roland Dreier; Glenn Grundstrom; OpenFabrics General; [EMAIL PROTECTED] Subject: Re: [ofa-general] [RFP] support for iWARP requirement- activeconnectside MUST send first FPDU Felix Marti wrote: -Original Message- From: Tom Tucker [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 23, 2007 9:32 PM To: Felix Marti Cc: Kanevsky, Arkady; Glenn Grundstrom; Sean Hefty; Steve Wise; Roland Dreier; [EMAIL PROTECTED]; OpenFabrics General Subject: Re: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU Felix Marti wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:general- [EMAIL PROTECTED] On Behalf Of Kanevsky, Arkady Sent: Tuesday, October 23, 2007 6:26 PM To: Glenn Grundstrom; Sean Hefty; Steve Wise Cc: Roland Dreier; [EMAIL PROTECTED]; OpenFabrics General Subject: RE: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU This is still a protocol and should be defined by IETF not OFA. But if we get agreement from all iWARP vendors this will be a good step. [felix] This will not work with a Chelsio RNIC which follows the IETF specification by a) not issuing a 0B RDMA Write to the wire and b) silently consuming an incoming 0B write. Therefore 0B RDMA Writes cannot be 'abused' for such a synchronization mechanism. I believe that the mentioned apps adhering to the iWarp requirement do a 'send' from the active side and only have the passive side issue RDMA ops once the incoming send has been received. I would guess that following a similar model is the best way to go and supported by all iWarp vendors implementing the IETF spec. IMO, the iWARP vendors _must_ get together and work on MPA '2'. Standardizing FPDU 'abuse' might be a good place to start, but it needs to be fixed to support peer-to-peer going forward. In the mean-time, imperfectly hiding the issue in the Firmware, uDAPL, the iWARP CM or anywhere else except the application seems to me to be the only customer friendly solution. [felix] While I'm not against trying to hide the connection migration details somewhere below the ULP, I'm not convinced that the issue is as severe as you make it to be and I would not press to have the issue resolved in a matter that requires a new MPA version. In fact, the different rdma transports (and maybe even different versions of the same transport (in the case of IB)) provide different features and I would assume that ULPs will eventually code to these features and must thus be aware of the underlying transport protocol. In that bigger picture, the connection migration issue at hand seems fairly trivial to solve even if it requires an ULP change... I didn't make an argument about severity. Qualifying the severity is in the customer's purview. I'm simply pointing out the following: a) the perspective that the restriction is trivial is how we got here, b) making the app change is putting a decision in the customer's hands that IMO an iWARP vendor would rather they didn't have to make Do I or don't I support iWARP?, and c) you have the power to hide this behavior for most cases. Finally, I believe RFC means Request for Comment. Well here's one last comment -- Add an FPDU message at the end of MPA exchange and fix the problem in the protocol. If we can not get agreement on it on reflector lets do it at SC'07 OFA dev. conference. Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Glenn
[ofa-general] [PATCH 4 of 5] mlx4: limit qp resources accepted for create_qp per query_device values and headroom requirements
mlx4: limit allowable qp create resources to avoid create_qp failures due to added headroom wqes. In addition, guarantee that qp capabilities following qp creation always lie within limits given by ib_query_device. (for userspace, we perform this limiting in libmlx4, so as not to break the ABI). Signed-off-by: Jack Morgenstein [EMAIL PROTECTED] diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index d8287d9..d40ec2f 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -109,7 +109,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev, props-max_mr_size = ~0ull; props-page_size_cap = dev-dev-caps.page_size_cap; props-max_qp = dev-dev-caps.num_qps - dev-dev-caps.reserved_qps; - props-max_qp_wr = dev-dev-caps.max_wqes; + props-max_qp_wr = dev-dev-caps.max_wqes - MLX4_IB_SQ_MAX_SPARE; props-max_sge = min(dev-dev-caps.max_sq_sg, dev-dev-caps.max_rq_sg); props-max_cq = dev-dev-caps.num_cqs - dev-dev-caps.reserved_cqs; diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index 2869765..56305e2 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -47,6 +47,13 @@ enum { MLX4_IB_DB_PER_PAGE = PAGE_SIZE / 4 }; +enum { + MLX4_IB_SQ_MIN_WQE_SHIFT = 6 +}; + +#define MLX4_IB_SQ_HEADROOM(shift) ((2048 (shift)) + 1) +#define MLX4_IB_SQ_MAX_SPARE (MLX4_IB_SQ_HEADROOM(MLX4_IB_SQ_MIN_WQE_SHIFT)) + struct mlx4_ib_db_pgdir; struct mlx4_ib_user_db_page; diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 6b33224..d6c1600 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -212,8 +212,9 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, int is_user, int has_srq, struct mlx4_ib_qp *qp) { /* Sanity check RQ size before proceeding */ - if (cap-max_recv_wr dev-dev-caps.max_wqes || - cap-max_recv_sge dev-dev-caps.max_rq_sg) + if (cap-max_recv_wr dev-dev-caps.max_wqes - MLX4_IB_SQ_MAX_SPARE || + cap-max_recv_sge + min(dev-dev-caps.max_sq_sg, dev-dev-caps.max_rq_sg)) return -EINVAL; if (has_srq) { @@ -232,8 +233,19 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, qp-rq.wqe_shift = ilog2(qp-rq.max_gs * sizeof (struct mlx4_wqe_data_seg)); } - cap-max_recv_wr = qp-rq.max_post = qp-rq.wqe_cnt; - cap-max_recv_sge = qp-rq.max_gs; + /* leave userspace return values as they were, so as not to break ABI */ + if (is_user) { + cap-max_recv_wr = qp-rq.max_post = qp-rq.wqe_cnt; + cap-max_recv_sge = qp-rq.max_gs; + } else { + cap-max_recv_wr = qp-rq.max_post = + min(dev-dev-caps.max_wqes - MLX4_IB_SQ_MAX_SPARE, qp-rq.wqe_cnt); + cap-max_recv_sge = min(qp-rq.max_gs, + min(dev-dev-caps.max_sq_sg, + dev-dev-caps.max_rq_sg)); + } + /* We don't support inline sends for kernel QPs (yet) */ + return 0; } @@ -242,8 +254,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, enum ib_qp_type type, struct mlx4_ib_qp *qp) { /* Sanity check SQ size before proceeding */ - if (cap-max_send_wr dev-dev-caps.max_wqes || - cap-max_send_sge dev-dev-caps.max_sq_sg || + if (cap-max_send_wr (dev-dev-caps.max_wqes - MLX4_IB_SQ_MAX_SPARE) || + cap-max_send_sge + min(dev-dev-caps.max_sq_sg, dev-dev-caps.max_rq_sg) || cap-max_inline_data + send_wqe_overhead(type) + sizeof (struct mlx4_wqe_inline_seg) dev-dev-caps.max_sq_desc_sz) return -EINVAL; @@ -261,6 +274,7 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, cap-max_inline_data + sizeof (struct mlx4_wqe_inline_seg)) + send_wqe_overhead(type))); + qp-sq.wqe_shift = max(MLX4_IB_SQ_MIN_WQE_SHIFT, qp-sq.wqe_shift); qp-sq.max_gs= ((1 qp-sq.wqe_shift) - send_wqe_overhead(type)) / sizeof (struct mlx4_wqe_data_seg); @@ -268,7 +282,7 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, struct ib_qp_cap *cap, * We need to leave 2 KB + 1 WQE of headroom in the SQ to * allow HW to prefetch. */ - qp-sq_spare_wqes = (2048 qp-sq.wqe_shift) + 1; + qp-sq_spare_wqes =
[ofa-general] [PATCH 5 of 5] mlx4: Do not allocate an extra (unneeded) CQE when creating a CQ
mlx4: Do not allocate an extra (unneeded) CQE when creating a CQ. The extra CQE can cause a huge waste of memory if requesting a power-of-2 number of CQEs. Leave create_cq for userspace CQs as before, to avoid breaking ABI. (Handle this in separate libmlx4 patch) Signed-off-by: Jack Morgenstein [EMAIL PROTECTED] diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 8bf44da..8a1ccc4 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -108,7 +108,13 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector if (!cq) return ERR_PTR(-ENOMEM); - entries = roundup_pow_of_two(entries + 1); + /* eliminate using extra CQE (for kernel space). +* For userspace, do in libmlx4, so that don't break ABI. +*/ + if (context) + entries = roundup_pow_of_two(entries + 1); + else + entries = roundup_pow_of_two(entries); cq-ibcq.cqe = entries - 1; buf_size = entries * sizeof (struct mlx4_cqe); spin_lock_init(cq-lock); diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 89b3f0b..d34b61b 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -141,12 +141,7 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) dev-caps.max_sq_desc_sz = dev_cap-max_sq_desc_sz; dev-caps.max_rq_desc_sz = dev_cap-max_rq_desc_sz; dev-caps.num_qp_per_mgm = MLX4_QP_PER_MGM; - /* -* Subtract 1 from the limit because we need to allocate a -* spare CQE so the HCA HW can tell the difference between an -* empty CQ and a full CQ. -*/ - dev-caps.max_cqes = dev_cap-max_cq_sz - 1; + dev-caps.max_cqes = dev_cap-max_cq_sz; dev-caps.reserved_cqs = dev_cap-reserved_cqs; dev-caps.reserved_eqs = dev_cap-reserved_eqs; dev-caps.reserved_mtts = DIV_ROUND_UP(dev_cap-reserved_mtts, ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] makefile problem in using librdmacm and libverbs
INC_VERBS = ${TOP_DIR}/libibverbs/include/ INC_RDMACM = ${TOP_DIR}/librdmacm/include/ Do these files match what's in /usr/local/include/infiniband and /usr/local/include/rdma? (Or the equivalent install directory.) You could try picking up the installed include files, rather than going directly into the source directory. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU
Sean Hefty wrote: I said clean way to do it. ;-) I'm referring to an rdma cm connection protocol for iWarp. We have one for IB. I mentioned uDAPL as a possibility because it abstracts the transport, QP, CQ, etc. anyway, and one could argue that the uDAPL iWarp provider should take necessary steps to support the uDAPL API. There is one OpenFabrics uDAPL provider for all OFA devices. Sure, we could add some logic in the DAPL abstraction layer to check for iWARP devices and possibly hide the restriction. Say we do that, what about the applications that sit directly on top of OFA verbs and rdma_cm? Say we add some iWARP abstraction at this layer, what about the WinOF stack? I don't know that there's a need to change the iWarp architecture. If you think customers are willing to work around this restriction then by all means leave the architecture alone and simply document the rdma API's. I would think that this put's iWARP vendors at a disadvantage. I am guessing that energy and time spent changing the iWARP protocol specification is a better use of everyone's time then hacking every iWARP stack out there to hide the restriction. -arlin ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general