Re: [ofa-general] Re: [PATCH] IB/ipoib: IPOIB CM rx use higher order fragments

2007-10-24 Thread Eli Cohen
On Tue, 2007-10-23 at 11:35 -0700, Roland Dreier wrote:
  In order to reduce the overhead of iterating the fragments of an
   SKB in the receive flow, we use fragments of higher order and thus
   reduce the number of iterations. This patch seams to improve receive
   throughput of small UDP messages.
 
 I don't think we want to do this -- it may be good for benchmarks but
 it will hurt reliability, since systems often have highly fragmented
 memory so higher-order atomic allocations will fail.
 
  - R.

Other drivers do similar allocations. For example, e1000 when working
with jumbo frames does such large allocations. Also I did not notice
allocation failures though my system was pretty much active but I can
monitor for such possible failures.

e1000_main.c line 3549:

else if (max_frame = E1000_RXBUFFER_16384)
adapter-rx_buffer_len = E1000_RXBUFFER_16384;

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU

2007-10-24 Thread Steve Wise



Sean Hefty wrote:
That is what I've been trying to push.  Both MVAPICH2 and OMPI have 
been open to adjusting their transports to adhere to this requirement.


I wouldn't mind implementing something to enforce this in the IWCM or 
the iWARP drivers IF there was a clean way to do it.  So far there 
hasn't been a clean way proposed.


Why can't either uDAPL or iW CM always do a send from the active to 
passive side that gets stripped off?  From the active side, the first 
send is always posted before any user sends, and if necessary, a user 
send can be queued by software to avoid a QP/CQ overrun.  The completion 
can simply be eaten by software.  On the passive side, you have a 
similar process for receiving the data.


(Yes this adds wire protocol, which requires both sides to support it.)

- Sean


I said clean way to do it. ;-)

Yes, this is the only under the covers solution I know of that will 
work with existing HW.   However, I don't think it can be done totally 
within the rdmacm or iwcm.  I think it involves providers poll 
function to deal with the send completion/error and the passive side 
recv completion/error.


Steve.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU

2007-10-24 Thread Tom Tucker

Felix Marti wrote:
  

-Original Message-
From: Tom Tucker [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 23, 2007 9:32 PM
To: Felix Marti
Cc: Kanevsky, Arkady; Glenn Grundstrom; Sean Hefty; Steve Wise; Roland
Dreier; [EMAIL PROTECTED]; OpenFabrics General
Subject: Re: [ofa-general] [RFP] support for iWARP requirement -
activeconnectside MUST send first FPDU

Felix Marti wrote:


-Original Message-
From: [EMAIL PROTECTED] [mailto:general-
[EMAIL PROTECTED] On Behalf Of Kanevsky, Arkady
Sent: Tuesday, October 23, 2007 6:26 PM
To: Glenn Grundstrom; Sean Hefty; Steve Wise
Cc: Roland Dreier; [EMAIL PROTECTED]; OpenFabrics
General
Subject: RE: [ofa-general] [RFP] support for iWARP requirement -
activeconnectside MUST send first FPDU

This is still a protocol and should be defined by IETF not OFA.
But if we get agreement from all iWARP vendors this will be a good
step.



[felix] This will not work with a Chelsio RNIC which follows the
  

IETF
  

specification by a) not issuing a 0B RDMA Write to the wire and b)
silently consuming an incoming 0B write. Therefore 0B RDMA Writes
  

cannot


be 'abused' for such a synchronization mechanism. I believe that the
mentioned apps adhering to the iWarp requirement do a 'send' from
  

the
  

active side and only have the passive side issue RDMA ops once the
incoming send has been received. I would guess that following a
  

similar


model is the best way to go and supported by all iWarp vendors
implementing the IETF spec.


  

IMO, the iWARP vendors _must_ get together and work on MPA '2'.
Standardizing FPDU 'abuse' might be a good place to start, but it


needs
  

to be fixed to support peer-to-peer going forward.

In the mean-time, imperfectly hiding the issue in the Firmware, uDAPL,
the iWARP CM or anywhere else except the application seems to me to be
the only customer friendly solution.



[felix] While I'm not against trying to hide the connection migration
details somewhere below the ULP, I'm not convinced that the issue is as
severe as you make it to be and I would not press to have the issue
resolved in a matter that requires a new MPA version. In fact, the
different rdma transports (and maybe even different versions of the same
transport (in the case of IB)) provide different features and I would
assume that ULPs will eventually code to these features and must thus be
aware of the underlying transport protocol. In that bigger picture, the
connection migration issue at hand seems fairly trivial to solve even if
it requires an ULP change... 
  
I didn't make an argument about severity. Qualifying the severity is in 
the customer's purview. I'm simply pointing out the following: a) the 
perspective that the restriction is trivial is how we got here, b) 
making the app change is putting a decision in the customer's hands that 
IMO an iWARP vendor would rather they didn't have to make Do I or don't 
I support iWARP?, and c) you have the power to hide this behavior for 
most cases.


Finally, I believe RFC means Request for Comment. Well here's one last 
comment -- Add an FPDU message at the end of MPA exchange and fix the 
problem in the protocol.


 
  

If we can not get agreement on it on reflector lets do
it at SC'07 OFA dev. conference.

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300





-Original Message-
From: Glenn Grundstrom [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 23, 2007 9:02 PM
To: Sean Hefty; Steve Wise
Cc: Roland Dreier; [EMAIL PROTECTED];
OpenFabrics General
Subject: RE: [ofa-general] [RFP] support for iWARP
requirement - activeconnect side MUST send first FPDU


  

That is what I've been trying to push.  Both MVAPICH2 and

  

OMPI have been



open to adjusting their transports to adhere to this

  

requirement.

  

I wouldn't mind implementing something to enforce this in

  

the IWCM or



the iWARP drivers IF there was a clean way to do it.  So

  

far there

  

hasn't been a clean way proposed.

  

Why can't either uDAPL or iW CM always do a send from the active



to

  

passive side that gets stripped off?  From the active side,



the first

  

send is always posted before any user sends, and if



necessary, a user

  

send can be queued by software to avoid a QP/CQ overrun.  The
completion can simply be eaten by software.  On the passive



side, you

  

have a similar process for receiving the data.



This is similar to an option in the NetEffect driver.  A zero
byte RDMA write is sent from the active side and accounted
for on the passive side.  This can 

RE: [ofa-general] [RFP] support for iWARP requirement- activeconnectside MUST send first FPDU

2007-10-24 Thread Kanevsky, Arkady
The bottom line we need to single solution which works for all vendors.
This issue cause interoperability problems.
So Customers will stay on the sideline until these type of issues are
resolved.
Hiding behind protocol holes is not going to help adoption.

Will sending 0-size send message from initiator side work?
Can IWCM on responder side squeeze 0-size buffer to recv this message
and swallow it. Hope that there is no check that need to be done
on all comletions? Will work for both interrupt and polling mode?

I still believe that it will be simplier to add it to MPA protocol.
Thanks,

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300
 

 -Original Message-
 From: Tom Tucker [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, October 24, 2007 10:40 AM
 To: Felix Marti
 Cc: Kanevsky, Arkady; Roland Dreier; Glenn Grundstrom; 
 OpenFabrics General; [EMAIL PROTECTED]
 Subject: Re: [ofa-general] [RFP] support for iWARP 
 requirement- activeconnectside MUST send first FPDU
 
 Felix Marti wrote:

  -Original Message-
  From: Tom Tucker [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, October 23, 2007 9:32 PM
  To: Felix Marti
  Cc: Kanevsky, Arkady; Glenn Grundstrom; Sean Hefty; Steve Wise; 
  Roland Dreier; [EMAIL PROTECTED]; 
 OpenFabrics General
  Subject: Re: [ofa-general] [RFP] support for iWARP requirement - 
  activeconnectside MUST send first FPDU
 
  Felix Marti wrote:
  
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:general- 
  [EMAIL PROTECTED] On Behalf Of Kanevsky, Arkady
  Sent: Tuesday, October 23, 2007 6:26 PM
  To: Glenn Grundstrom; Sean Hefty; Steve Wise
  Cc: Roland Dreier; [EMAIL PROTECTED]; OpenFabrics 
  General
  Subject: RE: [ofa-general] [RFP] support for iWARP requirement - 
  activeconnectside MUST send first FPDU
 
  This is still a protocol and should be defined by IETF not OFA.
  But if we get agreement from all iWARP vendors this will 
 be a good 
  step.
 
  
  [felix] This will not work with a Chelsio RNIC which follows the

  IETF

  specification by a) not issuing a 0B RDMA Write to the 
 wire and b) 
  silently consuming an incoming 0B write. Therefore 0B RDMA Writes

  cannot
  
  be 'abused' for such a synchronization mechanism. I 
 believe that the 
  mentioned apps adhering to the iWarp requirement do a 'send' from

  the

  active side and only have the passive side issue RDMA ops 
 once the 
  incoming send has been received. I would guess that following a

  similar
  
  model is the best way to go and supported by all iWarp vendors 
  implementing the IETF spec.
 
 

  IMO, the iWARP vendors _must_ get together and work on MPA '2'.
  Standardizing FPDU 'abuse' might be a good place to start, but it
  
  needs

  to be fixed to support peer-to-peer going forward.
 
  In the mean-time, imperfectly hiding the issue in the Firmware, 
  uDAPL, the iWARP CM or anywhere else except the 
 application seems to 
  me to be the only customer friendly solution.
  
 
  [felix] While I'm not against trying to hide the connection 
 migration 
  details somewhere below the ULP, I'm not convinced that the 
 issue is 
  as severe as you make it to be and I would not press to 
 have the issue 
  resolved in a matter that requires a new MPA version. In fact, the 
  different rdma transports (and maybe even different versions of the 
  same transport (in the case of IB)) provide different 
 features and I 
  would assume that ULPs will eventually code to these 
 features and must 
  thus be aware of the underlying transport protocol. In that bigger 
  picture, the connection migration issue at hand seems 
 fairly trivial 
  to solve even if it requires an ULP change...

 I didn't make an argument about severity. Qualifying the 
 severity is in the customer's purview. I'm simply pointing 
 out the following: a) the perspective that the restriction is 
 trivial is how we got here, b) making the app change is 
 putting a decision in the customer's hands that IMO an iWARP 
 vendor would rather they didn't have to make Do I or don't I 
 support iWARP?, and c) you have the power to hide this 
 behavior for most cases.
 
 Finally, I believe RFC means Request for Comment. Well 
 here's one last comment -- Add an FPDU message at the end of 
 MPA exchange and fix the problem in the protocol.
 
   

  If we can not get agreement on it on reflector lets do 
 it at SC'07 
  OFA dev. conference.
 
  Arkady Kanevsky   email: [EMAIL PROTECTED]
  Network Appliance Inc.   phone: 781-768-5395
  1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
  Waltham, MA 02451   central phone: 781-768-5300
 
 
 
  
  -Original Message-
  From: Glenn 

[ofa-general] [PATCH 4 of 5] mlx4: limit qp resources accepted for create_qp per query_device values and headroom requirements

2007-10-24 Thread Jack Morgenstein
mlx4: limit allowable qp create resources to avoid create_qp failures
  due to added headroom wqes.

In addition, guarantee that qp capabilities following qp creation
always lie within limits given by ib_query_device.
(for userspace, we perform this limiting in libmlx4, so as not to
 break the ABI).

Signed-off-by: Jack Morgenstein [EMAIL PROTECTED]

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index d8287d9..d40ec2f 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -109,7 +109,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
props-max_mr_size = ~0ull;
props-page_size_cap   = dev-dev-caps.page_size_cap;
props-max_qp  = dev-dev-caps.num_qps - 
dev-dev-caps.reserved_qps;
-   props-max_qp_wr   = dev-dev-caps.max_wqes;
+   props-max_qp_wr   = dev-dev-caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE;
props-max_sge = min(dev-dev-caps.max_sq_sg,
 dev-dev-caps.max_rq_sg);
props-max_cq  = dev-dev-caps.num_cqs - 
dev-dev-caps.reserved_cqs;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h 
b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 2869765..56305e2 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -47,6 +47,13 @@ enum {
MLX4_IB_DB_PER_PAGE = PAGE_SIZE / 4
 };
 
+enum {
+   MLX4_IB_SQ_MIN_WQE_SHIFT = 6
+};
+
+#define MLX4_IB_SQ_HEADROOM(shift) ((2048  (shift)) + 1)
+#define MLX4_IB_SQ_MAX_SPARE (MLX4_IB_SQ_HEADROOM(MLX4_IB_SQ_MIN_WQE_SHIFT))
+
 struct mlx4_ib_db_pgdir;
 struct mlx4_ib_user_db_page;
 
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 6b33224..d6c1600 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -212,8 +212,9 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct 
ib_qp_cap *cap,
   int is_user, int has_srq, struct mlx4_ib_qp *qp)
 {
/* Sanity check RQ size before proceeding */
-   if (cap-max_recv_wr   dev-dev-caps.max_wqes  ||
-   cap-max_recv_sge  dev-dev-caps.max_rq_sg)
+   if (cap-max_recv_wr  dev-dev-caps.max_wqes - MLX4_IB_SQ_MAX_SPARE ||
+   cap-max_recv_sge 
+   min(dev-dev-caps.max_sq_sg, dev-dev-caps.max_rq_sg))
return -EINVAL;
 
if (has_srq) {
@@ -232,8 +233,19 @@ static int set_rq_size(struct mlx4_ib_dev *dev, struct 
ib_qp_cap *cap,
qp-rq.wqe_shift = ilog2(qp-rq.max_gs * sizeof (struct 
mlx4_wqe_data_seg));
}
 
-   cap-max_recv_wr  = qp-rq.max_post = qp-rq.wqe_cnt;
-   cap-max_recv_sge = qp-rq.max_gs;
+   /* leave userspace return values as they were, so as not to break ABI */
+   if (is_user) {
+   cap-max_recv_wr  = qp-rq.max_post = qp-rq.wqe_cnt;
+   cap-max_recv_sge = qp-rq.max_gs;
+   } else {
+   cap-max_recv_wr  = qp-rq.max_post =
+   min(dev-dev-caps.max_wqes - MLX4_IB_SQ_MAX_SPARE, 
qp-rq.wqe_cnt);
+   cap-max_recv_sge = min(qp-rq.max_gs,
+   min(dev-dev-caps.max_sq_sg,
+   dev-dev-caps.max_rq_sg));
+   }
+   /* We don't support inline sends for kernel QPs (yet) */
+
 
return 0;
 }
@@ -242,8 +254,9 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, 
struct ib_qp_cap *cap,
  enum ib_qp_type type, struct mlx4_ib_qp *qp)
 {
/* Sanity check SQ size before proceeding */
-   if (cap-max_send_wr  dev-dev-caps.max_wqes  ||
-   cap-max_send_sge dev-dev-caps.max_sq_sg ||
+   if (cap-max_send_wr  (dev-dev-caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE) ||
+   cap-max_send_sge
+   min(dev-dev-caps.max_sq_sg, dev-dev-caps.max_rq_sg) ||
cap-max_inline_data + send_wqe_overhead(type) +
sizeof (struct mlx4_wqe_inline_seg)  dev-dev-caps.max_sq_desc_sz)
return -EINVAL;
@@ -261,6 +274,7 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, 
struct ib_qp_cap *cap,
cap-max_inline_data +
sizeof (struct 
mlx4_wqe_inline_seg)) +
send_wqe_overhead(type)));
+   qp-sq.wqe_shift = max(MLX4_IB_SQ_MIN_WQE_SHIFT, qp-sq.wqe_shift);
qp-sq.max_gs= ((1  qp-sq.wqe_shift) - send_wqe_overhead(type)) /
sizeof (struct mlx4_wqe_data_seg);
 
@@ -268,7 +282,7 @@ static int set_kernel_sq_size(struct mlx4_ib_dev *dev, 
struct ib_qp_cap *cap,
 * We need to leave 2 KB + 1 WQE of headroom in the SQ to
 * allow HW to prefetch.
 */
-   qp-sq_spare_wqes = (2048  qp-sq.wqe_shift) + 1;
+   qp-sq_spare_wqes = 

[ofa-general] [PATCH 5 of 5] mlx4: Do not allocate an extra (unneeded) CQE when creating a CQ

2007-10-24 Thread Jack Morgenstein
mlx4: Do not allocate an extra (unneeded) CQE when creating a CQ.

The extra CQE can cause a huge waste of memory if requesting
a power-of-2 number of CQEs.

Leave create_cq for userspace CQs as before, to avoid breaking ABI.
(Handle this in separate libmlx4 patch)

Signed-off-by: Jack Morgenstein [EMAIL PROTECTED]

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 8bf44da..8a1ccc4 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -108,7 +108,13 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, 
int entries, int vector
if (!cq)
return ERR_PTR(-ENOMEM);
 
-   entries  = roundup_pow_of_two(entries + 1);
+   /* eliminate using extra CQE (for kernel space).
+* For userspace, do in libmlx4, so that don't break ABI.
+*/
+   if (context)
+   entries  = roundup_pow_of_two(entries + 1);
+   else
+   entries  = roundup_pow_of_two(entries);
cq-ibcq.cqe = entries - 1;
buf_size = entries * sizeof (struct mlx4_cqe);
spin_lock_init(cq-lock);
diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h 
b/drivers/infiniband/hw/mlx4/mlx4_ib.h
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 89b3f0b..d34b61b 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -141,12 +141,7 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct 
mlx4_dev_cap *dev_cap)
dev-caps.max_sq_desc_sz = dev_cap-max_sq_desc_sz;
dev-caps.max_rq_desc_sz = dev_cap-max_rq_desc_sz;
dev-caps.num_qp_per_mgm = MLX4_QP_PER_MGM;
-   /*
-* Subtract 1 from the limit because we need to allocate a
-* spare CQE so the HCA HW can tell the difference between an
-* empty CQ and a full CQ.
-*/
-   dev-caps.max_cqes   = dev_cap-max_cq_sz - 1;
+   dev-caps.max_cqes   = dev_cap-max_cq_sz;
dev-caps.reserved_cqs   = dev_cap-reserved_cqs;
dev-caps.reserved_eqs   = dev_cap-reserved_eqs;
dev-caps.reserved_mtts  = DIV_ROUND_UP(dev_cap-reserved_mtts,
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] makefile problem in using librdmacm and libverbs

2007-10-24 Thread Sean Hefty
 INC_VERBS = ${TOP_DIR}/libibverbs/include/
 INC_RDMACM = ${TOP_DIR}/librdmacm/include/

Do these files match what's in /usr/local/include/infiniband and
/usr/local/include/rdma?  (Or the equivalent install directory.)  You
could try picking up the installed include files, rather than going
directly into the source directory.

- Sean
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU

2007-10-24 Thread Arlin Davis

Sean Hefty wrote:

I said clean way to do it. ;-)


I'm referring to an rdma cm connection protocol for iWarp.  We have one 
for IB.  I mentioned uDAPL as a possibility because it abstracts the 
transport, QP, CQ, etc. anyway, and one could argue that the uDAPL iWarp 
provider should take necessary steps to support the uDAPL API.


There is one OpenFabrics uDAPL provider for all OFA devices. Sure, we 
could add some logic in the DAPL abstraction layer to check for iWARP 
devices and possibly hide the restriction. Say we do that, what about 
the applications that sit directly on top of OFA verbs and rdma_cm? Say 
we add some iWARP abstraction at this layer, what about the WinOF stack?




I don't know that there's a need to change the iWarp architecture.


If you think customers are willing to work around this restriction then 
by all means leave the architecture alone and simply document the rdma 
API's. I would think that this put's iWARP vendors at a disadvantage.


I am guessing that energy and time spent changing the iWARP protocol 
specification is a better use of everyone's time then hacking every 
iWARP stack out there to hide the restriction.


-arlin
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general