Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

2007-09-27 Thread Tom Tucker
On Wed, 2007-09-26 at 14:06 -0500, Jim Mott wrote:
   This is a two part bug report.  One is a conceptual problem that may just 
 be a problem of understanding on my part.  The other is
 what I believe to be a bug in the mlx4 driver.

mthca has the same issue.

 
 1) ib_create_qp() fails with max_sge 
   If you use ib_query_device() to return the device specific 
 attribute max_sge, it seems reasonable to expect you can create
 a QP with max_send_sge=max_sge.  The problem is that this often
 fails.
 
   The reason is that depending on the QP type (RC, UD, etc.) and
 how the QP will be used (send, RDMA, atomic, etc.), there can be
 extra segments required in the WQE that eat up SGE entries.  So
 while some send WQE might have max_sge available SGEs, many will
 not.
 
   Normally the difference between max_sge and the actual maximum
 value allowed (and checked) for max_send_sge is 1 or 2.
 
   This issue may need API extensions to definitively resolve.  In
 the short term, it would be very nice if max_sge reported by 
 ib_query_device() could always return a value that ib_create_qp()
 could use.  Think of it as the minimum max_send_sge value that
 will work for all QP types.
 
 
 2) mlx4 setting of max send SQEs
   The recent patch to support shrinking WQEs introduces a 
 behavior that creates a big difference between the mlx4 
 supported send SGEs (checked against 61, should be 59 or 60,
 and reported in ib_query_device as 32 to equal receive side
 max_rq_sg value).  
 
   The patch that follows will allow an MLX4 to support the
 number of send SGEs returned by ib_query_devce, and in fact
 quite a few more.  It probably breaks shrinking WQEs and thus
 should not be applied directly.
 
   Note that if ib_query_device() returned max_sge adjusted
 for the raddr and atomic segments, this fix would not be
 needed.  MLX4 would still support more SGEs in hardware than
 can be used through the API, but that is a different problem.  
 
 --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 
 13:27:47.0 -0500
 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c  2007-09-26 
 13:36:40.0 -0500
 @@ -370,7 +370,7 @@ static int set_kernel_sq_size(struct mlx
 qp-sq.wqe_shift = ilog2(roundup_pow_of_two(s));
  
 for (;;) {
 -   if (1  qp-sq.wqe_shift  dev-dev-caps.max_sq_desc_sz)
 +   if (s  dev-dev-caps.max_sq_desc_sz)
 return -EINVAL;
  
 qp-sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1  
 qp-sq.wqe_shift);
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

2007-09-26 Thread Jim Mott
  This is a two part bug report.  One is a conceptual problem that may just be 
a problem of understanding on my part.  The other is
what I believe to be a bug in the mlx4 driver.

1) ib_create_qp() fails with max_sge 
  If you use ib_query_device() to return the device specific 
attribute max_sge, it seems reasonable to expect you can create
a QP with max_send_sge=max_sge.  The problem is that this often
fails.

  The reason is that depending on the QP type (RC, UD, etc.) and
how the QP will be used (send, RDMA, atomic, etc.), there can be
extra segments required in the WQE that eat up SGE entries.  So
while some send WQE might have max_sge available SGEs, many will
not.

  Normally the difference between max_sge and the actual maximum
value allowed (and checked) for max_send_sge is 1 or 2.

  This issue may need API extensions to definitively resolve.  In
the short term, it would be very nice if max_sge reported by 
ib_query_device() could always return a value that ib_create_qp()
could use.  Think of it as the minimum max_send_sge value that
will work for all QP types.


2) mlx4 setting of max send SQEs
  The recent patch to support shrinking WQEs introduces a 
behavior that creates a big difference between the mlx4 
supported send SGEs (checked against 61, should be 59 or 60,
and reported in ib_query_device as 32 to equal receive side
max_rq_sg value).  

  The patch that follows will allow an MLX4 to support the
number of send SGEs returned by ib_query_devce, and in fact
quite a few more.  It probably breaks shrinking WQEs and thus
should not be applied directly.

  Note that if ib_query_device() returned max_sge adjusted
for the raddr and atomic segments, this fix would not be
needed.  MLX4 would still support more SGEs in hardware than
can be used through the API, but that is a different problem.  

--- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 
13:27:47.0 -0500
+++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c  2007-09-26 
13:36:40.0 -0500
@@ -370,7 +370,7 @@ static int set_kernel_sq_size(struct mlx
qp-sq.wqe_shift = ilog2(roundup_pow_of_two(s));
 
for (;;) {
-   if (1  qp-sq.wqe_shift  dev-dev-caps.max_sq_desc_sz)
+   if (s  dev-dev-caps.max_sq_desc_sz)
return -EINVAL;
 
qp-sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1  qp-sq.wqe_shift);

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

2007-09-26 Thread Roland Dreier
  1) ib_create_qp() fails with max_sge 
If you use ib_query_device() to return the device specific 
  attribute max_sge, it seems reasonable to expect you can create
  a QP with max_send_sge=max_sge.  The problem is that this often
  fails.
  
The reason is that depending on the QP type (RC, UD, etc.) and
  how the QP will be used (send, RDMA, atomic, etc.), there can be
  extra segments required in the WQE that eat up SGE entries.  So
  while some send WQE might have max_sge available SGEs, many will
  not.

This issue may need API extensions to definitively resolve.  In
  the short term, it would be very nice if max_sge reported by 
  ib_query_device() could always return a value that ib_create_qp()
  could use.  Think of it as the minimum max_send_sge value that
  will work for all QP types.

The intention is that any attempt to create a QP with the maximum
number of S/G entries as reported by query device should succeed.
However, as you note there may be issues that make this fail, but I
would consider them as bugs to be fixed.

You mention API extensions to handle this -- do you have any concrete
ideas?  In the past we've talked a little about this, but I don't
think anyone has suggested any changes that would help matters while
still keeping the API no more complex than it already is.

The recent patch to support shrinking WQEs introduces a 
  behavior that creates a big difference between the mlx4 
  supported send SGEs (checked against 61, should be 59 or 60,
  and reported in ib_query_device as 32 to equal receive side
  max_rq_sg value).  

I'm not sure I understand this.  What's the new behavior?

Are you trying to take advantage of the fact that using non-power-of-2
size send WQEs would let you have a send queue with more than 32 S/G
entries?  I think doing that actually would require a change in the
API to allow different values for max_sge_rq and max_sge_sq to be
reported from ib_query_device().  Which in turn would break the
userspace ABI, etc, etc. and leaves me wondering if it's really worth it.

(BTW I hate the shrinking WQE terminology for this, although
obviously you weren't the one to introduce it)

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

2007-09-26 Thread Jim Mott
This problem comes about because ib_query_device() has only one
field (max_sge) to return all types of SGE maximums.  This value
must work for receive WQEs, send WQEs, and all the permutations
of QP type and hardware.

A minimal API change that could help would be to add two new fields
to ib_device_attr structure returned by ib_query_device:
  - delta_sge_sg
  - delta_sge_rd

The behavior would be that in all cases using max_sge for send or
receive SGE count in create_qp would always succeed.  That means
the current value the drivers return there would have to be reduced
to fix this bug.  All existing codes would continue to run.

If an application wanted to better use hardware that supports
asymmetric SGE counts, it could add the appropriate delta_sge_xx
value to max_sge and get more useful value.

It looks like there is some movement in this direction already
with the fields:
  - max_sge_rd (nes, amso1100, ehca, cxgb3 only)
  - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only)

If we do add any new fields to deal with this problem, we should
probably make sure all the drivers support them.  I guess that
portable applications check max_sge_rd and max_srp_sge for zero
and use max_sge if they are?

To fully solve the problem and let applications make 
optimal use of hardware, we probably need a new function 
that takes the create_qp parameters along with a list of
OPCODEs to be used (or excluded?) on this QP and returns 
the actual send and receive SGE maximums.



The issue with the shrinking WQE (sorry) is best 
shown by example.  The MLX4 supports a send WQE that
is 1008 bytes long unless you are doing RDMA_READ 
when you can only use 512 byte send WQEs.  A
receive WQE can be 512 bytes maximum.  

Ignore the non-power-of-2 size stuff and just
assume that all WQEs are fixed size power-of-2
with maximums of 1024 or 512.  This is 63 or 32
segments.  One segment for ctrl means that we 
get max_sge_rq of 31 and a matrix for max_sge_sq:

RDMA_READ : 30 (raddr)
RDMA_WRITE: 61 (raddr)
SEND-RC   : 62 
SEND-UD   : 59 (AV, AV, dest)

The problem with:
  if (1  qp-sq.wqe_shift  dev-dev-caps.max_sq_desc_sz)

is that since max_sq_desc is 1008 instead of 1024 we are forced
to use wqe_shift of 9 instead of 10.  That means that even
though the hardware supports an RC send with 62 SGEs, the most
we can actually ask for is 31.



All this brings us back to the original bug.

ib_query_device() returned max_sge=32, so we use it in max_send_sge 
when we create a QP.  

In mlx4/qp.c, we verify max_send_sge = max_sq_sg (62; 1008-16)
in a sanity check at entry to set_kernel_sq_size().  This passes.

Then we calculate the size of the WQE based on the QP type:
  cap-max_send_sge * sizeof (struct mlx4_wqe_data_seg) +
  send_wqe_overhead(type);
The send_wqe_overhead(RC) function returns 3 segments:
  - ctrl + atomic + raddr
So we get a WQE size of 560 bytes (32 SGEs + 3 overhead
segments) and this fails the power-of-2 test because 1024 
is greater than 1008.

Sorry for all the words.

-Original Message-
From: Roland Dreier [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 3:03 PM
To: Jim Mott
Cc: general@lists.openfabrics.org
Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge 
lower than reported by ib_query_device

  1) ib_create_qp() fails with max_sge 
If you use ib_query_device() to return the device specific 
  attribute max_sge, it seems reasonable to expect you can create
  a QP with max_send_sge=max_sge.  The problem is that this often
  fails.
  
The reason is that depending on the QP type (RC, UD, etc.) and
  how the QP will be used (send, RDMA, atomic, etc.), there can be
  extra segments required in the WQE that eat up SGE entries.  So
  while some send WQE might have max_sge available SGEs, many will
  not.

This issue may need API extensions to definitively resolve.  In
  the short term, it would be very nice if max_sge reported by 
  ib_query_device() could always return a value that ib_create_qp()
  could use.  Think of it as the minimum max_send_sge value that
  will work for all QP types.

The intention is that any attempt to create a QP with the maximum
number of S/G entries as reported by query device should succeed.
However, as you note there may be issues that make this fail, but I
would consider them as bugs to be fixed.

You mention API extensions to handle this -- do you have any concrete
ideas?  In the past we've talked a little about this, but I don't
think anyone has suggested any changes that would help matters while
still keeping the API no more complex than it already is.

The recent patch to support shrinking WQEs introduces a 
  behavior that creates a big difference between the mlx4 
  supported send SGEs (checked against 61, should be 59 or 60,
  and reported in ib_query_device as 32 to equal receive side
  max_rq_sg value).  

I'm not sure I understand this.  What's

Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

2007-09-26 Thread Roland Dreier
  A minimal API change that could help would be to add two new fields
  to ib_device_attr structure returned by ib_query_device:
- delta_sge_sg
- delta_sge_rd

Hmm, a cute idea but I'm still left wondering if it's worth the ABI
breakage etc just to give a few more S/G entries in some situations.

  The behavior would be that in all cases using max_sge for send or
  receive SGE count in create_qp would always succeed.  That means
  the current value the drivers return there would have to be reduced
  to fix this bug.  All existing codes would continue to run.

Actually are there any drivers other than patched mlx4 where max_sge
doesn't always work?  I agree we do want to get this right, but I
thought we had fixed all such bugs.  (And we should make sure that any
shrinking WQE patch for mlx4 doesn't introduce new bugs)

(BTW I see a different bug in unpatched mlx4, namely that it might
report a too-big number of S/G entries allowed for the SQ)

  It looks like there is some movement in this direction already
  with the fields:
- max_sge_rd (nes, amso1100, ehca, cxgb3 only)

This field is obsolete, since we don't handle RD and almost certainly
never will.  I'm not sure why anyone is setting a value.

- max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only)

Any devices that handle SRQ should set this.  I think cxgb3 does not
support SRQ.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

2007-09-26 Thread Jim Mott
The same bug exists with mthca.  I saw it originally in the kernel doing RDS 
work, but I just put together a short user space test.

ibv_query_device(MT25204) returns max_sge=30
  - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails
  - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works

I only have the two types of adapters to test with.
-Original Message-
From: Roland Dreier [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 5:32 PM
To: Jim Mott
Cc: general@lists.openfabrics.org
Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge 
lower than reported by ib_query_device

  A minimal API change that could help would be to add two new fields
  to ib_device_attr structure returned by ib_query_device:
- delta_sge_sg
- delta_sge_rd

Hmm, a cute idea but I'm still left wondering if it's worth the ABI
breakage etc just to give a few more S/G entries in some situations.

  The behavior would be that in all cases using max_sge for send or
  receive SGE count in create_qp would always succeed.  That means
  the current value the drivers return there would have to be reduced
  to fix this bug.  All existing codes would continue to run.

Actually are there any drivers other than patched mlx4 where max_sge
doesn't always work?  I agree we do want to get this right, but I
thought we had fixed all such bugs.  (And we should make sure that any
shrinking WQE patch for mlx4 doesn't introduce new bugs)

(BTW I see a different bug in unpatched mlx4, namely that it might
report a too-big number of S/G entries allowed for the SQ)

  It looks like there is some movement in this direction already
  with the fields:
- max_sge_rd (nes, amso1100, ehca, cxgb3 only)

This field is obsolete, since we don't handle RD and almost certainly
never will.  I'm not sure why anyone is setting a value.

- max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only)

Any devices that handle SRQ should set this.  I think cxgb3 does not
support SRQ.

 - R.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

2007-09-26 Thread Roland Dreier
  The same bug exists with mthca.  I saw it originally in the kernel doing RDS 
  work, but I just put together a short user space test.
  
  ibv_query_device(MT25204) returns max_sge=30
- ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails
- ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works

Which transport type?

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

2007-09-26 Thread Tom Tucker

FWIW, I have code in my apps that retries QP creation with reduced
values when the allocation with max fails. 

There was also an earlier e-mail thread on this exact same issue, but
the solution bantered about was to use special values in the qp_attr
structure ala QP_MAX_SEND_SGE (-1?). The provider would recognize this
value and allocate the max for that attribute that would succeed given
the current resource situation. The qp_attr structure would then be
updated by the provider with the values given. This approach extends,
but doesn't break the API, allows existing apps to work as usual, and
avoids the retry logic that I've added to my apps.

Just a thought,
Tom

On Wed, 2007-09-26 at 20:41 -0500, Jim Mott wrote:
 The same bug exists with mthca.  I saw it originally in the kernel doing RDS 
 work, but I just put together a short user space test.
 
 ibv_query_device(MT25204) returns max_sge=30
   - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails
   - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works
 
 I only have the two types of adapters to test with.
 -Original Message-
 From: Roland Dreier [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, September 26, 2007 5:32 PM
 To: Jim Mott
 Cc: general@lists.openfabrics.org
 Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge 
 lower than reported by ib_query_device
 
   A minimal API change that could help would be to add two new fields
   to ib_device_attr structure returned by ib_query_device:
 - delta_sge_sg
 - delta_sge_rd
 
 Hmm, a cute idea but I'm still left wondering if it's worth the ABI
 breakage etc just to give a few more S/G entries in some situations.
 
   The behavior would be that in all cases using max_sge for send or
   receive SGE count in create_qp would always succeed.  That means
   the current value the drivers return there would have to be reduced
   to fix this bug.  All existing codes would continue to run.
 
 Actually are there any drivers other than patched mlx4 where max_sge
 doesn't always work?  I agree we do want to get this right, but I
 thought we had fixed all such bugs.  (And we should make sure that any
 shrinking WQE patch for mlx4 doesn't introduce new bugs)
 
 (BTW I see a different bug in unpatched mlx4, namely that it might
 report a too-big number of S/G entries allowed for the SQ)
 
   It looks like there is some movement in this direction already
   with the fields:
 - max_sge_rd (nes, amso1100, ehca, cxgb3 only)
 
 This field is obsolete, since we don't handle RD and almost certainly
 never will.  I'm not sure why anyone is setting a value.
 
 - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only)
 
 Any devices that handle SRQ should set this.  I think cxgb3 does not
 support SRQ.
 
  - R.
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

2007-09-26 Thread Jim Mott
IBV_QPT_RC

-Original Message-
From: Roland Dreier [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 8:57 PM
To: Jim Mott
Cc: general@lists.openfabrics.org
Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge 
lower than reported by ib_query_device

  The same bug exists with mthca.  I saw it originally in the kernel doing RDS 
  work, but I just put together a short user space
test.
  
  ibv_query_device(MT25204) returns max_sge=30
- ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails
- ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works

Which transport type?

 - R.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general