Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device
On Wed, 2007-09-26 at 14:06 -0500, Jim Mott wrote: This is a two part bug report. One is a conceptual problem that may just be a problem of understanding on my part. The other is what I believe to be a bug in the mlx4 driver. mthca has the same issue. 1) ib_create_qp() fails with max_sge If you use ib_query_device() to return the device specific attribute max_sge, it seems reasonable to expect you can create a QP with max_send_sge=max_sge. The problem is that this often fails. The reason is that depending on the QP type (RC, UD, etc.) and how the QP will be used (send, RDMA, atomic, etc.), there can be extra segments required in the WQE that eat up SGE entries. So while some send WQE might have max_sge available SGEs, many will not. Normally the difference between max_sge and the actual maximum value allowed (and checked) for max_send_sge is 1 or 2. This issue may need API extensions to definitively resolve. In the short term, it would be very nice if max_sge reported by ib_query_device() could always return a value that ib_create_qp() could use. Think of it as the minimum max_send_sge value that will work for all QP types. 2) mlx4 setting of max send SQEs The recent patch to support shrinking WQEs introduces a behavior that creates a big difference between the mlx4 supported send SGEs (checked against 61, should be 59 or 60, and reported in ib_query_device as 32 to equal receive side max_rq_sg value). The patch that follows will allow an MLX4 to support the number of send SGEs returned by ib_query_devce, and in fact quite a few more. It probably breaks shrinking WQEs and thus should not be applied directly. Note that if ib_query_device() returned max_sge adjusted for the raddr and atomic segments, this fix would not be needed. MLX4 would still support more SGEs in hardware than can be used through the API, but that is a different problem. --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:27:47.0 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:36:40.0 -0500 @@ -370,7 +370,7 @@ static int set_kernel_sq_size(struct mlx qp-sq.wqe_shift = ilog2(roundup_pow_of_two(s)); for (;;) { - if (1 qp-sq.wqe_shift dev-dev-caps.max_sq_desc_sz) + if (s dev-dev-caps.max_sq_desc_sz) return -EINVAL; qp-sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 qp-sq.wqe_shift); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device
This is a two part bug report. One is a conceptual problem that may just be a problem of understanding on my part. The other is what I believe to be a bug in the mlx4 driver. 1) ib_create_qp() fails with max_sge If you use ib_query_device() to return the device specific attribute max_sge, it seems reasonable to expect you can create a QP with max_send_sge=max_sge. The problem is that this often fails. The reason is that depending on the QP type (RC, UD, etc.) and how the QP will be used (send, RDMA, atomic, etc.), there can be extra segments required in the WQE that eat up SGE entries. So while some send WQE might have max_sge available SGEs, many will not. Normally the difference between max_sge and the actual maximum value allowed (and checked) for max_send_sge is 1 or 2. This issue may need API extensions to definitively resolve. In the short term, it would be very nice if max_sge reported by ib_query_device() could always return a value that ib_create_qp() could use. Think of it as the minimum max_send_sge value that will work for all QP types. 2) mlx4 setting of max send SQEs The recent patch to support shrinking WQEs introduces a behavior that creates a big difference between the mlx4 supported send SGEs (checked against 61, should be 59 or 60, and reported in ib_query_device as 32 to equal receive side max_rq_sg value). The patch that follows will allow an MLX4 to support the number of send SGEs returned by ib_query_devce, and in fact quite a few more. It probably breaks shrinking WQEs and thus should not be applied directly. Note that if ib_query_device() returned max_sge adjusted for the raddr and atomic segments, this fix would not be needed. MLX4 would still support more SGEs in hardware than can be used through the API, but that is a different problem. --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:27:47.0 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:36:40.0 -0500 @@ -370,7 +370,7 @@ static int set_kernel_sq_size(struct mlx qp-sq.wqe_shift = ilog2(roundup_pow_of_two(s)); for (;;) { - if (1 qp-sq.wqe_shift dev-dev-caps.max_sq_desc_sz) + if (s dev-dev-caps.max_sq_desc_sz) return -EINVAL; qp-sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 qp-sq.wqe_shift); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device
1) ib_create_qp() fails with max_sge If you use ib_query_device() to return the device specific attribute max_sge, it seems reasonable to expect you can create a QP with max_send_sge=max_sge. The problem is that this often fails. The reason is that depending on the QP type (RC, UD, etc.) and how the QP will be used (send, RDMA, atomic, etc.), there can be extra segments required in the WQE that eat up SGE entries. So while some send WQE might have max_sge available SGEs, many will not. This issue may need API extensions to definitively resolve. In the short term, it would be very nice if max_sge reported by ib_query_device() could always return a value that ib_create_qp() could use. Think of it as the minimum max_send_sge value that will work for all QP types. The intention is that any attempt to create a QP with the maximum number of S/G entries as reported by query device should succeed. However, as you note there may be issues that make this fail, but I would consider them as bugs to be fixed. You mention API extensions to handle this -- do you have any concrete ideas? In the past we've talked a little about this, but I don't think anyone has suggested any changes that would help matters while still keeping the API no more complex than it already is. The recent patch to support shrinking WQEs introduces a behavior that creates a big difference between the mlx4 supported send SGEs (checked against 61, should be 59 or 60, and reported in ib_query_device as 32 to equal receive side max_rq_sg value). I'm not sure I understand this. What's the new behavior? Are you trying to take advantage of the fact that using non-power-of-2 size send WQEs would let you have a send queue with more than 32 S/G entries? I think doing that actually would require a change in the API to allow different values for max_sge_rq and max_sge_sq to be reported from ib_query_device(). Which in turn would break the userspace ABI, etc, etc. and leaves me wondering if it's really worth it. (BTW I hate the shrinking WQE terminology for this, although obviously you weren't the one to introduce it) - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device
This problem comes about because ib_query_device() has only one field (max_sge) to return all types of SGE maximums. This value must work for receive WQEs, send WQEs, and all the permutations of QP type and hardware. A minimal API change that could help would be to add two new fields to ib_device_attr structure returned by ib_query_device: - delta_sge_sg - delta_sge_rd The behavior would be that in all cases using max_sge for send or receive SGE count in create_qp would always succeed. That means the current value the drivers return there would have to be reduced to fix this bug. All existing codes would continue to run. If an application wanted to better use hardware that supports asymmetric SGE counts, it could add the appropriate delta_sge_xx value to max_sge and get more useful value. It looks like there is some movement in this direction already with the fields: - max_sge_rd (nes, amso1100, ehca, cxgb3 only) - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only) If we do add any new fields to deal with this problem, we should probably make sure all the drivers support them. I guess that portable applications check max_sge_rd and max_srp_sge for zero and use max_sge if they are? To fully solve the problem and let applications make optimal use of hardware, we probably need a new function that takes the create_qp parameters along with a list of OPCODEs to be used (or excluded?) on this QP and returns the actual send and receive SGE maximums. The issue with the shrinking WQE (sorry) is best shown by example. The MLX4 supports a send WQE that is 1008 bytes long unless you are doing RDMA_READ when you can only use 512 byte send WQEs. A receive WQE can be 512 bytes maximum. Ignore the non-power-of-2 size stuff and just assume that all WQEs are fixed size power-of-2 with maximums of 1024 or 512. This is 63 or 32 segments. One segment for ctrl means that we get max_sge_rq of 31 and a matrix for max_sge_sq: RDMA_READ : 30 (raddr) RDMA_WRITE: 61 (raddr) SEND-RC : 62 SEND-UD : 59 (AV, AV, dest) The problem with: if (1 qp-sq.wqe_shift dev-dev-caps.max_sq_desc_sz) is that since max_sq_desc is 1008 instead of 1024 we are forced to use wqe_shift of 9 instead of 10. That means that even though the hardware supports an RC send with 62 SGEs, the most we can actually ask for is 31. All this brings us back to the original bug. ib_query_device() returned max_sge=32, so we use it in max_send_sge when we create a QP. In mlx4/qp.c, we verify max_send_sge = max_sq_sg (62; 1008-16) in a sanity check at entry to set_kernel_sq_size(). This passes. Then we calculate the size of the WQE based on the QP type: cap-max_send_sge * sizeof (struct mlx4_wqe_data_seg) + send_wqe_overhead(type); The send_wqe_overhead(RC) function returns 3 segments: - ctrl + atomic + raddr So we get a WQE size of 560 bytes (32 SGEs + 3 overhead segments) and this fails the power-of-2 test because 1024 is greater than 1008. Sorry for all the words. -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 3:03 PM To: Jim Mott Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device 1) ib_create_qp() fails with max_sge If you use ib_query_device() to return the device specific attribute max_sge, it seems reasonable to expect you can create a QP with max_send_sge=max_sge. The problem is that this often fails. The reason is that depending on the QP type (RC, UD, etc.) and how the QP will be used (send, RDMA, atomic, etc.), there can be extra segments required in the WQE that eat up SGE entries. So while some send WQE might have max_sge available SGEs, many will not. This issue may need API extensions to definitively resolve. In the short term, it would be very nice if max_sge reported by ib_query_device() could always return a value that ib_create_qp() could use. Think of it as the minimum max_send_sge value that will work for all QP types. The intention is that any attempt to create a QP with the maximum number of S/G entries as reported by query device should succeed. However, as you note there may be issues that make this fail, but I would consider them as bugs to be fixed. You mention API extensions to handle this -- do you have any concrete ideas? In the past we've talked a little about this, but I don't think anyone has suggested any changes that would help matters while still keeping the API no more complex than it already is. The recent patch to support shrinking WQEs introduces a behavior that creates a big difference between the mlx4 supported send SGEs (checked against 61, should be 59 or 60, and reported in ib_query_device as 32 to equal receive side max_rq_sg value). I'm not sure I understand this. What's
Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device
A minimal API change that could help would be to add two new fields to ib_device_attr structure returned by ib_query_device: - delta_sge_sg - delta_sge_rd Hmm, a cute idea but I'm still left wondering if it's worth the ABI breakage etc just to give a few more S/G entries in some situations. The behavior would be that in all cases using max_sge for send or receive SGE count in create_qp would always succeed. That means the current value the drivers return there would have to be reduced to fix this bug. All existing codes would continue to run. Actually are there any drivers other than patched mlx4 where max_sge doesn't always work? I agree we do want to get this right, but I thought we had fixed all such bugs. (And we should make sure that any shrinking WQE patch for mlx4 doesn't introduce new bugs) (BTW I see a different bug in unpatched mlx4, namely that it might report a too-big number of S/G entries allowed for the SQ) It looks like there is some movement in this direction already with the fields: - max_sge_rd (nes, amso1100, ehca, cxgb3 only) This field is obsolete, since we don't handle RD and almost certainly never will. I'm not sure why anyone is setting a value. - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only) Any devices that handle SRQ should set this. I think cxgb3 does not support SRQ. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device
The same bug exists with mthca. I saw it originally in the kernel doing RDS work, but I just put together a short user space test. ibv_query_device(MT25204) returns max_sge=30 - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works I only have the two types of adapters to test with. -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 5:32 PM To: Jim Mott Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device A minimal API change that could help would be to add two new fields to ib_device_attr structure returned by ib_query_device: - delta_sge_sg - delta_sge_rd Hmm, a cute idea but I'm still left wondering if it's worth the ABI breakage etc just to give a few more S/G entries in some situations. The behavior would be that in all cases using max_sge for send or receive SGE count in create_qp would always succeed. That means the current value the drivers return there would have to be reduced to fix this bug. All existing codes would continue to run. Actually are there any drivers other than patched mlx4 where max_sge doesn't always work? I agree we do want to get this right, but I thought we had fixed all such bugs. (And we should make sure that any shrinking WQE patch for mlx4 doesn't introduce new bugs) (BTW I see a different bug in unpatched mlx4, namely that it might report a too-big number of S/G entries allowed for the SQ) It looks like there is some movement in this direction already with the fields: - max_sge_rd (nes, amso1100, ehca, cxgb3 only) This field is obsolete, since we don't handle RD and almost certainly never will. I'm not sure why anyone is setting a value. - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only) Any devices that handle SRQ should set this. I think cxgb3 does not support SRQ. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device
The same bug exists with mthca. I saw it originally in the kernel doing RDS work, but I just put together a short user space test. ibv_query_device(MT25204) returns max_sge=30 - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works Which transport type? - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device
FWIW, I have code in my apps that retries QP creation with reduced values when the allocation with max fails. There was also an earlier e-mail thread on this exact same issue, but the solution bantered about was to use special values in the qp_attr structure ala QP_MAX_SEND_SGE (-1?). The provider would recognize this value and allocate the max for that attribute that would succeed given the current resource situation. The qp_attr structure would then be updated by the provider with the values given. This approach extends, but doesn't break the API, allows existing apps to work as usual, and avoids the retry logic that I've added to my apps. Just a thought, Tom On Wed, 2007-09-26 at 20:41 -0500, Jim Mott wrote: The same bug exists with mthca. I saw it originally in the kernel doing RDS work, but I just put together a short user space test. ibv_query_device(MT25204) returns max_sge=30 - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works I only have the two types of adapters to test with. -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 5:32 PM To: Jim Mott Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device A minimal API change that could help would be to add two new fields to ib_device_attr structure returned by ib_query_device: - delta_sge_sg - delta_sge_rd Hmm, a cute idea but I'm still left wondering if it's worth the ABI breakage etc just to give a few more S/G entries in some situations. The behavior would be that in all cases using max_sge for send or receive SGE count in create_qp would always succeed. That means the current value the drivers return there would have to be reduced to fix this bug. All existing codes would continue to run. Actually are there any drivers other than patched mlx4 where max_sge doesn't always work? I agree we do want to get this right, but I thought we had fixed all such bugs. (And we should make sure that any shrinking WQE patch for mlx4 doesn't introduce new bugs) (BTW I see a different bug in unpatched mlx4, namely that it might report a too-big number of S/G entries allowed for the SQ) It looks like there is some movement in this direction already with the fields: - max_sge_rd (nes, amso1100, ehca, cxgb3 only) This field is obsolete, since we don't handle RD and almost certainly never will. I'm not sure why anyone is setting a value. - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only) Any devices that handle SRQ should set this. I think cxgb3 does not support SRQ. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device
IBV_QPT_RC -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 8:57 PM To: Jim Mott Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device The same bug exists with mthca. I saw it originally in the kernel doing RDS work, but I just put together a short user space test. ibv_query_device(MT25204) returns max_sge=30 - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works Which transport type? - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general