Re: [ofa-general] [PATCH] mlx4_core: increase max number of qp's and of srq's to 128K
On Wednesday 21 November 2007 10:09, Or Gerlitz wrote: Why you want to increase the maxima for SRQs as well? a 1:1 ratio between QPs to SRQs means a broken application design, isn't it? Not really, for the new XRC qp type. In this case, we will have one XRC connection per multi-process application per host, with a larger number of XRC_SRQs (one per process per host). However, the XRC SRQs act more like RD qps, so we really don't need to increase the default max SRQs. I'll post V2 of the patch now. - Jack ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH V2] mlx4_core: increase max number of qp's to 128K
mlx4_core: increase max QPs to 128K. With the advent large clusters which utilize multicore hosts, 64K qp's is not enough. We want to increase the default maxima for QPs to 128K. Signed-off-by: Jack Morgenstein [EMAIL PROTECTED] Index: ofa_1_3_dev_kernel/drivers/net/mlx4/main.c === --- ofa_1_3_dev_kernel.orig/drivers/net/mlx4/main.c 2007-11-21 17:51:56.0 +0200 +++ ofa_1_3_dev_kernel/drivers/net/mlx4/main.c 2007-11-22 10:26:04.0 +0200 @@ -76,7 +76,7 @@ static const char mlx4_version[] __devin DRV_VERSION ( DRV_RELDATE )\n; static struct mlx4_profile default_profile = { - .num_qp = 1 16, + .num_qp = 1 17, .num_srq= 1 16, .rdmarc_per_qp = 1 4, .num_cq = 1 16, ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] ofa_1_3_kernel 20071128-0200 daily build status
This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git git_branch: ofed_kernel Common build parameters: --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod --with-core-mod --with-addr_trans-mod --with-rds-mod --with-cxgb3-mod --with-nes-mod Passed: Passed on i686 with 2.6.15-23-server Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.15 Passed on i686 with linux-2.6.12 Passed on i686 with linux-2.6.14 Passed on i686 with linux-2.6.13 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on x86_64 with linux-2.6.19 Passed on ppc64 with linux-2.6.14 Passed on powerpc with linux-2.6.13 Passed on ia64 with linux-2.6.17 Passed on ppc64 with linux-2.6.16 Passed on powerpc with linux-2.6.14 Passed on powerpc with linux-2.6.12 Passed on ppc64 with linux-2.6.12 Passed on ppc64 with linux-2.6.19 Passed on ia64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.12 Passed on ppc64 with linux-2.6.15 Passed on ppc64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on ia64 with linux-2.6.14 Passed on ia64 with linux-2.6.15 Passed on ppc64 with linux-2.6.17 Passed on x86_64 with linux-2.6.14 Passed on x86_64 with linux-2.6.13 Passed on x86_64 with linux-2.6.22 Passed on ia64 with linux-2.6.13 Passed on powerpc with linux-2.6.15 Passed on ia64 with linux-2.6.12 Passed on x86_64 with linux-2.6.18-53.el5 Passed on ia64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ppc64 with linux-2.6.13 Passed on x86_64 with linux-2.6.15 Passed on ia64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ppc64 with linux-2.6.18-8.el5 Failed: ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] librdmacm/man: fix-up man pages
On 11/27/07, Sean Hefty [EMAIL PROTECTED] wrote: These have been committed to master branch. OK, got it. Some users have approached me and said that its unclear from the man pages for some values of the connection param structure what are their legal values. Reviewing this a little, I think we should add the maximum values for the retry_count and rnr_retry_count under the infiniband specific section of the rdma_connect and rdma_accept pages. Also, what about pushing all these documentation changes as a release to OFED 1.3? Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] libmlx4: max_recv_wr must be non-zero for non-SRQ QPs
max_recv_wr must also be non-zero for QPs which are not associated with an SRQ. Signed-off-by: Jack Morgenstein [EMAIL PROTECTED] --- Roland, Without this patch, if the userspace caller specifies max_recv_wr = 0 for a non-srq QP, the creation will be rejected in kernel space in file infiniband/hw/mlx4/qp.c, procedure set_rq_size: } else { /* HW requires = 1 RQ entry with = 1 gather entry */ == NOTE: if (is_user (!cap-max_recv_wr || !cap-max_recv_sge)) return -EINVAL; You've set max_recv_sge size to 1, but not max_recv_wr. Jack diff --git a/src/verbs.c b/src/verbs.c index 4e7beff..ec4c6a5 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -367,8 +367,12 @@ struct ibv_qp *mlx4_create_qp(struct ibv_pd *pd, struct ibv_qp_init_attr *attr) if (attr-srq) attr-cap.max_recv_wr = qp-rq.wqe_cnt = 0; - else if (attr-cap.max_recv_sge 1) - attr-cap.max_recv_sge = 1; + else { + if (attr-cap.max_recv_sge 1) + attr-cap.max_recv_sge = 1; + if (attr-cap.max_recv_wr 1) + attr-cap.max_recv_wr = 1; + } if (mlx4_alloc_qp_buf(pd, attr-cap, attr-qp_type, qp)) goto err; ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
Agree with initiator/client sending signalled 0B RDMA Read. This will handle client side. Still not 100% clear on passive/server side. Two issues which bothers me. 1. Is bogus S-tag allowed for incomming RDMA ops? I do not recall that RDDP requires that length is checked before S-tag. 2. How is verb layer on server side knows that RDMA Read op came and was done? Is it some back door to vendor FW? Will this be kicked for all incoming RDMA Read ops? Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 7:48 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. Further, the RDMA Read solution is adequate whenever the RDMA Write solution would have been (although at an unnecessary extra cost), but as near as I can determine it is not a complete solution. If the passive side needs an untagged message completion then *something* needs to send it. How can the CM layer (or, I suppose, the ULP itself) know that this untagged NOP message must be sent without meta-data? I believe at Reno we had the current rnic vendors all saying a SEND or 0B read will work. So: If someone has current iwarp HW that will _not_ handle this problem by doing the 0B read hack, please speak up now. As I see it, if we want to do the minimum that is required, but be certain that it is adequate, we need a per-connection setup meta-data exchange. Are you going to prototype this? Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
Any posting to SQ prior to connection establishment will complete immideately with the flashed status. Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Glenn Grundstrom [mailto:[EMAIL PROTECTED] Sent: Sunday, November 25, 2007 9:00 PM To: Steve Wise; Kanevsky, Arkady Cc: Leonid Grossman; [EMAIL PROTECTED] Subject: RE: [ofa-general] Re: iWARP peer-to-peer CM proposal Kanevsky, Arkady wrote: Very good points. Thanks Steve. If we can do unsignalled 0-size RDMA Read with bogus S-tag this may work better. Yes, it will require IRD not to be 0 set at Responder. Ditto ORD of at least 1 on Responder. There is no need to have extra CQ entry on either side for it. It is only needed for error path. So this will only be needed if Sender posted the full queue of sends. But it can not post anything because CM will not let it know that connection is established. Well, actually, I think the ULP _can_ post before establishing the connection. But I guess we can define the semantics such that applications using the rdma-cm interface must adhere to whatever we need to make this hack work. Q: are there apps using the rdma-cm out there today that pre-post SQ WRs before getting a ESTABLISHED event? Steve. ULPs are allowed to post prior to establishing the connection, but I can't name any that operate this way. Prohibiting applications that use the rdma_cm directly from pre-posting is okay, but what about ULP's over other ULP's (i.e. MPI over uDAPL). How can/will this be handled? Glenn. Happy Thanksgiving, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 21, 2007 1:07 PM To: Kanevsky, Arkady Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: [ofa-general] Re: iWARP peer-to-peer CM proposal Comments in-line below... Kanevsky, Arkady wrote: Group, below is proposal on how to resolve peer-to-peer iWARP CM issue discovered at interop event. The main issue is that MPA spec (relevant portion of IETF RFC 5044 is below) require that connection initiator send first message over the established connection. Multiple MPI implementations and several other apps use peer-to-peer model. So rather then forcing all of them to do it on their own, which will not help with interop between different implementations, the goal is to extend lower layers to provide it. Our first idea was to leave MPA protocol untouched and try to solve this problem in iw_cm. But there are too many complications to it. First, in order to adhere to RFC5044 initiator must send first FPDU and responder process it. But since the connection is already established processing FPDU involves ULP on whose behalf the connection is created. So either initiator sends a message which generates completion on responder CQ, thus visible to ULP, or not. In the later case, the only op which can do it is RDMA one, which means that responder somehow provided initiator S-tag which it can use. So, this is an extension to MPA, probably using private data. And that responder upon receiving it destroy this S-tag. In any case this is an extension of MPA. This stag exchange isn't needed if this RDMA op is a 0B READ. The responder waits for that 0B read and only indicates the rdma connection is established to its ULP when it replies to the 0B read. In this scenario, the responder/server side doesn't consume any CQ resources. But it would require an IRD of at least 1 to be configured on the QP. The initiator still requires an SQ entry, and possibly a CQ entry, for initiating the 0B read and handling completion. But its perhaps a little less painful than doing a SEND/RECV exchange. The read wr could be unsignaled so that it won't generate a CQE. But it still consumes an SQ WR slot so the SQ would have to be sized to allow this extra WR. And I
[ofa-general] [PATCH] IB/ehca: Fix static rate if path faster than link
The formula would yield -1 for this, which is wrong in a bad way (max throttling). Clamp to 0, which is the correct value. Signed-off-by: Joachim Fenkes [EMAIL PROTECTED] --- This fixes another regression introduced in rc3. Please review and apply for 2.6.24-rc4. Thanks! drivers/infiniband/hw/ehca/ehca_av.c |8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_av.c b/drivers/infiniband/hw/ehca/ehca_av.c index 453eb99..f7782c8 100644 --- a/drivers/infiniband/hw/ehca/ehca_av.c +++ b/drivers/infiniband/hw/ehca/ehca_av.c @@ -76,8 +76,12 @@ int ehca_calc_ipd(struct ehca_shca *shca, int port, link = ib_width_enum_to_int(pa.active_width) * pa.active_speed; - /* IPD = round((link / path) - 1) */ - *ipd = ((link + (path 1)) / path) - 1; + if (path = link) + /* no need to throttle if path faster than link */ + *ipd = 0; + else + /* IPD = round((link / path) - 1) */ + *ipd = ((link + (path 1)) / path) - 1; return 0; } -- 1.5.2 ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
-Original Message- From: [EMAIL PROTECTED] [mailto:general- [EMAIL PROTECTED] On Behalf Of Kanevsky, Arkady Sent: Wednesday, November 28, 2007 5:30 AM To: Steve Wise; Caitlin Bestler Cc: Leonid Grossman; [EMAIL PROTECTED] Subject: RE: [ofa-general] Re: iWARP peer-to-peer CM proposal Agree with initiator/client sending signalled 0B RDMA Read. This will handle client side. Still not 100% clear on passive/server side. Two issues which bothers me. 1. Is bogus S-tag allowed for incomming RDMA ops? I do not recall that RDDP requires that length is checked before S-tag. 2. How is verb layer on server side knows that RDMA Read op came and was done? Is it some back door to vendor FW? Will this be kicked for all incoming RDMA Read ops? As you point out, the server Verbs layer is not aware of an incoming 0B RDMA Read (or Write for that matter). Hence some kind of magic must happen in the adapter where we vendors will have a choice: a) just 'unjam' the SQ in the adapter (which means that the CM layer works as today and the server can post SQ ops before the 'unjam' is received but they won't make it to the wire) or b) send a back-door command to the CM which can then move the state machine to established only after the 'unjam' is received. Whatever is done, it cannot happen for all zero-length RDMA Read (or Write for that matter). Hence the adapter must be informed that that the next zero-length is the 'unjam' message (which also means that the server side could, in theory, omit sending the RDMA Read Response, because the RDMA Read Request was really a 'unjam'... not that I would be pushing for such an 'optimization' to avoid an extra wire message). Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 7:48 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. Further, the RDMA Read solution is adequate whenever the RDMA Write solution would have been (although at an unnecessary extra cost), but as near as I can determine it is not a complete solution. If the passive side needs an untagged message completion then *something* needs to send it. How can the CM layer (or, I suppose, the ULP itself) know that this untagged NOP message must be sent without meta-data? I believe at Reno we had the current rnic vendors all saying a SEND or 0B read will work. So: If someone has current iwarp HW that will _not_ handle this problem by doing the 0B read hack, please speak up now. As I see it, if we want to do the minimum that is required, but be certain that it is adequate, we need a per-connection setup meta-data exchange. Are you going to prototype this? Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] DDR vs SDR performance
Hello, I have a problem with the DDR performance: Configuration: 2 servers (IBM x3755, equiped with 4 dualcore opteron and 16GB RAM) 3 HCA's installed (2 Cisco DDR(Cheetah) and 1 Cisco dual SDR(LionMini), all PCI-e x8), all DDR HCA's at newest Cisco Firmware v1.2.917 build 3.2.0.149, with label 'HCA.Cheetah-DDR.20' The DDR's are connected with a cable, and s3n1 is running a SM. The SDR boards are connected over a Cisco SFS-7000D, but the DDR performance is +- the same over this SFS-7000D Both servers are running SLES10-SP1 with Ofed 1.2.5. s3n1:~ # ibstatus Infiniband device 'mthca0' port 1 status:-- DDR board #1, not connected default gid: fe80::::0005:ad00:000b:cb39 base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate:10 Gb/sec (4X) Infiniband device 'mthca1' port 1 status: --- DDR board #2, connected with cable default gid: fe80::::0005:ad00:000b:cb31 base lid:0x16 sm lid: 0x16 state: 4: ACTIVE phys state: 5: LinkUp rate:20 Gb/sec (4X DDR) Infiniband device 'mthca2' port 1 status: --- SDR board, only port 1 connected to the SFS-7000D default gid: fe80::::0005:ad00:0008:a8d9 base lid:0x3 sm lid: 0x2 state: 4: ACTIVE phys state: 5: LinkUp rate:10 Gb/sec (4X) Infiniband device 'mthca2' port 2 status: default gid: fe80::::0005:ad00:0008:a8da base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate:10 Gb/sec (4X) RDMA test of : -- SDR: s3n2:~ # ib_rdma_bw -d mthca2 gpfs3n1 7190: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 7190: Local address: LID 0x05, QPN 0x0408, PSN 0xf10f03 RKey 0x003b00 VAddr 0x002ba7b9943000 7190: Remote address: LID 0x03, QPN 0x040a, PSN 0xa9cf5c, RKey 0x003e00 VAddr 0x002adb2f3bb000 7190: Bandwidth peak (#0 to #989): 937.129 MB/sec 7190: Bandwidth average: 937.095 MB/sec 7190: Service Demand peak (#0 to #989): 2709 cycles/KB 7190: Service Demand Avg : 2709 cycles/KB -- DDR s3n2:~ # ib_rdma_bw -d mthca1 gpfs3n1 7191: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 7191: Local address: LID 0x10, QPN 0x0405, PSN 0x5e19e RKey 0x002600 VAddr 0x002b76eab2 7191: Remote address: LID 0x16, QPN 0x0405, PSN 0xdd976e, RKey 0x80002900 VAddr 0x002ba8ed10e000 7191: Bandwidth peak (#0 to #990): 1139.32 MB/sec 7191: Bandwidth average: 1139.31 MB/sec 7191: Service Demand peak (#0 to #990): 2228 cycles/KB 7191: Service Demand Avg : 2228 cycles/KB So only 200MB/s increase between SDR and DDR With comparable hardware(x3655, dual dualcore opteron, 8GB RAM), I get a little bit better RDMA performance(1395MB/s so close to the PCI-e x8 limit), but even worse IPoIB and SDP performance with kernels 2.6.22 and 2.6.23.9 and Ofed 1.3b IPoIB test(iperf), IPoIB in connected mode, MTU 65520: #ib2 is SDR, ib1 is DDR #SDR: s3n2:~ # iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 1.00 MByte (default) [ 3] local 192.168.1.2 port 50598 connected with 192.168.1.1 port 5001 [ 3] 0.0-10.0 sec 6.28 GBytes 5.40 Gbits/sec #DDR: s3n2:~ # iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 1.00 MByte (default) [ 3] local 192.168.1.2 port 32935 connected with 192.168.1.1 port 5001 [ 3] 0.0-10.0 sec 6.91 GBytes 5.93 Gbits/sec Now the increase is only 0.5Gbit And finally a test with SDP: DDR: s3n2:~ # LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=ok iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 3.91 MByte (default) [ 4] local 192.168.1.2 port 58186 connected with 192.168.1.1 port 5001 [ 4] 0.0-10.0 sec 7.72 GBytes 6.63 Gbits/sec #SDR: s3n2:~ # LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=ok iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 3.91 MByte (default) [ 4] local 192.168.1.2 port 58187 connected with 192.168.1.1 port 5001 [ 4] 0.0-10.0 sec 7.70 GBytes 6.61 Gbits/sec With SDP there is even no difference anymore between the 2 boards. Even when using multiple connections(using 3 servers(s3s2,s3s3,s3s4), x3655, 2.6.22, connecting all to one(s3s1) over
Re: [ofa-general] [PATCH ofed-1.3] IB/IPoIB: Restore support for interface statistics
On Wednesday 28 November 2007 09:20, Moni Shoua wrote: While moving to kernel 2.6.24 in OFED, the function for getting interface statistics got lost. This is a backport patch to re-enable net device statistics for kernels that do not have the struct net_device_stats in struct netdevice. This patch fixes bug 790. Thanks Moni, applied. I actually applied the patch so that it created the various backport files, then committed all the backport files together in a single commit, with your authorship and signed-off-by. (I probably should have added myself as well, below your signed-off -- since I changed the commit format -- but I forgot to do this; sorry about that). - Jack ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] DDR vs SDR performance
Is the chipset in your servers HT2000? Gilad. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stijn De Smet Sent: Wednesday, November 28, 2007 6:43 AM To: general@lists.openfabrics.org Subject: [ofa-general] DDR vs SDR performance Hello, I have a problem with the DDR performance: Configuration: 2 servers (IBM x3755, equiped with 4 dualcore opteron and 16GB RAM) 3 HCA's installed (2 Cisco DDR(Cheetah) and 1 Cisco dual SDR(LionMini), all PCI-e x8), all DDR HCA's at newest Cisco Firmware v1.2.917 build 3.2.0.149, with label 'HCA.Cheetah-DDR.20' The DDR's are connected with a cable, and s3n1 is running a SM. The SDR boards are connected over a Cisco SFS-7000D, but the DDR performance is +- the same over this SFS-7000D Both servers are running SLES10-SP1 with Ofed 1.2.5. s3n1:~ # ibstatus Infiniband device 'mthca0' port 1 status:-- DDR board #1, not connected default gid: fe80::::0005:ad00:000b:cb39 base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate:10 Gb/sec (4X) Infiniband device 'mthca1' port 1 status: --- DDR board #2, connected with cable default gid: fe80::::0005:ad00:000b:cb31 base lid:0x16 sm lid: 0x16 state: 4: ACTIVE phys state: 5: LinkUp rate:20 Gb/sec (4X DDR) Infiniband device 'mthca2' port 1 status: --- SDR board, only port 1 connected to the SFS-7000D default gid: fe80::::0005:ad00:0008:a8d9 base lid:0x3 sm lid: 0x2 state: 4: ACTIVE phys state: 5: LinkUp rate:10 Gb/sec (4X) Infiniband device 'mthca2' port 2 status: default gid: fe80::::0005:ad00:0008:a8da base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate:10 Gb/sec (4X) RDMA test of : -- SDR: s3n2:~ # ib_rdma_bw -d mthca2 gpfs3n1 7190: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 7190: Local address: LID 0x05, QPN 0x0408, PSN 0xf10f03 RKey 0x003b00 VAddr 0x002ba7b9943000 7190: Remote address: LID 0x03, QPN 0x040a, PSN 0xa9cf5c, RKey 0x003e00 VAddr 0x002adb2f3bb000 7190: Bandwidth peak (#0 to #989): 937.129 MB/sec 7190: Bandwidth average: 937.095 MB/sec 7190: Service Demand peak (#0 to #989): 2709 cycles/KB 7190: Service Demand Avg : 2709 cycles/KB -- DDR s3n2:~ # ib_rdma_bw -d mthca1 gpfs3n1 7191: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 7191: Local address: LID 0x10, QPN 0x0405, PSN 0x5e19e RKey 0x002600 VAddr 0x002b76eab2 7191: Remote address: LID 0x16, QPN 0x0405, PSN 0xdd976e, RKey 0x80002900 VAddr 0x002ba8ed10e000 7191: Bandwidth peak (#0 to #990): 1139.32 MB/sec 7191: Bandwidth average: 1139.31 MB/sec 7191: Service Demand peak (#0 to #990): 2228 cycles/KB 7191: Service Demand Avg : 2228 cycles/KB So only 200MB/s increase between SDR and DDR With comparable hardware(x3655, dual dualcore opteron, 8GB RAM), I get a little bit better RDMA performance(1395MB/s so close to the PCI-e x8 limit), but even worse IPoIB and SDP performance with kernels 2.6.22 and 2.6.23.9 and Ofed 1.3b IPoIB test(iperf), IPoIB in connected mode, MTU 65520: #ib2 is SDR, ib1 is DDR #SDR: s3n2:~ # iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 1.00 MByte (default) [ 3] local 192.168.1.2 port 50598 connected with 192.168.1.1 port 5001 [ 3] 0.0-10.0 sec 6.28 GBytes 5.40 Gbits/sec #DDR: s3n2:~ # iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 1.00 MByte (default) [ 3] local 192.168.1.2 port 32935 connected with 192.168.1.1 port 5001 [ 3] 0.0-10.0 sec 6.91 GBytes 5.93 Gbits/sec Now the increase is only 0.5Gbit And finally a test with SDP: DDR: s3n2:~ # LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=ok iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 3.91 MByte (default) [ 4] local 192.168.1.2 port 58186 connected with 192.168.1.1 port 5001 [ 4] 0.0-10.0 sec 7.72 GBytes 6.63 Gbits/sec #SDR: s3n2:~ # LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=ok iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 3.91 MByte (default) [ 4] local 192.168.1.2 port 58187
Re: [ofa-general] DDR vs SDR performance
One ServerWorks HT2100 A PCI Express Bridge, one HT2100 B PCI Express Bridge, and one ServerWorks HT1000 South Bridge Regards, Stijn Gilad Shainer wrote: Is the chipset in your servers HT2000? Gilad. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stijn De Smet Sent: Wednesday, November 28, 2007 6:43 AM To: general@lists.openfabrics.org Subject: [ofa-general] DDR vs SDR performance Hello, I have a problem with the DDR performance: Configuration: 2 servers (IBM x3755, equiped with 4 dualcore opteron and 16GB RAM) 3 HCA's installed (2 Cisco DDR(Cheetah) and 1 Cisco dual SDR(LionMini), all PCI-e x8), all DDR HCA's at newest Cisco Firmware v1.2.917 build 3.2.0.149, with label 'HCA.Cheetah-DDR.20' The DDR's are connected with a cable, and s3n1 is running a SM. The SDR boards are connected over a Cisco SFS-7000D, but the DDR performance is +- the same over this SFS-7000D Both servers are running SLES10-SP1 with Ofed 1.2.5. s3n1:~ # ibstatus Infiniband device 'mthca0' port 1 status:-- DDR board #1, not connected default gid: fe80::::0005:ad00:000b:cb39 base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate:10 Gb/sec (4X) Infiniband device 'mthca1' port 1 status: --- DDR board #2, connected with cable default gid: fe80::::0005:ad00:000b:cb31 base lid:0x16 sm lid: 0x16 state: 4: ACTIVE phys state: 5: LinkUp rate:20 Gb/sec (4X DDR) Infiniband device 'mthca2' port 1 status: --- SDR board, only port 1 connected to the SFS-7000D default gid: fe80::::0005:ad00:0008:a8d9 base lid:0x3 sm lid: 0x2 state: 4: ACTIVE phys state: 5: LinkUp rate:10 Gb/sec (4X) Infiniband device 'mthca2' port 2 status: default gid: fe80::::0005:ad00:0008:a8da base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate:10 Gb/sec (4X) RDMA test of : -- SDR: s3n2:~ # ib_rdma_bw -d mthca2 gpfs3n1 7190: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 7190: Local address: LID 0x05, QPN 0x0408, PSN 0xf10f03 RKey 0x003b00 VAddr 0x002ba7b9943000 7190: Remote address: LID 0x03, QPN 0x040a, PSN 0xa9cf5c, RKey 0x003e00 VAddr 0x002adb2f3bb000 7190: Bandwidth peak (#0 to #989): 937.129 MB/sec 7190: Bandwidth average: 937.095 MB/sec 7190: Service Demand peak (#0 to #989): 2709 cycles/KB 7190: Service Demand Avg : 2709 cycles/KB -- DDR s3n2:~ # ib_rdma_bw -d mthca1 gpfs3n1 7191: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 7191: Local address: LID 0x10, QPN 0x0405, PSN 0x5e19e RKey 0x002600 VAddr 0x002b76eab2 7191: Remote address: LID 0x16, QPN 0x0405, PSN 0xdd976e, RKey 0x80002900 VAddr 0x002ba8ed10e000 7191: Bandwidth peak (#0 to #990): 1139.32 MB/sec 7191: Bandwidth average: 1139.31 MB/sec 7191: Service Demand peak (#0 to #990): 2228 cycles/KB 7191: Service Demand Avg : 2228 cycles/KB So only 200MB/s increase between SDR and DDR With comparable hardware(x3655, dual dualcore opteron, 8GB RAM), I get a little bit better RDMA performance(1395MB/s so close to the PCI-e x8 limit), but even worse IPoIB and SDP performance with kernels 2.6.22 and 2.6.23.9 and Ofed 1.3b IPoIB test(iperf), IPoIB in connected mode, MTU 65520: #ib2 is SDR, ib1 is DDR #SDR: s3n2:~ # iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 1.00 MByte (default) [ 3] local 192.168.1.2 port 50598 connected with 192.168.1.1 port 5001 [ 3] 0.0-10.0 sec 6.28 GBytes 5.40 Gbits/sec #DDR: s3n2:~ # iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 1.00 MByte (default) [ 3] local 192.168.1.2 port 32935 connected with 192.168.1.1 port 5001 [ 3] 0.0-10.0 sec 6.91 GBytes 5.93 Gbits/sec Now the increase is only 0.5Gbit And finally a test with SDP: DDR: s3n2:~ # LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=ok iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 3.91 MByte (default) [ 4] local 192.168.1.2 port 58186 connected with 192.168.1.1 port 5001 [ 4] 0.0-10.0 sec 7.72 GBytes 6.63 Gbits/sec #SDR: s3n2:~ # LD_PRELOAD=libsdp.so
[ofa-general] Re: i got kernel oops in ib_umad when executing ULPs tests
Hi Dotan, On 11:24 Tue 27 Nov , Dotan Barak wrote: Hi. When executing SDP tests (stress_connect) i got a kernel oops in my machine in ib_umad: Is it reproducible somehow? Here are the machine props: * Host Name : sw112/3 Host Architecture : x86_64 Linux Distribution: SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 Kernel Version: 2.6.16.21-0.8-smp GCC Version : gcc (GCC) 4.1.0 (SUSE Linux) Memory size : 4049452 kB Number of CPUs: 4 cpu MHz : 3192.308 MST Version : 4.4.3 Driver Version: ofa_1_3_dev-20071126-0855 HCA ID(s) : mlx4_0 HCA model(s) : 25418 Board(s) : MT_04A0110002 * Here is the dump of the /var/log/messages: Nov 27 09:26:32 sw112 OpenSM[24713]: Exiting SM Nov 27 09:26:32 sw112 kernel: general protection fault: [1] SMP Nov 27 09:26:32 sw112 kernel: last sysfs file: /class/net/ib0/address Nov 27 09:26:32 sw112 kernel: CPU 2 Nov 27 09:26:32 sw112 kernel: Modules linked in: mst_pciconf mst_pci rdma_ucm rds ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_c m ib_sa ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core memtrack autofs4 ipv6 nfs lockd nfs_acl sunrpc af_packet button battery ac apparmor aamatch_pcre loop dm_mod ide_cd uhci_hcd ehci_hcd cdrom shpchp pci_hotplug hw_random i8xx_tco us bcore e1000 ext3 jbd edd fan thermal processor sg mptspi mptscsih mptbase scsi_transport_spi piix sd_mod scsi_mod ide_disk i de_core Nov 27 09:26:32 sw112 kernel: Pid: 24713, comm: opensm Tainted: PFU 2.6.16.21-0.8-smp #1 Nov 27 09:26:32 sw112 kernel: RIP: 0010:[8837d39f] 8837d39f{:ib_umad:dequeue_send+26} Nov 27 09:26:32 sw112 kernel: RSP: 0018:8100c0d9fde8 EFLAGS: 00010046 Nov 27 09:26:32 sw112 kernel: RAX: 8100c1a95658 RBX: 3f40a6f32b5a2004 RCX: 3f40a6f32b5a2014 Nov 27 09:26:32 sw112 kernel: RDX: 8100c0d9fe58 RSI: 3f40a6f32b5a2004 RDI: 81007401ac3c Nov 27 09:26:32 sw112 kernel: RBP: 3f40a6f32b5a2004 R08: 0206 R09: 07d7 Nov 27 09:26:32 sw112 kernel: R10: R11: 0246 R12: 81007401ac00 Nov 27 09:26:32 sw112 kernel: R13: 81007401a210 R14: 0005 R15: Nov 27 09:26:32 sw112 kernel: FS: 2b13822edef0() GS:81012bd6b340() knlGS: Nov 27 09:26:32 sw112 kernel: CS: 0010 DS: ES: CR0: 8005003b Nov 27 09:26:32 sw112 kernel: CR2: 005d99c0 CR3: 37079000 CR4: 06e0 Nov 27 09:26:32 sw112 kernel: Process opensm (pid: 24713, threadinfo 8100c0d9e000, task 8100cd8047d0) Nov 27 09:26:32 sw112 kernel: Stack: 81012d706b10 8100c0d9fe68 81007401ac00 8837d4b1 Nov 27 09:26:32 sw112 kernel:0296 8100c0d9fe40 81007401a210 81007401a200 Nov 27 09:26:32 sw112 kernel:0005 8827261e Nov 27 09:26:32 sw112 kernel: Call Trace: 8837d4b1{:ib_umad:send_handler+38} Nov 27 09:26:32 sw112 kernel: 8827261e{:ib_mad:ib_unregister_mad_agent+359} Nov 27 09:26:32 sw112 kernel: 8837d26b{:ib_umad:ib_umad_unreg_agent+121} Nov 27 09:26:32 sw112 kernel: 8837db37{:ib_umad:ib_umad_ioctl+74} 8018b6b9{do_ioctl+33} Nov 27 09:26:32 sw112 kernel:8018b94b{vfs_ioctl+584} 801e7e6b{__up_write+33} Nov 27 09:26:32 sw112 kernel:8018b9c6{sys_ioctl+98} 8010a7be{system_call+126} Nov 27 09:26:32 sw112 kernel: Nov 27 09:26:32 sw112 kernel: Code: 48 8b 53 10 48 8b 41 08 48 89 42 08 48 89 10 48 c7 41 08 00 Nov 27 09:26:32 sw112 kernel: RIP 8837d39f{:ib_umad:dequeue_send+26} RSP 8100c0d9fde8 Here is the dump of /var/log/opensm.log: Nov 27 09:26:44 546327 [D6AC7EF0] 0x03 - OpenSM 3.1.7 Nov 27 09:26:44 546407 [D6AC7EF0] 0x80 - OpenSM 3.1.7 Nov 27 09:26:44 547422 [D6AC7EF0] 0x02 - osm_vendor_bind: Binding to port 0x4025 ^^ Is this a valid GUID? Nov 27 09:26:44 673957 [D6AC7EF0] 0x01 - osm_vendor_bind: ERR 5426: Unable to register class 129 version 1 Nov 27 09:26:44 674032 [D6AC7EF0] 0x01 - osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed Nov 27 09:26:44 674057 [D6AC7EF0] 0x01 - osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR) Nov 27 09:26:44 674089 [D6AC7EF0] 0x01 - osm_sa_mad_ctrl_unbind: ERR 1A11: No previous bind Nov 27 09:26:44 675165 [D6AC7EF0] 0x80 - Exiting SM can you check this issue? Could you send OpenSM log file too? Sasha ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please
[ofa-general] [PATCH] return ENOSYS instead of -ENOSYS
Return ENOSYS instead of -ENOSYS. We are not in the kernel. diff --git a/src/verbs.c b/src/verbs.c index 4e7beff..7fa1dbc 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -227,7 +227,7 @@ err: int mlx4_resize_cq(struct ibv_cq *ibcq, int cqe) { /* XXX resize CQ not implemented */ - return -ENOSYS; + return ENOSYS; } int mlx4_destroy_cq(struct ibv_cq *cq) -- Gleb. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] DDR vs SDR performance
Here are some notes. You can contact me directly for more info. 1. You do not compare the same HW. The single port IB HCAs provides difference performance than the dual port devices. If you want to see the difference between SDR and DDR, you need to use the same IB configuration as well. 2. Saying that, with the single port DDR you should get around 1400MB/s with the RDMA tests but: - The benchmark you are using is not supported any more (well, for long time now). You should use the IB send, IB write etc tests - On Opteron, the HTxx00 chipset configuration is very important (not just for IB performance) - There is a difference of performance depends on the location of the memory. If you will run the tests you will see numbers in the high 1300 and low 1100 (with your current chipset config) Gilad. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stijn De Smet Sent: Wednesday, November 28, 2007 7:02 AM To: Gilad Shainer Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] DDR vs SDR performance One ServerWorks HT2100 A PCI Express Bridge, one HT2100 B PCI Express Bridge, and one ServerWorks HT1000 South Bridge Regards, Stijn Gilad Shainer wrote: Is the chipset in your servers HT2000? Gilad. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stijn De Smet Sent: Wednesday, November 28, 2007 6:43 AM To: general@lists.openfabrics.org Subject: [ofa-general] DDR vs SDR performance Hello, I have a problem with the DDR performance: Configuration: 2 servers (IBM x3755, equiped with 4 dualcore opteron and 16GB RAM) 3 HCA's installed (2 Cisco DDR(Cheetah) and 1 Cisco dual SDR(LionMini), all PCI-e x8), all DDR HCA's at newest Cisco Firmware v1.2.917 build 3.2.0.149, with label 'HCA.Cheetah-DDR.20' The DDR's are connected with a cable, and s3n1 is running a SM. The SDR boards are connected over a Cisco SFS-7000D, but the DDR performance is +- the same over this SFS-7000D Both servers are running SLES10-SP1 with Ofed 1.2.5. s3n1:~ # ibstatus Infiniband device 'mthca0' port 1 status:-- DDR board #1, not connected default gid: fe80::::0005:ad00:000b:cb39 base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate:10 Gb/sec (4X) Infiniband device 'mthca1' port 1 status: --- DDR board #2, connected with cable default gid: fe80::::0005:ad00:000b:cb31 base lid:0x16 sm lid: 0x16 state: 4: ACTIVE phys state: 5: LinkUp rate:20 Gb/sec (4X DDR) Infiniband device 'mthca2' port 1 status: --- SDR board, only port 1 connected to the SFS-7000D default gid: fe80::::0005:ad00:0008:a8d9 base lid:0x3 sm lid: 0x2 state: 4: ACTIVE phys state: 5: LinkUp rate:10 Gb/sec (4X) Infiniband device 'mthca2' port 2 status: default gid: fe80::::0005:ad00:0008:a8da base lid:0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate:10 Gb/sec (4X) RDMA test of : -- SDR: s3n2:~ # ib_rdma_bw -d mthca2 gpfs3n1 7190: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 7190: Local address: LID 0x05, QPN 0x0408, PSN 0xf10f03 RKey 0x003b00 VAddr 0x002ba7b9943000 7190: Remote address: LID 0x03, QPN 0x040a, PSN 0xa9cf5c, RKey 0x003e00 VAddr 0x002adb2f3bb000 7190: Bandwidth peak (#0 to #989): 937.129 MB/sec 7190: Bandwidth average: 937.095 MB/sec 7190: Service Demand peak (#0 to #989): 2709 cycles/KB 7190: Service Demand Avg : 2709 cycles/KB -- DDR s3n2:~ # ib_rdma_bw -d mthca1 gpfs3n1 7191: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 | 7191: Local address: LID 0x10, QPN 0x0405, PSN 0x5e19e RKey 0x002600 VAddr 0x002b76eab2 7191: Remote address: LID 0x16, QPN 0x0405, PSN 0xdd976e, RKey 0x80002900 VAddr 0x002ba8ed10e000 7191: Bandwidth peak (#0 to #990): 1139.32 MB/sec 7191: Bandwidth average: 1139.31 MB/sec 7191: Service Demand peak (#0 to #990): 2228 cycles/KB 7191: Service Demand Avg : 2228 cycles/KB So only 200MB/s increase between SDR and DDR With comparable hardware(x3655, dual dualcore opteron, 8GB RAM), I get a little bit better RDMA performance(1395MB/s so close to the PCI-e x8 limit), but even worse IPoIB and SDP performance with kernels 2.6.22 and 2.6.23.9 and Ofed 1.3b IPoIB test(iperf), IPoIB in connected mode, MTU 65520: #ib2 is SDR, ib1 is DDR #SDR: s3n2:~ # iperf -c cic-s3n1 Client connecting to cic-s3n1, TCP port 5001 TCP window size: 1.00 MByte
Re: [ofa-general] [ANNOUNCE] ibsim-0.4 tarballs release
On Wed, 2007-11-28 at 12:50 +0530, Keshetti Mahesh wrote: ibutils maintainer is Oren Kladnitsky [EMAIL PROTECTED] Not sure if he monitors this list. Sorry, I actual wanted to know who are the developers of ibadm group of utilities. ibadm or ibdm ? Your original question was about ibdm. ibdm is under the ibutils tree. I don't think Mellanox has open sourced ibadm but I might be wrong. Maybe it's just not part of OpenIB/OpenFabrics code. LASH resolves credit loops by using different VLs, I don't think ibdmchk takes this into account, but don't know for sure. Yes, I have verified in ibdmchk that it considers only one VL while checking for credit loops. I also think ibdmchk needs some support to handle LASH. I don't think it is currently supported by it (although that is not documented AFAIK). Is anyone currently working on this part (adding support to ibdmchk to handle LASH) in OFED community. I seriously doubt it. -- Hal -Mahesh ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] Bug fix IPOIB CM dereferencing invalid pointer - resend
Bug fix IPOIB CM dereferencing invalid pointer When ipoib_neigh_free gets called it needs to set to NULL its -cm-neigh member So that a completion with error reaching ipoib_cm_handle_tx_wc will not access an invalid pointer. Signed-off-by: Eli Cohen [EMAIL PROTECTED] --- This is what I really meant to send drivers/infiniband/ulp/ipoib/ipoib_main.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index a03a65e..0c66723 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -869,6 +869,10 @@ void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) } if (ipoib_cm_get(neigh)) ipoib_cm_destroy_tx(ipoib_cm_get(neigh)); + + if (neigh-cm) + neigh-cm-neigh = NULL; + kfree(neigh); } -- 1.5.3.6 ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] ipoib: Bug fix IPOIB CM dereferencing invalid pointer
Bug fix IPOIB CM dereferencing invalid pointer When ipoib_neigh_free gets called it needs to set to NULL its -cm member so that a completion with error reaching ipoib_cm_handle_tx_wc will not access an invalid pointer. Signed-off-by: Eli Cohen [EMAIL PROTECTED] --- drivers/infiniband/ulp/ipoib/ipoib_main.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index a03a65e..95c7714 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -869,6 +869,8 @@ void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) } if (ipoib_cm_get(neigh)) ipoib_cm_destroy_tx(ipoib_cm_get(neigh)); + + neigh-cm = NULL; kfree(neigh); } -- 1.5.3.6 ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] MTHCA driver from OFED 1.3a package
On Tuesday 27 November 2007 19:17, Lukas Hejtmanek wrote: On Tue, Nov 27, 2007 at 06:51:48PM +0200, Tziporet Koren wrote: just found, that OFED 1.3a with 2.6.23 kernel runs at 2/3 speed compared to 2.6.23 kernel with built in driver. Any reason for this? Which benchmark? ib_rdma_bw ib_send_bw ibv_uc_pingpong Which HCA? Mellanox InfiniBand HCA, HCA.Cheetah-DDR.20. Is it the same with ofed beta release? Did you mean 1.3b? I have not tried it. Which userspace libraries did you use with the built-in driver of the 2.6.23 kernel? - Jack ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH] Bug fix IPOIB CM dereferencing invalid pointer - resend
Actually I see that tx-neigh is already set to NULL in ipoib_cm_destroy_tx so this fixes nothing. Although when I did this my system stopped crashing. I guess I have to dig farther. By the way this happens when I run netperf UDP and the connection is closed during the test runs. On Wed, 2007-11-28 at 18:05 +0200, Eli Cohen wrote: Bug fix IPOIB CM dereferencing invalid pointer When ipoib_neigh_free gets called it needs to set to NULL its -cm-neigh member So that a completion with error reaching ipoib_cm_handle_tx_wc will not access an invalid pointer. Signed-off-by: Eli Cohen [EMAIL PROTECTED] --- This is what I really meant to send drivers/infiniband/ulp/ipoib/ipoib_main.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index a03a65e..0c66723 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -869,6 +869,10 @@ void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) } if (ipoib_cm_get(neigh)) ipoib_cm_destroy_tx(ipoib_cm_get(neigh)); + + if (neigh-cm) + neigh-cm-neigh = NULL; + kfree(neigh); } ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
-Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 4:48 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; openib- [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. That is an extra round-trip per connection setup, which is a significant penalty for a short lived connection. It is trivial for HPC/peer-to-peer applications, but would be a killer for something like HTTP over RDMA. Doing something like this for *every* connection makes it effectively a change to the MPA protocol. OFA is not the forum for such discussions, the IETF is. OFA drafting an understanding of how peer-to-peer applications use the existing protocol, on the other hand, is quite reasonable. But it has to be something done by peer-to-peer middleware or by the verbs layer in response to a flag from the peer-to-peer middleware. Otherwise it is not augmenting a protocol, it is changing it. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
Caitlin Bestler wrote: -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 4:48 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; openib- [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. That is an extra round-trip per connection setup, which is a significant penalty for a short lived connection. It is trivial for HPC/peer-to-peer applications, but would be a killer for something like HTTP over RDMA. Doing something like this for *every* connection makes it effectively a change to the MPA protocol. OFA is not the forum for such discussions, the IETF is. OFA drafting an understanding of how peer-to-peer applications use the existing protocol, on the other hand, is quite reasonable. But it has to be something done by peer-to-peer middleware or by the verbs layer in response to a flag from the peer-to-peer middleware. Otherwise it is not augmenting a protocol, it is changing it. posting a 0B read after the mpa setup isn't changing the MPA protocol. Its adding a protocol on top of the MPA setup in order to meet the requirements of the MPA protocol. Whether you add a private-data request for this or _assume_ the 0B read will happen doesn't change this. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
On Wed, 2007-11-28 at 11:43 -0500, Caitlin Bestler wrote: -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 4:48 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; openib- [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. That is an extra round-trip per connection setup, which is a significant penalty for a short lived connection. It is trivial for HPC/peer-to-peer applications, but would be a killer for something like HTTP over RDMA. I find it hard to get excited about optimizing short lived connections for RDMA. I simply don't think it's an interesting use case. And btw, HTTP long ago got rid of short lived connections because it's painful even on TCP. Doing something like this for *every* connection makes it effectively a change to the MPA protocol. Uh. No, it doesn't. Normalizing the behavior of applications during connection setup doesn't change the underlying protocol. It adds another one on top. OFA is not the forum for such discussions, the IETF is. My living room, the dinner table, the local bar and this mailing list are perfectly acceptable forums for discussing a protocol. The IETF is the forum for standardizing one. Right now, I don't think we're ready to standardize, because we're still exploring the options; the first of which is NOT changing MPA. This group has the unique benefit of actually USING and IMPLEMENTING the protocol and therefore has some beneficial insights that may and should be shared. All that said revving the MPA protocol is way down the road. OFA drafting an understanding of how peer-to-peer applications use the existing protocol, on the other hand, is quite reasonable. That's step 1 and the 0B READ is one way to do it. But it has to be something done by peer-to-peer middleware or by the verbs layer in response to a flag from the peer-to-peer middleware. Otherwise it is not augmenting a protocol, it is changing it. The flag may be useful, however, I don't see the connection between the flag and complying with the MPA protocol. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
Kanevsky, Arkady wrote: ULP can post recvs before connection is established but not to send queue prior to connection establishment. I hate quoting specs (and the RDMAC verbs spec isn't really any standard), but, page 25 of draft-hilland-iwarp-verbs-v1.0 indicates its ok to post SQ WRs when in idle: The QP MUST be in the Idle state following QP creation or when moved to this state with Modify QP. In this state, Send or Receive WRs MAY be posted but they MUST NOT be processed and CQEs MUST NOT be generated. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
Kanevsky, Arkady wrote: Agree with initiator/client sending signalled 0B RDMA Read. This will handle client side. Still not 100% clear on passive/server side. Two issues which bothers me. 1. Is bogus S-tag allowed for incomming RDMA ops? The stag/to must not be validated if the incoming read is 0B length. http://www.ietf.org/rfc/rfc5040.txt: * If the Data Source receives an RDMA Read Request Header with the RDMA Read Message Size set to zero, the Data Source RDMAP: * MUST NOT validate the Data Source STag and Data Source Tagged Offset contained in the RDMA Read Request Header, and * MUST respond with a zero-length RDMA Read Response Message. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: [PATCH] librdmacm/man: fix-up man pages
Some users have approached me and said that its unclear from the man pages for some values of the connection param structure what are their legal values. Reviewing this a little, I think we should add the maximum values for the retry_count and rnr_retry_count under the infiniband specific section of the rdma_connect and rdma_accept pages. I can do this. Also, what about pushing all these documentation changes as a release to OFED 1.3? I'm holding off on a release until I'm fairly sure that all of the documentation changes are in. I don't foresee a problem getting documentation only changes into OFED 1.3 though. - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
Another small discreptancy between IB and iWARP. Since RDMA_CM is used for ULP which are transport independent they will follow the stricter rule. That is IB. For IB any posting to SQ prior to QP being in RTS state shall be flushed. This semantic is actually very useful for ULPs which use insignalled completions. Because, once you see the completion for the request you posted after connection failure you are sure that all previously posted request on the same SQ are completed and had you had seen them all. So while, you are correct on the spec since we are working in IW_CM we can assume IB semantic on posting. Thanks, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 28, 2007 1:52 PM To: Kanevsky, Arkady Cc: Glenn Grundstrom; Leonid Grossman; [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Kanevsky, Arkady wrote: ULP can post recvs before connection is established but not to send queue prior to connection establishment. I hate quoting specs (and the RDMAC verbs spec isn't really any standard), but, page 25 of draft-hilland-iwarp-verbs-v1.0 indicates its ok to post SQ WRs when in idle: The QP MUST be in the Idle state following QP creation or when moved to this state with Modify QP. In this state, Send or Receive WRs MAY be posted but they MUST NOT be processed and CQEs MUST NOT be generated. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] librdmacm/man: fix-up man pages
On 11/28/07, Sean Hefty [EMAIL PROTECTED] wrote: Reviewing this a little, I think we should add the maximum values for the retry_count and rnr_retry_count under the infiniband specific section of the rdma_connect and rdma_accept pages. I can do this. thanks. I'm holding off on a release until I'm fairly sure that all of the documentation changes are in. I don't foresee a problem getting documentation only changes into OFED 1.3 though. indeed, cool. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Re: iWARP peer-to-peer CM proposal
On 11/29/07, Kanevsky, Arkady [EMAIL PROTECTED] wrote: So while, you are correct on the spec since we are working in IW_CM we can assume IB semantic on posting. please spend a minute on http://www.zip.com.au/~akpm/linux/patches/stuff/top-posting.txt Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] OFA server patching
Hi all. In the interest of keeping our server up to date, I applied the latest Ubuntu patches. Several upgrades were made, including git. If you have any problems, let me know. Thanks. -jeff ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] opensm: allow multiple scopes in a partition
Hi Sasha, This patch allows multiple scopes to be configured for a partition. This allows ipoib interfaces with different scopes to coexist in a partition. The partition configuration file can now have multiple scope=N flags and they all take effect (instead of just the last one). Signed-off-by: Rolf Manderscheid [EMAIL PROTECTED] -- diff --git a/opensm/man/opensm.8 b/opensm/man/opensm.8 index efd6ff0..c51f386 100644 --- a/opensm/man/opensm.8 +++ b/opensm/man/opensm.8 @@ -366,7 +366,8 @@ Currently recognized flags are: sl=val- specifies SL for this IPoIB MC group (default is 0) scope=val - specifies scope for this IPoIB MC group - (default is 2 (link local)) + (default is 2 (link local)). Multiple scope settings + are permitted for a partition. Note that values for rate, mtu, and scope should be specified as defined in the IBTA specification (for example, mtu=4 for 2048). diff --git a/opensm/opensm/osm_prtn_config.c b/opensm/opensm/osm_prtn_config.c index 1253031..646bf2a 100644 --- a/opensm/opensm/osm_prtn_config.c +++ b/opensm/opensm/osm_prtn_config.c @@ -68,7 +68,7 @@ struct part_conf { osm_log_t *p_log; osm_subn_t *p_subn; osm_prtn_t *p_prtn; - unsigned is_ipoib, mtu, rate, sl, scope; + unsigned is_ipoib, mtu, rate, sl, scope_mask; boolean_t full; }; @@ -89,6 +89,7 @@ static int partition_create(unsigned lineno, struct part_conf *conf, char *name, char *id, char *flag, char *flag_val) { uint16_t pkey; + unsigned int scope; if (!id name isdigit(*name)) { id = name; @@ -119,12 +120,26 @@ static int partition_create(unsigned lineno, struct part_conf *conf, } conf-p_prtn-sl = (uint8_t) conf-sl; - if (conf-is_ipoib) + if (! conf-is_ipoib) + return 0; + + if (! conf-scope_mask) { osm_prtn_add_mcgroup(conf-p_log, conf-p_subn, conf-p_prtn, (uint8_t) conf-rate, (uint8_t) conf-mtu, -(uint8_t) conf-scope); +0); + return 0; + } + + for (scope = 0; scope 16; scope++) { + if (((1scope) conf-scope_mask) == 0) + continue; + osm_prtn_add_mcgroup(conf-p_log, conf-p_subn, conf-p_prtn, +(uint8_t) conf-rate, +(uint8_t) conf-mtu, +(uint8_t) scope); + } return 0; } @@ -147,11 +162,13 @@ static int partition_add_flag(unsigned lineno, struct part_conf *conf, flag \'rate\' requires valid value - skipped\n, lineno); } else if (!strncmp(flag, scope, len)) { - if (!val || (conf-scope = strtoul(val, NULL, 0)) == 0) + unsigned int scope; + if (!val || (scope = strtoul(val, NULL, 0)) == 0 || scope 0xF) osm_log(conf-p_log, OSM_LOG_VERBOSE, PARSE WARN: line %d: flag \'scope\' requires valid value - skipped\n, lineno); + conf-scope_mask |= (1scope); } else if (!strncmp(flag, sl, len)) { unsigned sl; char *end; ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] OFA server patching
OFA bugzilla seems down, I get: Software error: Can't connect to the database. Error: Access denied for user 'ofabug_user'@'localhost' (using password: YES) Is your database installed and up and running? Do you have the correct username and password selected in localconfig? For help, please send mail to the webmaster ([EMAIL PROTECTED]), giving this error message and the time and date of the error. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jeff Becker Sent: Wednesday, November 28, 2007 3:33 PM To: general@lists.openfabrics.org Subject: [ofa-general] OFA server patching Hi all. In the interest of keeping our server up to date, I applied the latest Ubuntu patches. Several upgrades were made, including git. If you have any problems, let me know. Thanks. -jeff ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] OFA server patching
On 15:32 Wed 28 Nov , Jeff Becker wrote: Hi all. In the interest of keeping our server up to date, I applied the latest Ubuntu patches. Several upgrades were made, including git. git on the server was manually compiled and installed (from ~sashak/files/git-1.5.2). As far as I can see the same git installation still be there. Sasha ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] [RFC] rdma/ucm: add support for rdma_migrate_id()
This is based on user feedback from Doug Ledford at RedHat: Events that occur on an rdma_cm_id are reported to userspace through an event channel. Connection request events are reported on the event channel associated with the listen. When the connection is accepted, a new rdma_cm_id is created and automatically uses the listen event channel. This is suboptimal where the user only wants listen events on that channel. Additionally, it may be desirable to have events related to connection establishment use a different event channel than those related to already established connections. Allow the user to migrate an rdma_cm_id between event channels. All pending events associated with the rdma_cm_id are moved to the new event channel. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- I will follow this post with a patch to the librdmacm to make use of this. I wanted to get feedback on the approach, in particular about the locking and use of fget(). drivers/infiniband/core/ucma.c | 92 include/rdma/rdma_user_cm.h| 13 +- 2 files changed, 104 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 90d675a..15937eb 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -31,6 +31,7 @@ */ #include linux/completion.h +#include linux/file.h #include linux/mutex.h #include linux/poll.h #include linux/idr.h @@ -991,6 +992,96 @@ out: return ret; } +static void ucma_lock_files(struct ucma_file *file1, struct ucma_file *file2) +{ + /* Acquire mutex's based on pointer comparison to prevent deadlock. */ + if (file1 file2) { + mutex_lock(file1-mut); + mutex_lock(file2-mut); + } else { + mutex_lock(file2-mut); + mutex_lock(file1-mut); + } +} + +static void ucma_unlock_files(struct ucma_file *file1, struct ucma_file *file2) +{ + if (file1 file2) { + mutex_unlock(file2-mut); + mutex_unlock(file1-mut); + } else { + mutex_unlock(file1-mut); + mutex_unlock(file2-mut); + } +} + +static void ucma_move_events(struct ucma_context *ctx, struct ucma_file *file) +{ + struct ucma_event *uevent, *tmp; + + list_for_each_entry_safe(uevent, tmp, ctx-file-event_list, list) + if (uevent-ctx == ctx) + list_move_tail(uevent-list, file-event_list); +} + +static ssize_t ucma_migrate_id(struct ucma_file *new_file, + const char __user *inbuf, + int in_len, int out_len) +{ + struct rdma_ucm_migrate_id cmd; + struct rdma_ucm_migrate_resp resp; + struct ucma_context *ctx; + struct file *filp; + struct ucma_file *cur_file; + int ret = 0; + + if (copy_from_user(cmd, inbuf, sizeof(cmd))) + return -EFAULT; + + /* Get current fd to protect against it being closed */ + filp = fget(cmd.fd); + if (!filp) + return -ENOENT; + + /* Validate current fd and prevent destruction of id. */ + ctx = ucma_get_ctx(filp-private_data, cmd.id); + if (IS_ERR(ctx)) { + ret = PTR_ERR(ctx); + goto file_put; + } + + cur_file = ctx-file; + if (cur_file == new_file) { + resp.events_reported = ctx-events_reported; + goto response; + } + + /* +* Migrate events between fd's, maintaining order, and avoiding new +* events being added before existing events. +*/ + ucma_lock_files(cur_file, new_file); + mutex_lock(mut); + + list_move_tail(ctx-list, new_file-ctx_list); + ucma_move_events(ctx, new_file); + ctx-file = new_file; + resp.events_reported = ctx-events_reported; + + mutex_unlock(mut); + ucma_unlock_files(cur_file, new_file); + +response: + if (copy_to_user((void __user *)(unsigned long)cmd.response, +resp, sizeof(resp))) + ret = -EFAULT; + + ucma_put_ctx(ctx); +file_put: + fput(filp); + return ret; +} + static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, const char __user *inbuf, int in_len, int out_len) = { @@ -1012,6 +1103,7 @@ static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, [RDMA_USER_CM_CMD_NOTIFY] = ucma_notify, [RDMA_USER_CM_CMD_JOIN_MCAST] = ucma_join_multicast, [RDMA_USER_CM_CMD_LEAVE_MCAST] = ucma_leave_multicast, + [RDMA_USER_CM_CMD_MIGRATE_ID] = ucma_migrate_id }; static ssize_t ucma_write(struct file *filp, const char __user *buf, diff --git a/include/rdma/rdma_user_cm.h b/include/rdma/rdma_user_cm.h index 9749c1b..c557054 100644 --- a/include/rdma/rdma_user_cm.h +++ b/include/rdma/rdma_user_cm.h
[ofa-general] [PATCH] [RFC] librdmacm: add rdma_migrate_id
This is based on user feedback from Doug Ledford at RedHat: Events that occur on an rdma_cm_id are reported to userspace through an event channel. Connection request events are reported on the event channel associated with the listen. When the connection is accepted, a new rdma_cm_id is created and automatically uses the listen event channel. This is suboptimal where the user only wants listen events on that channel. Additionally, it may be desirable to have events related to connection establishment use a different event channel than those related to already established connections. Allow the user to migrate an rdma_cm_id between event channels. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- I started to provide support for calling rdma_migrate_id() while the user is polling for events or making other calls on the migrating id, but while the complexity seemed doable, it just didn't seem justified based on the expected usage model. I believe that the kernel interface allows this support to be added later, if it is needed. For now, the documentation simply states that the user can only migrate an id if they are not processing events on the current event channel and not invoking another call on that id simultaneously. Makefile.am |1 + examples/cmatose.c | 59 +++ include/rdma/rdma_cma.h |7 + include/rdma/rdma_cma_abi.h | 13 + man/rdma_migrate_id.3 | 27 man/ucmatose.1 |4 +++ src/cma.c | 35 ++ src/librdmacm.map |1 + 8 files changed, 140 insertions(+), 7 deletions(-) diff --git a/Makefile.am b/Makefile.am index 77782da..290cbc3 100644 --- a/Makefile.am +++ b/Makefile.am @@ -54,6 +54,7 @@ man_MANS = \ man/rdma_join_multicast.3 \ man/rdma_leave_multicast.3 \ man/rdma_listen.3 \ + man/rdma_migrate_id.3 \ man/rdma_notify.3 \ man/rdma_reject.3 \ man/rdma_resolve_addr.3 \ diff --git a/examples/cmatose.c b/examples/cmatose.c index dcb6074..2f6e5f6 100644 --- a/examples/cmatose.c +++ b/examples/cmatose.c @@ -82,6 +82,7 @@ static int message_size = 100; static int message_count = 10; static uint8_t set_tos = 0; static uint8_t tos; +static uint8_t migrate = 0; static char *dst_addr; static char *src_addr; @@ -465,6 +466,35 @@ static int disconnect_events(void) return ret; } +static int migrate_channel(struct rdma_cm_id *listen_id) +{ + struct rdma_event_channel *channel; + int i, ret; + + printf(migrating to new event channel\n); + + channel = rdma_create_event_channel(); + if (!channel) { + printf(cmatose: failed to create event channel\n); + return -1; + } + + ret = 0; + if (listen_id) + ret = rdma_migrate_id(listen_id, channel); + + for (i = 0; i connections !ret; i++) + ret = rdma_migrate_id(test.nodes[i].cma_id, channel); + + if (!ret) { + rdma_destroy_event_channel(test.channel); + test.channel = channel; + } else + printf(cmatose: failure migrating to channel: %d\n, ret); + + return ret; +} + static int get_addr(char *dst, struct sockaddr_in *addr) { struct addrinfo *res; @@ -543,6 +573,13 @@ static int run_server(void) printf(data transfers complete\n); } + + if (migrate) { + ret = migrate_channel(listen_id); + if (ret) + goto out; + } + printf(cmatose: disconnecting\n); for (i = 0; i connections; i++) { if (!test.nodes[i].connected) @@ -592,30 +629,36 @@ static int run_client(void) ret = connect_events(); if (ret) - goto out; + goto disc; if (message_count) { printf(receiving data transfers\n); ret = poll_cqs(); if (ret) - goto out; + goto disc; printf(sending replies\n); for (i = 0; i connections; i++) { ret = post_sends(test.nodes[i]); if (ret) - goto out; + goto disc; } printf(data transfers complete\n); } ret = 0; -out: + + if (migrate) { + ret = migrate_channel(NULL); + if (ret) + goto out; + } +disc: ret2 = disconnect_events(); if (ret2) ret = ret2; - +out: return ret; } @@ -623,7 +666,7 @@ int main(int argc, char **argv) { int op, ret; - while ((op = getopt(argc, argv, s:b:c:C:S:t:)) != -1) { + while ((op = getopt(argc, argv, s:b:c:C:S:t:m)) != -1) {
Re: [ofa-general] OFA server patching
Working on it... Thanks. -jeff Scott Weitzenkamp (sweitzen) wrote: OFA bugzilla seems down, I get: Software error: Can't connect to the database. Error: Access denied for user 'ofabug_user'@'localhost' (using password: YES) Is your database installed and up and running? Do you have the correct username and password selected in localconfig? For help, please send mail to the webmaster ([EMAIL PROTECTED]), giving this error message and the time and date of the error. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jeff Becker Sent: Wednesday, November 28, 2007 3:33 PM To: general@lists.openfabrics.org Subject: [ofa-general] OFA server patching Hi all. In the interest of keeping our server up to date, I applied the latest Ubuntu patches. Several upgrades were made, including git. If you have any problems, let me know. Thanks. -jeff ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2.6.25 2/2] RDMA/cxgb3: Support 5.0 firmware.
OK, applied 1 and 2... Note: this change requires 5.0 firmware. I assume the change to the cxgb3 FW versions is pending in a net driver change for 2.6.25? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] IB/ehca: Fix static rate if path faster than link
thanks, applied ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH] return ENOSYS instead of -ENOSYS
thanks, applied ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] libmlx4: max_recv_wr must be non-zero for non-SRQ QPs
thanks, applied ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH 2.6.25 2/2] RDMA/cxgb3: Support 5.0 firmware.
Yes. Roland Dreier wrote: OK, applied 1 and 2... Note: this change requires 5.0 firmware. I assume the change to the cxgb3 FW versions is pending in a net driver change for 2.6.25? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] nightly osm_sim report 2007-11-29:normal completion
OSM Simulation Regression Summary [Generated mail - please do NOT reply] OpenSM binary date = 2007-11-28 OpenSM git rev = Mon_Nov_26_08:12:10_2007 [b989216e1ae91e0049ec3d4980cb8e2bdad8ed49] ibutils git rev = Tue_Sep_4_17:57:34_2007 [4bf283f6a0d7c0264c3a1d2de92745e457585fdb] Total=480 Pass=480 Fail=0 Pass: 36 Stability IS1-16.topo 36 Pkey IS1-16.topo 36 OsmTest IS1-16.topo 36 OsmStress IS1-16.topo 36 Multicast IS1-16.topo 36 LidMgr IS1-16.topo 12 Stability IS3-loop.topo 12 Stability IS3-128.topo 12 Pkey IS3-128.topo 12 OsmTest IS3-loop.topo 12 OsmTest IS3-128.topo 12 OsmStress IS3-128.topo 12 Multicast IS3-loop.topo 12 Multicast IS3-128.topo 12 LidMgr IS3-128.topo 12 FatTree merge-roots-4-ary-2-tree.topo 12 FatTree merge-root-4-ary-3-tree.topo 12 FatTree gnu-stallion-64.topo 12 FatTree blend-4-ary-2-tree.topo 12 FatTree RhinoDDR.topo 12 FatTree FullGnu.topo 12 FatTree 4-ary-2-tree.topo 12 FatTree 2-ary-4-tree.topo 12 FatTree 12-node-spaced.topo 12 FTreeFail 4-ary-2-tree-missing-sw-link.topo 12 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo 12 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo 12 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo Failures: ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general