Re: [ofa-general] rdma_resolve_route() returning -EINVAL
Roland Dreier wrote: The issue is more of spec compliance than a likely real-life scenario... and as for why no one else is worrying about it, I think it's because the only other user of rdma_connect() in the tree is iSER, and I guess no one worried too much there. SRP uses the IB CM directly, and waits for timewait exit before calling a connection closed. Roland, I guess there's some tradeoff here between the time connection recovery would take when the ULPs does wait for the timewait event vs the risk of getting into the target considering the REQ as stale and rejecting. Does SRP just wait for the event, or the new connection is established in parallel, which also means it would use a different QP number always. Or. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] SRP/mlx4 interrupts throttling performance
Cameron Harr wrote: I was able to get the latest scst code working with Vu's standalone ib_srpt and the kernel IB modules, and dropped my ib_srpt thread count to 2. However, I still get about the same IOP performance on the target although interrupts on the busy cpu have gone up to around 140K. Interesting, but now I'm at a bit of a loss as to where the bottleneck could be. I figured it was Interrupts, but if the CPU is handling more right now, perhaps the problem is elsewhere? How many context switches per second do you have during your test on the target? Once in scst-devel mailing list there was a thread about observation that SRP target driver produces 10 context switches per command. See http://sourceforge.net/mailarchive/message.php?msg_id=e2e108260802070110q1fa084a1j54945d06c16c94f2%40mail.gmail.com If it is so in your case as well, it would very well explain your issue. 10 CS/cmd is a definite overkill, it should be 1 or, at max, 2 CS/cmd. BTW, I suppose you don't use the debug SCST build, do you? Vlad Cameron Cameron Harr wrote: Cameron Harr wrote: Additionally, I found that I can load the newer scst code if I use the kernel-supplied modules and the standalone srpt-1.0.0 package that I think you provide Vu. I was about to try it along with dropping a module param for ib_srpt (I was using a thread count of 32 that had given me better performance on an earlier test). I'll report back on this. Not much luck using the newer scst code and default kernel modules (Running CentOS 5.2). If I try using the default kernel modules on the initiator, I can't get them to see anything (the ofed SM pkg doesn't see any devices to run on). When using the regular OFED on the initiator, my target dies when I try to attach to the target on the initiator: - ib_srpt: Host login i_port_id=0x0:0x2c90300026053 t_port_id=0x2c90300026046:0x2c90300026046 it_iu_len=996 Oct 3 13:44:23 test05 kernel: i[4127]: scst: scst_mgmt_thread:5187:***CRITICAL ERROR*** session 8107f3222b88 is in scst_sess_shut_list, but in unknown shut phase 0 BUG at /usr/src/scst.tot/src/scst_targ.c:5188 --- [cut here ] - [please bite here ] - Kernel BUG at /usr/src/scst.tot/src/scst_targ.c:5188 invalid opcode: [1] SMP last sysfs file: /devices/pci:00/:00:00.0/class CPU 2 Modules linked in: ib_srpt(U) ib_cm ib_sa scst_vdisk(U) scst(U) fio_driver(PU) fio_port(PU) mlx4_ib ib_mad ib_core ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc nls_utf8 hfsplus dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 i5000_edac i2c_core edac_mc pcspkr shpchp mlx4_core e1000e ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 4127, comm: scsi_tgt_mgmt Tainted: P 2.6.18-92.1.13.el5 #1 RIP: 0010:[88488a56] [88488a56] :scst:scst_mgmt_thread+0x3ff/0x577 - ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] ofa_1_4_kernel 20081006-0200 daily build status
This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 -- ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH] SDP: fix initial recv buffer size
Amir Vadai wrote: Set initial recv buffer according to incoming hha header. Fixed bugzilla 1086: SDP Linux and SDP windows don't work togeather Have you asked Vlad to pull this? Tziporet ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] ***SPAM*** ibdm network topology format
Sasha, On Thu, Oct 2, 2008 at 6:22 PM, Hal Rosenstock [EMAIL PROTECTED] wrote: Sasha, On Thu, Oct 2, 2008 at 1:00 PM, Sasha Khapyorsky [EMAIL PROTECTED] wrote: Hi Hal, On 10:18 Thu 02 Oct , Hal Rosenstock wrote: 2. ibis doesn't register class 0x81 - SM direct routed, only SM lid routed (0x1). In comment in ibutils/ibis/src/ibsm.c line 118 is stated: /* no need to bind the Directed Route class as it will automatically be handled by the osm_vendor_bind if asked for LID route */ As far as I can see in osm_vendor_bind() it is not (but it is in opposite order - when class 0x81 is registered class 0x1 will be registered too). Yes that is what osm_vendor_ibumad.c:osm_vendor_bind does. So either ibdiagnet needs to register 0x81 r.t.1 or osm_vendor_ibumad.c:osm_vendor_bind needs to be symmetric in terms of registering the other SM class when only one is requested. This is a minor change in the underlying semantics. [Popping up a level in terms of this, (other than applications taking advantage of this feature,) I'm not sure why the vendor layer should assume that just because one SM class is requested, the other should be too]. I just looked and the latter appears to be consistent with the other vendor layers. I think either solution will work. Your solution below also looks like it would work but don't that should be done in a sim layer. I'm not like this solution too, but the fact that ibis works with real stack without registering 0x81 class is unclear for me. Me too. See below. Somehow it works without ibsim - so I suspect user_mad handles it. (Hal, could you clarify?) The kernel (user_mad/mad) does not change the requested registrations but I'm not sure I understand the question you are asking to be clarified. Is that what you're asking ? ibis works somehow with real stack. It registers 0x1 class only and uses direct routing SMPs. Do you have any idea about why (osm_vendor_idumad and/or libibumad don't help)? libibumad umad_register does not do anything that would affect this either. I can only conclude there must be something in ibutils that fixes this if it does work with the real stack. It shouldn't be too hard to track down where that registration for class 0x81 comes from. Are you sure this is the only registration and not DR class too ? That's the first thing to confirm or maybe you've already confirmed this and it wasn't clear to me in what you wrote. If so, I have a theory about what could be occuring. It may be the case that it is an effect of the kernel MAD layer in that a MAD agent can send any class and when using request/response it matches on transaction ID which contains the MAD agent. Unsolicited messages on that other class wouldn't get through though. I just ran a simple test of this and that appears to be the case. -- Hal -- Hal Sasha ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] SRP/mlx4 interrupts throttling performance
Cameron Harr wrote: Cameron Harr wrote: Additionally, I found that I can load the newer scst code if I use the kernel-supplied modules and the standalone srpt-1.0.0 package that I think you provide Vu. I was about to try it along with dropping a module param for ib_srpt (I was using a thread count of 32 that had given me better performance on an earlier test). I'll report back on this. Not much luck using the newer scst code and default kernel modules (Running CentOS 5.2). If I try using the default kernel modules on the initiator, I can't get them to see anything (the ofed SM pkg doesn't see any devices to run on). When using the regular OFED on the initiator, my target dies when I try to attach to the target on the initiator: - ib_srpt: Host login i_port_id=0x0:0x2c90300026053 t_port_id=0x2c90300026046:0x2c90300026046 it_iu_len=996 Oct 3 13:44:23 test05 kernel: i[4127]: scst: scst_mgmt_thread:5187:***CRITICAL ERROR*** session 8107f3222b88 is in scst_sess_shut_list, but in unknown shut phase 0 BUG at /usr/src/scst.tot/src/scst_targ.c:5188 --- [cut here ] - [please bite here ] - Kernel BUG at /usr/src/scst.tot/src/scst_targ.c:5188 This can happen if the target driver frees some IO or TM command twice, by, eg, calling scst_tgt_cmd_done() two times for the same command. invalid opcode: [1] SMP last sysfs file: /devices/pci:00/:00:00.0/class CPU 2 Modules linked in: ib_srpt(U) ib_cm ib_sa scst_vdisk(U) scst(U) fio_driver(PU) fio_port(PU) mlx4_ib ib_mad ib_core ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc nls_utf8 hfsplus dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 i5000_edac i2c_core edac_mc pcspkr shpchp mlx4_core e1000e ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 4127, comm: scsi_tgt_mgmt Tainted: P 2.6.18-92.1.13.el5 #1 RIP: 0010:[88488a56] [88488a56] :scst:scst_mgmt_thread+0x3ff/0x577 - ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Status of NFS over RDMA and SRP?
Steven Truelove wrote: Hi, I am considering using our existing Infiniband interconnect to provide high-speed storage access to our compute cluster. It looks like the two ways to do this are NFS over RDMA and SRP. SRP initiator is part of the Linux kernel and also part of OFED. SRP target is part of OFED (starting from OFED 1.3) and also submitted to the kernel as part of Generic SCSI target mid-level driver - SCST (http://scst.sourceforge.net) SRP is in GA stage Tziporet ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] rdma_resolve_route() returning -EINVAL
At 05:08 AM 10/6/2008, Or Gerlitz wrote: Roland Dreier wrote: The issue is more of spec compliance than a likely real-life scenario... and as for why no one else is worrying about it, I think it's because the only other user of rdma_connect() in the tree is iSER, and I guess no one worried too much there. SRP uses the IB CM directly, and waits for timewait exit before calling a connection closed. Roland, I guess there's some tradeoff here between the time connection recovery would take when the ULPs does wait for the timewait event vs the risk of getting into the target considering the REQ as stale and rejecting. Does SRP just wait for the event, or the new connection is established in parallel, which also means it would use a different QP number always. That's what the NFS/RDMA client does - I always create a new cm_id and qp. So the TIMEWAIT upcall isn't very interesting, unless I change that. My problem was that I started getting the upcall, when I didn't before. Tom. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] RE: IPoIB CM connectivity issue.
In second case (when OFED is CREQ initiator) only one RC QP was used to establish a connection and apparently bidirectional traffic was capable to go through that one QP. Yes, at least in the case where you have an SRQ-capable adapter, it doesn't really matter which QP has incoming traffic. However, it was much simpler in the IPoIB implementation to simply open a QP to send traffic rather than searching through all passive connections for a connection to the same peer. Is this behavior causing problems for you? I just didn't expect of using TWO QPs per one connection. It probably won't simplify my implementation :) I assume you mean sending ARP replies. Yes, you are correct. I never noticed before but RFC 4755 does say: Additionally, all address resolution responses (ARP or Neighbor Discovery) MUST always be encapsulated in a UD mode packet. Yes, you are right. Please discard my note regarding ARP reply. Not sure what you mean -- if Linux is sending ARP replies on a connected QP, then that is not allowed according to the RFC. I meant Linux sends ARP reply over UD behaviour is in compliance with the document. However, looking at this quote again I see that the RFC's requirement rather unfortunately includes neighbour discovery too. It's not *too* bad to look at the ethertype in the IPoIB pseudo-header to check for an ARP packet, but sending all neighbour discovery messages seems very ugly -- even just sending all ICMP6 messages via UD wouldn't be very nice to implement, and it would eg break ping6 with large messages, so we would have to look deep deep into packets to see which were ND messages. I wonder what the rationale behind that part of the RFC was? - R. Yes, it would be good to know the reason behind this. So far handling ND messages over UD ONLY seem odd considering it's function to detect unreacheable neighbours. If one has alive RC QP to a neighbour most likely it is reacheable. Alex. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
***SPAM*** Re: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache
Hi Yevgeny, On Sun, Oct 5, 2008 at 9:26 PM, Yevgeny Kliteynik [EMAIL PROTECTED] wrote: Hi Sasha, The following series of 6 patches implements unicast routing cache in OpenSM. This implementation (v2, previous version was sent before OFED 1.3) was rewritten from scratch: - no caching of existing connectivity - no caching of existing lid matrices - each switch has an LFT buffer that contains the result of the last routing engine execution (instead of one buffer in ucast_mgr) - links/ports/nodes changes are spotted during the discovery - only the links/ports/nodes that went down are cached - when switch goes down, caching its lid matrices and LFT In one of the following cases we can use cached routing - there is no topology change - one or more CAs disappeared - one or more leaf switches disappeared In these cases cached routing is written to the switches as is (unless the switch doesn't exist). If there is any other topology change, existing cache is invalidated and the routing engine(s) run as usual. Glad to see this! A few comments/questions: It seems that there is a LFT cache per switch. This seems to be a big memory penalty to me (in large subnets). So I have two questions related to this: Can this only be done this way when cached routing is being used ? Also, when cached routing is being used, is this only needed for leaf switches ? I'm wondering when there is a cached node match whether the available peer ports/neighbors are validated (or something equivalent) to know caching is valid ? It might also include whether a switch is still a leaf switch (which may be redundant as that should show up as a peer port/neighbor change). It looks like the structure is there for this but I didn't review the code in detail. Are you sure all the memory allocation failures are handled properly within the routing cache code ? What I mean is that NULL is returned and does this always result in a caching not used/routing recalculated ? Also, in that case, should some log message be indicated rather than hiding this ? Nit: doc/current-routing.txt should also be updated for this feature. -- Hal The patches are: - patch 1/6: move lft_buf from ucast_mgr to osm_switch - patch 2/6: Add -A or --ucast_cache option to opensm - patch 3/6: adding osm_ucast_cache.{c,h} files (this is the cache implementation itself) - patch 4/6: adding new cache files to makefile - patch 5/6: integrating unicast cache into the discovery and ucast manager - patch 6/6: man entry for cached routing -- Yevgeny ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Allowing end-users to query for fabric information
Roland, I've been thinking about this some more and I have to say I'm still a bit confused. Are you saying that any root user on any node of the fabric can change the routing tables? Isn't the ability to access and alter subnet information controlled via the management key? -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mike Heinz Sent: Monday, September 22, 2008 3:19 PM To: Roland Dreier Cc: general@lists.openfabrics.org Subject: RE: [ofa-general] Allowing end-users to query for fabric information Thanks for the explanation. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Monday, September 22, 2008 3:18 PM To: Mike Heinz Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] Allowing end-users to query for fabric information What was the reason for making this design choice? While I could certainly provide boot scripts to change the permissions to /dev/infiniband/umad*, I'd rather understand why the decision was made to restrict access. because /dev/infiniband/umadX allows full unfiltered access to send/receive any MADs. Including changing routing tables, bringing ports down, etc. Not stuff that unprivileged users should be able to do. It would make sense to have a higher-level interface that only allows safe queries without side effects, but that's quite a bit more work than just changing permissions on device nodes. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
***SPAM*** Re: [ofa-general] Allowing end-users to query for fabric information
Mike, On Mon, Oct 6, 2008 at 11:09 AM, Mike Heinz [EMAIL PROTECTED] wrote: Roland, I've been thinking about this some more and I have to say I'm still a bit confused. Are you saying that any root user on any node of the fabric can change the routing tables? Isn't the ability to access and alter subnet information controlled via the management key? There are two levels to this. First you must be able to send the MAD and once that can happen the receiving SMA performs the usual MKey checks which depend on the protection level assuming it is an SM class MAD like the one to change the routing tables. -- Hal -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mike Heinz Sent: Monday, September 22, 2008 3:19 PM To: Roland Dreier Cc: general@lists.openfabrics.org Subject: RE: [ofa-general] Allowing end-users to query for fabric information Thanks for the explanation. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Monday, September 22, 2008 3:18 PM To: Mike Heinz Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] Allowing end-users to query for fabric information What was the reason for making this design choice? While I could certainly provide boot scripts to change the permissions to /dev/infiniband/umad*, I'd rather understand why the decision was made to restrict access. because /dev/infiniband/umadX allows full unfiltered access to send/receive any MADs. Including changing routing tables, bringing ports down, etc. Not stuff that unprivileged users should be able to do. It would make sense to have a higher-level interface that only allows safe queries without side effects, but that's quite a bit more work than just changing permissions on device nodes. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Allowing end-users to query for fabric information
Well, I guess that's my point - I'd like to be able to create tools for non-root users that would collect interesting information about the fabric. As far as I know, this should be a safe operation, because the SA should be protected by the m-key - but it seems that the policy in OFED is that this is not a safe operation and access must be tightly controlled. While it's a trivial task to patch OFED to give non-root users access to the /dev/infiniband/umad* devices, I certainly don't want to provide tools to my users that create security holes in the fabric. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -Original Message- From: Hal Rosenstock [mailto:[EMAIL PROTECTED] Sent: Monday, October 06, 2008 11:16 AM To: Mike Heinz Cc: Roland Dreier; general@lists.openfabrics.org Subject: Re: [ofa-general] Allowing end-users to query for fabric information Mike, On Mon, Oct 6, 2008 at 11:09 AM, Mike Heinz [EMAIL PROTECTED] wrote: Roland, I've been thinking about this some more and I have to say I'm still a bit confused. Are you saying that any root user on any node of the fabric can change the routing tables? Isn't the ability to access and alter subnet information controlled via the management key? There are two levels to this. First you must be able to send the MAD and once that can happen the receiving SMA performs the usual MKey checks which depend on the protection level assuming it is an SM class MAD like the one to change the routing tables. -- Hal -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mike Heinz Sent: Monday, September 22, 2008 3:19 PM To: Roland Dreier Cc: general@lists.openfabrics.org Subject: RE: [ofa-general] Allowing end-users to query for fabric information Thanks for the explanation. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Monday, September 22, 2008 3:18 PM To: Mike Heinz Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] Allowing end-users to query for fabric information What was the reason for making this design choice? While I could certainly provide boot scripts to change the permissions to /dev/infiniband/umad*, I'd rather understand why the decision was made to restrict access. because /dev/infiniband/umadX allows full unfiltered access to send/receive any MADs. Including changing routing tables, bringing ports down, etc. Not stuff that unprivileged users should be able to do. It would make sense to have a higher-level interface that only allows safe queries without side effects, but that's quite a bit more work than just changing permissions on device nodes. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] OFED meeting agenda for today (Oct 6)
Agenda for OFED meeting today on OFED 1.4 status toward RC3: 1. Interop event status - Rupert 2. RC3 features: 1. NFS-RDMA to work on RHEL 5.1 - done 2. OSM: Cashed routing - patches sent - should be committed in a day or two 3. Cleanup compilation warning - Mellanox started - any progress by other companies? 3. OFED testing status - all 4. Critical bugs review: 1128blo Othe[EMAIL PROTECTED] release IPoIB-CM QP resources in flushing CQE context 1113cri RHEL[EMAIL PROTECTED] rpm -e scsi-target-utils-0.1-2008715 fails 1198cri SLES[EMAIL PROTECTED] hang during ipoib create_child/ifdown 1164maj SLES[EMAIL PROTECTED] iperf over IPoIB fails for 100 tcp connections 1247maj RHEL[EMAIL PROTECTED] ipoib_ud_test caused kernel oops on ofed_1_4 (sw083/084) 1221maj SLES[EMAIL PROTECTED] SLES10 sp2: remote logins via ssh fail due to rpcbind and... 1248maj SLES[EMAIL PROTECTED] Bonding - after reboot the host stucks while raising the ... 1099maj All [EMAIL PROTECTED] IPoIB IPv6 does not work on RH4 1153maj Othe[EMAIL PROTECTED] OpenSM- Multicast group will not open when IB host is the... 5. OFA BOF at SC08 - Woody 6. Open discussion Tziporet ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] SRP/mlx4 interrupts throttling performance
Vlad, Thanks for the suggestion. As I look via vmstat, my CSw/s rate is fairly constant around 280K when scst_threads=1 (per Vu's suggestion) and pops up to ~330-340K CSw/s when scst_threads is set to 8. I'm currently doing 512B writes, and this gives me about a 4:1 ratio of context switches to IOPs with 1 SCST thread (70K IOPs) and around 4.5:1 when there are 8 SCST threads (75K IOPs). You say those numbers could be overkill - do you know of a way to drop the number? I'm very interested in trying Vu's other suggestions (multiple initiators and multiple QPs, but my other initiator has been too busy all weekend to run on. Debug, tracing, and all that was turned off in the SCST Makefiles. -Cameron Vladislav Bolkhovitin wrote: Cameron Harr wrote: I was able to get the latest scst code working with Vu's standalone ib_srpt and the kernel IB modules, and dropped my ib_srpt thread count to 2. However, I still get about the same IOP performance on the target although interrupts on the busy cpu have gone up to around 140K. Interesting, but now I'm at a bit of a loss as to where the bottleneck could be. I figured it was Interrupts, but if the CPU is handling more right now, perhaps the problem is elsewhere? How many context switches per second do you have during your test on the target? Once in scst-devel mailing list there was a thread about observation that SRP target driver produces 10 context switches per command. See http://sourceforge.net/mailarchive/message.php?msg_id=e2e108260802070110q1fa084a1j54945d06c16c94f2%40mail.gmail.com If it is so in your case as well, it would very well explain your issue. 10 CS/cmd is a definite overkill, it should be 1 or, at max, 2 CS/cmd. BTW, I suppose you don't use the debug SCST build, do you? Vlad Cameron Cameron Harr wrote: Cameron Harr wrote: Additionally, I found that I can load the newer scst code if I use the kernel-supplied modules and the standalone srpt-1.0.0 package that I think you provide Vu. I was about to try it along with dropping a module param for ib_srpt (I was using a thread count of 32 that had given me better performance on an earlier test). I'll report back on this. Not much luck using the newer scst code and default kernel modules (Running CentOS 5.2). If I try using the default kernel modules on the initiator, I can't get them to see anything (the ofed SM pkg doesn't see any devices to run on). When using the regular OFED on the initiator, my target dies when I try to attach to the target on the initiator: - ib_srpt: Host login i_port_id=0x0:0x2c90300026053 t_port_id=0x2c90300026046:0x2c90300026046 it_iu_len=996 Oct 3 13:44:23 test05 kernel: i[4127]: scst: scst_mgmt_thread:5187:***CRITICAL ERROR*** session 8107f3222b88 is in scst_sess_shut_list, but in unknown shut phase 0 BUG at /usr/src/scst.tot/src/scst_targ.c:5188 --- [cut here ] - [please bite here ] - Kernel BUG at /usr/src/scst.tot/src/scst_targ.c:5188 invalid opcode: [1] SMP last sysfs file: /devices/pci:00/:00:00.0/class CPU 2 Modules linked in: ib_srpt(U) ib_cm ib_sa scst_vdisk(U) scst(U) fio_driver(PU) fio_port(PU) mlx4_ib ib_mad ib_core ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc nls_utf8 hfsplus dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 i5000_edac i2c_core edac_mc pcspkr shpchp mlx4_core e1000e ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 4127, comm: scsi_tgt_mgmt Tainted: P 2.6.18-92.1.13.el5 #1 RIP: 0010:[88488a56] [88488a56] :scst:scst_mgmt_thread+0x3ff/0x577 - ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] OFED Roll
Am I correct that the Cisco OFED Roll installs Infiniband but not Infiniband over IP? Does it just use RDMA as a transport? The OFED download from Openfabrics installs IB over IP and I prefer not to use it since the latencies are double that of RDMA and the throughput is about one half of RDMA. So, another question. Suppose I have already installed OFED 1.3.1 with IB over IP. How do I configure my system (including the ib0.conf and other conf files) to use RDMA rather than IB over IP? Thanks for your help. Jim ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH/RFC] IB/mthca: Use pci_request_regions()
On Mon, Sep 29, 2008 at 09:41:37PM -0700, Roland Dreier wrote: Back in prehistoric (pre-git!) days, the kernel's MSI-X support did request_mem_region() on a device's MSI-X tables, which meant that a driver that enabled MSI-X couldn't use pci_request_regions() (since that would clash with the PCI layer's MSI-X request). However, that was removed (by me!) years ago, so mthca can just use pci_request_regions() and pci_release_regions() instead of its own much more complicated code that avoids requesting the MSI-X tables. Looks like a nice diet to the code. Acked by: Eli Cohen [EMAIL PROTECTED] ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
***SPAM*** Re: [ofa-general] Allowing end-users to query for fabric information
On Mon, Oct 6, 2008 at 11:27 AM, Mike Heinz [EMAIL PROTECTED] wrote: Well, I guess that's my point - I'd like to be able to create tools for non-root users that would collect interesting information about the fabric. As far as I know, this should be a safe operation, because the SA should be protected by the m-key - but it seems that the policy in OFED is that this is not a safe operation and access must be tightly controlled. Do you mean SM or SA ? Subverting the SM is not a good idea. The SM is the central point for setting up SM attributes. Policy needs to be instilled through the SM. There are some SA attributes which are somewhat dangerous too as they are essentially writable as well from an end node. Furthermore, most fabrics do not utilize MKey protection so the second level is not there yet and only the most primitive form of this is available within some SMs. While it's a trivial task to patch OFED to give non-root users access to the /dev/infiniband/umad* devices, I certainly don't want to provide tools to my users that create security holes in the fabric. IMO this would do that although I would phrase it slightly differently. -- Hal -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -Original Message- From: Hal Rosenstock [mailto:[EMAIL PROTECTED] Sent: Monday, October 06, 2008 11:16 AM To: Mike Heinz Cc: Roland Dreier; general@lists.openfabrics.org Subject: Re: [ofa-general] Allowing end-users to query for fabric information Mike, On Mon, Oct 6, 2008 at 11:09 AM, Mike Heinz [EMAIL PROTECTED] wrote: Roland, I've been thinking about this some more and I have to say I'm still a bit confused. Are you saying that any root user on any node of the fabric can change the routing tables? Isn't the ability to access and alter subnet information controlled via the management key? There are two levels to this. First you must be able to send the MAD and once that can happen the receiving SMA performs the usual MKey checks which depend on the protection level assuming it is an SM class MAD like the one to change the routing tables. -- Hal -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mike Heinz Sent: Monday, September 22, 2008 3:19 PM To: Roland Dreier Cc: general@lists.openfabrics.org Subject: RE: [ofa-general] Allowing end-users to query for fabric information Thanks for the explanation. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Monday, September 22, 2008 3:18 PM To: Mike Heinz Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] Allowing end-users to query for fabric information What was the reason for making this design choice? While I could certainly provide boot scripts to change the permissions to /dev/infiniband/umad*, I'd rather understand why the decision was made to restrict access. because /dev/infiniband/umadX allows full unfiltered access to send/receive any MADs. Including changing routing tables, bringing ports down, etc. Not stuff that unprivileged users should be able to do. It would make sense to have a higher-level interface that only allows safe queries without side effects, but that's quite a bit more work than just changing permissions on device nodes. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] SRP/mlx4 interrupts throttling performance
Cameron Harr wrote: Vlad, Thanks for the suggestion. As I look via vmstat, my CSw/s rate is fairly constant around 280K when scst_threads=1 (per Vu's suggestion) and pops up to ~330-340K CSw/s when scst_threads is set to 8. I'm currently doing 512B writes, and this gives me about a 4:1 ratio of context switches to IOPs with 1 SCST thread (70K IOPs) and around 4.5:1 when there are 8 SCST threads (75K IOPs). This is still too high. Considering that each CS is about 1 microsecond you can estimate how many IOPS's it costs you. You say those numbers could be overkill - do you know of a way to drop the number? Sorry, I don't. I can only say that too many CSs problem is in SRPT driver. With qla2x00t driver and BLOCKIO backstorage you will have 1 CS/sec or less in average. I'm very interested in trying Vu's other suggestions (multiple initiators and multiple QPs, but my other initiator has been too busy all weekend to run on. Debug, tracing, and all that was turned off in the SCST Makefiles. -Cameron Vladislav Bolkhovitin wrote: Cameron Harr wrote: I was able to get the latest scst code working with Vu's standalone ib_srpt and the kernel IB modules, and dropped my ib_srpt thread count to 2. However, I still get about the same IOP performance on the target although interrupts on the busy cpu have gone up to around 140K. Interesting, but now I'm at a bit of a loss as to where the bottleneck could be. I figured it was Interrupts, but if the CPU is handling more right now, perhaps the problem is elsewhere? How many context switches per second do you have during your test on the target? Once in scst-devel mailing list there was a thread about observation that SRP target driver produces 10 context switches per command. See http://sourceforge.net/mailarchive/message.php?msg_id=e2e108260802070110q1fa084a1j54945d06c16c94f2%40mail.gmail.com If it is so in your case as well, it would very well explain your issue. 10 CS/cmd is a definite overkill, it should be 1 or, at max, 2 CS/cmd. BTW, I suppose you don't use the debug SCST build, do you? Vlad ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 00/03] RDMA Transport Support for 9P
Roland: This patchset implements an RDMA transport provider for the v9fs (Plan 9 filesystem). Could you take a look at it and let us know what you think? Thanks, Tom Here is the original posting... Eric: This patch series implements an RDMA Transport provider for 9P and is relative to your for-next branch. The RDMA support is built on the OpenFabrics API and uses SEND and RECV to exchange data. This patch series has been tested with dbench and iozone. Signed-off-by: Tom Tucker [EMAIL PROTECTED] Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED] [PATCH 01/03] 9prdma: RDMA Transport Support for 9P net/9p/trans_rdma.c | 996 +++ 1 files changed, 996 insertions(+), 0 deletions(-) [PATCH 02/03] 9prdma: Makefile change for the RDMA transport net/9p/Makefile |4 1 files changed, 4 insertions(+), 0 deletions(-) [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport net/9p/Kconfig |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 02/03] 9prdma: Makefile change for the RDMA transport
This adds a make rule for the 9pnet_rdma module that implements the RDMA transport. Signed-off-by: Tom Tucker [EMAIL PROTECTED] Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED] --- net/9p/Makefile |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/net/9p/Makefile b/net/9p/Makefile index 5192194..bc909ab 100644 --- a/net/9p/Makefile +++ b/net/9p/Makefile @@ -1,5 +1,6 @@ obj-$(CONFIG_NET_9P) := 9pnet.o obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o +obj-$(CONFIG_NET_9P_RDMA) += 9pnet_rdma.o 9pnet-objs := \ mod.o \ @@ -12,3 +13,6 @@ obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o 9pnet_virtio-objs := \ trans_virtio.o \ + +9pnet_rdma-objs := \ + trans_rdma.o \ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport
This patch adds a config option for the 9P RDMA transport. Signed-off-by: Tom Tucker [EMAIL PROTECTED] Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED] --- net/9p/Kconfig |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/net/9p/Kconfig b/net/9p/Kconfig index ff34c5a..c42c0c4 100644 --- a/net/9p/Kconfig +++ b/net/9p/Kconfig @@ -20,6 +20,12 @@ config NET_9P_VIRTIO This builds support for a transports between guest partitions and a host partition. +config NET_9P_RDMA + depends on NET_9P INFINIBAND EXPERIMENTAL + tristate 9P RDMA Transport (Experimental) + help + This builds support for a RDMA transport. + config NET_9P_DEBUG bool Debug information depends on NET_9P ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 01/03] 9prdma: RDMA Transport Support for 9P
This file implements the RDMA transport provider for 9P. It allows mounts to be performed over iWARP and IB capable network interfaces and uses the OpenFabrics API to perform I/O. Signed-off-by: Tom Tucker [EMAIL PROTECTED] Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED] --- net/9p/trans_rdma.c | 1025 +++ 1 files changed, 1025 insertions(+), 0 deletions(-) diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c new file mode 100644 index 000..f919768 --- /dev/null +++ b/net/9p/trans_rdma.c @@ -0,0 +1,1025 @@ +/* + * linux/fs/9p/trans_rdma.c + * + * RDMA transport layer based on the trans_fd.c implementation. + * + * Copyright (C) 2008 by Tom Tucker [EMAIL PROTECTED] + * Copyright (C) 2006 by Russ Cox [EMAIL PROTECTED] + * Copyright (C) 2004-2005 by Latchesar Ionkov [EMAIL PROTECTED] + * Copyright (C) 2004-2008 by Eric Van Hensbergen [EMAIL PROTECTED] + * Copyright (C) 1997-2002 by Ron Minnich [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to: + * Free Software Foundation + * 51 Franklin Street, Fifth Floor + * Boston, MA 02111-1301 USA + * + */ + +#include linux/in.h +#include linux/module.h +#include linux/net.h +#include linux/ipv6.h +#include linux/kthread.h +#include linux/errno.h +#include linux/kernel.h +#include linux/un.h +#include linux/uaccess.h +#include linux/inet.h +#include linux/idr.h +#include linux/file.h +#include linux/parser.h +#include net/9p/9p.h +#include net/9p/transport.h +#include rdma/ib_verbs.h +#include rdma/rdma_cm.h +#include rdma/ib_verbs.h + +#define P9_PORT5640 +#define P9_RDMA_SQ_DEPTH 32 +#define P9_RDMA_RQ_DEPTH 32 +#define P9_RDMA_SEND_SGE 4 +#define P9_RDMA_RECV_SGE 4 +#define P9_RDMA_IRD0 +#define P9_RDMA_ORD0 +#define P9_RDMA_TIMEOUT3 /* 30 seconds */ +#define P9_RDMA_MAXSIZE(4*4096)/* Min SGE is 4, so we can +* safely advertise a maxsize +* of 64k */ + +#define P9_RDMA_MAX_SGE (P9_RDMA_MAXSIZE PAGE_SHIFT) +/** + * struct p9_trans_rdma - RDMA transport instance + * + * @state: tracks the transport state machine for connection setup and tear down + * @cm_id: The RDMA CM ID + * @pd: Protection Domain pointer + * @qp: Queue Pair pointer + * @cq: Completion Queue pointer + * @lkey: The local access only memory region key + * @next_tag: The next tag for tracking rpc + * @timeout: Number of uSecs to wait for connection management events + * @sq_depth: The depth of the Send Queue + * @sq_count: Number of WR on the Send Queue + * @rq_depth: The depth of the Receive Queue. NB: I _think_ that 9P is + * purely req/rpl (i.e. no unaffiliated replies, but I'm not sure, so + * I'm allowing this to be tweaked separately. + * @addr: The remote peer's address + * @req_lock: Protects the active request list + * @req_list: List of sent RPC awaiting replies + * @send_wait: Wait list when the SQ fills up + * @cm_done: Completion event for connection management tracking + */ +struct p9_trans_rdma { + enum { + P9_RDMA_INIT, + P9_RDMA_ADDR_RESOLVED, + P9_RDMA_ROUTE_RESOLVED, + P9_RDMA_CONNECTED, + P9_RDMA_FLUSHING, + P9_RDMA_CLOSING, + P9_RDMA_CLOSED, + } state; + struct rdma_cm_id *cm_id; + struct ib_pd *pd; + struct ib_qp *qp; + struct ib_cq *cq; + struct ib_mr *dma_mr; + u32 lkey; + atomic_t next_tag; + long timeout; + int sq_depth; + atomic_t sq_count; + int rq_depth; + struct sockaddr_in addr; + + spinlock_t req_lock; + struct list_head req_list; + + wait_queue_head_t send_wait; + struct completion cm_done; + struct p9_idpool *tagpool; +}; + +/** + * p9_rdma_context - Keeps track of in-process WR + * + * @wc_op: Mellanox's broken HW doesn't provide the original WR op + * when the CQE completes in error. This forces apps to keep track of + * the op themselves. Yes, it's a Pet Peeve of mine ;-) + * @busa: Bus address to unmap when the WR completes + * @req: Keeps track of requests (send) + * @rcall: Keepts track of replies (receive) + */ +struct p9_rdma_req; +struct p9_rdma_context { + enum ib_wc_opcode wc_op; + dma_addr_t busa; + union {
Re: [ofa-general] [PATCH 00/03] RDMA Transport Support for 9P
This patchset implements an RDMA transport provider for the v9fs (Plan 9 filesystem). Could you take a look at it and let us know what you think? I sent comments on the initial posting I saw on lkml ... did they not make it to you? [PATCH 01/03] 9prdma: RDMA Transport Support for 9P [PATCH 02/03] 9prdma: Makefile change for the RDMA transport [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport one meta-comment I didn't send last time: the patches are small enough that I would just send it all in one patch, since it makes sense to apply it that way anyway. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: Continue of defer skb_orphan() until irqs enabled
Sorry for the delay in getting back to you on this, but we've done our testing and haven't found any problems. As I mentioned, we're using OFED 1.3.1, so the patch had to be tweaked a bit. The patch we used follows. --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2008-09-09 15:53:24.856316458 -0700 +++ e/drivers/infiniband/ulp/ipoib/ipoib.h 2008-09-29 10:19:00.833519991 -0700 @@ -345,10 +345,9 @@ struct ipoib_ethtool_st { }; /* - * Device private locking: tx_lock protects members used in TX fast - * path (and we use LLTX so upper layers don't do extra locking). - * lock protects everything else. lock nests inside of tx_lock (ie - * tx_lock must be acquired first if needed). + * Device private locking: network stack tx_lock protects members used + * in TX fast path, lock protects everything else. lock nests inside + * of tx_lock (ie tx_lock must be acquired first if needed). */ struct ipoib_dev_priv { spinlock_t lock; @@ -397,7 +396,6 @@ struct ipoib_dev_priv { struct ipoib_vmap rx_vmap_ring; struct ipoib_sg_rx_buf *rx_ring; - spinlock_t tx_lock; struct ipoib_vmaptx_vmap_ring; struct ipoib_tx_buf *tx_ring; unsigned tx_head; --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2008-09-09 15:53:24.856316458 -0700 +++ e/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2008-09-26 13:38:00.066208156 -0700 @@ -776,7 +776,8 @@ void ipoib_cm_handle_tx_wc(struct net_de dev_kfree_skb_any(tx_req-skb); - spin_lock_irqsave(priv-tx_lock, flags); + netif_tx_lock(dev); + ++tx-tx_tail; if (unlikely(--priv-tx_outstanding == ipoib_sendq_size 1) netif_queue_stopped(dev) @@ -791,7 +792,7 @@ void ipoib_cm_handle_tx_wc(struct net_de (status=%d, wrid=%d vend_err %x)\n, wc-status, wr_id, wc-vendor_err); - spin_lock(priv-lock); + spin_lock_irqsave(priv-lock, flags); neigh = tx-neigh; if (neigh) { @@ -811,10 +812,10 @@ void ipoib_cm_handle_tx_wc(struct net_de clear_bit(IPOIB_FLAG_OPER_UP, tx-flags); - spin_unlock(priv-lock); + spin_unlock_irqrestore(priv-lock, flags); } - spin_unlock_irqrestore(priv-tx_lock, flags); + netif_tx_unlock(dev); } int ipoib_cm_dev_open(struct net_device *dev) @@ -1134,7 +1135,6 @@ static void ipoib_cm_tx_destroy(struct i { struct ipoib_dev_priv *priv = netdev_priv(p-dev); struct ipoib_cm_tx_buf *tx_req; - unsigned long flags; unsigned long begin; ipoib_dbg(priv, Destroy active connection 0x%x head 0x%x tail 0x%x\n, @@ -1165,12 +1165,12 @@ timeout: DMA_TO_DEVICE); dev_kfree_skb_any(tx_req-skb); ++p-tx_tail; - spin_lock_irqsave(priv-tx_lock, flags); + netif_tx_lock_bh(p-dev); if (unlikely(--priv-tx_outstanding == ipoib_sendq_size 1) netif_queue_stopped(p-dev) test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) netif_wake_queue(p-dev); - spin_unlock_irqrestore(priv-tx_lock, flags); + netif_tx_unlock_bh(p-dev); } if (p-qp) @@ -1187,6 +1187,7 @@ static int ipoib_cm_tx_handler(struct ib struct ipoib_dev_priv *priv = netdev_priv(tx-dev); struct net_device *dev = priv-dev; struct ipoib_neigh *neigh; + unsigned long flags; int ret; switch (event-event) { @@ -1205,8 +1206,8 @@ static int ipoib_cm_tx_handler(struct ib case IB_CM_REJ_RECEIVED: case IB_CM_TIMEWAIT_EXIT: ipoib_dbg(priv, CM error %d.\n, event-event); - spin_lock_irq(priv-tx_lock); - spin_lock(priv-lock); + netif_tx_lock_bh(dev); + spin_lock_irqsave(priv-lock, flags); neigh = tx-neigh; if (neigh) { @@ -1224,8 +1225,8 @@ static int ipoib_cm_tx_handler(struct ib queue_work(ipoib_workqueue, priv-cm.reap_task); } - spin_unlock(priv-lock); - spin_unlock_irq(priv-tx_lock); + spin_unlock_irqrestore(priv-lock, flags); + netif_tx_unlock_bh(dev); break; default: break; @@ -1279,19 +1280,24 @@ static void ipoib_cm_tx_start(struct wor struct ib_sa_path_rec pathrec; u32 qpn; - spin_lock_irqsave(priv-tx_lock, flags); - spin_lock(priv-lock); + netif_tx_lock_bh(dev); + spin_lock_irqsave(priv-lock, flags); + while (!list_empty(priv-cm.start_list)) { p = list_entry(priv-cm.start_list.next, typeof(*p), list); list_del_init(p-list); neigh = p-neigh; qpn =
Re: [ofa-general] [PATCH 00/03] RDMA Transport Support for 9P
Roland Dreier wrote: This patchset implements an RDMA transport provider for the v9fs (Plan 9 filesystem). Could you take a look at it and let us know what you think? I sent comments on the initial posting I saw on lkml ... did they not make it to you? No, I just missed it. Sorry. I just responded to your comments, [PATCH 01/03] 9prdma: RDMA Transport Support for 9P [PATCH 02/03] 9prdma: Makefile change for the RDMA transport [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport one meta-comment I didn't send last time: the patches are small enough that I would just send it all in one patch, since it makes sense to apply it that way anyway. Ok, makes my life easy. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] SRP/mlx4 interrupts throttling performance
Vladislav Bolkhovitin wrote: Cameron Harr wrote: Vlad, Thanks for the suggestion. As I look via vmstat, my CSw/s rate is fairly constant around 280K when scst_threads=1 (per Vu's suggestion) and pops up to ~330-340K CSw/s when scst_threads is set to 8. I'm currently doing 512B writes, and this gives me about a 4:1 ratio of context switches to IOPs with 1 SCST thread (70K IOPs) and around 4.5:1 when there are 8 SCST threads (75K IOPs). This is still too high. Considering that each CS is about 1 microsecond you can estimate how many IOPS's it costs you. Dropping scst_threads down to 2, from 8, with 2 initiators, seems to make a fairly significant difference, propelling me to a little over 100K IOPs and putting the CS rate around 2:1, sometimes lower. 2 threads gave the best performance compared to 1, 4 and 8. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] SRP/mlx4 interrupts throttling performance
Vu Pham wrote: Cameron Harr wrote: Vu Pham wrote: Alternatively, is there anything in the SCST layer I should tweak. I'm still running rev 245 of that code (kinda old, but works with OFED 1.3.1 w/o hacks). With blockio I get the best performance + stability with scst_threads=1 I got best performance with threads=2 or 3, and I've noticed that the srpt_thread is often at 99%, though if I increase/decrease the thread=? parameter for ib_srpt, it doesn't seem to make a difference. A second initiator doesn't seem to help much either, with a single initiator writing to two targets, can now usually get between 95K and 105K IOPs. My target server (with DAS) contains 8 2.8 GHz CPU cores and can sustain over 200K IOPs locally, but only around 73K IOPs over SRP. Is this number from one initiator or multiple? One initiator. At first I thought it might be a limitation of the SRP, and added a second initiator, but the aggregate performance of the two was about equal to that of a single initiator. Try again with scst_threads=1. I expect that you can get ~140K with two initiators Unfortunately, I'm nowhere close that high, though I am significantly higher than before. 2 initiators does seem to reduce the context switching rate however, which is good. Looking at /proc/interrupts, I see that the mlx_core (comp) device is pushing about 135K Int/s on 1 of 2 CPUs. All CPUs are enabled for that PCI-E slot, but it only ever uses 2 of the CPUs, and only 1 at a time. None of the other CPUs has an interrupt rate more than about 40-50K/s. The number of interrupt can be cut down if there are more completions to be processed by sw. ie. please test with multiple QPs between one initiator vs. your target and multiple initiators vs. your target Interrupts are still pretty high (around 160K/s now), but that seems to not be my bottleneck. Context switching seems to be about 2-2.5 for every IOP and sometimes less - not perfect, but not horrible either. ib_srpt process completions in event callback handler. With more QPs there are more completions pending per interrupt instead of one completion event per interrupt. You can have multiple QPs between initiator vs. target by using different initiator_id_ext ie. echo id_ext=xxx,ioc_guid=yyy,initiator_ext=1 /sys/class/infiniband_srp/.../add_target echo id_ext=xxx,ioc_guid=yyy,initiator_ext=2 /sys/class/infiniband_srp/.../add_target echo id_ext=xxx,ioc_guid=yyy,initiator_ext=3 /sys/class/infiniband_srp/.../add_target This doesn't seem to net much of an improvement, though I understand the reasoning behind it. My hunch is there's another bottleneck now to look for. Cameron ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] SRP/mlx4 interrupts throttling performance
Cameron Harr wrote: This is still too high. Considering that each CS is about 1 microsecond you can estimate how many IOPS's it costs you. Dropping scst_threads down to 2, from 8, with 2 initiators, seems to make a fairly significant difference, propelling me to a little over 100K IOPs and putting the CS rate around 2:1, sometimes lower. 2 threads gave the best performance compared to 1, 4 and 8. Just as a status update, I've gotten my best performance with scst_threads=3 on 2 initiators, and using a separate QP for each drive an initiator is writing to. I'm getting pretty consistent 112-115K IOPs using two initiators, each writing with 2 processes to the same 2 physical targets, using 512B blocks. Adding the second initiator only bumps me up by about 20K IOPs, but as all the CPUs are pegged around 99%, I'll take that as a bottleneck. Also, as a note from Vlad's advice, the CS rate is now around 70K/s on 115K IOPs, so it's not too bad. Interrupts (where this thread started), are around 200K/s - a lot higher than I thought they'd go, but I'm not complaining. :) Thanks for the help. Cameron ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache
Hi Hal, Hal Rosenstock wrote: Hi Yevgeny, On Sun, Oct 5, 2008 at 9:26 PM, Yevgeny Kliteynik [EMAIL PROTECTED] wrote: Hi Sasha, The following series of 6 patches implements unicast routing cache in OpenSM. This implementation (v2, previous version was sent before OFED 1.3) was rewritten from scratch: - no caching of existing connectivity - no caching of existing lid matrices - each switch has an LFT buffer that contains the result of the last routing engine execution (instead of one buffer in ucast_mgr) - links/ports/nodes changes are spotted during the discovery - only the links/ports/nodes that went down are cached - when switch goes down, caching its lid matrices and LFT In one of the following cases we can use cached routing - there is no topology change - one or more CAs disappeared - one or more leaf switches disappeared In these cases cached routing is written to the switches as is (unless the switch doesn't exist). If there is any other topology change, existing cache is invalidated and the routing engine(s) run as usual. Glad to see this! A few comments/questions: It seems that there is a LFT cache per switch. This seems to be a big memory penalty to me (in large subnets). So I have two questions related to this: Can this only be done this way when cached routing is being used ? Actually, I was thinking about something else: Currently we have switch LFT implemented as osm_fwd_tbl_t. I can remove the unnecessary complexity of the osm_fwd_tbl_t by replacing it with a simple uint8_t array (same as LFT buffer). Then by simple comparison I will check whether the recently calculated LFT matches the switch's LFT, and if there is a match, then lft_buf can be freed. In this case only the switches that have LFT different from the recently calculated LFT will have both tables, which would be rare and temporary - on the next heavy sweep the LFTs would match, and lft_buf would be freed. Effectively, it won't have memory penalty. It can be done in a separate patch. Also, when cached routing is being used, is this only needed for leaf switches ? No, it is needed for all the switches, because cache can also handle non-leaf switch fast reset. I'm wondering when there is a cached node match whether the available peer ports/neighbors are validated (or something equivalent) to know caching is valid ? It might also include whether a switch is still a leaf switch (which may be redundant as that should show up as a peer port/neighbor change). It looks like the structure is there for this but I didn't review the code in detail. If I understood your question correctly, then yes, such validation is done by osm_ucast_cache_validate() function. Can you describe in more details the case that you are asking about? Are you sure all the memory allocation failures are handled properly within the routing cache code ? What I mean is that NULL is returned and does this always result in a caching not used/routing recalculated ? Also, in that case, should some log message be indicated rather than hiding this ? I will check it. Nit: doc/current-routing.txt should also be updated for this feature. OK, separate patch. -- Yevgeny -- Hal The patches are: - patch 1/6: move lft_buf from ucast_mgr to osm_switch - patch 2/6: Add -A or --ucast_cache option to opensm - patch 3/6: adding osm_ucast_cache.{c,h} files (this is the cache implementation itself) - patch 4/6: adding new cache files to makefile - patch 5/6: integrating unicast cache into the discovery and ucast manager - patch 6/6: man entry for cached routing -- Yevgeny ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general