Re: [ofa-general] rdma_resolve_route() returning -EINVAL

2008-10-06 Thread Or Gerlitz

Roland Dreier wrote:

The issue is more of spec compliance than a likely real-life scenario... and as 
for why no one else is worrying about it, I think it's because the only other 
user of rdma_connect() in the tree is iSER, and I guess no one worried too much 
there.  SRP uses the IB CM directly, and waits for timewait exit before calling 
a connection closed.

Roland,

I guess there's some tradeoff here between the time connection recovery 
would take when the ULPs does wait for the timewait event vs the risk of 
getting into the target considering the REQ as stale and rejecting. Does 
SRP just wait for the event, or the new connection is established in 
parallel, which also means it would use a different QP number always.


Or.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] SRP/mlx4 interrupts throttling performance

2008-10-06 Thread Vladislav Bolkhovitin

Cameron Harr wrote:
I was able to get the latest scst code working with Vu's standalone 
ib_srpt and the kernel IB modules, and dropped my ib_srpt thread count 
to 2. However, I still get about the same IOP performance on the target 
although interrupts on the busy cpu have gone up to around 140K. 
Interesting, but now I'm at a bit of a loss as to where the bottleneck 
could be. I figured it was Interrupts, but if the CPU is handling more 
right now, perhaps the problem is elsewhere?


How many context switches per second do you have during your test on the 
target?


Once in scst-devel mailing list there was a thread about observation 
that SRP target driver produces 10 context switches per command. See 
http://sourceforge.net/mailarchive/message.php?msg_id=e2e108260802070110q1fa084a1j54945d06c16c94f2%40mail.gmail.com


If it is so in your case as well, it would very well explain your issue. 
10 CS/cmd is a definite overkill, it should be 1 or, at max, 2 CS/cmd.


BTW, I suppose you don't use the debug SCST build, do you?

Vlad


Cameron

Cameron Harr wrote:

Cameron Harr wrote:
Additionally, I found that I can load the newer scst code if I use 
the kernel-supplied modules and the standalone srpt-1.0.0 package 
that I think you provide Vu. I was about to try it along with 
dropping a module param for ib_srpt (I was using a thread count of 32 
that had given me better performance on an earlier test). I'll report 
back on this.
Not much luck using the newer scst code and default kernel modules 
(Running CentOS 5.2). If I try using the default kernel modules on the 
initiator, I can't get them to see anything (the ofed SM pkg doesn't 
see any devices to run on). When using the regular OFED on the 
initiator, my target dies when I try to attach to the target on the 
initiator:

-
 ib_srpt: Host login i_port_id=0x0:0x2c90300026053 
t_port_id=0x2c90300026046:0x2c90300026046 it_iu_len=996
Oct  3 13:44:23 test05 kernel: i[4127]: scst: 
scst_mgmt_thread:5187:***CRITICAL ERROR*** session 8107f3222b88 is 
in scst_sess_shut_list, but in unknown shut phase 0

BUG at /usr/src/scst.tot/src/scst_targ.c:5188
--- [cut here ] - [please bite here ] -
Kernel BUG at /usr/src/scst.tot/src/scst_targ.c:5188
invalid opcode:  [1] SMP
last sysfs file: /devices/pci:00/:00:00.0/class
CPU 2
Modules linked in: ib_srpt(U) ib_cm ib_sa scst_vdisk(U) scst(U) 
fio_driver(PU) fio_port(PU) mlx4_ib ib_mad ib_core ipv6 xfrm_nalgo 
crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc nls_utf8 hfsplus 
dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button 
battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 
i5000_edac i2c_core edac_mc pcspkr shpchp mlx4_core e1000e ata_piix 
libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd

Pid: 4127, comm: scsi_tgt_mgmt Tainted: P  2.6.18-92.1.13.el5 #1
RIP: 0010:[88488a56]  [88488a56] 
:scst:scst_mgmt_thread+0x3ff/0x577

-


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] ofa_1_4_kernel 20081006-0200 daily build status

2008-10-06 Thread Vladimir Sokolovsky (Mellanox)
This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.26
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on ppc64 with linux-2.6.24
Log:
/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:
 In function 'ehca_poll_eqs':
/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942:
 warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer 
type
/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946:
 warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer 
type
make[4]: *** 
[/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o]
 Error 1
make[3]: *** 
[/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca]
 Error 2
make[2]: *** 
[/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband]
 Error 2
make[1]: *** 
[_module_/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check] 
Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24'
make: *** [kernel] Error 2
--
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH] SDP: fix initial recv buffer size

2008-10-06 Thread Tziporet Koren

Amir Vadai wrote:

Set initial recv buffer according to incoming hha header.

Fixed bugzilla 1086: SDP Linux and SDP windows don't work togeather
  


Have you asked Vlad to pull this?

Tziporet
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] ***SPAM*** ibdm network topology format

2008-10-06 Thread Hal Rosenstock
Sasha,

On Thu, Oct 2, 2008 at 6:22 PM, Hal Rosenstock [EMAIL PROTECTED] wrote:
 Sasha,

 On Thu, Oct 2, 2008 at 1:00 PM, Sasha Khapyorsky [EMAIL PROTECTED] wrote:
 Hi Hal,

 On 10:18 Thu 02 Oct , Hal Rosenstock wrote:
 
  2. ibis doesn't register class 0x81 - SM direct routed, only SM lid
  routed (0x1). In comment in ibutils/ibis/src/ibsm.c line 118 is stated:
 
   /* no need to bind the Directed Route class as it will automatically
  be handled by the osm_vendor_bind if asked for LID route */
 
  As far as I can see in osm_vendor_bind() it is not (but it is in
  opposite order - when class 0x81 is registered class 0x1 will be
  registered too).

 Yes that is what osm_vendor_ibumad.c:osm_vendor_bind does.

 So either ibdiagnet needs to register 0x81 r.t.1 or
 osm_vendor_ibumad.c:osm_vendor_bind needs to be symmetric in terms
 of registering the other SM class when only one is requested. This is
 a minor change in the underlying semantics. [Popping up a level in
 terms of this, (other than applications taking advantage of this
 feature,) I'm not sure why the vendor layer should assume that just
 because one SM class is requested, the other should be too].  I just
 looked and the latter appears to be consistent with the other vendor
 layers. I think either solution will work. Your solution below also
 looks like it would work but don't that should be done in a sim layer.

 I'm not like this solution too, but the fact that ibis works with real
 stack without registering 0x81 class is unclear for me.

 Me too. See below.

  Somehow it works without ibsim - so I suspect user_mad handles it.
 
  (Hal, could you clarify?)

 The kernel (user_mad/mad) does not change the requested registrations
 but I'm not sure I understand the question you are asking to be
 clarified. Is that what you're asking ?

 ibis works somehow with real stack. It registers 0x1 class only and
 uses direct routing SMPs. Do you have any idea about why
 (osm_vendor_idumad and/or libibumad don't help)?

 libibumad umad_register does not do anything that would affect this
 either. I can only conclude there must be something in ibutils that
 fixes this if it does work with the real stack. It shouldn't be too
 hard to track down where that registration for class 0x81 comes from.

Are you sure this is the only registration and not DR class too ?
That's the first thing to confirm or maybe you've already confirmed
this and it wasn't clear to me in what you wrote. If so, I have a
theory about what could be occuring. It may be the case that it is an
effect of the kernel MAD layer in that a MAD agent can send any class
and when using request/response it matches on transaction ID which
contains the MAD agent. Unsolicited messages on that other class
wouldn't get through though. I just ran a simple test of this and that
appears to be the case.

-- Hal


 -- Hal

 Sasha


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] SRP/mlx4 interrupts throttling performance

2008-10-06 Thread Vladislav Bolkhovitin


Cameron Harr wrote:

Cameron Harr wrote:
Additionally, I found that I can load the newer scst code if I use the 
kernel-supplied modules and the standalone srpt-1.0.0 package that I 
think you provide Vu. I was about to try it along with dropping a 
module param for ib_srpt (I was using a thread count of 32 that had 
given me better performance on an earlier test). I'll report back on 
this.


Not much luck using the newer scst code and default kernel modules 
(Running CentOS 5.2). If I try using the default kernel modules on the 
initiator, I can't get them to see anything (the ofed SM pkg doesn't see 
any devices to run on). When using the regular OFED on the initiator, my 
target dies when I try to attach to the target on the initiator:

-
  ib_srpt: Host login i_port_id=0x0:0x2c90300026053 
t_port_id=0x2c90300026046:0x2c90300026046 it_iu_len=996
Oct  3 13:44:23 test05 kernel: i[4127]: scst: 
scst_mgmt_thread:5187:***CRITICAL ERROR*** session 8107f3222b88 is 
in scst_sess_shut_list, but in unknown shut phase 0

BUG at /usr/src/scst.tot/src/scst_targ.c:5188
--- [cut here ] - [please bite here ] -
Kernel BUG at /usr/src/scst.tot/src/scst_targ.c:5188


This can happen if the target driver frees some IO or TM command twice, 
by, eg, calling scst_tgt_cmd_done() two times for the same command.



invalid opcode:  [1] SMP
last sysfs file: /devices/pci:00/:00:00.0/class
CPU 2
Modules linked in: ib_srpt(U) ib_cm ib_sa scst_vdisk(U) scst(U) 
fio_driver(PU) fio_port(PU) mlx4_ib ib_mad ib_core ipv6 xfrm_nalgo 
crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc nls_utf8 hfsplus 
dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery 
asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 i5000_edac 
i2c_core edac_mc pcspkr shpchp mlx4_core e1000e ata_piix libata sd_mod 
scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd

Pid: 4127, comm: scsi_tgt_mgmt Tainted: P  2.6.18-92.1.13.el5 #1
RIP: 0010:[88488a56]  [88488a56] 
:scst:scst_mgmt_thread+0x3ff/0x577

-
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Status of NFS over RDMA and SRP?

2008-10-06 Thread Tziporet Koren

Steven Truelove wrote:

Hi,

   I am considering using our existing Infiniband interconnect to 
provide high-speed storage access to our compute cluster. It looks 
like the two ways to do this are NFS over RDMA and SRP.



SRP initiator is part of the Linux kernel and also part of OFED.
SRP target is part of OFED (starting from OFED 1.3) and also submitted 
to the kernel as part of Generic SCSI target mid-level driver - SCST 
(http://scst.sourceforge.net)


SRP is in GA stage

Tziporet


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] rdma_resolve_route() returning -EINVAL

2008-10-06 Thread Talpey, Thomas
At 05:08 AM 10/6/2008, Or Gerlitz wrote:
Roland Dreier wrote:
 The issue is more of spec compliance than a likely real-life 
scenario... and as for why no one else is worrying about it, I think 
it's because the only other user of rdma_connect() in the tree is 
iSER, and I guess no one worried too much there.  SRP uses the IB CM 
directly, and waits for timewait exit before calling a connection closed.
Roland,

I guess there's some tradeoff here between the time connection recovery 
would take when the ULPs does wait for the timewait event vs the risk of 
getting into the target considering the REQ as stale and rejecting. Does 
SRP just wait for the event, or the new connection is established in 
parallel, which also means it would use a different QP number always.

That's what the NFS/RDMA client does - I always create a new cm_id and qp.
So the TIMEWAIT upcall isn't very interesting, unless I change that. My
problem was that I started getting the upcall, when I didn't before.

Tom.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] RE: IPoIB CM connectivity issue.

2008-10-06 Thread Alex Estrin
 
   In second case (when OFED is CREQ initiator) only one RC 
 QP was used
   to establish a connection and apparently bidirectional traffic was
   capable to go through that one QP.
 
 Yes, at least in the case where you have an SRQ-capable adapter, it
 doesn't really matter which QP has incoming traffic.  However, it was
 much simpler in the IPoIB implementation to simply open a QP to send
 traffic rather than searching through all passive connections for a
 connection to the same peer.
 
 Is this behavior causing problems for you?

I just didn't expect of using TWO QPs per one connection.
It probably won't simplify my implementation :)

I assume you mean sending ARP replies.  Yes, you are 
 correct.  I never
noticed before but RFC 4755 does say:
 
   Additionally, all address resolution responses (ARP 
 or Neighbor
   Discovery) MUST always be encapsulated in a UD mode packet.
 
   Yes, you are right. Please discard my note regarding ARP reply.
 
 Not sure what you mean -- if Linux is sending ARP replies on 
 a connected
 QP, then that is not allowed according to the RFC.

I meant Linux sends ARP reply over UD behaviour is in compliance with
the document.

 However, looking at this quote again I see that the RFC's requirement
 rather unfortunately includes neighbour discovery too.  It's not *too*
 bad to look at the ethertype in the IPoIB pseudo-header to 
 check for an
 ARP packet, but sending all neighbour discovery messages 
 seems very ugly
 -- even just sending all ICMP6 messages via UD wouldn't be 
 very nice to
 implement, and it would eg break ping6 with large messages, 
 so we would
 have to look deep deep into packets to see which were ND messages.
 
 I wonder what the rationale behind that part of the RFC was?
 
  - R.
Yes, it would be good to know the reason behind this. So far handling ND
messages over UD ONLY 
seem odd considering it's function to detect unreacheable neighbours.
If one has alive RC QP to a neighbour most likely it is reacheable.

Alex.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


***SPAM*** Re: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache

2008-10-06 Thread Hal Rosenstock
Hi Yevgeny,

On Sun, Oct 5, 2008 at 9:26 PM, Yevgeny Kliteynik
[EMAIL PROTECTED] wrote:
 Hi Sasha,

 The following series of 6 patches implements unicast routing cache
 in OpenSM.

 This implementation (v2, previous version was sent before OFED 1.3)
 was rewritten from scratch:
  - no caching of existing connectivity
  - no caching of existing lid matrices
  - each switch has an LFT buffer that contains the result of
   the last routing engine execution (instead of one buffer
   in ucast_mgr)
  - links/ports/nodes changes are spotted during the discovery
  - only the links/ports/nodes that  went down are cached
  - when switch goes down, caching its lid matrices and LFT

 In one of the following cases we can use cached routing
  - there is no topology change
  - one or more CAs disappeared
  - one or more leaf switches disappeared
 In these cases cached routing is written to the switches as is
 (unless the switch doesn't exist).
 If there is any other topology change, existing cache is invalidated
 and the routing engine(s) run as usual.

Glad to see this!

A few comments/questions:

It seems that there is a LFT cache per switch. This seems to be a big
memory penalty to me (in large subnets). So I have two questions
related to this:
Can this only be done this way when cached routing is being used ?
Also, when cached routing is being used, is this only needed for leaf switches ?

I'm wondering when there is a cached node match whether the available
peer ports/neighbors are validated (or something equivalent) to know
caching is valid ? It might also include whether a switch is still a
leaf switch (which may be redundant as that should show up as a peer
port/neighbor change). It looks like the structure is there for this
but I didn't review the code in detail.

Are you sure all the memory allocation failures are handled properly
within the routing cache code ? What I mean is that NULL is returned
and does this always result in a caching not used/routing recalculated
? Also, in that case, should some log message be indicated rather than
hiding this ?

Nit: doc/current-routing.txt should also be updated for this feature.

-- Hal

 The patches are:
  - patch 1/6: move lft_buf from ucast_mgr to osm_switch
  - patch 2/6: Add -A or --ucast_cache option to opensm
  - patch 3/6: adding osm_ucast_cache.{c,h} files (this is
   the cache implementation itself)
  - patch 4/6: adding new cache files to makefile
  - patch 5/6: integrating unicast cache into the discovery
   and ucast manager
  - patch 6/6: man entry for cached routing

 -- Yevgeny
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Allowing end-users to query for fabric information

2008-10-06 Thread Mike Heinz
Roland,

I've been thinking about this some more and I have to say I'm still a
bit confused. Are you saying that any root user on any node of the
fabric can change the routing tables? Isn't the ability to access and
alter subnet information controlled via the management key?


--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Mike Heinz
Sent: Monday, September 22, 2008 3:19 PM
To: Roland Dreier
Cc: general@lists.openfabrics.org
Subject: RE: [ofa-general] Allowing end-users to query for fabric
information

Thanks for the explanation. 


--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania

-Original Message-
From: Roland Dreier [mailto:[EMAIL PROTECTED]
Sent: Monday, September 22, 2008 3:18 PM
To: Mike Heinz
Cc: general@lists.openfabrics.org
Subject: Re: [ofa-general] Allowing end-users to query for fabric
information

  What was the reason for making this design choice? While I could  
certainly provide boot scripts to change the permissions to  
/dev/infiniband/umad*, I'd rather understand why the decision was made
 to restrict access.

because /dev/infiniband/umadX allows full unfiltered access to
send/receive any MADs.  Including changing routing tables, bringing
ports down, etc.  Not stuff that unprivileged users should be able to
do.

It would make sense to have a higher-level interface that only allows
safe queries without side effects, but that's quite a bit more work than
just changing permissions on device nodes.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


***SPAM*** Re: [ofa-general] Allowing end-users to query for fabric information

2008-10-06 Thread Hal Rosenstock
Mike,

On Mon, Oct 6, 2008 at 11:09 AM, Mike Heinz [EMAIL PROTECTED] wrote:
 Roland,

 I've been thinking about this some more and I have to say I'm still a
 bit confused. Are you saying that any root user on any node of the
 fabric can change the routing tables? Isn't the ability to access and
 alter subnet information controlled via the management key?

There are two levels to this. First you must be able to send the MAD
and once that can happen the receiving SMA performs the usual MKey
checks which depend on the protection level assuming it is an SM class
MAD like the one to change the routing tables.

-- Hal



 --
 Michael Heinz
 Principal Engineer, Qlogic Corporation
 King of Prussia, Pennsylvania

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Mike Heinz
 Sent: Monday, September 22, 2008 3:19 PM
 To: Roland Dreier
 Cc: general@lists.openfabrics.org
 Subject: RE: [ofa-general] Allowing end-users to query for fabric
 information

 Thanks for the explanation.


 --
 Michael Heinz
 Principal Engineer, Qlogic Corporation
 King of Prussia, Pennsylvania

 -Original Message-
 From: Roland Dreier [mailto:[EMAIL PROTECTED]
 Sent: Monday, September 22, 2008 3:18 PM
 To: Mike Heinz
 Cc: general@lists.openfabrics.org
 Subject: Re: [ofa-general] Allowing end-users to query for fabric
 information

   What was the reason for making this design choice? While I could  
 certainly provide boot scripts to change the permissions to  
 /dev/infiniband/umad*, I'd rather understand why the decision was made
 to restrict access.

 because /dev/infiniband/umadX allows full unfiltered access to
 send/receive any MADs.  Including changing routing tables, bringing
 ports down, etc.  Not stuff that unprivileged users should be able to
 do.

 It would make sense to have a higher-level interface that only allows
 safe queries without side effects, but that's quite a bit more work than
 just changing permissions on device nodes.

  - R.
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit
 http://openib.org/mailman/listinfo/openib-general
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] Allowing end-users to query for fabric information

2008-10-06 Thread Mike Heinz
Well,

I guess that's my point - I'd like to be able to create tools for
non-root users that would collect interesting information about the
fabric. As far as I know, this should be a safe operation, because the
SA should be protected by the m-key - but it seems that the policy in
OFED is that this is not a safe operation and access must be tightly
controlled.

While it's a trivial task to patch OFED to give non-root users access to
the /dev/infiniband/umad* devices, I certainly don't want to provide
tools to my users that create security holes in the fabric.

--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania

-Original Message-
From: Hal Rosenstock [mailto:[EMAIL PROTECTED] 
Sent: Monday, October 06, 2008 11:16 AM
To: Mike Heinz
Cc: Roland Dreier; general@lists.openfabrics.org
Subject: Re: [ofa-general] Allowing end-users to query for fabric
information

Mike,

On Mon, Oct 6, 2008 at 11:09 AM, Mike Heinz [EMAIL PROTECTED]
wrote:
 Roland,

 I've been thinking about this some more and I have to say I'm still a 
 bit confused. Are you saying that any root user on any node of the 
 fabric can change the routing tables? Isn't the ability to access and 
 alter subnet information controlled via the management key?

There are two levels to this. First you must be able to send the MAD and
once that can happen the receiving SMA performs the usual MKey checks
which depend on the protection level assuming it is an SM class MAD like
the one to change the routing tables.

-- Hal



 --
 Michael Heinz
 Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Mike Heinz
 Sent: Monday, September 22, 2008 3:19 PM
 To: Roland Dreier
 Cc: general@lists.openfabrics.org
 Subject: RE: [ofa-general] Allowing end-users to query for fabric 
 information

 Thanks for the explanation.


 --
 Michael Heinz
 Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania

 -Original Message-
 From: Roland Dreier [mailto:[EMAIL PROTECTED]
 Sent: Monday, September 22, 2008 3:18 PM
 To: Mike Heinz
 Cc: general@lists.openfabrics.org
 Subject: Re: [ofa-general] Allowing end-users to query for fabric 
 information

   What was the reason for making this design choice? While I could  

 certainly provide boot scripts to change the permissions to   
 /dev/infiniband/umad*, I'd rather understand why the decision was made
 to restrict access.

 because /dev/infiniband/umadX allows full unfiltered access to 
 send/receive any MADs.  Including changing routing tables, bringing 
 ports down, etc.  Not stuff that unprivileged users should be able to 
 do.

 It would make sense to have a higher-level interface that only allows 
 safe queries without side effects, but that's quite a bit more work 
 than just changing permissions on device nodes.

  - R.
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit
 http://openib.org/mailman/listinfo/openib-general
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit 
 http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] OFED meeting agenda for today (Oct 6)

2008-10-06 Thread Tziporet Koren


Agenda for OFED meeting today on OFED 1.4 status toward RC3:

1. Interop event status - Rupert
2. RC3 features:
1. NFS-RDMA to work on RHEL 5.1 - done
2. OSM: Cashed routing - patches sent - should be committed in a
day or two
3. Cleanup compilation warning - Mellanox started - any progress
by other companies?
3. OFED testing status - all
4. Critical bugs review:
1128blo Othe[EMAIL PROTECTED]   release
IPoIB-CM QP resources in flushing CQE context
1113cri RHEL[EMAIL PROTECTED]   rpm -e
scsi-target-utils-0.1-2008715 fails
1198cri SLES[EMAIL PROTECTED]   hang during
ipoib create_child/ifdown
1164maj SLES[EMAIL PROTECTED]   iperf over IPoIB
fails for 100 tcp connections
1247maj RHEL[EMAIL PROTECTED]   ipoib_ud_test
caused kernel oops on ofed_1_4 (sw083/084)
1221maj SLES[EMAIL PROTECTED]   SLES10 sp2:
remote logins via ssh fail due to rpcbind and...
1248maj SLES[EMAIL PROTECTED]   Bonding - after
reboot the host stucks while raising the ...
1099maj All [EMAIL PROTECTED]   IPoIB IPv6 does
not work on RH4
1153maj Othe[EMAIL PROTECTED]   OpenSM-
Multicast group will not open when IB host is the...
5. OFA BOF at SC08 - Woody 
6. Open discussion

Tziporet

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] SRP/mlx4 interrupts throttling performance

2008-10-06 Thread Cameron Harr

Vlad,
Thanks for the suggestion. As I look via vmstat, my CSw/s rate is fairly 
constant around 280K when scst_threads=1 (per Vu's suggestion) and pops 
up to ~330-340K CSw/s when scst_threads is set to 8. I'm currently doing 
512B writes, and this gives me about a 4:1 ratio of context switches to 
IOPs with 1 SCST thread (70K IOPs) and around 4.5:1 when there are 8 
SCST threads (75K IOPs). You say those numbers could be overkill - do 
you know of a way to drop the number? I'm very interested in trying Vu's 
other suggestions (multiple initiators and multiple QPs, but my other 
initiator has been too busy all weekend to run on.


Debug, tracing, and all that was turned off in the SCST Makefiles.

-Cameron

Vladislav Bolkhovitin wrote:

Cameron Harr wrote:
I was able to get the latest scst code working with Vu's standalone 
ib_srpt and the kernel IB modules, and dropped my ib_srpt thread 
count to 2. However, I still get about the same IOP performance on 
the target although interrupts on the busy cpu have gone up to 
around 140K. Interesting, but now I'm at a bit of a loss as to where 
the bottleneck could be. I figured it was Interrupts, but if the CPU 
is handling more right now, perhaps the problem is elsewhere?


How many context switches per second do you have during your test on 
the target?


Once in scst-devel mailing list there was a thread about observation 
that SRP target driver produces 10 context switches per command. See 
http://sourceforge.net/mailarchive/message.php?msg_id=e2e108260802070110q1fa084a1j54945d06c16c94f2%40mail.gmail.com 



If it is so in your case as well, it would very well explain your 
issue. 10 CS/cmd is a definite overkill, it should be 1 or, at max, 2 
CS/cmd.


BTW, I suppose you don't use the debug SCST build, do you?

Vlad


Cameron

Cameron Harr wrote:

Cameron Harr wrote:
Additionally, I found that I can load the newer scst code if I use 
the kernel-supplied modules and the standalone srpt-1.0.0 package 
that I think you provide Vu. I was about to try it along with 
dropping a module param for ib_srpt (I was using a thread count of 
32 that had given me better performance on an earlier test). I'll 
report back on this.
Not much luck using the newer scst code and default kernel modules 
(Running CentOS 5.2). If I try using the default kernel modules on 
the initiator, I can't get them to see anything (the ofed SM pkg 
doesn't see any devices to run on). When using the regular OFED on 
the initiator, my target dies when I try to attach to the target on 
the initiator:

-
 ib_srpt: Host login i_port_id=0x0:0x2c90300026053 
t_port_id=0x2c90300026046:0x2c90300026046 it_iu_len=996
Oct  3 13:44:23 test05 kernel: i[4127]: scst: 
scst_mgmt_thread:5187:***CRITICAL ERROR*** session 8107f3222b88 
is in scst_sess_shut_list, but in unknown shut phase 0

BUG at /usr/src/scst.tot/src/scst_targ.c:5188
--- [cut here ] - [please bite here ] -
Kernel BUG at /usr/src/scst.tot/src/scst_targ.c:5188
invalid opcode:  [1] SMP
last sysfs file: /devices/pci:00/:00:00.0/class
CPU 2
Modules linked in: ib_srpt(U) ib_cm ib_sa scst_vdisk(U) scst(U) 
fio_driver(PU) fio_port(PU) mlx4_ib ib_mad ib_core ipv6 xfrm_nalgo 
crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc nls_utf8 
hfsplus dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec 
button battery asus_acpi acpi_memhotplug ac parport_pc lp parport 
i2c_i801 i5000_edac i2c_core edac_mc pcspkr shpchp mlx4_core e1000e 
ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd

Pid: 4127, comm: scsi_tgt_mgmt Tainted: P  2.6.18-92.1.13.el5 #1
RIP: 0010:[88488a56]  [88488a56] 
:scst:scst_mgmt_thread+0x3ff/0x577

-


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general





___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] OFED Roll

2008-10-06 Thread publications
Am I correct that the Cisco OFED Roll installs Infiniband but not Infiniband
over IP?  Does it just use RDMA as a transport?

The OFED download from Openfabrics installs IB over IP and I prefer not to
use it since the latencies are double that of RDMA and the throughput is
about one half of RDMA.

So, another question.  Suppose I have already installed OFED 1.3.1 with IB
over IP.  How do I configure my system (including the ib0.conf and other
conf files) to use RDMA rather than IB over IP?

Thanks for your help.

Jim


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH/RFC] IB/mthca: Use pci_request_regions()

2008-10-06 Thread Eli Cohen
On Mon, Sep 29, 2008 at 09:41:37PM -0700, Roland Dreier wrote:
 Back in prehistoric (pre-git!) days, the kernel's MSI-X support did
 request_mem_region() on a device's MSI-X tables, which meant that a
 driver that enabled MSI-X couldn't use pci_request_regions() (since
 that would clash with the PCI layer's MSI-X request).
 
 However, that was removed (by me!) years ago, so mthca can just use
 pci_request_regions() and pci_release_regions() instead of its own
 much more complicated code that avoids requesting the MSI-X tables.
 

Looks like a nice diet to the code.

Acked by: Eli Cohen [EMAIL PROTECTED] 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


***SPAM*** Re: [ofa-general] Allowing end-users to query for fabric information

2008-10-06 Thread Hal Rosenstock
On Mon, Oct 6, 2008 at 11:27 AM, Mike Heinz [EMAIL PROTECTED] wrote:
 Well,

 I guess that's my point - I'd like to be able to create tools for
 non-root users that would collect interesting information about the
 fabric. As far as I know, this should be a safe operation, because the
 SA should be protected by the m-key - but it seems that the policy in
 OFED is that this is not a safe operation and access must be tightly
 controlled.

Do you mean SM or SA ?

Subverting the SM is not a good idea. The SM is the central point for
setting up SM attributes. Policy needs to be instilled through the SM.

There are some SA attributes which are somewhat dangerous too as they
are essentially writable as well from an end node.

Furthermore, most fabrics do not utilize MKey protection so the second
level is not there yet and only the most primitive form of this is
available within some SMs.

 While it's a trivial task to patch OFED to give non-root users access to
 the /dev/infiniband/umad* devices, I certainly don't want to provide
 tools to my users that create security holes in the fabric.

IMO this would do that although I would phrase it slightly differently.

-- Hal

 --
 Michael Heinz
 Principal Engineer, Qlogic Corporation
 King of Prussia, Pennsylvania

 -Original Message-
 From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
 Sent: Monday, October 06, 2008 11:16 AM
 To: Mike Heinz
 Cc: Roland Dreier; general@lists.openfabrics.org
 Subject: Re: [ofa-general] Allowing end-users to query for fabric
 information

 Mike,

 On Mon, Oct 6, 2008 at 11:09 AM, Mike Heinz [EMAIL PROTECTED]
 wrote:
 Roland,

 I've been thinking about this some more and I have to say I'm still a
 bit confused. Are you saying that any root user on any node of the
 fabric can change the routing tables? Isn't the ability to access and
 alter subnet information controlled via the management key?

 There are two levels to this. First you must be able to send the MAD and
 once that can happen the receiving SMA performs the usual MKey checks
 which depend on the protection level assuming it is an SM class MAD like
 the one to change the routing tables.

 -- Hal



 --
 Michael Heinz
 Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Mike Heinz
 Sent: Monday, September 22, 2008 3:19 PM
 To: Roland Dreier
 Cc: general@lists.openfabrics.org
 Subject: RE: [ofa-general] Allowing end-users to query for fabric
 information

 Thanks for the explanation.


 --
 Michael Heinz
 Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania

 -Original Message-
 From: Roland Dreier [mailto:[EMAIL PROTECTED]
 Sent: Monday, September 22, 2008 3:18 PM
 To: Mike Heinz
 Cc: general@lists.openfabrics.org
 Subject: Re: [ofa-general] Allowing end-users to query for fabric
 information

   What was the reason for making this design choice? While I could  

 certainly provide boot scripts to change the permissions to  
 /dev/infiniband/umad*, I'd rather understand why the decision was made
 to restrict access.

 because /dev/infiniband/umadX allows full unfiltered access to
 send/receive any MADs.  Including changing routing tables, bringing
 ports down, etc.  Not stuff that unprivileged users should be able to
 do.

 It would make sense to have a higher-level interface that only allows
 safe queries without side effects, but that's quite a bit more work
 than just changing permissions on device nodes.

  - R.
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit
 http://openib.org/mailman/listinfo/openib-general
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit
 http://openib.org/mailman/listinfo/openib-general


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] SRP/mlx4 interrupts throttling performance

2008-10-06 Thread Vladislav Bolkhovitin

Cameron Harr wrote:

Vlad,
Thanks for the suggestion. As I look via vmstat, my CSw/s rate is fairly 
constant around 280K when scst_threads=1 (per Vu's suggestion) and pops 
up to ~330-340K CSw/s when scst_threads is set to 8. I'm currently doing 
512B writes, and this gives me about a 4:1 ratio of context switches to 
IOPs with 1 SCST thread (70K IOPs) and around 4.5:1 when there are 8 
SCST threads (75K IOPs).


This is still too high. Considering that each CS is about 1 microsecond 
you can estimate how many IOPS's it costs you.


You say those numbers could be overkill - do 
you know of a way to drop the number?


Sorry, I don't. I can only say that too many CSs problem is in SRPT 
driver. With qla2x00t driver and BLOCKIO backstorage you will have 1 
CS/sec or less in average.


I'm very interested in trying Vu's 
other suggestions (multiple initiators and multiple QPs, but my other 
initiator has been too busy all weekend to run on.


Debug, tracing, and all that was turned off in the SCST Makefiles.

-Cameron

Vladislav Bolkhovitin wrote:

Cameron Harr wrote:
I was able to get the latest scst code working with Vu's standalone 
ib_srpt and the kernel IB modules, and dropped my ib_srpt thread 
count to 2. However, I still get about the same IOP performance on 
the target although interrupts on the busy cpu have gone up to 
around 140K. Interesting, but now I'm at a bit of a loss as to where 
the bottleneck could be. I figured it was Interrupts, but if the CPU 
is handling more right now, perhaps the problem is elsewhere?
How many context switches per second do you have during your test on 
the target?


Once in scst-devel mailing list there was a thread about observation 
that SRP target driver produces 10 context switches per command. See 
http://sourceforge.net/mailarchive/message.php?msg_id=e2e108260802070110q1fa084a1j54945d06c16c94f2%40mail.gmail.com 



If it is so in your case as well, it would very well explain your 
issue. 10 CS/cmd is a definite overkill, it should be 1 or, at max, 2 
CS/cmd.


BTW, I suppose you don't use the debug SCST build, do you?

Vlad


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 00/03] RDMA Transport Support for 9P

2008-10-06 Thread Tom Tucker

Roland:

This patchset implements an RDMA transport provider for the 
v9fs (Plan 9 filesystem). Could you take a look at it and let us
know what you think?

Thanks,
Tom

Here is the original posting...

Eric:

This patch series implements an RDMA Transport provider for 9P and 
is relative to your for-next branch.  The RDMA support is built on the 
OpenFabrics API and uses SEND and RECV to exchange data. This patch 
series has been tested with dbench and iozone.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED]

[PATCH 01/03] 9prdma: RDMA Transport Support for 9P

 net/9p/trans_rdma.c |  996 +++
 1 files changed, 996 insertions(+), 0 deletions(-)

[PATCH 02/03] 9prdma: Makefile change for the RDMA transport

 net/9p/Makefile |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

[PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport

 net/9p/Kconfig |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 02/03] 9prdma: Makefile change for the RDMA transport

2008-10-06 Thread Tom Tucker
This adds a make rule for the 9pnet_rdma module that implements
the RDMA transport.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED]

---
 net/9p/Makefile |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/9p/Makefile b/net/9p/Makefile
index 5192194..bc909ab 100644
--- a/net/9p/Makefile
+++ b/net/9p/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_NET_9P) := 9pnet.o
 obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o
+obj-$(CONFIG_NET_9P_RDMA) += 9pnet_rdma.o
 
 9pnet-objs := \
mod.o \
@@ -12,3 +13,6 @@ obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o
 
 9pnet_virtio-objs := \
trans_virtio.o \
+
+9pnet_rdma-objs := \
+   trans_rdma.o \
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport

2008-10-06 Thread Tom Tucker
This patch adds a config option for the 9P RDMA transport.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED]

---
 net/9p/Kconfig |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/9p/Kconfig b/net/9p/Kconfig
index ff34c5a..c42c0c4 100644
--- a/net/9p/Kconfig
+++ b/net/9p/Kconfig
@@ -20,6 +20,12 @@ config NET_9P_VIRTIO
  This builds support for a transports between
  guest partitions and a host partition.
 
+config NET_9P_RDMA
+   depends on NET_9P  INFINIBAND  EXPERIMENTAL
+   tristate 9P RDMA Transport (Experimental)
+   help
+ This builds support for a RDMA transport.
+
 config NET_9P_DEBUG
bool Debug information
depends on NET_9P
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 01/03] 9prdma: RDMA Transport Support for 9P

2008-10-06 Thread Tom Tucker
This file implements the RDMA transport provider for 9P. It allows
mounts to be performed over iWARP and IB capable network interfaces
and uses the OpenFabrics API to perform I/O.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED]

---
 net/9p/trans_rdma.c | 1025 +++
 1 files changed, 1025 insertions(+), 0 deletions(-)

diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c
new file mode 100644
index 000..f919768
--- /dev/null
+++ b/net/9p/trans_rdma.c
@@ -0,0 +1,1025 @@
+/*
+ * linux/fs/9p/trans_rdma.c
+ *
+ * RDMA transport layer based on the trans_fd.c implementation.
+ *
+ *  Copyright (C) 2008 by Tom Tucker [EMAIL PROTECTED]
+ *  Copyright (C) 2006 by Russ Cox [EMAIL PROTECTED]
+ *  Copyright (C) 2004-2005 by Latchesar Ionkov [EMAIL PROTECTED]
+ *  Copyright (C) 2004-2008 by Eric Van Hensbergen [EMAIL PROTECTED]
+ *  Copyright (C) 1997-2002 by Ron Minnich [EMAIL PROTECTED]
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to:
+ *  Free Software Foundation
+ *  51 Franklin Street, Fifth Floor
+ *  Boston, MA  02111-1301  USA
+ *
+ */
+
+#include linux/in.h
+#include linux/module.h
+#include linux/net.h
+#include linux/ipv6.h
+#include linux/kthread.h
+#include linux/errno.h
+#include linux/kernel.h
+#include linux/un.h
+#include linux/uaccess.h
+#include linux/inet.h
+#include linux/idr.h
+#include linux/file.h
+#include linux/parser.h
+#include net/9p/9p.h
+#include net/9p/transport.h
+#include rdma/ib_verbs.h
+#include rdma/rdma_cm.h
+#include rdma/ib_verbs.h
+
+#define P9_PORT5640
+#define P9_RDMA_SQ_DEPTH   32
+#define P9_RDMA_RQ_DEPTH   32
+#define P9_RDMA_SEND_SGE   4
+#define P9_RDMA_RECV_SGE   4
+#define P9_RDMA_IRD0
+#define P9_RDMA_ORD0
+#define P9_RDMA_TIMEOUT3   /* 30 seconds */
+#define P9_RDMA_MAXSIZE(4*4096)/* Min SGE is 4, so we 
can
+* safely advertise a maxsize
+* of 64k */
+
+#define P9_RDMA_MAX_SGE (P9_RDMA_MAXSIZE  PAGE_SHIFT)
+/**
+ * struct p9_trans_rdma - RDMA transport instance
+ *
+ * @state: tracks the transport state machine for connection setup and tear 
down
+ * @cm_id: The RDMA CM ID
+ * @pd: Protection Domain pointer
+ * @qp: Queue Pair pointer
+ * @cq: Completion Queue pointer
+ * @lkey: The local access only memory region key
+ * @next_tag: The next tag for tracking rpc
+ * @timeout: Number of uSecs to wait for connection management events
+ * @sq_depth: The depth of the Send Queue
+ * @sq_count: Number of WR on the Send Queue
+ * @rq_depth: The depth of the Receive Queue. NB: I _think_ that 9P is
+ * purely req/rpl (i.e. no unaffiliated replies, but I'm not sure, so
+ * I'm allowing this to be tweaked separately.
+ * @addr: The remote peer's address
+ * @req_lock: Protects the active request list
+ * @req_list: List of sent RPC awaiting replies
+ * @send_wait: Wait list when the SQ fills up
+ * @cm_done: Completion event for connection management tracking
+ */
+struct p9_trans_rdma {
+   enum {
+   P9_RDMA_INIT,
+   P9_RDMA_ADDR_RESOLVED,
+   P9_RDMA_ROUTE_RESOLVED,
+   P9_RDMA_CONNECTED,
+   P9_RDMA_FLUSHING,
+   P9_RDMA_CLOSING,
+   P9_RDMA_CLOSED,
+   } state;
+   struct rdma_cm_id *cm_id;
+   struct ib_pd *pd;
+   struct ib_qp *qp;
+   struct ib_cq *cq;
+   struct ib_mr *dma_mr;
+   u32 lkey;
+   atomic_t next_tag;
+   long timeout;
+   int sq_depth;
+   atomic_t sq_count;
+   int rq_depth;
+   struct sockaddr_in addr;
+
+   spinlock_t req_lock;
+   struct list_head req_list;
+
+   wait_queue_head_t send_wait;
+   struct completion cm_done;
+   struct p9_idpool *tagpool;
+};
+
+/**
+ * p9_rdma_context - Keeps track of in-process WR
+ *
+ * @wc_op: Mellanox's broken HW doesn't provide the original WR op
+ * when the CQE completes in error. This forces apps to keep track of
+ * the op themselves. Yes, it's a Pet Peeve of mine ;-)
+ * @busa: Bus address to unmap when the WR completes
+ * @req: Keeps track of requests (send)
+ * @rcall: Keepts track of replies (receive)
+ */
+struct p9_rdma_req;
+struct p9_rdma_context {
+   enum ib_wc_opcode wc_op;
+   dma_addr_t busa;
+   union {

Re: [ofa-general] [PATCH 00/03] RDMA Transport Support for 9P

2008-10-06 Thread Roland Dreier
  This patchset implements an RDMA transport provider for the 
  v9fs (Plan 9 filesystem). Could you take a look at it and let us
  know what you think?

I sent comments on the initial posting I saw on lkml ... did they not
make it to you?

  [PATCH 01/03] 9prdma: RDMA Transport Support for 9P
  [PATCH 02/03] 9prdma: Makefile change for the RDMA transport
  [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport

one meta-comment I didn't send last time: the patches are small enough
that I would just send it all in one patch, since it makes sense to
apply it that way anyway.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: Continue of defer skb_orphan() until irqs enabled

2008-10-06 Thread akepner

Sorry for the delay in getting back to you on this, but we've 
done our testing and haven't found any problems. As I mentioned, 
we're using OFED 1.3.1, so the patch had to be tweaked a bit. 
The patch we used follows.


--- a/drivers/infiniband/ulp/ipoib/ipoib.h  2008-09-09 15:53:24.856316458 
-0700
+++ e/drivers/infiniband/ulp/ipoib/ipoib.h  2008-09-29 10:19:00.833519991 
-0700
@@ -345,10 +345,9 @@ struct ipoib_ethtool_st {
 };
 
 /*
- * Device private locking: tx_lock protects members used in TX fast
- * path (and we use LLTX so upper layers don't do extra locking).
- * lock protects everything else.  lock nests inside of tx_lock (ie
- * tx_lock must be acquired first if needed).
+ * Device private locking: network stack tx_lock protects members used
+ * in TX fast path, lock protects everything else.  lock nests inside
+ * of tx_lock (ie tx_lock must be acquired first if needed).
  */
 struct ipoib_dev_priv {
spinlock_t lock;
@@ -397,7 +396,6 @@ struct ipoib_dev_priv {
struct ipoib_vmap   rx_vmap_ring;
struct ipoib_sg_rx_buf *rx_ring;
 
-   spinlock_t   tx_lock;
struct ipoib_vmaptx_vmap_ring;
struct ipoib_tx_buf *tx_ring;
unsigned tx_head;
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c   2008-09-09 15:53:24.856316458 
-0700
+++ e/drivers/infiniband/ulp/ipoib/ipoib_cm.c   2008-09-26 13:38:00.066208156 
-0700
@@ -776,7 +776,8 @@ void ipoib_cm_handle_tx_wc(struct net_de
 
dev_kfree_skb_any(tx_req-skb);
 
-   spin_lock_irqsave(priv-tx_lock, flags);
+   netif_tx_lock(dev);
+
++tx-tx_tail;
if (unlikely(--priv-tx_outstanding == ipoib_sendq_size  1) 
netif_queue_stopped(dev) 
@@ -791,7 +792,7 @@ void ipoib_cm_handle_tx_wc(struct net_de
   (status=%d, wrid=%d vend_err %x)\n,
   wc-status, wr_id, wc-vendor_err);
 
-   spin_lock(priv-lock);
+   spin_lock_irqsave(priv-lock, flags);
neigh = tx-neigh;
 
if (neigh) {
@@ -811,10 +812,10 @@ void ipoib_cm_handle_tx_wc(struct net_de
 
clear_bit(IPOIB_FLAG_OPER_UP, tx-flags);
 
-   spin_unlock(priv-lock);
+   spin_unlock_irqrestore(priv-lock, flags);
}
 
-   spin_unlock_irqrestore(priv-tx_lock, flags);
+   netif_tx_unlock(dev);
 }
 
 int ipoib_cm_dev_open(struct net_device *dev)
@@ -1134,7 +1135,6 @@ static void ipoib_cm_tx_destroy(struct i
 {
struct ipoib_dev_priv *priv = netdev_priv(p-dev);
struct ipoib_cm_tx_buf *tx_req;
-   unsigned long flags;
unsigned long begin;
 
ipoib_dbg(priv, Destroy active connection 0x%x head 0x%x tail 0x%x\n,
@@ -1165,12 +1165,12 @@ timeout:
DMA_TO_DEVICE);
dev_kfree_skb_any(tx_req-skb);
++p-tx_tail;
-   spin_lock_irqsave(priv-tx_lock, flags);
+   netif_tx_lock_bh(p-dev);
if (unlikely(--priv-tx_outstanding == ipoib_sendq_size  1) 
netif_queue_stopped(p-dev) 
test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
netif_wake_queue(p-dev);
-   spin_unlock_irqrestore(priv-tx_lock, flags);
+   netif_tx_unlock_bh(p-dev);
}
 
if (p-qp)
@@ -1187,6 +1187,7 @@ static int ipoib_cm_tx_handler(struct ib
struct ipoib_dev_priv *priv = netdev_priv(tx-dev);
struct net_device *dev = priv-dev;
struct ipoib_neigh *neigh;
+   unsigned long flags;
int ret;
 
switch (event-event) {
@@ -1205,8 +1206,8 @@ static int ipoib_cm_tx_handler(struct ib
case IB_CM_REJ_RECEIVED:
case IB_CM_TIMEWAIT_EXIT:
ipoib_dbg(priv, CM error %d.\n, event-event);
-   spin_lock_irq(priv-tx_lock);
-   spin_lock(priv-lock);
+   netif_tx_lock_bh(dev);
+   spin_lock_irqsave(priv-lock, flags);
neigh = tx-neigh;
 
if (neigh) {
@@ -1224,8 +1225,8 @@ static int ipoib_cm_tx_handler(struct ib
queue_work(ipoib_workqueue, priv-cm.reap_task);
}
 
-   spin_unlock(priv-lock);
-   spin_unlock_irq(priv-tx_lock);
+   spin_unlock_irqrestore(priv-lock, flags);
+   netif_tx_unlock_bh(dev);
break;
default:
break;
@@ -1279,19 +1280,24 @@ static void ipoib_cm_tx_start(struct wor
struct ib_sa_path_rec pathrec;
u32 qpn;
 
-   spin_lock_irqsave(priv-tx_lock, flags);
-   spin_lock(priv-lock);
+   netif_tx_lock_bh(dev);
+   spin_lock_irqsave(priv-lock, flags);
+
while (!list_empty(priv-cm.start_list)) {
p = list_entry(priv-cm.start_list.next, typeof(*p), list);
list_del_init(p-list);
neigh = p-neigh;
qpn = 

Re: [ofa-general] [PATCH 00/03] RDMA Transport Support for 9P

2008-10-06 Thread Tom Tucker

Roland Dreier wrote:
  This patchset implements an RDMA transport provider for the 
  v9fs (Plan 9 filesystem). Could you take a look at it and let us

  know what you think?

I sent comments on the initial posting I saw on lkml ... did they not
make it to you?



No, I just missed it. Sorry. I just responded to your comments,


  [PATCH 01/03] 9prdma: RDMA Transport Support for 9P
  [PATCH 02/03] 9prdma: Makefile change for the RDMA transport
  [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport

one meta-comment I didn't send last time: the patches are small enough
that I would just send it all in one patch, since it makes sense to
apply it that way anyway.



Ok, makes my life easy.


 - R.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] SRP/mlx4 interrupts throttling performance

2008-10-06 Thread Cameron Harr

Vladislav Bolkhovitin wrote:

Cameron Harr wrote:

Vlad,
Thanks for the suggestion. As I look via vmstat, my CSw/s rate is 
fairly constant around 280K when scst_threads=1 (per Vu's suggestion) 
and pops up to ~330-340K CSw/s when scst_threads is set to 8. I'm 
currently doing 512B writes, and this gives me about a 4:1 ratio of 
context switches to IOPs with 1 SCST thread (70K IOPs) and around 
4.5:1 when there are 8 SCST threads (75K IOPs).


This is still too high. Considering that each CS is about 1 
microsecond you can estimate how many IOPS's it costs you.


Dropping scst_threads down to 2, from 8, with 2 initiators, seems to 
make a fairly significant difference, propelling me to a little over 
100K IOPs and putting the CS rate around 2:1, sometimes lower. 2 threads 
gave the best performance compared to 1, 4 and 8.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] SRP/mlx4 interrupts throttling performance

2008-10-06 Thread Cameron Harr



Vu Pham wrote:

Cameron Harr wrote:

Vu Pham wrote:



Alternatively, is there anything in the SCST layer I should tweak. I'm
still running rev 245 of that code (kinda old, but works with OFED 
1.3.1

w/o hacks).


With blockio I get the best performance + stability with scst_threads=1


I got best performance with threads=2 or 3, and I've noticed that the 
srpt_thread is often at 99%, though if I increase/decrease the 
thread=? parameter for ib_srpt, it doesn't seem to make a difference. 
A second initiator doesn't seem to help much either, with a single 
initiator writing to two targets, can now usually get between 95K and 
105K IOPs.


My target server (with DAS) contains 8 2.8 GHz CPU cores and can 
sustain over 200K IOPs locally, but only around 73K IOPs over SRP.


Is this number from one initiator or multiple?
One initiator. At first I thought it might be a limitation of the 
SRP, and added a second initiator, but the aggregate performance of 
the two was about equal to that of a single initiator.


Try again with scst_threads=1. I expect that you can get ~140K with 
two initiators


Unfortunately, I'm nowhere close that high, though I am significantly 
higher than before. 2 initiators does seem to reduce the context 
switching rate however, which is good.
Looking at /proc/interrupts, I see that the mlx_core (comp) device 
is pushing about 135K Int/s on 1 of 2 CPUs. All CPUs are enabled 
for that PCI-E slot, but it only ever uses 2 of the CPUs, and only 
1 at a time. None of the other CPUs has an interrupt rate more 
than about 40-50K/s.
The number of interrupt can be cut down if there are more 
completions to be processed by sw. ie. please test with multiple QPs 
between one initiator vs. your target and multiple initiators vs. 
your target
Interrupts are still pretty high (around 160K/s now), but that seems to 
not be my bottleneck. Context switching seems to be about 2-2.5 for 
every IOP and sometimes less - not perfect, but not horrible either.


ib_srpt process completions in event callback handler. With more QPs 
there are more completions pending per interrupt instead of one 
completion event per interrupt.
You can have multiple QPs between initiator vs. target by using 
different initiator_id_ext ie.
echo id_ext=xxx,ioc_guid=yyy,initiator_ext=1  
/sys/class/infiniband_srp/.../add_target
echo id_ext=xxx,ioc_guid=yyy,initiator_ext=2  
/sys/class/infiniband_srp/.../add_target
echo id_ext=xxx,ioc_guid=yyy,initiator_ext=3  
/sys/class/infiniband_srp/.../add_target
This doesn't seem to net much of an improvement, though I understand the 
reasoning behind it. My hunch is there's another bottleneck now to look for.


Cameron
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] SRP/mlx4 interrupts throttling performance

2008-10-06 Thread Cameron Harr

Cameron Harr wrote:
This is still too high. Considering that each CS is about 1 
microsecond you can estimate how many IOPS's it costs you.


Dropping scst_threads down to 2, from 8, with 2 initiators, seems to 
make a fairly significant difference, propelling me to a little over 
100K IOPs and putting the CS rate around 2:1, sometimes lower. 2 
threads gave the best performance compared to 1, 4 and 8.


Just as a status update, I've gotten my best performance with 
scst_threads=3 on 2 initiators, and using a separate QP for each drive 
an initiator is writing to. I'm getting pretty consistent 112-115K IOPs 
using two initiators, each writing with 2 processes to the same 2 
physical targets, using 512B blocks. Adding the second initiator only 
bumps me up by about 20K IOPs, but as all the CPUs are pegged around 
99%, I'll take that as a bottleneck. Also, as a note from Vlad's advice, 
the CS rate is now around 70K/s on 115K IOPs, so it's not too bad. 
Interrupts (where this thread started), are around 200K/s - a lot higher 
than I thought they'd go, but I'm not complaining. :)


Thanks for the help.
Cameron
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache

2008-10-06 Thread Yevgeny Kliteynik

Hi Hal,

Hal Rosenstock wrote:

Hi Yevgeny,

On Sun, Oct 5, 2008 at 9:26 PM, Yevgeny Kliteynik
[EMAIL PROTECTED] wrote:

Hi Sasha,

The following series of 6 patches implements unicast routing cache
in OpenSM.

This implementation (v2, previous version was sent before OFED 1.3)
was rewritten from scratch:
 - no caching of existing connectivity
 - no caching of existing lid matrices
 - each switch has an LFT buffer that contains the result of
  the last routing engine execution (instead of one buffer
  in ucast_mgr)
 - links/ports/nodes changes are spotted during the discovery
 - only the links/ports/nodes that  went down are cached
 - when switch goes down, caching its lid matrices and LFT

In one of the following cases we can use cached routing
 - there is no topology change
 - one or more CAs disappeared
 - one or more leaf switches disappeared
In these cases cached routing is written to the switches as is
(unless the switch doesn't exist).
If there is any other topology change, existing cache is invalidated
and the routing engine(s) run as usual.


Glad to see this!

A few comments/questions:

It seems that there is a LFT cache per switch. This seems to be a big
memory penalty to me (in large subnets). So I have two questions
related to this:
Can this only be done this way when cached routing is being used ?


Actually, I was thinking about something else:
Currently we have switch LFT implemented as osm_fwd_tbl_t.
I can remove the unnecessary complexity of the osm_fwd_tbl_t by replacing
it with a simple uint8_t array (same as LFT buffer). Then by simple
comparison I will check whether the recently calculated LFT
matches the switch's LFT, and if there is a match, then lft_buf
can be freed. In this case only the switches that have LFT different
from the recently calculated LFT will have both tables, which would be
rare and temporary - on the next heavy sweep the LFTs would match, and
lft_buf would be freed.
Effectively, it won't have memory penalty.
It can be done in a separate patch.


Also, when cached routing is being used, is this only needed for leaf switches ?


No, it is needed for all the switches, because cache can also
handle non-leaf switch fast reset.


I'm wondering when there is a cached node match whether the available
peer ports/neighbors are validated (or something equivalent) to know
caching is valid ? It might also include whether a switch is still a
leaf switch (which may be redundant as that should show up as a peer
port/neighbor change). It looks like the structure is there for this
but I didn't review the code in detail.


If I understood your question correctly, then yes, such validation
is done by osm_ucast_cache_validate() function.
Can you describe in more details the case that you are asking about?


Are you sure all the memory allocation failures are handled properly
within the routing cache code ? What I mean is that NULL is returned
and does this always result in a caching not used/routing recalculated
? Also, in that case, should some log message be indicated rather than
hiding this ?


I will check it.


Nit: doc/current-routing.txt should also be updated for this feature.


OK, separate patch.

-- Yevgeny


-- Hal


The patches are:
 - patch 1/6: move lft_buf from ucast_mgr to osm_switch
 - patch 2/6: Add -A or --ucast_cache option to opensm
 - patch 3/6: adding osm_ucast_cache.{c,h} files (this is
  the cache implementation itself)
 - patch 4/6: adding new cache files to makefile
 - patch 5/6: integrating unicast cache into the discovery
  and ucast manager
 - patch 6/6: man entry for cached routing

-- Yevgeny
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general






___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general