[ofa-general] (no subject)

2007-12-20 Thread pwcrpyizoklzp
QUIT
Received: from unknown (HELO vlm9764.net) (248.48.195.9)
by  with SMTP; 20 Dec 2007 10:03:43 -
X-Originating-IP: [248.48.195.9]
Date: Thu, 20 Dec 2007 18:03:42 +0800
From: =?GB2312?B?xqS47yDQrLLEILmry74=?= [EMAIL PROTECTED]
To: openib-general [EMAIL PROTECTED]
Subject: =?GB2312?B?xqS477LEwc+hory8yvWhosnosbg=?=
X-Mailer: VolleyMail 6.0[cn]
Mime-Version: 1.0
X-Priority: 3
X-MSMail-Priority: Normal
Content-Type: text/plain;
charset=GB2312
Content-Transfer-Encoding: base64

b3BlbmliLWdlbmVyYWyjrMT6usOjug0KMjAwOLXazuW97MnPuqO5+rzKxqS477LEwc+hory8yvWh
osnosbjVucDAu+ENCrLOICDVuSAg0fsgIMfrIMrpDQrKsbzko7oyMDA4xOo21MIxOMjV1sEyMMjV
tdi146O6yc+6o8rAw7PJzLPHICjQy9Llwrc5ObrFKQ0Ktee7sKO6ODYtMjEtNjQ4Mjc4ODm31rv6
ODE4ICAgICANCrSr1eajujg2LTIxLTUxNzE0NjY2ICA2NDgyNjYzMCAgICAgICAgICAgDQpFLW1h
aWw6amlhbmduYW5Ac2h5aHpsLmNvbQ0KwarPtcjLo7q9rcTPMTM4MTgyMjYxNTgNCg0Kob/VucDA
u+HSu8DADQrW97DstaXOu6O6yc+6o8rQxqS477y8yvXQrbvhICAgICAgICAgICAgICAgIMnPuqPN
4r6tw7PJzM7x1bnAwNPQz965q8u+DQrWp7PWtaXOuzogyc+6o8rQv8bRp7y8yvXQrbvhICAgICAg
ICAgICAgICAgINbQufrGpLjvvLDWxtCsuaTStdHQvr/Uug0K0K2w7LWlzrujutbQufrGpLjvuaTS
tdDFz6LW0NDEICAgICAgICAgICAgICC5+rzSxqS479bGxrfWysG/vOC2vbzs0enW0NDEDQogICAg
ICAgICAgufq80sOrxqTWysG/vOC2vbzs0enW0NDEICAgICAgICAgIMirufrWxtCsuaTStdDFz6LW
0NDEDQqz0LDstaXOu6O6yc+6o83ivq3Ds8nMzvHVucDA09DP3rmry74gICAgICAgIMnPuqPRxbvU
1bnAwLf+zvHT0M/euavLvg0KuqPN4rT6wO06INLitPPA+7+owPvM2NfJ0a+5ybfd09DP3rmry74g
IMjVsb5JU0bKws7xvtYgICC6q7n6obbKscnQxqS+36G31NPWvg0KDQqhv7PQsOy5q8u+vPK96Q0K
ysCyqbyvzcXJz7qjzeK+rcOzyczO8dW5wMDT0M/euavLvsrHyc+6o9aqw/u1xLn609DVucDAuavL
vqOstuDE6rPQsOzW0Ln6u6q2q734s/a/2snMxre9u9LXu+G88rPGIruqvbu74SKhotbQufq5+rzK
uaTStbKpwMC74aGiueO9u7vhyc+6o7270tfNxbvhzvG5pNf3o6y7ucO/xOrX6davufrE2sbz0rWz
9r6z1bnAwKGjDQrJz7qj0cW71NW5wMC3/s7x09DP3rmry77Kx8nPuqPK0Lvh1bnQ0NK10K274bvh
1LG1pc671q7Su6OoseC6xTI0N6Opo6yzybmmvtmw7Ln9yv3Krrj2ufq8yrTz0M3VucDAo6zM4rLE
yea8sMakuO+hotCswOChovTDxqShos/ksPyhoru3saOhoruvuaShorniteehosnM0rW1yMHs0/Kh
o7mry769qNPQvfwyMM3yzPXXqNK1wvK80tDFz6K1xMXTtPPK/b7dv+KjrM/qvqHK1cK8wcu5+sTa
zeK1xNeo0rWyybm6ycy6zb6tz/rJzKGjDQqhv8/CvezVubvhzNjJq6O6DQqxvtW5u+HHsMvEvezT
ycnPuqPRxbvU1bnAwLf+zvHT0M/euavLvrbAwaKz0LDso6zU2sewvLi97LPJuaa+2bDstcS7+bSh
yc+jrDIwMDjE6tW5u+G9q9PJysCyqbyvzcXJz7qjzeK+rcOzyczO8dW5wMDT0M/euavLvtPryc+6
o9HFu9TVucDAt/7O8dPQz965q8u+waq6z7PQsOyjrMnutsi6z9f3o6y088GmzbbI66OsyKvD5tH7
x+u5+s3itcS+rc/6ycy6zbLJubrJzKOsx7/Bps3Gvfi9+LP2v9rDs9LXLMq5ss7Vucbz0rWyu7P2
ufrDxb7NxNy08r+quPy24LXEzeLP+sK3vra6zcf+tcChow0Kob/Dvczl0Pu0q7Wlzrs6DQogu9u0
z8akuO/JzM7xzfihos3y0trGpLjvyczO8c34oaIxNjnQrNK118rRts34oaLW0Ln60KzN+KGi1tC5
+sak0KzN+KGi1tC5+rrPs8m47834oaLQrLv61NrP36Gi1tC5+r36va3QrM34oaK3/srOyczH6aGi
1tC5+tCs0rW7pcGqzfihotbQufrQrLv6u6XBqs34oaLW0Ln6xqTDq7270tfN+KGizsLW3dCszfih
otbQufrQrLa8zfihotbQufrQrLa8oaLW0Ln6xqS+3834oaLW0Ln60KyyxM34oaLW0Ln6us+zycak
uO/N+KGi1tC7qtCs0rXN+KGiINbQu6rGpLjv1NrP36Gi1tC5+tCsu/q7pcGqzfihotLXw7PNqKGi
ycy7otbQufqhojMyONCszfi1yA0K16jStdTT1r6juqG2sbG+qcakuO+ht6G21tC5+sakuO+ht6G2
1tDN4tCs0baht6G2zveyv8akuO+ht6G20KzStb3nobehtsnPuqPGpLjvobcNCqH01bnGt7e2zqej
ug0K1sa476Gi1sbQrLv60LXJ6LG41bnH+KO61sa476Gi1sbQrLv60LWhor7bsLH1pbv60LWhosak
uO+807mkyeixuCi18b/Mu/ovtPKx6rv6L8fQuO67+imhos/ksPy7+tC1oaK37NbGyeixuKGit+zH
sLfsuvPV+8DtyeixuKOstefE1Lio1vrJ6LG4vLDWxtCsyfqy+s/foaLF5Lz+tcihow0K0KyyxNW5
x/ijutCsssTQrMHPoaLQrMSjoaLQrOm4oaLQrLPEoaLQrNH5oaLO5b3wxeS8/rrNuKjBz6GiQ0FE
L0NBTc+1zbO1yKGjDQrGpLjvoaK6z7PJuO/Vucf4o7rGpLjvvLC6z7PJxqS476GiUFW476GiUFZD
yMvU7LjvoaLQrMPmuO+jqLK8o6mhosmzt6K476OosryjqaGiz+Sw/Ljvo6iyvKOpoaLO3rfEsryh
orK71q+yvKGi1ebGpMOrwc+hosakuO/D5sHPoaLGpLuv1K3Bz6Gi1K3GpKGisOuzyca3tcihow0K
xqS477uvuaTVucf4o7qx7cPmu+7Q1LzBoaLN0dasvMGhor3+yL6horuvuaTW+rzBoaK809asvMGh
os2/ys6holBVvbqhos3yxNy9uqGjDQrGpLjv1sbGt9W5x/ijusTQ0KyhosWu0KyhotDdz9DQrKGi
zc/QrKGiz+Sw/KGixqS+36GiytbM16GixqTSwqGixqSy3fTDxqS1yA0KDQqhv86q1bnJzMzhuam1
xLf+zvENCjEsINTatPO74c341b7Jz8Pit9HOqrLO1bnG89K11/bSu8TqtcS547jm0Pu0qzsNCjIs
IMPit9HOqrLO1bnJzMzhuam5q7my1PDIzrGjz9WhotW5s6HH5b3goaIyNNChyrGxo7Cytci3/s7x
o7sNCjOjrLTzu+G9q7Hg06G+q8PAu+G/r6Osw+K30c6qss7Vucbz0rW/r7XHMjAw19bX89PSuavL
vrzyvemjuw0KNKOstPO74czhuam9u82o1MvK5KGi1bm+39fiwd6hosDx0se1yLf+zvGjrNW5ycy4
+b7dx+m/9tGh08OjrLfR08PX1MDto7sNCjWjrLTzu+HWuLaotv7Qx9bBzuXQx7j3tbW+xrXqo6zV
ucnMxr7WpMjr16G9q8/tyty9z7Tz1du/26GjDQqhv7LO1bnPuNTyDQoxLszu0LSyztW5yerH67Ht
08q8xLvytKvV5tbB1+nWr7WlzrujrLKi1No3yNXE2r2rss7VubfR08O157vju/K9u9bB1+nWr7Wl
zrujrLLO1bnJzNTau+Oz9rj3z+630dPDuvOjrMfrvavS+NDQu+O/7rWltKvV5tbB1+nWr7Wlzruj
rM7Sw8e9q9TaytW1vbLO1bm30brzv6q+37eixrGjuw0KMi7Vuc67y7PQ8rfWxeTUrdTyo7oiz8jJ
6sfro6zPyLCyxcWjrM/IuLa/7qOsz8jIt8jPIizLq8Pmv6q/2tW5zru808rVMjAlt9HTw6O7DQoz
Ltfp1q+1pc67ytW1vbLO1bnJ6sfrvLDVucyot9HTw7rzo6y9q9PaMjAwOMTqMdTCMjDI1cewvMSh
trLO1bnK1rLhobe4+NW5ycyjuyANCg0Kob++tMfrss7Vucbz0rW8sMqx0+vO0sPHubXNqMGqwuej
rLvxyKHX7tDC1bm74dDFz6INCsrAsqm8r83FL8nPuqPN4r6tw7PJzM7x1bnAwLmry74vyc+6o8rQ
xqS477y8yvXQrbvhL8nPuqPRxbvU1bnAwLf+zvG5q8u+DQq12Na3o7rJz7qjytDk7s+qwrcyNTHF
qs371+WzxzW6xcKlMjBGytIgICAg08qx4KO6MjAwMjM1DQq157uwo7o4Ni0yMS02NDgyNzg4ObfW

[ofa-general] ofa_1_3_kernel 20071220-0200 daily build status

2007-12-20 Thread Vladimir Sokolovsky (Mellanox)
This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_3/linux-2.6.git
git_branch: ofed_kernel

Common build parameters:   --with-ipoib-mod --with-sdp-mod --with-srp-mod 
--with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-mlx4-mod 
--with-core-mod --with-addr_trans-mod  --with-rds-mod --with-cxgb3-mod 
--with-nes-mod

Passed:
Passed on i686 with 2.6.15-23-server
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.13
Passed on i686 with linux-2.6.14
Passed on i686 with linux-2.6.15
Passed on i686 with linux-2.6.12
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.17
Passed on x86_64 with linux-2.6.19
Passed on powerpc with linux-2.6.12
Passed on ppc64 with linux-2.6.16
Passed on ia64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19
Passed on ia64 with linux-2.6.23
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on ppc64 with linux-2.6.18
Passed on powerpc with linux-2.6.13
Passed on ppc64 with linux-2.6.15
Passed on ia64 with linux-2.6.13
Passed on ppc64 with linux-2.6.14
Passed on powerpc with linux-2.6.14
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on ppc64 with linux-2.6.12
Passed on powerpc with linux-2.6.15
Passed on x86_64 with linux-2.6.17
Passed on ia64 with linux-2.6.15
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.14
Passed on ia64 with linux-2.6.12
Passed on ppc64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.12
Passed on x86_64 with linux-2.6.13
Passed on ppc64 with linux-2.6.17
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on ia64 with linux-2.6.22
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.21.1
Passed on ppc64 with linux-2.6.13
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.15
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.14
Passed on x86_64 with linux-2.6.18-53.el5
Passed on ia64 with linux-2.6.16.21-0.8-default

Failed:
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] can you please add a new product to OpenFabrics Linux?

2007-12-20 Thread Dotan Barak

The product mstflint is missing.

The owner of this product is orenk.at.dev.mellanox.co.il


thanks
Dotan
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Finn

2007-12-20 Thread Fran Finn

We got huge stock of geniune quality medicines at very less price.

http://katheryntyusuu.googlepages.com
press right here

Thanks,

DR. Finn Fran

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

2007-12-20 Thread Jack Morgenstein

background:  see XRC Cleanup order issue thread at

http://lists.openfabrics.org/pipermail/general/2007-December/043935.html

(userspace process which created the receiving XRC qp on a given host dies 
before
other processes which still need to receive XRC messages on their SRQs which are
paired with the now-destroyed receiving XRC QP.)

Solution: Add a userspace verb (as part of the XRC suite) which enables the 
user process
to create an XRC QP owned by the kernel -- which belongs to the required XRC 
domain.

This QP will be destroyed when the XRC domain is closed (i.e., as part of a 
ibv_close_xrc_domain
call, but only when the domain's reference count goes to zero).

Below, I give the new userspace API for this function.  Any feedback will be 
appreciated.
This API will be implemented in the upcoming OFED 1.3 release, so we need 
feedback ASAP.

Notes:
1. There is no query or destroy verb for this QP. There is also no userspace 
object for the
   QP. Userspace has ONLY the raw qp number to use when creating the (X)RC 
connection.

2. Since the QP is owned by kernel space, async events for this QP are also 
handled in kernel
   space (i.e., reported in /var/log/messages). There are no completion events 
for the QP, since
   it does not send, and all receives completions are reported in the XRC SRQ's 
cq.

   If this QP enters the error state, the remote QP which sends will start 
receiving RETRY_EXCEEDED
   errors, so the application will be aware of the failure.

- Jack
==
/**
 * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as a receive-side only 
QP,
 *  and moves the created qp through the RESET-INIT and INIT-RTR 
transitions.
 *  (The RTR-RTS transition is not needed, since this QP does no sending).
 *  The sending XRC QP uses this QP as destination, while specifying an XRC 
SRQ
 *  for actually receiving the transmissions and generating all completions 
on the
 *  receiving side.
 *
 *  This QP is created in kernel space, and persists until the XRC domain 
is closed.
 *  (i.e., its reference count goes to zero).
 *
 * @pd: protection domain to use.  At lower layer, this provides access to 
userspace obj
 * @xrc_domain: xrc domain to use for the QP.
 * @attr: modify-qp attributes needed to bring the QP to RTR.
 * @attr_mask:  bitmap indicating which attributes are provided in the attr 
struct.
 *  used for validity checking.
 * @xrc_rcv_qpn: qp_num of created QP (if success). To be passed to the remote 
node. The
 *   remote node will use xrc_rcv_qpn in ibv_post_send when sending 
to
 *   XRC SRQ's on this host in the same xrc domain.
 *
 * RETURNS: success (0), or a (negative) error value.
 */

int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
 struct ibv_xrc_domain *xrc_domain,
 struct ibv_qp_attr *attr,
 enum ibv_qp_attr_mask attr_mask,
 uint32_t *xrc_rcv_qpn);

Notes:

1. Although the kernel creates the qp in the kernel's own PD, we still need the 
PD
   parameter to determine the device.

2. I chose to use struct ibv_qp_attr, which is used in modify QP, rather than 
create
   a new structure for this purpose.  This also guards against API changes in 
the event
   that during development I notice that more modify-qp parameters must be 
specified
   for this operation to work.

3. Table of the ibv_qp_attr parameters showing what values to set:

struct ibv_qp_attr {
enum ibv_qp_state   qp_state;   Not needed
enum ibv_qp_state   cur_qp_state;   Not needed
-- Driver starts from RESET and takes qp to RTR.
enum ibv_mtupath_mtu;   Yes
enum ibv_mig_state  path_mig_state; Yes
uint32_tqkey;   Yes
uint32_trq_psn; Yes
uint32_tsq_psn; Not needed
uint32_tdest_qp_num;Yes -- this is the 
remote side QP for the RC conn.
int qp_access_flags;Yes
struct ibv_qp_cap   cap;Need only XRC domain. 
Other caps will use 
hard-coded values:
max_send_wr = 1;
max_recv_wr = 0;
max_send_sge = 
1;
max_recv_sge = 
0;
max_inline_data 
= 0;
struct ibv_ah_attr  ah_attr;Yes
struct ibv_ah_attr  alt_ah_attr;Optional
uint16_t

Re: [ofa-general] Re: some questions on stale connection handling at the IB CM

2007-12-20 Thread Or Gerlitz

Sean Hefty wrote:

So in the case of lost DREQ etc, in cm_match_req() we will pass the
checking for duplicate REQs but fall in the check for stale connections
and it can happen in endless loop? this seems like a bug to me.



This problem isn't limited to stale connections.  If a client tries to
connect, gets a reject for whatever reason, ignores the reject, then
tries to reconnect with the same parameters, then they've put themselves
into an endless loop.


I don't follow: if they don't ignore the reject, but reuse the same QP 
for their successive connection requests, each new REQ will pass the ID 
check (duplicate REQs) but will fail on the remote QPN check, correct? 
so what can a client do to not fall into that? what does it means to not 
ignore the reject? note that even if on getting a reject they release 
the qp and allocate new one, they can get the qp number.



Yes, this seems to be able to solve the keep-alive thing in a generic
fashion for all ULPs using the IB CM, will you be able to look on this
during the next weeks or so?



This method can be used by apps today.  The only enhancement that I can
see being made is having the CM automatically send the messages at
regular intervals.  But I hesitate to add this to the CM since it
doesn't have knowledge of traffic occurring over the QP, and may
interfere with the app wanted to actually change alternate path information.


You mean one side to send a LAP message with the current path and the 
peer replying with APR message confirming this is fine? I guess this LAP 
sending has to carried out by both sides, correct? and its not supported 
for RDMA-CM users...


As for your comments, assuming an app must notify the CM that it does 
not use a QP anymore (and if not we delare it RTFM bug), as long as the 
QP is alive from the CM view point, its perfectly fine to sends these 
LAPs, doing this once every few seconds or tens of seconds will not 
create heavy load, I think. As for the point of interfering with apps 
that want to use LAP/APR for APM implementation over their protocols, we 
can let the CM consumer specify if they want the CM to issue keep-alives 
for them, and what is the frequency of sending the messages.


Or.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] peer to peer connections support

2007-12-20 Thread Or Gerlitz

Kanevsky, Arkady wrote:

are you proposing that rdma_cm try to separate 2 cases.
One where 2 sides each trying to set up a connection to another side,
vs. where 2 sides are trying to set up 1 connection but each side
issuing a connection request?


I am not proposing now, but rather trying to understand with Sean what 
his vision of a possible API



Isn't it easier to handle in MPI which has a unique rank so only one
side issues a connection request?


This is in MPI schemes that all-to-all-connect on job start, where I 
refer the case of connections on demand.


Or.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


***SPAM*** RE: [ofa-general] ***SPAM*** SFS 3012 SRP problem

2007-12-20 Thread Jeroen Van Aken
We are using 2 IBM FAStT900's.

Normally the timestamps of the messages on both the SFS and the IB host
match.

Thanks

 

jeroen

 

From: Scott Weitzenkamp (sweitzen) [mailto:[EMAIL PROTECTED] 
Sent: woensdag 19 december 2007 18:32
To: Jeroen Van Aken; general@lists.openfabrics.org
Subject: RE: [ofa-general] ***SPAM*** SFS 3012 SRP problem

 

If you have a Cisco supoport contract, you should open a case with the Cisco
TAC.

 

What kind of FC storage are you using?

 

The chassis syslog message show the host is unresponsive (the OUT_SERVICE
and IN_SERVICE message).  Do the timing of these messages match the ib_srp
messages on the host?

 

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems

 

 


  _  


From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Jeroen Van Aken
Sent: Wednesday, December 19, 2007 6:54 AM
To: general@lists.openfabrics.org
Subject: [ofa-general] ***SPAM*** SFS 3012 SRP problem

Hello

 

We are doing some SRP tests with the Cisco SFS 3012 Gateway. We connected 4
hosts, each with 2 infiniband cables on one dual infiniband card to the
SFS3012 gateway. The gateway is also connected to our fibre channel storage.
The ofed used is OFED-1.3-beta2 on each of the hosts. The infiniband cards
used are InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (rev
a0) and  Mellanox Technologies MT23108 InfiniHost (rev a1) cards.

When generating heavy load over the switch (by reading from our FC storage
over all the luns simultaneously), we sometimes get the following errors:

On the hosts: 

 

Dec 13 13:07:54 gpfs4n1 syslog-ng[8212]: STATS: dropped 0

Dec 13 13:20:26 gpfs4n1 run_srp_daemon[8422]: failed srp_daemon:
[HCA=mthca0] [port=1] [exit status=110]. Will try to restart srp_daemon
periodically. No mor

e warnings will be issued in the next 7200 seconds if the same problem
repeats

Dec 13 13:20:27 gpfs4n1 run_srp_daemon[8428]: starting srp_daemon:
[HCA=mthca0] [port=1]

Dec 13 14:01:20 gpfs4n1 sshd[8539]: Accepted keyboard-interactive/pam for
root from 172.16.0.18 port 3545 ssh2

Dec 13 14:07:55 gpfs4n1 syslog-ng[8212]: STATS: dropped 0

Dec 13 14:13:01 gpfs4n1 syslog-ng[8212]: Changing permissions on special
file /dev/xconsole

Dec 13 14:13:01 gpfs4n1 syslog-ng[8212]: Changing permissions on special
file /dev/tty10

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:01 gpfs4n1 kernel: SRP abort called

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed send status 12

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed send status 12

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

Dec 13 14:13:02 gpfs4n1 kernel: ib_srp: failed receive status 5

 

On the switch ts_log

**SWITCH
LOG***

Dec 13 14:04:30 topspin-cc ib_sm.x[1357]: [INFO]: Configuration caused by
multicast membership change

Dec 13 14:05:49 topspin-cc ib_sm.x[1383]: [INFO]: Session not initiated:
Cold Sync Limit exceeded for Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:07:49 topspin-cc ib_sm.x[1383]: [INFO]: Initialize a backup
session with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:07:59 topspin-cc ib_sm.x[1383]: [INFO]: Session initialization
failed with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:09:59 topspin-cc ib_sm.x[1383]: [INFO]: Initialize a backup
session with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:10:09 topspin-cc ib_sm.x[1383]: [INFO]: Session initialization
failed with Standby SM guid 00:05:ad:00:00:08:94:5d

Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM OUT_OF_SERVICE
trap for GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:1d:ce:21

Dec 13 14:12:06 topspin-cc ib_sm.x[1357]: [INFO]: Generate SM OUT_OF_SERVICE
trap for 

Re: [ofa-general] smpquery regression in 1.3-rc1

2007-12-20 Thread Hal Rosenstock
On Thu, 2007-12-20 at 13:42 +0200, Yevgeny Kliteynik wrote:
 Hal Rosenstock wrote:
  On Wed, 2007-12-19 at 11:58 -0800, [EMAIL PROTECTED] wrote:
  We're seeing a regression in smpquery from alpha2 to rc1. 
 
  For example, with alpha2 I get:
  grommit:~ # smpquery -G nodeinfo 0x66a01a000737c
  # Node info: Lid 3
  BaseVers:1
  ClassVers:...1
  NodeType:Channel Adapter
  NumPorts:2
  SystemGuid:..0x00066a009800737c
  Guid:0x00066a009800737c
  PortGuid:0x00066a01a000737c
  PartCap:.64
  DevId:...0x6278
  Revision:0x00a0
  LocalPort:...2
  VendorId:0x00066a
  grommit:~ # 
 
 
  And with rc1, I get:
  grommit:~ # smpquery -G nodeinfo 0x66a01a000737c
  ibwarn: [5650] ib_path_query: sa call path_query failed
  smpquery: iberror: failed: can't resolve destination port 0x66a01a000737c
  grommit:~ #  
 
  But using a LID works fine:
  grommit:~ # smpquery nodeinfo 3
  # Node info: Lid 3
  BaseVers:1
  ClassVers:...1
  NodeType:Channel Adapter
  NumPorts:2
  SystemGuid:..0x00066a009800737c
  Guid:0x00066a009800737c
  PortGuid:0x00066a01a000737c
  PartCap:.64
  DevId:...0x6278
  Revision:0x00a0
  LocalPort:...2
  VendorId:0x00066a
  grommit:~ # 
 
  Strangest of all, running it under strace also works:
  grommit:~ # strace smpquery -G nodeinfo 0x66a01a000737c  
  /tmp/smpquery.out 
  .
  grommit:~ # cat /tmp/smpquery.out
  # Node info: Lid 3
  BaseVers:1
  ClassVers:...1
  NodeType:Channel Adapter
  NumPorts:2
  SystemGuid:..0x00066a009800737c
  Guid:0x00066a009800737c
  PortGuid:0x00066a01a000737c
  PartCap:.64
  DevId:...0x6278
  Revision:0x00a0
  LocalPort:...2
  VendorId:0x00066a
  grommit:~ #
 
  Some weird race condition...
 
  Anyone else seeing the same?
  
  -G requires a SA path record lookup so this could be an issue with that
  timing out in some cases (assuming the port is active and the SM is
  operational).
 
 I'm seeing the same problem.
 Sometimes the query works, and sometimes it doesn't.
 I also see that when the query fails, OpenSM doesn't get PathRecord query at 
 all.
 
 Hal, can you elaborate on that timing out in some cases issue?

I just meant that the SM not responding (for an unknown reason right
now) would yield this effect.

 Adding Jack for the libibmad issue:
 
 I see that the ib_path_query() in libibmad/sa.c sometimes fails
 when calling safe_sa_call().

This could just be more detail on the same thing in terms of the
(smpquery) client which is layered on top of libibmad: the SA path query
timeout.

I would suggest running OpenSM in verbose mode (both instances are with
OpenSM) and seeing if it responds to the PathRecord query used by this
form of smpquery and continue troubleshooting from there based on the
result.

-- Hal

 -- Yevgeny
 
  -- Hal
  ___
  general mailing list
  general@lists.openfabrics.org
  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
  
  To unsubscribe, please visit 
  http://openib.org/mailman/listinfo/openib-general
  
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH] IB/ehca: Forward event client-reregister-required to registered clients

2007-12-20 Thread Hoang-Nam Nguyen
This patch allows ehca to forward event client-reregister-required to
registered clients. Such one event is generated by the switch eg. after
its reboot.

Signed-off-by: Hoang-Nam Nguyen [EMAIL PROTECTED]
---
 drivers/infiniband/hw/ehca/ehca_irq.c |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c 
b/drivers/infiniband/hw/ehca/ehca_irq.c
index 3f617b2..4c734ec 100644
--- a/drivers/infiniband/hw/ehca/ehca_irq.c
+++ b/drivers/infiniband/hw/ehca/ehca_irq.c
@@ -62,6 +62,7 @@
 #define NEQE_PORT_NUMBER   EHCA_BMASK_IBM( 8, 15)
 #define NEQE_PORT_AVAILABILITY EHCA_BMASK_IBM(16, 16)
 #define NEQE_DISRUPTIVEEHCA_BMASK_IBM(16, 16)
+#define NEQE_SPECIFIC_EVENTEHCA_BMASK_IBM(16, 23)
 
 #define ERROR_DATA_LENGTH  EHCA_BMASK_IBM(52, 63)
 #define ERROR_DATA_TYPEEHCA_BMASK_IBM( 0,  7)
@@ -354,6 +355,7 @@ static void parse_ec(struct ehca_shca *shca, u64 eqe)
 {
u8 ec   = EHCA_BMASK_GET(NEQE_EVENT_CODE, eqe);
u8 port = EHCA_BMASK_GET(NEQE_PORT_NUMBER, eqe);
+   u8 spec_event;
 
switch (ec) {
case 0x30: /* port availability change */
@@ -394,6 +396,16 @@ static void parse_ec(struct ehca_shca *shca, u64 eqe)
case 0x33:  /* trace stopped */
ehca_err(shca-ib_device, Traced stopped.);
break;
+   case 0x34: /* util async event */
+   spec_event = EHCA_BMASK_GET(NEQE_SPECIFIC_EVENT, eqe);
+   if (spec_event == 0x80) /* client reregister required */
+   dispatch_port_event(shca, port,
+   IB_EVENT_CLIENT_REREGISTER,
+   client reregister req.);
+   else
+   ehca_warn(shca-ib_device, Unknown util async 
+ event %x on port %x, spec_event, port);
+   break;
default:
ehca_err(shca-ib_device, Unknown event code: %x on %s.,
 ec, shca-ib_device.name);
-- 
1.5.2


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] cladophora

2007-12-20 Thread Foulkes Posso
Halloha,


  Downnloadable Softwaare   
http://www.geocities.com/ggfd28kfhfkgku/ 



To be due to ignorance or delusion. The soul's interview
with the king, and placed the memorandum betrayed the brotherhood?
from every member of colonel at last called the halt, the
boy sank of what good for you and me to speculate, since
can trust one another's word more fully than the faithful
to her promise, abandoning that prosperity must be no distracting
cares i will look for the has since assumed the less heathen
appellation then the ... Manufacturers' association, ...
it.___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

2007-12-20 Thread Pavel Shamis (Pasha)

Adding Open MPI and MVAPICH community to the thread.

Pasha (Pavel Shamis)

Jack Morgenstein wrote:

background:  see XRC Cleanup order issue thread at

http://lists.openfabrics.org/pipermail/general/2007-December/043935.html

(userspace process which created the receiving XRC qp on a given host dies 
before
other processes which still need to receive XRC messages on their SRQs which are
paired with the now-destroyed receiving XRC QP.)

Solution: Add a userspace verb (as part of the XRC suite) which enables the 
user process
to create an XRC QP owned by the kernel -- which belongs to the required XRC 
domain.

This QP will be destroyed when the XRC domain is closed (i.e., as part of a 
ibv_close_xrc_domain
call, but only when the domain's reference count goes to zero).

Below, I give the new userspace API for this function.  Any feedback will be 
appreciated.
This API will be implemented in the upcoming OFED 1.3 release, so we need 
feedback ASAP.

Notes:
1. There is no query or destroy verb for this QP. There is also no userspace 
object for the
   QP. Userspace has ONLY the raw qp number to use when creating the (X)RC 
connection.

2. Since the QP is owned by kernel space, async events for this QP are also 
handled in kernel
   space (i.e., reported in /var/log/messages). There are no completion events 
for the QP, since
   it does not send, and all receives completions are reported in the XRC SRQ's 
cq.

   If this QP enters the error state, the remote QP which sends will start 
receiving RETRY_EXCEEDED
   errors, so the application will be aware of the failure.

- Jack
==
/**
 * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as a receive-side only 
QP,
 *  and moves the created qp through the RESET-INIT and INIT-RTR 
transitions.
 *  (The RTR-RTS transition is not needed, since this QP does no sending).
 *  The sending XRC QP uses this QP as destination, while specifying an XRC 
SRQ
 *  for actually receiving the transmissions and generating all completions 
on the
 *  receiving side.
 *
 *  This QP is created in kernel space, and persists until the XRC domain 
is closed.
 *  (i.e., its reference count goes to zero).
 *
 * @pd: protection domain to use.  At lower layer, this provides access to 
userspace obj
 * @xrc_domain: xrc domain to use for the QP.
 * @attr: modify-qp attributes needed to bring the QP to RTR.
 * @attr_mask:  bitmap indicating which attributes are provided in the attr 
struct.
 *  used for validity checking.
 * @xrc_rcv_qpn: qp_num of created QP (if success). To be passed to the remote 
node. The
 *   remote node will use xrc_rcv_qpn in ibv_post_send when sending 
to
 *   XRC SRQ's on this host in the same xrc domain.
 *
 * RETURNS: success (0), or a (negative) error value.
 */

int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
 struct ibv_xrc_domain *xrc_domain,
 struct ibv_qp_attr *attr,
 enum ibv_qp_attr_mask attr_mask,
 uint32_t *xrc_rcv_qpn);

Notes:

1. Although the kernel creates the qp in the kernel's own PD, we still need the 
PD
   parameter to determine the device.

2. I chose to use struct ibv_qp_attr, which is used in modify QP, rather than 
create
   a new structure for this purpose.  This also guards against API changes in 
the event
   that during development I notice that more modify-qp parameters must be 
specified
   for this operation to work.

3. Table of the ibv_qp_attr parameters showing what values to set:

struct ibv_qp_attr {
enum ibv_qp_state   qp_state;   Not needed
enum ibv_qp_state   cur_qp_state;   Not needed
-- Driver starts from RESET and takes qp to RTR.
enum ibv_mtupath_mtu;   Yes
enum ibv_mig_state  path_mig_state; Yes
uint32_tqkey;   Yes
uint32_trq_psn; Yes
uint32_tsq_psn; Not needed
uint32_tdest_qp_num;Yes -- this is the 
remote side QP for the RC conn.
int qp_access_flags;Yes
	struct ibv_qp_cap	cap;			Need only XRC domain. 
			Other caps will use hard-coded values:

max_send_wr = 1;
max_recv_wr = 0;
max_send_sge = 
1;
max_recv_sge = 
0;
max_inline_data 
= 0;
struct ibv_ah_attr  ah_attr;Yes
struct ibv_ah_attr  alt_ah_attr;Optional
  

[ofa-general] Re: [PATCH] opensm: osm_state_mgr.c - stop idle queue processing if heavy sweep requested

2007-12-20 Thread Sasha Khapyorsky
On 09:40 Wed 19 Dec , Yevgeny Kliteynik wrote:
  Sasha Khapyorsky wrote:
  Hi Yevgeny,
  On 15:33 Mon 17 Dec , Yevgeny Kliteynik wrote:
  If a heavy sweep requested during idle queue processing, OSM continues
  to process it till the end and only then notices the heavy sweep request.
  In some cases this might leave a topology change unhandled for several
  minutes.
  Could you provide more details about such cases?
  As far as I know the idle queue is used only for multicast re-routing.
  If so, it is interesting by itself why it takes minutes and where. Is
  where MCG join/leave storm?
 
  Exactly. The problem was discovered on a big cluster with hundreds of mcast 
  groups,
  when there is some massive change in the subnet (like rebooting hundreds of 
  nodes).

Ok, then proposed patch looks like half solution for me.

During mcast join/leave storm idle queue will be filled with requests to
rebuild mcast routing. OpenSM will process it one by one (and this will
take a lot of time) instead of process all pended mcast groups in one
run. I think it is first improvement needed here.

Even with such improvement we will not be able to control the order of
heavy sweep/mcast join requests, so basically idea of breaking idle
queue processing looks fine for me, but it is not all what should be
done here. Heavy sweep by itself recalculates mcast routing for all
existing groups, it should invalidate all pended mcast rerouting
requests instead of continuing idle queue processing after heavy
sweep. Make sense?

Sasha

 
  -- Yevgeny
 
  Or single re-routing cycle takes minutes?
  Sasha
  Signed-off-by:  Yevgeny Kliteynik [EMAIL PROTECTED]
  ---
   opensm/opensm/osm_state_mgr.c |   31 ---
   1 files changed, 24 insertions(+), 7 deletions(-)
 
  diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
  index 5c39f11..6ee5ee6 100644
  --- a/opensm/opensm/osm_state_mgr.c
  +++ b/opensm/opensm/osm_state_mgr.c
  @@ -1607,13 +1607,30 @@ void osm_state_mgr_process(IN osm_state_mgr_t * 
  const p_mgr,
 /* CALL the done function */
 __process_idle_time_queue_done(p_mgr);
 
  -  /*
  -   * Set the signal to 
  OSM_SIGNAL_IDLE_TIME_PROCESS
  -   * so that the next element in the queue gets 
  processed
  -   */
  -
  -  signal = OSM_SIGNAL_IDLE_TIME_PROCESS;
  -  p_mgr-state = OSM_SM_STATE_PROCESS_REQUEST;
  +  if (p_mgr-p_subn-force_immediate_heavy_sweep) 
  {
  +  /*
  +   * Do not read next item from the idle 
  queue.
  +   * Immediate heavy sweep is requested, 
  so it's
  +   * more important.
  +   * Besides, there is a chance that 
  after the
  +   * heavy sweep complition, idle queue 
  processing
  +   * that SM would have performed here 
  will be obsolete.
  +   */
  +  if (osm_log_is_active(p_mgr-p_log, 
  OSM_LOG_DEBUG))
  +  osm_log(p_mgr-p_log, 
  OSM_LOG_DEBUG,
  +  osm_state_mgr_process: 
  +  interrupting idle time queue 
  processing - heavy sweep 
  requested\n);
  +  signal = OSM_SIGNAL_NONE:
  +  p_mgr-state = OSM_SM_STATE_IDLE;
  +  }
  +  else {
  +  /*
  +   * Set the signal to 
  OSM_SIGNAL_IDLE_TIME_PROCESS
  +   * so that the next element in the 
  queue gets processed
  +   */
  +  signal = OSM_SIGNAL_IDLE_TIME_PROCESS;
  +  p_mgr-state = 
  OSM_SM_STATE_PROCESS_REQUEST;
  +  }
 break;
 
 default:
  -- 
  1.5.1.4
 
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] peer to peer connections support

2007-12-20 Thread Or Gerlitz

Sean Hefty wrote:
...

I didn't follow this.

...

Peer to peer SIDs are in a different domain than client/server SIDs, and
the peer_to_peer field is used to indicate which domain a SID is in.


Sorry if I wasn't clear, let me see if I understand you: with this
different domain implementation, under both client/server the passive
calls cm listen and the active call cm connect, where under peer/to/peer
both sides call cm listen and later both sides may call cm connect or
only one side, correct?


To add to my comments on the CM API, struct ib_cm_req_param, which is
used to send the REQ, includes service_id and peer_to_peer fields.  The
latter is a boolean used by the CM to distinguish if incoming REQs can
be matched with the outgoing REQ.


OK, this makes things clearer.


Why there should be a difference between the rdma-cm to the cm? if in
the cm you have a model without API change, wouldn't it apply also to
the rdma-cm?



The rdma_cm does not know how to set the peer_to_peer field in the
ib_cm_req_param.  It sets this field to 0 today.


But it could set it to one as well... assuming my understanding above of
the suggested implementation is correct, we can change the RDMA-CM API
to let users specify on rdma_connect that they want peer to peer
support, so such apps can issue rdma_listen call and later call
rdma_connect with this bit set and they are done (or almost done... I
guess there some more devil in the details here, isn't it?)


  I think that in the MPI world each rank gets a SID from the local CM and
  they exchange the SIDs out-of-band, then connections are opened. If its
  a connection-on-demand scheme, then when ever the rank process calls
  mpi_send() to peer for which the local MPI library does not have a
  connection, it tries to connect. So if this happens at once between
  some pair of ranks, there should be a way to form one connection out of
  these two connecting requests. My thinking/motivation is that support of
  this scheme should be in the IB stack (cm and rdma-cm) level and not in
  the specific MPI implementation level.

Are the out of band connections used by MPI formed using client/server
or peer to peer?  I believe that Intel MPI has each rank listen for
connections from the ranks below it using client/server.


yes, MPIs that do all-to-all-connect on job start, typically use
client/server where all the ranks  0 issue listen call and then all
lower ranks connect to higher ranks or etc some other symmetry breaking
scheme. I am trying to see what needs to be supported by the IB stack to 
let MPIs that do connect on demand use the RDMA-CM.



There are a couple of problems with the peer to peer model.  First,
unless the connections occur at exactly the same time, they miss
connecting (rejected with invalid SID).  


This makes the all peer to peer model useless, since an app can not make
sure that connection occur at exactly the same time! my understanding of
the spec is that peer to peer model has the ability to handle also 
connections that occur at exactly the same time but not only.



Second, if multiple peer to
peer connections need to form between the same pair of nodes, things can
go screwy (that's the technical term) trying to match up the peer requests.


Under MPI each rank uses a different SID, so I think we are safe from 
this problem.


Or





___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] peer to peer connections support

2007-12-20 Thread Kanevsky, Arkady
SO in a nutshell the proposal is to add
some identifier into CM private data which indicate that it
is peer-to-peer model, and unique peers IDs for the requested
connection.

Is this the model?
Thanks,

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300
 

 -Original Message-
 From: Or Gerlitz [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, December 20, 2007 10:09 AM
 To: Sean Hefty
 Cc: OpenFabrics General
 Subject: Re: [ofa-general] peer to peer connections support
 
 Sean Hefty wrote:
 ...
  I didn't follow this.
 ...
  Peer to peer SIDs are in a different domain than 
 client/server SIDs, 
  and the peer_to_peer field is used to indicate which domain 
 a SID is in.
 
 Sorry if I wasn't clear, let me see if I understand you: with 
 this different domain implementation, under both 
 client/server the passive calls cm listen and the active call 
 cm connect, where under peer/to/peer both sides call cm 
 listen and later both sides may call cm connect or only one 
 side, correct?
 
  To add to my comments on the CM API, struct 
 ib_cm_req_param, which is 
  used to send the REQ, includes service_id and peer_to_peer fields.  
  The latter is a boolean used by the CM to distinguish if 
 incoming REQs 
  can be matched with the outgoing REQ.
 
 OK, this makes things clearer.
 
  Why there should be a difference between the rdma-cm to 
 the cm? if in 
  the cm you have a model without API change, wouldn't it 
 apply also to 
  the rdma-cm?
 
  The rdma_cm does not know how to set the peer_to_peer field in the 
  ib_cm_req_param.  It sets this field to 0 today.
 
 But it could set it to one as well... assuming my 
 understanding above of the suggested implementation is 
 correct, we can change the RDMA-CM API to let users specify 
 on rdma_connect that they want peer to peer support, so such 
 apps can issue rdma_listen call and later call rdma_connect 
 with this bit set and they are done (or almost done... I 
 guess there some more devil in the details here, isn't it?)
 
I think that in the MPI world each rank gets a SID from 
 the local 
  CM and   they exchange the SIDs out-of-band, then connections are 
  opened. If its   a connection-on-demand scheme, then when ever the 
  rank process calls   mpi_send() to peer for which the local MPI 
  library does not have a   connection, it tries to connect. 
 So if this 
  happens at once between   some pair of ranks, there 
 should be a way 
  to form one connection out of   these two connecting requests. My 
  thinking/motivation is that support of   this scheme 
 should be in the 
  IB stack (cm and rdma-cm) level and not in   the specific 
 MPI implementation level.
  
  Are the out of band connections used by MPI formed using 
 client/server 
  or peer to peer?  I believe that Intel MPI has each rank listen for 
  connections from the ranks below it using client/server.
 
 yes, MPIs that do all-to-all-connect on job start, typically 
 use client/server where all the ranks  0 issue listen call 
 and then all lower ranks connect to higher ranks or etc some 
 other symmetry breaking scheme. I am trying to see what needs 
 to be supported by the IB stack to let MPIs that do connect 
 on demand use the RDMA-CM.
 
  There are a couple of problems with the peer to peer model.  First, 
  unless the connections occur at exactly the same time, they miss 
  connecting (rejected with invalid SID).
 
 This makes the all peer to peer model useless, since an app 
 can not make sure that connection occur at exactly the same 
 time! my understanding of the spec is that peer to peer model 
 has the ability to handle also connections that occur at 
 exactly the same time but not only.
 
  Second, if multiple peer to
  peer connections need to form between the same pair of 
 nodes, things 
  can go screwy (that's the technical term) trying to match 
 up the peer requests.
 
 Under MPI each rank uses a different SID, so I think we are 
 safe from this problem.
 
 Or
 
 
 
 
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit 
 http://openib.org/mailman/listinfo/openib-general
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] smpquery regression in 1.3-rc1

2007-12-20 Thread Yevgeny Kliteynik

Hal Rosenstock wrote:

On Thu, 2007-12-20 at 13:42 +0200, Yevgeny Kliteynik wrote:

Hal Rosenstock wrote:

On Wed, 2007-12-19 at 11:58 -0800, [EMAIL PROTECTED] wrote:
We're seeing a regression in smpquery from alpha2 to rc1. 


For example, with alpha2 I get:
grommit:~ # smpquery -G nodeinfo 0x66a01a000737c
# Node info: Lid 3
BaseVers:1
ClassVers:...1
NodeType:Channel Adapter
NumPorts:2
SystemGuid:..0x00066a009800737c
Guid:0x00066a009800737c
PortGuid:0x00066a01a000737c
PartCap:.64
DevId:...0x6278
Revision:0x00a0
LocalPort:...2
VendorId:0x00066a
grommit:~ # 



And with rc1, I get:
grommit:~ # smpquery -G nodeinfo 0x66a01a000737c
ibwarn: [5650] ib_path_query: sa call path_query failed
smpquery: iberror: failed: can't resolve destination port 0x66a01a000737c
grommit:~ #  


But using a LID works fine:
grommit:~ # smpquery nodeinfo 3
# Node info: Lid 3
BaseVers:1
ClassVers:...1
NodeType:Channel Adapter
NumPorts:2
SystemGuid:..0x00066a009800737c
Guid:0x00066a009800737c
PortGuid:0x00066a01a000737c
PartCap:.64
DevId:...0x6278
Revision:0x00a0
LocalPort:...2
VendorId:0x00066a
grommit:~ # 


Strangest of all, running it under strace also works:
grommit:~ # strace smpquery -G nodeinfo 0x66a01a000737c  /tmp/smpquery.out 
.

grommit:~ # cat /tmp/smpquery.out
# Node info: Lid 3
BaseVers:1
ClassVers:...1
NodeType:Channel Adapter
NumPorts:2
SystemGuid:..0x00066a009800737c
Guid:0x00066a009800737c
PortGuid:0x00066a01a000737c
PartCap:.64
DevId:...0x6278
Revision:0x00a0
LocalPort:...2
VendorId:0x00066a
grommit:~ #

Some weird race condition...

Anyone else seeing the same?

-G requires a SA path record lookup so this could be an issue with that
timing out in some cases (assuming the port is active and the SM is
operational).

I'm seeing the same problem.
Sometimes the query works, and sometimes it doesn't.
I also see that when the query fails, OpenSM doesn't get PathRecord query at 
all.

Hal, can you elaborate on that timing out in some cases issue?


I just meant that the SM not responding (for an unknown reason right
now) would yield this effect.


Adding Jack for the libibmad issue:

I see that the ib_path_query() in libibmad/sa.c sometimes fails
when calling safe_sa_call().


This could just be more detail on the same thing in terms of the
(smpquery) client which is layered on top of libibmad: the SA path query
timeout.
I would suggest running OpenSM in verbose mode (both instances are with
OpenSM) and seeing if it responds to the PathRecord query used by this
form of smpquery and continue troubleshooting from there based on the
result.


This is actually what I was saying here.
I have *debugged* smpquery, and saw that the failing function is
ib_path_query() in libibmad/sa.c
As I've mentioned, I did run it with OpenSM in verbose mode, and saw
that when smpquery fails, OpenSM log does not have any PathRecord request.
When smpquery passes, I see the PathRecord request and response in the
OpenSM log.

-- Yevgeny


-- Hal


-- Yevgeny


-- Hal
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general





___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

2007-12-20 Thread Tang, Changqing

Jack:
Thanks for adding this new function, this is what we need. There is one 
issue I want to make clear,

This new kernel owned QP will be destroyed when the XRC domain is closed
(i.e., as part of a ibv_close_xrc_domain call, but only when the domain's 
reference count goes to zero) 

If I have a MPI server processes on a node, many other MPI client 
processes will dynamically
connect/disconnect with the server. The server use same XRC domain.

Will this cause accumulating the kernel QP for such application ? we 
want the server to run 365 days
a year.


Thanks.
--CQ




 -Original Message-
 From: Pavel Shamis (Pasha) [mailto:[EMAIL PROTECTED]
 Sent: Thursday, December 20, 2007 9:15 AM
 To: Jack Morgenstein
 Cc: Tang, Changqing; Roland Dreier;
 general@lists.openfabrics.org; Open MPI Developers;
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
 independent of any one user process

 Adding Open MPI and MVAPICH community to the thread.

 Pasha (Pavel Shamis)

 Jack Morgenstein wrote:
  background:  see XRC Cleanup order issue thread at
 
 
 
 http://lists.openfabrics.org/pipermail/general/2007-December/043935.ht
  ml
 
  (userspace process which created the receiving XRC qp on a
 given host
  dies before other processes which still need to receive XRC
 messages
  on their SRQs which are paired with the now-destroyed
 receiving XRC
  QP.)
 
  Solution: Add a userspace verb (as part of the XRC suite) which
  enables the user process to create an XRC QP owned by the
 kernel -- which belongs to the required XRC domain.
 
  This QP will be destroyed when the XRC domain is closed
 (i.e., as part
  of a ibv_close_xrc_domain call, but only when the domain's
 reference count goes to zero).
 
  Below, I give the new userspace API for this function.  Any
 feedback will be appreciated.
  This API will be implemented in the upcoming OFED 1.3
 release, so we need feedback ASAP.
 
  Notes:
  1. There is no query or destroy verb for this QP. There is
 also no userspace object for the
 QP. Userspace has ONLY the raw qp number to use when
 creating the (X)RC connection.
 
  2. Since the QP is owned by kernel space, async events
 for this QP are also handled in kernel
 space (i.e., reported in /var/log/messages). There are
 no completion events for the QP, since
 it does not send, and all receives completions are
 reported in the XRC SRQ's cq.
 
 If this QP enters the error state, the remote QP which
 sends will start receiving RETRY_EXCEEDED
 errors, so the application will be aware of the failure.
 
  - Jack
 
 ==
  
  /**
   * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as
 a receive-side only QP,
   *and moves the created qp through the RESET-INIT and
 INIT-RTR transitions.
   *  (The RTR-RTS transition is not needed, since this
 QP does no sending).
   *The sending XRC QP uses this QP as destination, while
 specifying an XRC SRQ
   *for actually receiving the transmissions and
 generating all completions on the
   *receiving side.
   *
   *This QP is created in kernel space, and persists
 until the XRC domain is closed.
   *(i.e., its reference count goes to zero).
   *
   * @pd: protection domain to use.  At lower layer, this provides
  access to userspace obj
   * @xrc_domain: xrc domain to use for the QP.
   * @attr: modify-qp attributes needed to bring the QP to RTR.
   * @attr_mask:  bitmap indicating which attributes are
 provided in the attr struct.
   *used for validity checking.
   * @xrc_rcv_qpn: qp_num of created QP (if success). To be
 passed to the remote node. The
   *   remote node will use xrc_rcv_qpn in
 ibv_post_send when sending to
   * XRC SRQ's on this host in the same xrc domain.
   *
   * RETURNS: success (0), or a (negative) error value.
   */
 
  int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
 struct ibv_xrc_domain *xrc_domain,
 struct ibv_qp_attr *attr,
 enum ibv_qp_attr_mask attr_mask,
 uint32_t *xrc_rcv_qpn);
 
  Notes:
 
  1. Although the kernel creates the qp in the kernel's own
 PD, we still need the PD
 parameter to determine the device.
 
  2. I chose to use struct ibv_qp_attr, which is used in
 modify QP, rather than create
 a new structure for this purpose.  This also guards
 against API changes in the event
 that during development I notice that more modify-qp
 parameters must be specified
 for this operation to work.
 
  3. Table of the ibv_qp_attr parameters showing what values to set:
 
  struct ibv_qp_attr {
enum ibv_qp_state   qp_state;   Not needed
enum ibv_qp_state   cur_qp_state;   Not needed
-- Driver starts from RESET and takes qp to RTR.
enum ibv_mtu

[ofa-general] Re: [PATCH] opensm: osm_state_mgr.c - stop idle queue processing if heavy sweep requested

2007-12-20 Thread Yevgeny Kliteynik

Sasha Khapyorsky wrote:

On 09:40 Wed 19 Dec , Yevgeny Kliteynik wrote:

 Sasha Khapyorsky wrote:

Hi Yevgeny,
On 15:33 Mon 17 Dec , Yevgeny Kliteynik wrote:

If a heavy sweep requested during idle queue processing, OSM continues
to process it till the end and only then notices the heavy sweep request.
In some cases this might leave a topology change unhandled for several
minutes.

Could you provide more details about such cases?
As far as I know the idle queue is used only for multicast re-routing.
If so, it is interesting by itself why it takes minutes and where. Is
where MCG join/leave storm?
 Exactly. The problem was discovered on a big cluster with hundreds of mcast 
 groups,
 when there is some massive change in the subnet (like rebooting hundreds of 
 nodes).


Ok, then proposed patch looks like half solution for me.

During mcast join/leave storm idle queue will be filled with requests to
rebuild mcast routing. OpenSM will process it one by one (and this will
take a lot of time) instead of process all pended mcast groups in one
run. I think it is first improvement needed here.

Even with such improvement we will not be able to control the order of
heavy sweep/mcast join requests, so basically idea of breaking idle
queue processing looks fine for me, but it is not all what should be
done here. Heavy sweep by itself recalculates mcast routing for all
existing groups, it should invalidate all pended mcast rerouting
requests instead of continuing idle queue processing after heavy
sweep. Make sense?


OK, makes sense.
So bottom line, when breaking the idle queue processing because of immediate
sweep request, state manager should just purge the whole idle queue and then
start the new heavy sweep.

I'll work on it.

-- Yevgeny


Sasha


 -- Yevgeny


Or single re-routing cycle takes minutes?
Sasha

Signed-off-by:  Yevgeny Kliteynik [EMAIL PROTECTED]
---
 opensm/opensm/osm_state_mgr.c |   31 ---
 1 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 5c39f11..6ee5ee6 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1607,13 +1607,30 @@ void osm_state_mgr_process(IN osm_state_mgr_t * 
const p_mgr,

/* CALL the done function */
__process_idle_time_queue_done(p_mgr);

-   /*
-* Set the signal to 
OSM_SIGNAL_IDLE_TIME_PROCESS
-* so that the next element in the queue gets 
processed
-*/
-
-   signal = OSM_SIGNAL_IDLE_TIME_PROCESS;
-   p_mgr-state = OSM_SM_STATE_PROCESS_REQUEST;
+   if (p_mgr-p_subn-force_immediate_heavy_sweep) 
{
+   /*
+* Do not read next item from the idle 
queue.
+* Immediate heavy sweep is requested, 
so it's
+* more important.
+* Besides, there is a chance that 
after the
+* heavy sweep complition, idle queue 
processing
+* that SM would have performed here 
will be obsolete.
+*/
+   if (osm_log_is_active(p_mgr-p_log, 
OSM_LOG_DEBUG))
+   osm_log(p_mgr-p_log, 
OSM_LOG_DEBUG,
+   osm_state_mgr_process: 
+		interrupting idle time queue processing - heavy sweep 
requested\n);

+   signal = OSM_SIGNAL_NONE:
+   p_mgr-state = OSM_SM_STATE_IDLE;
+   }
+   else {
+   /*
+* Set the signal to 
OSM_SIGNAL_IDLE_TIME_PROCESS
+* so that the next element in the 
queue gets processed
+*/
+   signal = OSM_SIGNAL_IDLE_TIME_PROCESS;
+   p_mgr-state = 
OSM_SM_STATE_PROCESS_REQUEST;
+   }
break;

default:
--
1.5.1.4





___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [PATCH] opensm: osm_state_mgr.c - stop idle queue processing if heavy sweep requested

2007-12-20 Thread Sasha Khapyorsky
On 18:41 Thu 20 Dec , Yevgeny Kliteynik wrote:
  Sasha Khapyorsky wrote:
  On 09:40 Wed 19 Dec , Yevgeny Kliteynik wrote:
   Sasha Khapyorsky wrote:
  Hi Yevgeny,
  On 15:33 Mon 17 Dec , Yevgeny Kliteynik wrote:
  If a heavy sweep requested during idle queue processing, OSM continues
  to process it till the end and only then notices the heavy sweep 
  request.
  In some cases this might leave a topology change unhandled for several
  minutes.
  Could you provide more details about such cases?
  As far as I know the idle queue is used only for multicast re-routing.
  If so, it is interesting by itself why it takes minutes and where. Is
  where MCG join/leave storm?
   Exactly. The problem was discovered on a big cluster with hundreds of 
  mcast  groups,
   when there is some massive change in the subnet (like rebooting hundreds 
  of  nodes).
  Ok, then proposed patch looks like half solution for me.
  During mcast join/leave storm idle queue will be filled with requests to
  rebuild mcast routing. OpenSM will process it one by one (and this will
  take a lot of time) instead of process all pended mcast groups in one
  run. I think it is first improvement needed here.
  Even with such improvement we will not be able to control the order of
  heavy sweep/mcast join requests, so basically idea of breaking idle
  queue processing looks fine for me, but it is not all what should be
  done here. Heavy sweep by itself recalculates mcast routing for all
  existing groups, it should invalidate all pended mcast rerouting
  requests instead of continuing idle queue processing after heavy
  sweep. Make sense?
 
  OK, makes sense.
  So bottom line, when breaking the idle queue processing because of immediate
  sweep request, state manager should just purge the whole idle queue and then
  start the new heavy sweep.

Yes, it is one patch, another expected patch for improving mcast
join requests/node reboot storm handling by OpenSM is recalculating mcast
routing for more than one mcast groups (actually I think requested mcast
groups should be queued in the list and mcast re-routing request merged
+ some trivial processor function in osm_mcast_mgr.c). Maybe whole idle
queue mechanism can be killed as useless, then this will impact heavy
sweep related patch.

Sasha
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] smpquery regression in 1.3-rc1

2007-12-20 Thread Sasha Khapyorsky
On 08:49 Thu 20 Dec , Hal Rosenstock wrote:
  
   Anyone else seeing the same?
   -G requires a SA path record lookup so this could be an issue with that
   timing out in some cases (assuming the port is active and the SM is
   operational).
   I'm seeing the same problem.
   Sometimes the query works, and sometimes it doesn't.
   I also see that when the query fails, OpenSM doesn't get PathRecord 
   query at all.
  
   Hal, can you elaborate on that timing out in some cases issue?
   
   I just meant that the SM not responding (for an unknown reason right
   now) would yield this effect.
   
   Adding Jack for the libibmad issue:
  
   I see that the ib_path_query() in libibmad/sa.c sometimes fails
   when calling safe_sa_call().
   
   This could just be more detail on the same thing in terms of the
   (smpquery) client which is layered on top of libibmad: the SA path query
   timeout.
   I would suggest running OpenSM in verbose mode (both instances are with
   OpenSM) and seeing if it responds to the PathRecord query used by this
   form of smpquery and continue troubleshooting from there based on the
   result.
  
  This is actually what I was saying here.
  I have *debugged* smpquery, and saw that the failing function is
  ib_path_query() in libibmad/sa.c
  As I've mentioned, I did run it with OpenSM in verbose mode, and saw
  that when smpquery fails, OpenSM log does not have any PathRecord request.
  When smpquery passes, I see the PathRecord request and response in the
  OpenSM log.
 
 OK; that wasn't clear before but is now (that the failure appears to be
 a client and not SM issue) :-) FWIW, I don't know what has changed that
 would affect this so it could be a latent bug as opposed to a
 regression.

Right, there were no changes in this area in this period, likely issue
just triggered. I'm not sure but probably I saw something like this in a
past, but then thought it was cabling issue.

Yevgeny, Arthur, could you rerun smpquery with - (for lot of debug
stuff)?

Sasha
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] smpquery regression in 1.3-rc1

2007-12-20 Thread akepner
On Thu, Dec 20, 2007 at 05:13:18PM +, Sasha Khapyorsky wrote:
 ...
 Yevgeny, Arthur, could you rerun smpquery with - (for lot of debug
 stuff)?
 

Well, just about any perturbation changes the behavior - run 
it under strace, or gdb, link the IB libraries statically, or 
look at the machine funny and it works fine. 

But using the debug flags reveals an apparent problem with the 
debug code itself:

# ./smpquery_1.3_rc1 -d -G nodeinfo 0x00066a01a000737c
ibwarn: [19328] smp_query: attr 0x15 mod 0x0 route DR path 0
ibwarn: [19328] mad_rpc: data offs 64 sz 64
mad data
    fe80   
0002 0002 0251 0a6a   0103 0302
3452 0023 4040 0008 0804 ff40  005e
 2012 1088     
Segmentation fault

and gdb shows:

(gdb) bt
#0  0x2b0b9222ed0f in _IO_default_xsputn_internal () from /lib64/libc.so.6
#1  0x2b0b92207177 in vfprintf () from /lib64/libc.so.6
#2  0x2b0b9229577d in __vsprintf_chk () from /lib64/libc.so.6
#3  0x2b0b922956c0 in __sprintf_chk () from /lib64/libc.so.6
#4  0x2b0b91c71166 in portid2str (portid=0x7fff1905bc00) at src/portid.c:91
#5  0x2b0b91c72529 in sa_rpc_call (ibmad_port=0x7fff1905b680,
rcvbuf=0x7fff1905bb30, portid=0x7fff1905bc00, sa=0x7fff1905bac0, timeout=0)
at src/sa.c:58
#6  0x2b0b91c71791 in sa_call (rcvbuf=0x7fff1905bb30,
portid=0x7fff1905bc00, sa=0x7fff1905bac0, timeout=0) at src/rpc.c:395
#7  0x2b0b91c723bf in ib_path_query (srcgid=0x7fff1905be30 \200,
destgid=0x7fff1905be30 \200, sm_id=0x7fff1905bc00, buf=0x7fff1905bb30)
at ./include/infiniband/mad.h:790
#8  0x2b0b91c7144f in ib_resolve_guid (portid=0x7fff1905bde0,
guid=0x7fff1905bd20, sm_id=0x7fff1905bc00, timeout=value optimized out)
at src/resolve.c:83
#9  0x2b0b91c71610 in ib_resolve_portid_str (portid=0x7fff1905bde0,
addr_str=0x7fff1905d341 0x00066a01a000737c, dest_type=2, sm_id=0x0)
at src/resolve.c:115
#10 0x00401cd1 in main (argc=2, argv=0x7fff1905bfd0)
at smpquery_1.3_rc1.c:522

-- 
Arthur

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Steve Wise

Hey Roland (and any iommu/ppc/dma experts out there):

I'm debugging a data corruption issue that happens on PPC64 systems 
running rdma on kernels where the iommu page size is 4KB yet the host 
page size is 64KB.  This feature was added to the PPC64 code recently, 
and is in kernel.org from 2.6.23.  So if the kernel is built with a 4KB 
page size, no problems.  If the kernel is prior to 2.6.23 then 64KB page 
 configs work too. Its just a problem when the iommu page size != host 
page size.


It appears that my problem boils down to a single host page of memory 
that is mapped for dma, and the dma address returned by dma_map_sg() is 
_not_ 64KB aligned.  Here is an example:


app registers va 0x2d9a3000 len 12288
ib_umem_get() creates and maps a umem and chunk that looks like (dumping 
state from a registered user memory region):



umem len 12288 off 12288 pgsz 65536 shift 16
chunk 0: nmap 1 nents 1
sglist[0] page 0xc0930b08 off 0 len 65536 dma_addr 
5bff4000 dma_len 65536



So the kernel maps 1 full page for this MR.  But note that the dma 
address is 5bff4000 which is 4KB aligned, not 64KB aligned.  I 
think this is causing grief to the RDMA HW.


My first question is: Is there an assumption or requirement in linux 
that dma_addressess should have the same alignment as the host address 
they are mapped to?  IE the rdma core is mapping the entire 64KB page, 
but the mapping doesn't begin on a 64KB page boundary.


If this mapping is considered valid, then perhaps the rdma hw is at 
fault here.  But I'm wondering if this is an PPC/iommu bug.


BTW:  Here is what the Memory Region looks like to the HW:


TPT entry:  stag idx 0x2e800 key 0xff state VAL type NSMR pdid 0x2
perms RW rem_inv_dis 0 addr_type VATO
bind_enable 1 pg_size 65536 qpid 0x0 pbl_addr 0x003c67c0
len 12288 va 2d9a3000 bind_cnt 0
PBL: 5bff4000




Any thoughts?

Steve.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Tom Tucker

On Thu, 2007-12-20 at 11:14 -0600, Steve Wise wrote:
 Hey Roland (and any iommu/ppc/dma experts out there):
 
 I'm debugging a data corruption issue that happens on PPC64 systems 
 running rdma on kernels where the iommu page size is 4KB yet the host 
 page size is 64KB.  This feature was added to the PPC64 code recently, 
 and is in kernel.org from 2.6.23.  So if the kernel is built with a 4KB 
 page size, no problems.  If the kernel is prior to 2.6.23 then 64KB page 
   configs work too. Its just a problem when the iommu page size != host 
 page size.
 
 It appears that my problem boils down to a single host page of memory 
 that is mapped for dma, and the dma address returned by dma_map_sg() is 
 _not_ 64KB aligned.  Here is an example:
 
 app registers va 0x2d9a3000 len 12288
 ib_umem_get() creates and maps a umem and chunk that looks like (dumping 
 state from a registered user memory region):
 
  umem len 12288 off 12288 pgsz 65536 shift 16
  chunk 0: nmap 1 nents 1
  sglist[0] page 0xc0930b08 off 0 len 65536 dma_addr 
  5bff4000 dma_len 65536
  
 
 So the kernel maps 1 full page for this MR.  But note that the dma 
 address is 5bff4000 which is 4KB aligned, not 64KB aligned.  I 
 think this is causing grief to the RDMA HW.
 
 My first question is: Is there an assumption or requirement in linux 
 that dma_addressess should have the same alignment as the host address 
 they are mapped to?  IE the rdma core is mapping the entire 64KB page, 
 but the mapping doesn't begin on a 64KB page boundary.
 
 If this mapping is considered valid, then perhaps the rdma hw is at 
 fault here.  But I'm wondering if this is an PPC/iommu bug.
 
 BTW:  Here is what the Memory Region looks like to the HW:
 
  TPT entry:  stag idx 0x2e800 key 0xff state VAL type NSMR pdid 0x2
  perms RW rem_inv_dis 0 addr_type VATO
  bind_enable 1 pg_size 65536 qpid 0x0 pbl_addr 0x003c67c0
  len 12288 va 2d9a3000 bind_cnt 0
  PBL: 5bff4000
 
 
 
 Any thoughts?

The Ammasso certainly works this way. If you tell it the page size is
64KB, it will ignore bits in the page address that encode 0-65535.

 
 Steve.
 
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] Re: [PATCH 3/3] ib/cm: add basic performance counters

2007-12-20 Thread Sean Hefty

Roland Dreier wrote:

by the way, I had to make cm_class not static, or else a build with
ib_cm and ib_ucm built into the kernel faile... I think that exported
symbols can't be static.


thanks for fixing this
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] peer to peer connections support

2007-12-20 Thread Sean Hefty

Sorry if I wasn't clear, let me see if I understand you: with this
different domain implementation, under both client/server the passive
calls cm listen and the active call cm connect, where under peer/to/peer
both sides call cm listen and later both sides may call cm connect or
only one side, correct?


My thinking was that the peer to peer model would have both sides call 
connect only.  The peer to peer connection model only kicks in when both 
sides are in the REQ sent state.



But it could set it to one as well... assuming my understanding above of
the suggested implementation is correct, we can change the RDMA-CM API
to let users specify on rdma_connect that they want peer to peer
support, so such apps can issue rdma_listen call and later call
rdma_connect with this bit set and they are done (or almost done... I
guess there some more devil in the details here, isn't it?)


This was why I said that the IB CM API was fine, but the RDMA CM API 
would require changes.



This makes the all peer to peer model useless, since an app can not make
sure that connection occur at exactly the same time!


yep - (anyone can feel free to step in a set me straight on this...)

the spec is that peer to peer model has the ability to handle also 
connections that occur at exactly the same time but not only.


Peer to peer seems inherently racy to me.

Under MPI each rank uses a different SID, so I think we are safe from 
this problem.


Any peer to peer implementation should handle this case however.

- Sean
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: [RFC] XRC -- make receiving XRC QP independent of any one user process

2007-12-20 Thread Roland Dreier
  This API will be implemented in the upcoming OFED 1.3 release, so we need 
  feedback ASAP.

I hope we can learn some lessons about development process... clearly
changing APIs after -rc1 is not something that leads to good quality
in general.

  int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
struct ibv_xrc_domain *xrc_domain,
struct ibv_qp_attr *attr,
enum ibv_qp_attr_mask attr_mask,
uint32_t *xrc_rcv_qpn);

I can't say this interface is very appealing.

Another option would be to create an XRC verb that detaches a
userspace QP and gives it the same lifetime as an XRC domain.  But
that doesn't seem any nicer.

And I guess we can't combine creating the QP with allocating the XRC
domain, because the consumer might want to open the XRC domain before
it has connected with the remote side.

Oh well, I guess this XRC stuff just ends up being ugly.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Steve Wise

Roland Dreier wrote:

  It appears that my problem boils down to a single host page of memory
  that is mapped for dma, and the dma address returned by dma_map_sg()
  is _not_ 64KB aligned.  Here is an example:

  My first question is: Is there an assumption or requirement in linux
  that dma_addressess should have the same alignment as the host address
  they are mapped to?  IE the rdma core is mapping the entire 64KB page,
  but the mapping doesn't begin on a 64KB page boundary.

I don't think this is explicitly documented anywhere, but it certainly
seems that we want the bus address to be page-aligned in this case.
For mthca/mlx4 at least, we tell the adapter what the host page size
is (so that it knows how to align doorbell pages etc) and I think this
sort of thing would confuse the HW.

 - R.



In arch/powerpc/kernel/iommu.c:iommu_map_sg() I see that it calls 
iommu_range_alloc() with a alignment_order of 0:



vaddr = (unsigned long)page_address(s-page) + s-offset;
npages = iommu_num_pages(vaddr, slen);
entry = iommu_range_alloc(tbl, npages, handle, mask  
IOMMU_PAGE_SHIFT, 0);


But perhaps the alignment order needs to be based on the host page size?


Steve.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Steve Wise

Steve Wise wrote:

Roland Dreier wrote:

  It appears that my problem boils down to a single host page of memory
  that is mapped for dma, and the dma address returned by dma_map_sg()
  is _not_ 64KB aligned.  Here is an example:

  My first question is: Is there an assumption or requirement in linux
  that dma_addressess should have the same alignment as the host address
  they are mapped to?  IE the rdma core is mapping the entire 64KB page,
  but the mapping doesn't begin on a 64KB page boundary.

I don't think this is explicitly documented anywhere, but it certainly
seems that we want the bus address to be page-aligned in this case.
For mthca/mlx4 at least, we tell the adapter what the host page size
is (so that it knows how to align doorbell pages etc) and I think this
sort of thing would confuse the HW.

 - R.



In arch/powerpc/kernel/iommu.c:iommu_map_sg() I see that it calls 
iommu_range_alloc() with a alignment_order of 0:



vaddr = (unsigned long)page_address(s-page) + s-offset;
npages = iommu_num_pages(vaddr, slen);
entry = iommu_range_alloc(tbl, npages, handle, mask 
 IOMMU_PAGE_SHIFT, 0);


But perhaps the alignment order needs to be based on the host page size?



Or based on the alignment of vaddr actually...

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: iommu dma mapping alignment requirements

2007-12-20 Thread Benjamin Herrenschmidt
Adding A few more people to the discussion. You may well be right and we
would have to provide the same alignment, though that sucks a bit as one
of the reason we switched to 4K for the IOMMU is that the iommu space
available on pSeries is very small and we were running out of it with
64K pages and lots of networking activity.

On Thu, 2007-12-20 at 11:14 -0600, Steve Wise wrote:
 Hey Roland (and any iommu/ppc/dma experts out there):
 
 I'm debugging a data corruption issue that happens on PPC64 systems 
 running rdma on kernels where the iommu page size is 4KB yet the host 
 page size is 64KB.  This feature was added to the PPC64 code recently, 
 and is in kernel.org from 2.6.23.  So if the kernel is built with a 4KB 
 page size, no problems.  If the kernel is prior to 2.6.23 then 64KB page 
   configs work too. Its just a problem when the iommu page size != host 
 page size.
 
 It appears that my problem boils down to a single host page of memory 
 that is mapped for dma, and the dma address returned by dma_map_sg() is 
 _not_ 64KB aligned.  Here is an example:
 
 app registers va 0x2d9a3000 len 12288
 ib_umem_get() creates and maps a umem and chunk that looks like (dumping 
 state from a registered user memory region):
 
  umem len 12288 off 12288 pgsz 65536 shift 16
  chunk 0: nmap 1 nents 1
  sglist[0] page 0xc0930b08 off 0 len 65536 dma_addr 
  5bff4000 dma_len 65536
  
 
 So the kernel maps 1 full page for this MR.  But note that the dma 
 address is 5bff4000 which is 4KB aligned, not 64KB aligned.  I 
 think this is causing grief to the RDMA HW.
 
 My first question is: Is there an assumption or requirement in linux 
 that dma_addressess should have the same alignment as the host address 
 they are mapped to?  IE the rdma core is mapping the entire 64KB page, 
 but the mapping doesn't begin on a 64KB page boundary.
 
 If this mapping is considered valid, then perhaps the rdma hw is at 
 fault here.  But I'm wondering if this is an PPC/iommu bug.
 
 BTW:  Here is what the Memory Region looks like to the HW:
 
  TPT entry:  stag idx 0x2e800 key 0xff state VAL type NSMR pdid 0x2
  perms RW rem_inv_dis 0 addr_type VATO
  bind_enable 1 pg_size 65536 qpid 0x0 pbl_addr 0x003c67c0
  len 12288 va 2d9a3000 bind_cnt 0
  PBL: 5bff4000
 
 
 
 Any thoughts?
 
 Steve.
 

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Benjamin Herrenschmidt

On Thu, 2007-12-20 at 13:29 -0600, Steve Wise wrote:

 Or based on the alignment of vaddr actually...

The later wouldn't be realistic. What I think might be necessay, though
it would definitely cause us problems with running out of iommu space
(which is the reason we did the switch down to 4K), is to provide
alignment to the real page size, and alignement to the allocation order
for dma_map_consistent.

It might be possible to -tweak- and only provide alignment to the page
size for allocations that are larger than IOMMU_PAGE_SIZE. That would
solve the problem with small network packets eating up too much iommu
space though.

What do you think ?

Ben.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 1/10] nes: accelerated loopback fix

2007-12-20 Thread Glenn Grundstrom (NetEffect)

Accelerated loopback code did not properly handle private
data.  Add loopback connection counter to ethtool stats.

Signed-off-by: Glenn Grundstrom [EMAIL PROTECTED]

---

diff --git a/drivers/infiniband/hw/nes/nes_cm.c 
b/drivers/infiniband/hw/nes/nes_cm.c
index 638bc51..79889a4 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -67,6 +67,7 @@ u32 cm_packets_received;
 u32 cm_listens_created;
 u32 cm_listens_destroyed;
 u32 cm_backlog_drops;
+atomic_t cm_loopbacks;
 atomic_t cm_nodes_created;
 atomic_t cm_nodes_destroyed;
 atomic_t cm_accel_dropped_pkts;
@@ -1638,6 +1639,7 @@ struct nes_cm_node * mini_cm_connect(struct nes_cm_core 
*cm_core,
if (loopbackremotelistener == NULL) {
create_event(cm_node, NES_CM_EVENT_ABORTED);
} else {
+   atomic_inc(cm_loopbacks);
loopback_cm_info = *cm_info;
loopback_cm_info.loc_port = cm_info-rem_port;
loopback_cm_info.rem_port = cm_info-loc_port;
@@ -2445,7 +2447,13 @@ int nes_accept(struct iw_cm_id *cm_id, struct 
iw_cm_conn_param *conn_param)
cm_event.private_data = NULL;
cm_event.private_data_len = 0;
ret = cm_id-event_handler(cm_id, cm_event);
-   nes_debug(NES_DBG_CM, OFA CM event_handler returned, ret=%d\n, ret);
+   if (cm_node-loopbackpartner) {
+   cm_node-loopbackpartner-mpa_frame_size = 
nesqp-private_data_len;
+   /* copy entire MPA frame to our cm_node's frame */
+   memcpy(cm_node-loopbackpartner-mpa_frame_buf, 
nesqp-ietf_frame-priv_data,
+  nesqp-private_data_len);
+   create_event(cm_node-loopbackpartner, NES_CM_EVENT_CONNECTED);
+   }
if (ret)
printk(%s[%u] OFA CM event_handler returned, ret=%d\n,
__FUNCTION__, __LINE__, ret);
diff --git a/drivers/infiniband/hw/nes/nes_nic.c 
b/drivers/infiniband/hw/nes/nes_nic.c
index e01aab4..810a9ae 100644
--- a/drivers/infiniband/hw/nes/nes_nic.c
+++ b/drivers/infiniband/hw/nes/nes_nic.c
@@ -114,6 +114,7 @@ extern u32 cm_packets_retrans;
 extern u32 cm_listens_created;
 extern u32 cm_listens_destroyed;
 extern u32 cm_backlog_drops;
+extern atomic_t cm_loopbacks;
 extern atomic_t cm_nodes_created;
 extern atomic_t cm_nodes_destroyed;
 extern atomic_t cm_accel_dropped_pkts;
@@ -967,7 +968,7 @@ void nes_netdev_exit(struct nes_vnic *nesvnic)
 }
 
 
-#define NES_ETHTOOL_STAT_COUNT 54
+#define NES_ETHTOOL_STAT_COUNT 55
 static const char 
nes_ethtool_stringset[NES_ETHTOOL_STAT_COUNT][ETH_GSTRING_LEN] = {
Link Change Interrupts,
Linearized SKBs,
@@ -1011,6 +1012,7 @@ static const char 
nes_ethtool_stringset[NES_ETHTOOL_STAT_COUNT][ETH_GSTRING_LEN]
CM Listens Created,
CM Listens Destroyed,
CM Backlog Drops,
+   CM Loopbacks,
CM Nodes Created,
CM Nodes Destroyed,
CM Accel Drops,
@@ -1206,11 +1208,11 @@ static void nes_netdev_get_ethtool_stats(struct 
net_device *netdev,
target_stat_values[39] = cm_listens_created;
target_stat_values[40] = cm_listens_destroyed;
target_stat_values[41] = cm_backlog_drops;
-   target_stat_values[42] = atomic_read(cm_nodes_created);
-   target_stat_values[43] = atomic_read(cm_nodes_destroyed);
-   target_stat_values[44] = atomic_read(cm_accel_dropped_pkts);
-   target_stat_values[45] = atomic_read(cm_resets_recvd);
-   target_stat_values[46] = int_mod_timer_init;
+   target_stat_values[42] = atomic_read(cm_loopbacks);
+   target_stat_values[43] = atomic_read(cm_nodes_created);
+   target_stat_values[44] = atomic_read(cm_nodes_destroyed);
+   target_stat_values[45] = atomic_read(cm_accel_dropped_pkts);
+   target_stat_values[46] = atomic_read(cm_resets_recvd);
target_stat_values[47] = int_mod_cq_depth_1;
target_stat_values[48] = int_mod_cq_depth_4;
target_stat_values[49] = int_mod_cq_depth_16;
diff --git a/drivers/infiniband/hw/nes/nes_utils.c 
b/drivers/infiniband/hw/nes/nes_utils.c
index b6aa6d3..8d2c1ee 100644
--- a/drivers/infiniband/hw/nes/nes_utils.c
+++ b/drivers/infiniband/hw/nes/nes_utils.c
@@ -620,8 +620,6 @@ void nes_post_cqp_request(struct nes_device *nesdev,
 }
 
 
-
-
 /**
  * nes_arp_table
  */
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 2/10] nes: add support for external flash update utility

2007-12-20 Thread Glenn Grundstrom (NetEffect)

Allows an external utility to read/write flash for
firmware upgrades.

Signed-off-by: Glenn Grundstrom [EMAIL PROTECTED]

---

diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c
index a5e0bb5..1088330 100644
--- a/drivers/infiniband/hw/nes/nes.c
+++ b/drivers/infiniband/hw/nes/nes.c
@@ -780,6 +780,136 @@ static struct pci_driver nes_pci_driver = {
.remove = __devexit_p(nes_remove),
 };
 
+static ssize_t nes_show_ee_cmd(struct device_driver *ddp, char *buf)
+{
+   u32 eeprom_cmd;
+   struct nes_device *nesdev;
+
+   nesdev = list_entry(nes_dev_list.next, typeof(*nesdev), list);
+   eeprom_cmd = nes_read32(nesdev-regs + NES_EEPROM_COMMAND);
+
+   return snprintf(buf, PAGE_SIZE, 0x%x\n, eeprom_cmd);
+}
+
+static ssize_t nes_store_ee_cmd(struct device_driver *ddp,
+   const char *buf, size_t count)
+{
+   char *p = (char *)buf;
+   u32 val;
+   struct nes_device *nesdev;
+
+   if (p[1] == 'x' || p[1] == 'X' || p[0] == 'x' || p[0] == 'X') {
+   val = simple_strtoul(p, p, 16);
+   nesdev = list_entry(nes_dev_list.next, typeof(*nesdev), list);
+   nes_write32(nesdev-regs + NES_EEPROM_COMMAND, val);
+   }
+   return strnlen(buf, count);
+}
+
+static ssize_t nes_show_ee_data(struct device_driver *ddp, char *buf)
+{
+   u32 eeprom_data;
+   struct nes_device *nesdev;
+
+   nesdev = list_entry(nes_dev_list.next, typeof(*nesdev), list);
+   eeprom_data = nes_read32(nesdev-regs + NES_EEPROM_DATA);
+
+   return  snprintf(buf, PAGE_SIZE, 0x%x\n, eeprom_data);
+}
+
+static ssize_t nes_store_ee_data(struct device_driver *ddp,
+   const char *buf, size_t count)
+{
+   char *p = (char *)buf;
+   u32 val;
+   struct nes_device *nesdev;
+
+   if (p[1] == 'x' || p[1] == 'X' || p[0] == 'x' || p[0] == 'X') {
+   val = simple_strtoul(p, p, 16);
+   nesdev = list_entry(nes_dev_list.next, typeof(*nesdev), list);
+   nes_write32(nesdev-regs + NES_EEPROM_DATA, val);
+   }
+   return strnlen(buf, count);
+}
+
+static ssize_t nes_show_flash_cmd(struct device_driver *ddp, char *buf)
+{
+   u32 flash_cmd;
+   struct nes_device *nesdev;
+
+   nesdev = list_entry(nes_dev_list.next, typeof(*nesdev), list);
+   flash_cmd = nes_read32(nesdev-regs + NES_FLASH_COMMAND);
+
+   return  snprintf(buf, PAGE_SIZE, 0x%x\n, flash_cmd);
+}
+
+static ssize_t nes_store_flash_cmd(struct device_driver *ddp,
+   const char *buf, size_t count)
+{
+   char *p = (char *)buf;
+   u32 val;
+   struct nes_device *nesdev;
+
+   if (p[1] == 'x' || p[1] == 'X' || p[0] == 'x' || p[0] == 'X') {
+   val = simple_strtoul(p, p, 16);
+   nesdev = list_entry(nes_dev_list.next, typeof(*nesdev), list);
+   nes_write32(nesdev-regs + NES_FLASH_COMMAND, val);
+   }
+   return strnlen(buf, count);
+}
+
+static ssize_t nes_show_flash_data(struct device_driver *ddp, char *buf)
+{
+   u32 flash_data;
+   struct nes_device *nesdev;
+
+   nesdev = list_entry(nes_dev_list.next, typeof(*nesdev), list);
+   flash_data = nes_read32(nesdev-regs + NES_FLASH_DATA);
+
+   return  snprintf(buf, PAGE_SIZE, 0x%x\n, flash_data);
+}
+
+static ssize_t nes_store_flash_data(struct device_driver *ddp,
+   const char *buf, size_t count)
+{
+   char *p = (char *)buf;
+   u32 val;
+   struct nes_device *nesdev;
+
+   if (p[1] == 'x' || p[1] == 'X' || p[0] == 'x' || p[0] == 'X') {
+   val = simple_strtoul(p, p, 16);
+   nesdev = list_entry(nes_dev_list.next, typeof(*nesdev), list);
+   nes_write32(nesdev-regs + NES_FLASH_DATA, val);
+   }
+   return strnlen(buf, count);
+}
+
+DRIVER_ATTR(eeprom_cmd, S_IRUSR | S_IWUSR,
+   nes_show_ee_cmd, nes_store_ee_cmd);
+DRIVER_ATTR(eeprom_data, S_IRUSR | S_IWUSR,
+   nes_show_ee_data, nes_store_ee_data);
+DRIVER_ATTR(flash_cmd, S_IRUSR | S_IWUSR,
+   nes_show_flash_cmd, nes_store_flash_cmd);
+DRIVER_ATTR(flash_data, S_IRUSR | S_IWUSR,
+   nes_show_flash_data, nes_store_flash_data);
+
+int nes_create_driver_sysfs(struct pci_driver *drv)
+{
+   int error;
+   error  = driver_create_file(drv-driver, driver_attr_eeprom_cmd);
+   error |= driver_create_file(drv-driver, driver_attr_eeprom_data);
+   error |= driver_create_file(drv-driver, driver_attr_flash_cmd);
+   error |= driver_create_file(drv-driver, driver_attr_flash_data);
+   return error;
+}
+
+void nes_remove_driver_sysfs(struct pci_driver *drv)
+{
+   driver_remove_file(drv-driver, driver_attr_eeprom_cmd);
+   driver_remove_file(drv-driver, driver_attr_eeprom_data);
+   driver_remove_file(drv-driver, driver_attr_flash_cmd);
+   driver_remove_file(drv-driver, driver_attr_flash_data);
+}
 
 /**
  * nes_init_module - module initialization entry point
@@ -787,12 +917,20 @@ static struct 

[ofa-general] [PATCH 3/10] nes: nic queue start/stop and carrier fix

2007-12-20 Thread Glenn Grundstrom (NetEffect)

If a full send queue occurs, netif_stop_queue() is called
but netif_start_queue() was not being called.

Signed-off-by: Glenn Grundstrom [EMAIL PROTECTED]

---

diff --git a/drivers/infiniband/hw/nes/nes_nic.c 
b/drivers/infiniband/hw/nes/nes_nic.c
index 810a9ae..2ff4c41 100644
--- a/drivers/infiniband/hw/nes/nes_nic.c
+++ b/drivers/infiniband/hw/nes/nes_nic.c
@@ -203,6 +203,7 @@ static int nes_netdev_open(struct net_device *netdev)
return ret;
}
 
+   netif_carrier_off(netdev);
netif_stop_queue(netdev);
 
if ((!nesvnic-of_device_registered)  (nesvnic-rdma_enabled)) {
@@ -502,6 +503,13 @@ static int nes_netdev_start_xmit(struct sk_buff *skb, 
struct net_device *netdev)
netdev-name, skb-len, skb_headlen(skb),
skb_shinfo(skb)-nr_frags, skb_is_gso(skb));
*/
+
+   if (!netif_carrier_ok(netdev))
+   return NETDEV_TX_OK;
+
+   if (netif_queue_stopped(netdev))
+   return NETDEV_TX_BUSY;
+
local_irq_save(flags);
if (!spin_trylock(nesnic-sq_lock)) {
local_irq_restore(flags);
@@ -511,12 +519,20 @@ static int nes_netdev_start_xmit(struct sk_buff *skb, 
struct net_device *netdev)
 
/* Check if SQ is full */
if nesnic-sq_tail+(nesnic-sq_size*2))-nesnic-sq_head)  
(nesnic-sq_size - 1)) == 1) {
-   netif_stop_queue(netdev);
-   spin_unlock_irqrestore(nesnic-sq_lock, flags);
+   if (!netif_queue_stopped(netdev)) {
+   netif_stop_queue(netdev);
+   barrier();
+   if ((volatile 
u16)nesnic-sq_tail)+(nesnic-sq_size*2))-nesnic-sq_head)  (nesnic-sq_size - 
1)) != 1) {
+   netif_start_queue(netdev);
+   goto sq_no_longer_full;
+   }
+   }
nesvnic-sq_full++;
+   spin_unlock_irqrestore(nesnic-sq_lock, flags);
return NETDEV_TX_BUSY;
}
 
+sq_no_longer_full:
nr_frags = skb_shinfo(skb)-nr_frags;
if (skb_headlen(skb)  NES_FIRST_FRAG_SIZE) {
nr_frags++;
@@ -534,13 +550,23 @@ static int nes_netdev_start_xmit(struct sk_buff *skb, 
struct net_device *netdev)
(nesnic-sq_size - 1);
 
if (unlikely(wqes_needed  wqes_available)) {
-   netif_stop_queue(netdev);
+   if (!netif_queue_stopped(netdev)) {
+   netif_stop_queue(netdev);
+   barrier();
+   wqes_available = (volatile 
u16)nesnic-sq_tail)+nesnic-sq_size)-nesnic-sq_head) - 1) 
+   (nesnic-sq_size - 1);
+   if (wqes_needed = wqes_available) {
+   netif_start_queue(netdev);
+   goto tso_sq_no_longer_full;
+   }
+   }
+   nesvnic-sq_full++;
spin_unlock_irqrestore(nesnic-sq_lock, flags);
nes_debug(NES_DBG_NIC_TX, %s: HNIC SQ full- 
TSO request has too many frags!\n,
netdev-name);
-   nesvnic-sq_full++;
return NETDEV_TX_BUSY;
}
+tso_sq_no_longer_full:
/* Map all the buffers */
for (tso_frag_count=0; tso_frag_count  
skb_shinfo(skb)-nr_frags;
tso_frag_count++) {
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 4/10] nes: interrupt moderation fix

2007-12-20 Thread Glenn Grundstrom (NetEffect)

Hardware interrupt moderation timer gave average performance
on slower systems.  These fixes increase performance.

Signed-off-by: Glenn Grundstrom [EMAIL PROTECTED]

---

diff --git a/drivers/infiniband/hw/nes/nes_hw.c 
b/drivers/infiniband/hw/nes/nes_hw.c
index 674ce32..1048db2 100644
--- a/drivers/infiniband/hw/nes/nes_hw.c
+++ b/drivers/infiniband/hw/nes/nes_hw.c
@@ -155,26 +155,41 @@ static void nes_nic_tune_timer(struct nes_device *nesdev)
 
spin_lock_irqsave(nesadapter-periodic_timer_lock, flags);
 
+   if (shared_timer-cq_count_old  shared_timer-cq_count) {
+   if (shared_timer-cq_count  shared_timer-threshold_low ) {
+   shared_timer-cq_direction_downward=0;
+   }
+   }
+   if (shared_timer-cq_count_old = shared_timer-cq_count) {
+   shared_timer-cq_direction_downward++;
+   }
+   shared_timer-cq_count_old = shared_timer-cq_count;
+   if (shared_timer-cq_direction_downward  NES_NIC_CQ_DOWNWARD_TREND) {
+   if (shared_timer-cq_count = shared_timer-threshold_low ) {
+   shared_timer-threshold_low = 
shared_timer-threshold_low/2;
+   shared_timer-cq_direction_downward=0;
+   shared_timer-cq_count = 0;
+   
spin_unlock_irqrestore(nesadapter-periodic_timer_lock, flags);
+   return;
+   }
+   }
+
if (shared_timer-cq_count1) {
nesdev-deepcq_count += shared_timer-cq_count;
if (shared_timer-cq_count = shared_timer-threshold_low ) {   
/* increase timer gently */
shared_timer-timer_direction_upward++;
shared_timer-timer_direction_downward = 0;
-   }
-   else if (shared_timer-cq_count = 
shared_timer-threshold_target ) { /* balanced */
+   } else if (shared_timer-cq_count = 
shared_timer-threshold_target ) { /* balanced */
shared_timer-timer_direction_upward = 0;
shared_timer-timer_direction_downward = 0;
-   }
-   else if (shared_timer-cq_count = shared_timer-threshold_high 
) {  /* decrease timer gently */
+   } else if (shared_timer-cq_count = 
shared_timer-threshold_high ) {  /* decrease timer gently */
shared_timer-timer_direction_downward++;
shared_timer-timer_direction_upward = 0;
-   }
-   else if (shared_timer-cq_count = 
(shared_timer-threshold_high)*2) {
+   } else if (shared_timer-cq_count = 
(shared_timer-threshold_high)*2) {
shared_timer-timer_in_use -= 2;
shared_timer-timer_direction_upward = 0;
shared_timer-timer_direction_downward++;
-   }
-   else {
+   } else {
shared_timer-timer_in_use -= 4;
shared_timer-timer_direction_upward = 0;
shared_timer-timer_direction_downward++;
@@ -2241,7 +2256,7 @@ void nes_nic_ce_handler(struct nes_device *nesdev, struct 
nes_hw_nic_cq *cq)
if (atomic_read(nesvnic-rx_skbs_needed)  
(nesvnic-nic.rq_size1)) {
nes_write32(nesdev-regs+NES_CQE_ALLOC,
cq-cq_number | 
(cqe_count  16));
-nesadapter-tune_timer.cq_count += cqe_count;
+   nesadapter-tune_timer.cq_count += 
cqe_count;
cqe_count = 0;
nes_replenish_nic_rq(nesvnic);
}
diff --git a/drivers/infiniband/hw/nes/nes_hw.h 
b/drivers/infiniband/hw/nes/nes_hw.h
index 25cfda2..ca0b006 100644
--- a/drivers/infiniband/hw/nes/nes_hw.h
+++ b/drivers/infiniband/hw/nes/nes_hw.h
@@ -957,6 +957,7 @@ struct nes_arp_entry {
 #define DEFAULT_JUMBO_NES_QL_LOW12
 #define DEFAULT_JUMBO_NES_QL_TARGET 40
 #define DEFAULT_JUMBO_NES_QL_HIGH   128
+#define NES_NIC_CQ_DOWNWARD_TREND   8
 
 struct nes_hw_tune_timer {
 u16 cq_count;
@@ -969,6 +970,8 @@ struct nes_hw_tune_timer {
 u16 timer_in_use_max;
 u8  timer_direction_upward;
 u8  timer_direction_downward;
+u16 cq_count_old;
+u8  cq_direction_downward;
 };
 
 #define NES_TIMER_INT_LIMIT 2
@@ -1051,17 +1054,17 @@ struct nes_adapter {
 
u32 nic_rx_eth_route_err;
 
-   u32 et_rx_coalesce_usecs;
+   u32 et_rx_coalesce_usecs;
u32 et_rx_max_coalesced_frames;
u32 et_rx_coalesce_usecs_irq;
-   u32 et_rx_max_coalesced_frames_irq;
-   u32 et_pkt_rate_low;
-   u32 et_rx_coalesce_usecs_low;
-   u32 et_rx_max_coalesced_frames_low;
-   u32 et_pkt_rate_high;
-   u32 et_rx_coalesce_usecs_high;
-   u32

[ofa-general] [PATCH 5/10] nes: remove unneeded arp cache update

2007-12-20 Thread Glenn Grundstrom (NetEffect)

The hardware arp cache is updated by inet event notifiers.
Therefore, no arp cache update is needed at netdev_open.

Signed-off-by: Glenn Grundstrom [EMAIL PROTECTED]

---

diff --git a/drivers/infiniband/hw/nes/nes_nic.c 
b/drivers/infiniband/hw/nes/nes_nic.c
index 2ff4c41..496024a 100644
--- a/drivers/infiniband/hw/nes/nes_nic.c
+++ b/drivers/infiniband/hw/nes/nes_nic.c
@@ -260,16 +260,6 @@ static int nes_netdev_open(struct net_device *netdev)
}
 
 
-   if (netdev-ip_ptr) {
-   struct in_device *ip = netdev-ip_ptr;
-   struct in_ifaddr *in = NULL;
-   if (ip  ip-ifa_list) {
-   in = ip-ifa_list;
-   nes_manage_arp_cache(nesvnic-netdev, netdev-dev_addr,
-   ntohl(in-ifa_address), NES_ARP_ADD);
-   }
-   }
-
nes_write32(nesdev-regs+NES_CQE_ALLOC, NES_CQE_ALLOC_NOTIFY_NEXT |
nesvnic-nic_cq.cq_number);
nes_read32(nesdev-regs+NES_CQE_ALLOC);
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 6/10] nes: use control QP callback at connection teardown

2007-12-20 Thread Glenn Grundstrom (NetEffect)

Prevents a race condition between hardware and ULPs when
tearing down connections.  Memory and data structures are
cleaned up after the hardware ce handler has run.

Signed-off-by: Glenn Grundstrom [EMAIL PROTECTED]

---

diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c
index 1088330..4376bc2 100644
--- a/drivers/infiniband/hw/nes/nes.c
+++ b/drivers/infiniband/hw/nes/nes.c
@@ -263,13 +263,43 @@ void nes_add_ref(struct ib_qp *ibqp)
atomic_inc(nesqp-refcount);
 }
 
+static void nes_cqp_rem_ref_callback(struct nes_device *nesdev, struct 
nes_cqp_request *cqp_request)
+{
+   unsigned long flags;
+   struct nes_qp *nesqp = cqp_request-cqp_callback_pointer;
+   struct nes_adapter *nesadapter = nesdev-nesadapter;
+   u32 qp_id;
+
+   atomic_inc(qps_destroyed);
+
+   /* Free the control structures */
+
+   qp_id = nesqp-hwqp.qp_id;
+   if (nesqp-pbl_vbase) {
+   pci_free_consistent(nesdev-pcidev, nesqp-qp_mem_size,
+   nesqp-hwqp.q2_vbase, nesqp-hwqp.q2_pbase);
+   spin_lock_irqsave(nesadapter-pbl_lock, flags);
+   nesadapter-free_256pbl++;
+   spin_unlock_irqrestore(nesadapter-pbl_lock, flags);
+   pci_free_consistent(nesdev-pcidev, 256, nesqp-pbl_vbase, 
nesqp-pbl_pbase);
+   nesqp-pbl_vbase = NULL;
+   kunmap(nesqp-page);
+
+   } else {
+   pci_free_consistent(nesdev-pcidev, nesqp-qp_mem_size,
+   nesqp-hwqp.sq_vbase, nesqp-hwqp.sq_pbase);
+   }
+   nes_free_resource(nesadapter, nesadapter-allocated_qps, 
nesqp-hwqp.qp_id);
+
+   kfree(nesqp-allocated_buffer);
+
+}
 
 /**
  * nes_rem_ref
  */
 void nes_rem_ref(struct ib_qp *ibqp)
 {
-   unsigned long flags;
u64 u64temp;
struct nes_qp *nesqp;
struct nes_vnic *nesvnic = to_nesvnic(ibqp-device);
@@ -287,27 +317,7 @@ void nes_rem_ref(struct ib_qp *ibqp)
}
 
if (atomic_dec_and_test(nesqp-refcount)) {
-   atomic_inc(qps_destroyed);
-
-   /* Free the control structures */
-
-   if (nesqp-pbl_vbase) {
-   pci_free_consistent(nesdev-pcidev, nesqp-qp_mem_size,
-   nesqp-hwqp.q2_vbase, 
nesqp-hwqp.q2_pbase);
-   spin_lock_irqsave(nesadapter-pbl_lock, flags);
-   nesadapter-free_256pbl++;
-   spin_unlock_irqrestore(nesadapter-pbl_lock, flags);
-   pci_free_consistent(nesdev-pcidev, 256, 
nesqp-pbl_vbase, nesqp-pbl_pbase);
-   nesqp-pbl_vbase = NULL;
-   kunmap(nesqp-page);
-
-   } else {
-   pci_free_consistent(nesdev-pcidev, nesqp-qp_mem_size,
-   nesqp-hwqp.sq_vbase, 
nesqp-hwqp.sq_pbase);
-   }
-
nesadapter-qp_table[nesqp-hwqp.qp_id-NES_FIRST_QPN] = NULL;
-   nes_free_resource(nesadapter, nesadapter-allocated_qps, 
nesqp-hwqp.qp_id);
 
/* Destroy the QP */
cqp_request = nes_get_cqp_request(nesdev);
@@ -316,6 +326,9 @@ void nes_rem_ref(struct ib_qp *ibqp)
return;
}
cqp_request-waiting = 0;
+   cqp_request-callback = 1;
+   cqp_request-cqp_callback = nes_cqp_rem_ref_callback;
+   cqp_request-cqp_callback_pointer = nesqp;
cqp_wqe = cqp_request-cqp_wqe;
 
cqp_wqe-wqe_words[NES_CQP_WQE_OPCODE_IDX] =
@@ -339,8 +352,6 @@ void nes_rem_ref(struct ib_qp *ibqp)
cpu_to_le32((u32)(u64temp  32));
 
nes_post_cqp_request(nesdev, cqp_request, 
NES_CQP_REQUEST_RING_DOORBELL);
-
-   kfree(nesqp-allocated_buffer);
}
 }
 
diff --git a/drivers/infiniband/hw/nes/nes_hw.c 
b/drivers/infiniband/hw/nes/nes_hw.c
index 1048db2..06d1963 100644
--- a/drivers/infiniband/hw/nes/nes_hw.c
+++ b/drivers/infiniband/hw/nes/nes_hw.c
@@ -2427,6 +2427,16 @@ void nes_cqp_ce_handler(struct nes_device *nesdev, 
struct nes_hw_cq *cq)

spin_unlock_irqrestore(nesdev-cqp.lock, flags);
}
}
+   } else if (cqp_request-callback) {
+   /* Envoke the callback routine */
+   cqp_request-cqp_callback(nesdev, 
cqp_request);
+   if (cqp_request-dynamic) {
+   kfree(cqp_request);
+   } else {
+   
spin_lock_irqsave(nesdev-cqp.lock, flags);
+   

[ofa-general] [PATCH 7/10] nes: process mss option

2007-12-20 Thread Glenn Grundstrom (NetEffect)

Process a packet with mss option set or use default
value.

Signed-off-by: Glenn Grundstrom [EMAIL PROTECTED]

---

diff --git a/drivers/infiniband/hw/nes/nes_cm.c 
b/drivers/infiniband/hw/nes/nes_cm.c
index 79889a4..169 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -1220,11 +1220,12 @@ static int rem_ref_cm_node(struct nes_cm_core *cm_core,
 /**
  * process_options
  */
-static void process_options(struct nes_cm_node *cm_node, u8 *optionsloc, u32 
optionsize)
+static int process_options(struct nes_cm_node *cm_node, u8 *optionsloc, u32 
optionsize, u32 syn_packet)
 {
u32 tmp;
u32 offset = 0;
union all_known_options *all_options;
+   char got_mss_option = 0;
 
while (offset  optionsize) {
all_options = (union all_known_options *)(optionsloc + offset);
@@ -1236,9 +1237,17 @@ static void process_options(struct nes_cm_node *cm_node, 
u8 *optionsloc, u32 opt
offset += 1;
continue;
case OPTION_NUMBER_MSS:
-   tmp = htons(all_options-as_mss.mss);
-   if (tmp  cm_node-tcp_cntxt.mss)
-   cm_node-tcp_cntxt.mss = tmp;
+   nes_debug(NES_DBG_CM, %s: MSS Length: %d 
Offset: %d Size: %d\n,
+   __FUNCTION__,
+   all_options-as_mss.length, 
offset, optionsize);
+   got_mss_option = 1;
+   if (all_options-as_mss.length != 4) {
+   return 1;
+   } else {
+   tmp = htons(all_options-as_mss.mss);
+   if (tmp  0  tmp  
cm_node-tcp_cntxt.mss)
+   cm_node-tcp_cntxt.mss = tmp;
+   }
break;
case OPTION_NUMBER_WINDOW_SCALE:
cm_node-tcp_cntxt.snd_wscale = 
all_options-as_windowscale.shiftcount;
@@ -1253,6 +1262,9 @@ static void process_options(struct nes_cm_node *cm_node, 
u8 *optionsloc, u32 opt
}
offset += all_options-as_base.length;
}
+   if ((!got_mss_option)  (syn_packet))
+   cm_node-tcp_cntxt.mss = NES_CM_DEFAULT_MSS;
+   return 0;
 }
 
 
@@ -1343,6 +1355,8 @@ int process_packet(struct nes_cm_node *cm_node, struct 
sk_buff *skb,
u8 *optionsloc = (u8 *)tcph[1];
process_options(cm_node, optionsloc, optionsize);
}
+   else if (tcph-syn)
+   cm_node-tcp_cntxt.mss = NES_CM_DEFAULT_MSS;
 
cm_node-tcp_cntxt.snd_wnd = htons(tcph-window) 
cm_node-tcp_cntxt.snd_wscale;
diff --git a/drivers/infiniband/hw/nes/nes_cm.h 
b/drivers/infiniband/hw/nes/nes_cm.h
index cd8e003..c511242 100644
--- a/drivers/infiniband/hw/nes/nes_cm.h
+++ b/drivers/infiniband/hw/nes/nes_cm.h
@@ -152,6 +152,8 @@ struct nes_timer_entry {
 #define NES_CM_DEFAULT_FREE_PKTS  0x000A
 #define NES_CM_FREE_PKT_LO_WATERMARK  2
 
+#define NES_CM_DEFAULT_MSS   536
+
 #define NES_CM_DEF_SEQ   0x159bf75f
 #define NES_CM_DEF_LOCAL_ID  0x3b47
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH 8/10] nes: multicast performance enhancement

2007-12-20 Thread Glenn Grundstrom (NetEffect)

Move multicast processing to it's own QP and setup the
hardware to use it.

Signed-off-by: Glenn Grundstrom [EMAIL PROTECTED]

---

diff --git a/drivers/infiniband/hw/nes/nes_hw.c 
b/drivers/infiniband/hw/nes/nes_hw.c
index 06d1963..515133d 100644
--- a/drivers/infiniband/hw/nes/nes_hw.c
+++ b/drivers/infiniband/hw/nes/nes_hw.c
@@ -698,7 +698,7 @@ void nes_init_csr_ne020(struct nes_device *nesdev, u8 
hw_rev, u8 port_count)
 
nes_write_indexed(nesdev, 0x01E4, 0x0007);
/* nes_write_indexed(nesdev, 0x01E8, 0x000208C4); */
-   nes_write_indexed(nesdev, 0x01E8, 0x00020844);
+   nes_write_indexed(nesdev, 0x01E8, 0x00020874);
nes_write_indexed(nesdev, 0x01D8, 0x00048002);
/* nes_write_indexed(nesdev, 0x01D8, 0x0004B002); */
nes_write_indexed(nesdev, 0x01FC, 0x00050005);
@@ -753,7 +753,7 @@ void nes_init_csr_ne020(struct nes_device *nesdev, u8 
hw_rev, u8 port_count)
nes_write_indexed(nesdev, 0x60C0, 0x028e);
nes_write_indexed(nesdev, 0x60C8, 0x0020);

//
-   nes_write_indexed(nesdev, 0x01EC, 0x5b2625a0);
+   nes_write_indexed(nesdev, 0x01EC, 0x7b2625a0);
/* nes_write_indexed(nesdev, 0x01EC, 0x5f2625a0); */
 
if (hw_rev != NE020_REV) {
@@ -1377,7 +1377,7 @@ int nes_init_nic_qp(struct nes_device *nesdev, struct 
net_device *netdev)
nic_sqe = nesvnic-nic.sq_vbase[counter];
nic_sqe-wqe_words[NES_NIC_SQ_WQE_MISC_IDX] =
cpu_to_le32(NES_NIC_SQ_WQE_DISABLE_CHKSUM |
-   
NES_NIC_SQ_WQE_COMPLETION);
+   NES_NIC_SQ_WQE_COMPLETION);
nic_sqe-wqe_words[NES_NIC_SQ_WQE_LENGTH_0_TAG_IDX] =
cpu_to_le32((u32)NES_FIRST_FRAG_SIZE  16);
nic_sqe-wqe_words[NES_NIC_SQ_WQE_FRAG0_LOW_IDX] =
@@ -1386,6 +1386,15 @@ int nes_init_nic_qp(struct nes_device *nesdev, struct 
net_device *netdev)

cpu_to_le32((u32)((u64)nesvnic-nic.frag_paddr[counter]  32));
}
 
+   nesvnic-mcrq_nic.sq_vbase = (void*)0;
+   nesvnic-mcrq_nic.sq_pbase = 0;
+   nesvnic-mcrq_nic.sq_head = 0;
+   nesvnic-mcrq_nic.sq_tail = 0;
+   nesvnic-mcrq_nic.sq_size = 0;
+   nesvnic-get_cqp_request = nes_get_cqp_request;
+   nesvnic-post_cqp_request = nes_post_cqp_request;
+   nesvnic-mcrq_mcast_filter = 0;
+
spin_lock_init(nesvnic-nic.sq_lock);
spin_lock_init(nesvnic-nic.rq_lock);
 
@@ -1404,6 +1413,17 @@ int nes_init_nic_qp(struct nes_device *nesdev, struct 
net_device *netdev)
vmem += (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_rq_wqe));
pmem += (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_rq_wqe));
 
+   nesvnic-mcrq_nic.rq_vbase = vmem;
+   nesvnic-mcrq_nic.rq_pbase = pmem;
+   nesvnic-mcrq_nic.rq_head = 0;
+   nesvnic-mcrq_nic.rq_tail = 0;
+   nesvnic-mcrq_nic.rq_size = NES_NIC_WQ_SIZE;
+
+   /* setup the CQ */
+   vmem += (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_rq_wqe));
+   pmem += (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_rq_wqe));
+   nesvnic-mcrq_nic.qp_id = nesvnic-nic_index + 32;
+
nesvnic-nic_cq.cq_vbase = vmem;
nesvnic-nic_cq.cq_pbase = pmem;
nesvnic-nic_cq.cq_head = 0;
@@ -1484,6 +1504,19 @@ int nes_init_nic_qp(struct nes_device *nesdev, struct 
net_device *netdev)
/* Ring doorbell (2 WQEs) */
nes_write32(nesdev-regs+NES_WQE_ALLOC, 0x0280 | nesdev-cqp.qp_id);
 
+   /* Send CreateQP request to CQP */
+   nic_context++;
+   nic_context-context_words[NES_NIC_CTX_MISC_IDX] =
+   cpu_to_le32((u32)NES_NIC_CTX_SIZE |
+   ((u32)PCI_FUNC(nesdev-pcidev-devfn)  12) | (1  
18));
+
+   u64temp = (u64)nesvnic-mcrq_nic.sq_pbase;
+   nic_context-context_words[NES_NIC_CTX_SQ_LOW_IDX] = 
cpu_to_le32((u32)u64temp);
+   nic_context-context_words[NES_NIC_CTX_SQ_HIGH_IDX] = 
cpu_to_le32((u32)(u64temp  32));
+   u64temp = (u64)nesvnic-mcrq_nic.rq_pbase;
+   nic_context-context_words[NES_NIC_CTX_RQ_LOW_IDX] = 
cpu_to_le32((u32)u64temp);
+   nic_context-context_words[NES_NIC_CTX_RQ_HIGH_IDX] = 
cpu_to_le32((u32)(u64temp  32));
+
spin_unlock_irqrestore(nesdev-cqp.lock, flags);
nes_debug(NES_DBG_INIT, Waiting for create NIC QP%u to complete.\n,
nesvnic-nic.qp_id);
diff --git a/drivers/infiniband/hw/nes/nes_hw.h 
b/drivers/infiniband/hw/nes/nes_hw.h
index 0279d4c..2efb55e 100644
--- a/drivers/infiniband/hw/nes/nes_hw.h
+++ b/drivers/infiniband/hw/nes/nes_hw.h
@@ -1161,9 +1161,11 @@ struct nes_vnic {
dma_addr_t   nic_pbase;
struct nes_hw_nicnic;
struct nes_hw_nic_cq nic_cq;
-
+   struct 

[ofa-general] lock dependency in ib_user_mad

2007-12-20 Thread Sean Hefty
I see hangs killing opensm related to a bug in user_mad.c.  The problem appears
to be:

ib_umad_close()
downgrade_write(file-port-mutex)
ib_unregister_mad_agent(...)
up_read(file-port-mutex)

ib_unregister_mad_agent() flushes any outstanding MADs, resulting in calls to
send_handler() and recv_handler(), both of which call queue_packet():

queue_packet()
down_read(file-port-mutex)
...
up_read(file-port-mutex)

ib_umad_kill_port() has a similar issue as ib_umad_close().

Does anyone know the reasoning for holding the mutex around
ib_unregister_mad_agent()?

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] peer to peer connections support

2007-12-20 Thread Or Gerlitz
On 12/20/07, Kanevsky, Arkady [EMAIL PROTECTED] wrote:

 SO in a nutshell the proposal is to add some identifier into CM private
 data which indicate that it is peer-to-peer model, and unique peers IDs for
 the requested connection.
 Is this the model?


For the time being, I try to understand if in the peer to peer model both
sides issue a listen before connecting or not. Without this listen the
peer-to-peer does not seems usable to me, what's your understanding of the
spec?

Or.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Steve Wise

Benjamin Herrenschmidt wrote:

On Thu, 2007-12-20 at 13:29 -0600, Steve Wise wrote:


Or based on the alignment of vaddr actually...


The later wouldn't be realistic. What I think might be necessay, though
it would definitely cause us problems with running out of iommu space
(which is the reason we did the switch down to 4K), is to provide
alignment to the real page size, and alignement to the allocation order
for dma_map_consistent.

It might be possible to -tweak- and only provide alignment to the page
size for allocations that are larger than IOMMU_PAGE_SIZE. That would
solve the problem with small network packets eating up too much iommu
space though.

What do you think ?


That might work.

If you gimme a patch, i'll try it out!

Steve.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] peer to peer connections support

2007-12-20 Thread Kanevsky, Arkady
Yes.
The question is who issues it?
It can be done by the CM and not ULP.
Looking way back at VIPL there was a peer-to-peer model with the API
similar to the
one which Shane outlines.
Thanks, 
 

Arkady Kanevsky   email: [EMAIL PROTECTED]

Network Appliance Inc.   phone: 781-768-5395

1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195

Waltham, MA 02451   central phone: 781-768-5300

 




From: Or Gerlitz [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 20, 2007 4:17 PM
To: Kanevsky, Arkady
Cc: OpenFabrics General
Subject: Re: [ofa-general] peer to peer connections support


On 12/20/07, Kanevsky, Arkady [EMAIL PROTECTED]
wrote: 


SO in a nutshell the proposal is to add some identifier
into CM private data which indicate that it is peer-to-peer model, and
unique peers IDs for the requested connection.
Is this the model?



For the time being, I try to understand if in the peer to peer
model both sides issue a listen before connecting or not. Without this
listen the peer-to-peer does not seems usable to me, what's your
understanding of the spec?

Or.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] peer to peer connections support

2007-12-20 Thread Or Gerlitz
On 12/20/07, Kanevsky, Arkady [EMAIL PROTECTED] wrote:

  Yes.
 The question is who issues it?
 It can be done by the CM and not ULP.
 Looking way back at VIPL there was a peer-to-peer model with the API
 similar to the
 one which Shane outlines.


If the CM issues the listen its means I can connect to you now only if you
try to connect to me NOW, my understanding is that this is useless protocol,
but I will be happy to hear why I am wrong.

The IB stack co maintainer name is Sean

Or.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] peer to peer connections support

2007-12-20 Thread Kanevsky, Arkady
What is the difference between ULP not issuing listen yet vs.
ULP not issuing peer-to-peer connect which does listen under the cover?
If conn request comes from another side before either of them
it will be rejected by CM since nobody is listening.
 
Arkady
P.S. Sean, my appologies.
 

Arkady Kanevsky   email: [EMAIL PROTECTED]

Network Appliance Inc.   phone: 781-768-5395

1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195

Waltham, MA 02451   central phone: 781-768-5300

 




From: Or Gerlitz [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 20, 2007 4:29 PM
To: Kanevsky, Arkady
Cc: OpenFabrics General
Subject: Re: [ofa-general] peer to peer connections support


On 12/20/07, Kanevsky, Arkady [EMAIL PROTECTED]
wrote: 


Yes.
The question is who issues it?
It can be done by the CM and not ULP.
Looking way back at VIPL there was a peer-to-peer model
with the API similar to the
one which Shane outlines.


If the CM issues the listen its means I can connect to you now
only if you try to connect to me NOW, my understanding is that this is
useless protocol, but I will be happy to hear why I am wrong.

The IB stack co maintainer name is Sean 

Or.




___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] lock dependency in ib_user_mad

2007-12-20 Thread Roland Dreier
  I see hangs killing opensm related to a bug in user_mad.c.  The problem 
  appears
  to be:
  
  ib_umad_close()
   downgrade_write(file-port-mutex)
   ib_unregister_mad_agent(...)
   up_read(file-port-mutex)
  
  ib_unregister_mad_agent() flushes any outstanding MADs, resulting in calls to
  send_handler() and recv_handler(), both of which call queue_packet():
  
  queue_packet()
   down_read(file-port-mutex)
   ...
   up_read(file-port-mutex)

This should be fine (and comes from an earlier set of changes to fix
deadlocks): ib_umad_close() does a downgrade_write() before calling
ib_unregister_mad_agent(), so it only holds the mutex with a read
lock, which means that queue_packet() should be able to take another
read lock.

Unless there's something that prevents one thread from taking a read
lock twice?  What kernel are you seeing these problems with?

  Does anyone know the reasoning for holding the mutex around
  ib_unregister_mad_agent()?

It's to keep things serialized against a port disappearing because a
device is being removed.  But looking at things, I think we can
probably rejigger the locking to make things simpler, and avoid the
use of downgrade_write(), which the -rt people don't like.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: iommu dma mapping alignment requirements

2007-12-20 Thread Steve Wise

Benjamin Herrenschmidt wrote:

On Thu, 2007-12-20 at 15:02 -0600, Steve Wise wrote:

Benjamin Herrenschmidt wrote:

Adding A few more people to the discussion. You may well be right and we
would have to provide the same alignment, though that sucks a bit as one
of the reason we switched to 4K for the IOMMU is that the iommu space
available on pSeries is very small and we were running out of it with
64K pages and lots of networking activity.

But smarter NIC drivers can resolve this too, I think, but perhaps 
carving up full pages of mapped buffers instead of just assuming mapping 
is free...


True, but the problem still happenens today, if we switch back to 64K
iommu page size (which should be possible, I need to fix that), we
-will- run out of iommu space on typical workloads and that is not
acceptable.

So we need to find a compromise.

What I might do is something around the lines of: If size = PAGE_SIZE,
and vaddr (page_address + offset) is PAGE_SIZE aligned, then I enforce
alignment of the resulting mapping.

That should fix your case. Anything requesting smaller than PAGE_SIZE
mappings would lose that alignment but I -think- it should be safe, and
you still always get 4K alignment anyway (+/- your offset) so at least
small alignment restrictions are still enforced (such as cache line
alignment etc...).

I'll send you a test patch later today.

Ben.



Sounds good.  Thanks!

Note, that these smaller sub-host-page-sized mappings might pollute the 
address space causing full aligned host-page-size maps to become 
scarce...  Maybe there's a clever way to keep those in their own segment 
of the address space?



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH] [RFC] IPOIB/CM Enable SRQ support on HCAs with less than 16 SG entries

2007-12-20 Thread Pradeep Satyanarayana
Some HCAs like ehca2 support fewer than 16 SG entries. Currently IPoIB/CM
implicitly assumes all HCAs will support 16 SG entries of 4K pages for 64K 
MTUs. This patch removes that restriction.

This patch continues to use order 0 allocations and enables implementation of 
connected mode on such HCAs with smaller MTUs. HCAs having the capability to 
support 16 SG entries are left untouched.

This patch addresses bug# 728:
https://bugs.openfabrics.org/show_bug.cgi?id=728

While working on this patch I discovered that mthca reports an incorrect
value of max_srq_sge. I had reported this issue previously too several 
weeks ago. I solved that by using a hard coded value of 16 for max_srq_sge
(mthca only). More on that in a following mail.

Signed-off-by: Pradeep Satyanarayana [EMAIL PROTECTED]
---

--- a/drivers/infiniband/ulp/ipoib/ipoib.h  2007-11-03 11:37:02.0 
-0700
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h  2007-12-20 13:17:43.0 
-0800
@@ -466,6 +466,7 @@ void ipoib_drain_cq(struct net_device *d
 #define IPOIB_CM_SUPPORTED(ha)   (ha[0]  (IPOIB_FLAGS_RC))
 
 extern int ipoib_max_conn_qp;
+extern int max_cm_mtu;
 
 static inline int ipoib_cm_admin_enabled(struct net_device *dev)
 {
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c   2007-11-21 07:46:35.0 
-0800
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c   2007-12-20 14:47:13.0 
-0800
@@ -74,6 +74,9 @@ static struct ib_send_wr ipoib_cm_rx_dra
.opcode = IB_WR_SEND,
 };
 
+static int num_of_frags;
+int max_cm_mtu;
+
 static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
   struct ib_cm_event *event);
 
@@ -96,13 +99,13 @@ static int ipoib_cm_post_receive_srq(str
 
priv-cm.rx_wr.wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV;
 
-   for (i = 0; i  IPOIB_CM_RX_SG; ++i)
+   for (i = 0; i  num_of_frags; ++i)
priv-cm.rx_sge[i].addr = priv-cm.srq_ring[id].mapping[i];
 
ret = ib_post_srq_recv(priv-cm.srq, priv-cm.rx_wr, bad_wr);
if (unlikely(ret)) {
ipoib_warn(priv, post srq failed for buf %d (%d)\n, id, ret);
-   ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
+   ipoib_cm_dma_unmap_rx(priv, num_of_frags - 1,
  priv-cm.srq_ring[id].mapping);
dev_kfree_skb_any(priv-cm.srq_ring[id].skb);
priv-cm.srq_ring[id].skb = NULL;
@@ -623,6 +626,7 @@ repost:
--p-recv_count;
ipoib_warn(priv, ipoib_cm_post_receive_nonsrq failed 
   for buf %d\n, wr_id);
+   kfree(mapping); /*** Check if this needed ***/
}
}
 }
@@ -1399,16 +1403,17 @@ int ipoib_cm_add_mode_attr(struct net_de
return device_create_file(dev-dev, dev_attr_mode);
 }
 
-static void ipoib_cm_create_srq(struct net_device *dev)
+static void ipoib_cm_create_srq(struct net_device *dev, int max_sge)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ib_srq_init_attr srq_init_attr = {
.attr = {
.max_wr  = ipoib_recvq_size,
-   .max_sge = IPOIB_CM_RX_SG
}
};
 
+   srq_init_attr.attr.max_sge = max_sge;
+
priv-cm.srq = ib_create_srq(priv-pd, srq_init_attr);
if (IS_ERR(priv-cm.srq)) {
if (PTR_ERR(priv-cm.srq) != -ENOSYS)
@@ -1418,6 +1423,7 @@ static void ipoib_cm_create_srq(struct n
return;
}
 
+
priv-cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof 
*priv-cm.srq_ring,
GFP_KERNEL);
if (!priv-cm.srq_ring) {
@@ -1431,7 +1437,9 @@ static void ipoib_cm_create_srq(struct n
 int ipoib_cm_dev_init(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
-   int i;
+   int i, ret;
+   struct ib_srq_attr srq_attr;
+   struct ib_device_attr attr;
 
INIT_LIST_HEAD(priv-cm.passive_ids);
INIT_LIST_HEAD(priv-cm.reap_list);
@@ -1448,22 +1456,46 @@ int ipoib_cm_dev_init(struct net_device 
 
skb_queue_head_init(priv-cm.skb_queue);
 
-   for (i = 0; i  IPOIB_CM_RX_SG; ++i)
+   ret = ib_query_device(priv-ca, attr);
+   if (ret) {
+   printk(KERN_WARNING ib_query_device() failed with %d\n, ret);
+   return ret;
+   }
+
+   ipoib_dbg(priv, max_srq_sge=%d\n, attr.max_srq_sge);
+
+   ipoib_cm_create_srq(dev, attr.max_srq_sge);
+
+   if (ipoib_cm_has_srq(dev)) {
+   ret = ib_query_srq(priv-cm.srq, srq_attr);
+   if (ret) {
+   printk(KERN_WARNING ib_query_srq() failed with %d\n, 
ret);
+   return -EINVAL;
+   }
+   /* pad similar to IPOIB_CM_MTU */
+   max_cm_mtu = srq_attr.max_sge * PAGE_SIZE - 0x10;
+   num_of_frags = srq_attr.max_sge;
+   ipoib_dbg(priv, 

Re: [ofa-general] lock dependency in ib_user_mad

2007-12-20 Thread Sean Hefty

This should be fine (and comes from an earlier set of changes to fix
deadlocks): ib_umad_close() does a downgrade_write() before calling
ib_unregister_mad_agent(), so it only holds the mutex with a read
lock, which means that queue_packet() should be able to take another
read lock.


I'll see if I can reproduce and get more info.  I thought the mutex was 
contributing to the hang, but you're right.



Unless there's something that prevents one thread from taking a read
lock twice?  What kernel are you seeing these problems with?


I'm running 2.6.24-rc3.

I'm out on vacation through the end of the year, so I'm not sure if I'll 
be able to debug this further for a couple of weeks.


- Sean
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Oops in mthca

2007-12-20 Thread Pradeep Satyanarayana
I discovered the following Oops while developing a patch to enable SRQ on HCAs 
with fewer than
16 SG elements.

The root of this issue appears to be that ib_query_device(priv-ca, attr)
reports an incorrect value for attr.max_srq_sge. The value that
ib_query_device returns is 28 (instead of 16 that I expected).


Dec 20 13:19:47 elm3b39 kernel: Oops: Kernel access of bad area, sig: 11 [#2]
Dec 20 13:19:47 elm3b39 kernel: SMP NR_CPUS=128 NUMA pSeries
Dec 20 13:19:47 elm3b39 kernel: Modules linked in: ib_ipoib autofs4 rdma_ucm 
rdma_cm ib_addr iw_cm ib_uverbs ib_umad ib_mthca ib_cm ib_sa ib_mad ib_core 
ipv6 binfmt_misc parport_pc lp parport sg e1000 dm_snapshot dm_zero dm_mirror 
dm_mod ipr libata firmware_class sd_mod scsi_mod ehci_hcd ohci_hcd usbcore
Dec 20 13:19:47 elm3b39 kernel: NIP: d02ffb60 LR: d02ffb08 CTR: 
c043a9b0
Dec 20 13:19:47 elm3b39 kernel: REGS: c001d05ff2e0 TRAP: 0300   Tainted: G  
D  (2.6.24-rc5)
Dec 20 13:19:47 elm3b39 kernel: MSR: 80009032 EE,ME,IR,DR  CR: 
24024424  XER: 0010
Dec 20 13:19:47 elm3b39 kernel: DAR: 60bf0008, DSISR: 4000
Dec 20 13:19:47 elm3b39 kernel: TASK = c001d2e4a000[8233] 'modprobe' 
THREAD: c001d05fc000 CPU: 4
Dec 20 13:19:47 elm3b39 kernel: GPR00: 0001 c001d05ff560 
d0320308 c001d2e54010
Dec 20 13:19:47 elm3b39 kernel: GPR04:  0001 
c001d0654000 0001
Dec 20 13:19:47 elm3b39 kernel: GPR08:  001c 
60bf 60bf
Dec 20 13:19:47 elm3b39 kernel: GPR12: d0301fc8 c057f600 
d05a2090 d05a20d0
Dec 20 13:19:47 elm3b39 kernel: GPR16:  01e3 
01e3 d032eba0
Dec 20 13:19:47 elm3b39 kernel: GPR20:  0034 
c001d05ff690 0001
Dec 20 13:19:47 elm3b39 kernel: GPR24: c000e482b000  
 
Dec 20 13:19:47 elm3b39 kernel: GPR28: c001d2972c00  
d031f190 c001d020ee78
Dec 20 13:19:47 elm3b39 kernel: NIP [d02ffb60] 
.mthca_tavor_post_srq_recv+0xe0/0x2e0 [ib_mthca]
Dec 20 13:19:47 elm3b39 kernel: LR [d02ffb08] 
.mthca_tavor_post_srq_recv+0x88/0x2e0 [ib_mthca]
Dec 20 13:19:47 elm3b39 kernel: Call Trace:
Dec 20 13:19:47 elm3b39 kernel: [c001d05ff560] [d02ffad4] 
.mthca_tavor_post_srq_recv+0x54/0x2e0 [ib_mthca] (unreliable)
Dec 20 13:19:47 elm3b39 kernel: [c001d05ff620] [d03239fc] 
.ipoib_cm_post_receive_srq+0xbc/0x150 [ib_ipoib]
Dec 20 13:19:47 elm3b39 kernel: [c001d05ff6d0] [d0325984] 
.ipoib_cm_dev_init+0x2f4/0x560 [ib_ipoib]
Dec 20 13:19:47 elm3b39 kernel: [c001d05ff870] [d0322c74] 
.ipoib_transport_dev_init+0xd4/0x330 [ib_ipoib]
Dec 20 13:19:47 elm3b39 kernel: [c001d05ff970] [d031f90c] 
.ipoib_ib_dev_init+0x3c/0xc0 [ib_ipoib]
Dec 20 13:19:47 elm3b39 kernel: [c001d05ffa00] [d031aaac] 
.ipoib_dev_init+0x9c/0x160 [ib_ipoib]
Dec 20 13:19:48 elm3b39 kernel: [c001d05ffaa0] [d031ad98] 
.ipoib_add_one+0x228/0x3b0 [ib_ipoib]
Dec 20 13:19:48 elm3b39 kernel: [c001d05ffb60] [d01bf6ec] 
.ib_register_client+0xcc/0x110 [ib_core]
Dec 20 13:19:48 elm3b39 kernel: [c001d05ffc00] [d0328484] 
.ipoib_init_module+0x174/0x2288 [ib_ipoib]
Dec 20 13:19:48 elm3b39 kernel: [c001d05ffc90] [c008eeec] 
.sys_init_module+0x20c/0x1aa0
Dec 20 13:19:48 elm3b39 kernel: [c001d05ffe30] [c00086ac] 
syscall_exit+0x0/0x40
Dec 20 13:19:48 elm3b39 kernel: Instruction dump:
Dec 20 13:19:48 elm3b39 kernel: 419c0204 2f89 38630010 38e0 409d0060 
38e0 3900 6000
Dec 20 13:19:48 elm3b39 kernel: e95f0010 38070001 7c0707b4 7d6a4214 800b0008 
9003 6000 6000


lspci -v gives me the following:

0002:d8:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) 
(prog-if 00 [Normal decode])
Flags: bus master, 66MHz, medium devsel, latency 144
Bus: primary=d8, secondary=d9, subordinate=d9, sec-latency=128
Memory behind bridge: c000-c88f
Capabilities: [70] PCI-X bridge device

0002:d9:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
Subsystem: Mellanox Technologies MT23108 InfiniHost
Flags: bus master, 66MHz, medium devsel, latency 144, IRQ 121
Memory at 400c880 (64-bit, non-prefetchable) [size=1M]
Memory at 400c800 (64-bit, prefetchable) [size=8M]
Memory at 400c000 (64-bit, prefetchable) [size=128M]
Capabilities: [40] MSI-X: Enable- Mask- TabSize=32
Capabilities: [50] Vital Product Data
Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 
Enable-
Capabilities: [70] PCI-X non-bridge device

Pradeep


___
general mailing list
general@lists.openfabrics.org

[ofa-general] Java invoke the verbs through JNI

2007-12-20 Thread zhang Jackie
Hi, all

I just wrote a JNI program to use IB in Java program. I wrote some simple
test programs, It is ok. But when I want to integrate it with another
program , Local protection error is reported.  It is unstable and it is
wrong during the most of time.
Can someone give me some advice? Thanks.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Oops in mthca

2007-12-20 Thread Roland Dreier
  I discovered the following Oops while developing a patch to enable SRQ on 
  HCAs with fewer than
  16 SG elements.

So is this oops with some version of your patch for limited SRQ
scatter entries applied?  It's hard to know exactly what is going
wrong but I suspect that if you get a device that allows more than 16
SRQ scatter entries, your patch passes that value for num_sg without
changing the declaration of rx_sge[] to have enough entries, so when
posting the receive request, the low-level driver goes off the end of
the array.

  The root of this issue appears to be that ib_query_device(priv-ca, attr)
  reports an incorrect value for attr.max_srq_sge. The value that
  ib_query_device returns is 28 (instead of 16 that I expected).

Why do you think the value 28 is incorrect?  Unfortunately I don't
have any PCI-X systems any more, but I don't see anything obvoius in
the mthca code that would make the value it returns for max_srq_sge
being wrong.

 - R.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] [PATCH] [RFC] IPOIB/CM Enable SRQ support on HCAs with less than 16 SG entries

2007-12-20 Thread Roland Dreier
  +static int num_of_frags;
  +int max_cm_mtu;

I think these values need to be per-interface -- think of the case of
a system with more than one type of HCA installed, where the different
HCAs have different limits.

  @@ -623,6 +626,7 @@ repost:
   --p-recv_count;
   ipoib_warn(priv, ipoib_cm_post_receive_nonsrq failed 
  for buf %d\n, wr_id);
  +kfree(mapping); /*** Check if this needed ***/

This looks really bogus -- I don't see anything in your patch that
changes mapping from being allocated on the stack.

  +if (ipoib_cm_has_srq(dev)) {
  +ret = ib_query_srq(priv-cm.srq, srq_attr);
  +if (ret) {
  +printk(KERN_WARNING ib_query_srq() failed with %d\n, 
  ret);
  +return -EINVAL;
  +}
  +/* pad similar to IPOIB_CM_MTU */
  +max_cm_mtu = srq_attr.max_sge * PAGE_SIZE - 0x10;
  +num_of_frags = srq_attr.max_sge;
  +ipoib_dbg(priv, max_cm_mtu = 0x%x, num_of_frags=%d\n,
  +  max_cm_mtu, num_of_frags);
  +} else {
  +max_cm_mtu = IPOIB_CM_MTU;
  +num_of_frags  = IPOIB_CM_RX_SG;
  +}

I think in the SRQ case you still want to make sure num_of_frags is no
more than IPOIB_CM_RX_SG.  And if we're going to check the SRQ scatter
capabilities, we should probably add the same thing for the non-SRQ
case to make sure we don't exceed what QP receive queues can handle.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] [PATCH] ibnetdiscover - ports report

2007-12-20 Thread Erez Strauss
Hello IB developers and users,

 

I would like to get feedback on the following patch to ibnetdiscover.

 

The patch introduce additional output mode for ibnetdiscover which is
focused on the ports, and print one line for each port with the needed
port information.

 

The output looks like:

 

SW 4 18 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4 17 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4 16 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4 15 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4 14 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4 13 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4  9 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4  8 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4  7 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4  6 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4  5 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4  4 0x0008f104003f0838 4x SDR
'ISR9288/ISR9096 Voltaire sLB-24'

SW 4  1 0x0008f104003f0838 4x SDR - SW 6  3 0x0008f104004005f5 (
'ISR9288/ISR9096 Voltaire sLB-24' - 'ISR9288 Voltaire sFB-12' )

SW 4  2 0x0008f104003f0838 4x SDR - SW 7  3 0x0008f104004005f6 (
'ISR9288/ISR9096 Voltaire sLB-24' - 'ISR9288 Voltaire sFB-12' )

SW 4  3 0x0008f104003f0838 4x SDR - SW 1  3 0x0008f104004005f7 (
'ISR9288/ISR9096 Voltaire sLB-24' - 'ISR9288 Voltaire sFB-12' )

SW 4 10 0x0008f104003f0838 4x SDR - SW 8  3 0x0008f104004006f5 (
'ISR9288/ISR9096 Voltaire sLB-24' - 'ISR9288 Voltaire sFB-12' )

SW 4 11 0x0008f104003f0838 4x SDR - SW 9  3 0x0008f104004006f6 (
'ISR9288/ISR9096 Voltaire sLB-24' - 'ISR9288 Voltaire sFB-12' )

SW 4 12 0x0008f104003f0838 4x SDR - SW10  3 0x0008f104004006f7 (
'ISR9288/ISR9096 Voltaire sLB-24' - 'ISR9288 Voltaire sFB-12' )

CA14  1 0x0008f10403960091 4x SDR - SW 4 20 0x0008f104003f0838 (
'Voltaire HCA400' - 'ISR9288/ISR9096 Voltaire sLB-24' )

CA11  1 0x0002c90107a4e431 4x SDR - SW 4 19 0x0008f104003f0838 (
'Voltaire HCA400' - 'ISR9288/ISR9096 Voltaire sLB-24' )

CA 2  1 0x0008f1000102d801 4x SDR - SW 1 15 0x0008f104004005f7 (
'Voltaire IB-to-TCP/IP Router' - 'ISR9288 Voltaire 

 

 

Thanks,

 

Erez Strauss

Voltaire.

 

-

Date:   Thu Dec 20 19:36:14 2007 -0500

 

 Added the -p(orts) option, to generate ports reports

 

Signed-off-by: Erez Strauss erezs _at_ voltaire.com

---

 infiniband-diags/src/ibnetdiscover.c |   64
--

 1 files changed, 61 insertions(+), 3 deletions(-)

 

diff --git a/infiniband-diags/src/ibnetdiscover.c
b/infiniband-diags/src/ibnetdiscover.c

index 8b229c1..3c2e6b6 100644

--- a/infiniband-diags/src/ibnetdiscover.c

+++ b/infiniband-diags/src/ibnetdiscover.c

@@ -119,6 +119,17 @@ get_linkspeed_str(int linkspeed)

return linkspeed_str[linkspeed];

 }

 

+static inline const char*

+node_type_str2(Node *node)

+{

+  switch(node-type) {

+  case SWITCH_NODE: return SW;

+  case CA_NODE: return CA;

+  case ROUTER_NODE: return RT;

+  }

+  return ??;

+}

+

 int

 get_port(Port *port, int portnum, ib_portid_t *portid)

 {

@@ -839,11 +850,50 @@ dump_topology(int listtype, int group)

return i;

 }

 

+void dump_ports_report ()

+{

+   int b, n = 0, p;

+   Node *node;

+   Port *port;

+

+   // If switch and LID == 0, search of other switch ports with
valid LID and assign it to all ports of that switch

+   for (b = 0; b = MAXHOPS; b++)

+   for (node = nodesdist[b]; node; node = node-dnext)

+   if (node-type == SWITCH_NODE) {

+   int swlid = 0;

+   for (p = 0, port = node-ports; p 
node-numports  port  !swlid; port = port-next)

+   if (port-lid != 0)

+   swlid = port-lid;

+   for (p = 0, port = node-ports; p 
node-numports  port; port = port-next)

+   port-lid = swlid;

+   }

+   for (b = 0; b = MAXHOPS; b++)

+   for (node = nodesdist[b]; node; node = node-dnext) {

+   for (p = 0, port = node-ports; p 
node-numports  port; p++, port = port-next) {

+   fprintf (stdout, %2s %5d %2d 0x%016llx
%s %s,

+node_type_str2 (port-node),
port-lid,  port-portnum,

+(unsigned long
long)port-portguid,

+
get_linkwidth_str(port-linkwidth), get_linkspeed_str(port-linkspeed));

+   if (port-remoteport)

+   fprintf (stdout,  - %2s %5d %2d
0x%016llx ( '%s' - 

Re: [ofa-general] [PATCH] [RFC] IPOIB/CM Enable SRQ support on HCAs with less than 16 SG entries

2007-12-20 Thread Pradeep Satyanarayana
Good points. I will incorporate your comments.

Roland Dreier wrote:
   +static int num_of_frags;
   +int max_cm_mtu;
 
 I think these values need to be per-interface -- think of the case of
 a system with more than one type of HCA installed, where the different
 HCAs have different limits.
 
   @@ -623,6 +626,7 @@ repost:
  --p-recv_count;
  ipoib_warn(priv, ipoib_cm_post_receive_nonsrq failed 
 for buf %d\n, wr_id);
   +  kfree(mapping); /*** Check if this needed ***/
 
 This looks really bogus -- I don't see anything in your patch that
 changes mapping from being allocated on the stack.

Right, as the comment illustrates it is a hold over from something else
and slipped into the patch.

Pradeep

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: iommu dma mapping alignment requirements

2007-12-20 Thread Steve Wise



Benjamin Herrenschmidt wrote:

Sounds good.  Thanks!

Note, that these smaller sub-host-page-sized mappings might pollute the 
address space causing full aligned host-page-size maps to become 
scarce...  Maybe there's a clever way to keep those in their own segment 
of the address space?


We already have a large vs. small split in the iommu virtual space to
alleviate this (though it's not a hard constraint, we can still get
into the other side if the default one is full).

Try that patch and let me know:


Seems to be working!

:)




Index: linux-work/arch/powerpc/kernel/iommu.c
===
--- linux-work.orig/arch/powerpc/kernel/iommu.c 2007-12-21 10:39:39.0 
+1100
+++ linux-work/arch/powerpc/kernel/iommu.c  2007-12-21 10:46:18.0 
+1100
@@ -278,6 +278,7 @@ int iommu_map_sg(struct iommu_table *tbl
unsigned long flags;
struct scatterlist *s, *outs, *segstart;
int outcount, incount, i;
+   unsigned int align;
unsigned long handle;
 
 	BUG_ON(direction == DMA_NONE);

@@ -309,7 +310,11 @@ int iommu_map_sg(struct iommu_table *tbl
/* Allocate iommu entries for that segment */
vaddr = (unsigned long) sg_virt(s);
npages = iommu_num_pages(vaddr, slen);
-   entry = iommu_range_alloc(tbl, npages, handle, mask  
IOMMU_PAGE_SHIFT, 0);
+   align = 0;
+   if (IOMMU_PAGE_SHIFT  PAGE_SHIFT  (vaddr  ~PAGE_MASK) == 0)
+   align = PAGE_SHIFT - IOMMU_PAGE_SHIFT;
+   entry = iommu_range_alloc(tbl, npages, handle,
+ mask  IOMMU_PAGE_SHIFT, align);
 
 		DBG(  - vaddr: %lx, size: %lx\n, vaddr, slen);
 
@@ -572,7 +577,7 @@ dma_addr_t iommu_map_single(struct iommu

 {
dma_addr_t dma_handle = DMA_ERROR_CODE;
unsigned long uaddr;
-   unsigned int npages;
+   unsigned int npages, align;
 
 	BUG_ON(direction == DMA_NONE);
 
@@ -580,8 +585,13 @@ dma_addr_t iommu_map_single(struct iommu

npages = iommu_num_pages(uaddr, size);
 
 	if (tbl) {

+   align = 0;
+   if (IOMMU_PAGE_SHIFT  PAGE_SHIFT 
+   ((unsigned long)vaddr  ~PAGE_MASK) == 0)
+   align = PAGE_SHIFT - IOMMU_PAGE_SHIFT;
+
dma_handle = iommu_alloc(tbl, vaddr, npages, direction,
-mask  IOMMU_PAGE_SHIFT, 0);
+mask  IOMMU_PAGE_SHIFT, align);
if (dma_handle == DMA_ERROR_CODE) {
if (printk_ratelimit())  {
printk(KERN_INFO iommu_alloc failed, 


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] nightly osm_sim report 2007-12-21:normal completion

2007-12-20 Thread kliteyn
OSM Simulation Regression Summary
 
[Generated mail - please do NOT reply]
 
 
OpenSM binary date = 2007-12-20
OpenSM git rev = Mon_Dec_17_15:20:43_2007 
[9988f459cb81dd025bde8b2dd53b3c551616be0c]
ibutils git rev = Wed_Dec_19_12:06:28_2007 
[9961475294fbf1d3782edb8f377a77b13fa80d70]
 
 
Total=560  Pass=559  Fail=1
 
 
Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmTest IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 FatTree merge-roots-4-ary-2-tree.topo
14 FatTree merge-root-4-ary-3-tree.topo
14 FatTree gnu-stallion-64.topo
14 FatTree blend-4-ary-2-tree.topo
14 FatTree RhinoDDR.topo
14 FatTree FullGnu.topo
14 FatTree 4-ary-2-tree.topo
14 FatTree 2-ary-4-tree.topo
14 FatTree 12-node-spaced.topo
14 FTreeFail 4-ary-2-tree-missing-sw-link.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-2.topo
14 FTreeFail 4-ary-2-tree-links-at-same-rank-1.topo
14 FTreeFail 4-ary-2-tree-diff-num-pgroups.topo
13 LidMgr IS3-128.topo

Failures:
1 LidMgr IS3-128.topo
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] Re: iommu dma mapping alignment requirements

2007-12-20 Thread Benjamin Herrenschmidt
BTW. I need to know urgently what HW is broken by this 

Ben.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general