date:20090223

[ofa-general] ofa_1_4_kernel 20090223-0200 daily build status

2009-02-23 Thread Vladimir Sokolovsky (Mellanox)

This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git
git_branch: ofed_kernel

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.16
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.17
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on x86_64 with linux-2.6.16
Passed on x86_64 with linux-2.6.16.43-0.3-smp
Passed on x86_64 with linux-2.6.16.21-0.8-smp
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.17
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-1.2798.fc6
Passed on x86_64 with linux-2.6.18-53.el5
Passed on x86_64 with linux-2.6.18-128.el5
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.18-93.el5
Passed on x86_64 with linux-2.6.18-8.el5
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.22.5-31-default
Passed on x86_64 with linux-2.6.9-42.ELsmp
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.9-78.ELsmp
Passed on x86_64 with linux-2.6.9-67.ELsmp
Passed on x86_64 with linux-2.6.9-55.ELsmp
Passed on ia64 with linux-2.6.16
Passed on ia64 with linux-2.6.16.21-0.8-default
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.17
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.25
Passed on ia64 with linux-2.6.26
Passed on ppc64 with linux-2.6.16
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.17
Passed on ppc64 with linux-2.6.18-8.el5
Passed on ppc64 with linux-2.6.19

Failed:
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver

2009-02-23 Thread Jack Morgenstein

On Monday 23 February 2009 06:40, Roland Dreier wrote:
 Oh I see... we leave the sysfs stuff around way too long, since we want
 to use it for tracking the lifetime of our class device.  the patch
 below fixes things for me here... there's still room for substantial
 cleanup but I think this gets the crashes fixed at least:
 
I'm not sure that it does.  This does not make sysfs access atomic wrt module 
unloading.
I think an app can still lose it's timeslice while inside the sysfs access, and 
module
unload can still occur while the app is waiting for a new time slice (although 
the code pages
will not be removed as yet -- see below).

While the module code pages will still be available, what prevents module 
cleanup from
deleting all the module's resources?  In this case, the app will succeed in 
invoking
the low-level driver (its code is still loaded), but may cause an Oops when 
that low-level
driver code attempts to access low-level driver data structures (which have 
been freed).

What about the patch I just submitted?
http://lists.openfabrics.org/pipermail/general/2009-February/057565.html

([ofa-general] [PATCH] ib_core: avoid race condition between sysfs access and 
low-level module unload)

- Jack
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Too many calls to mlx4_CLOSE_PORT()?

2009-02-23 Thread Eli Cohen

Roland,

browsing the code, I see that mlx4_CLOSE_PORT() gets called from,
seemingly, too many places. I would expect it to get called only from
__mlx4_ib_modify_qp() when QP0 gets closed, but mlx4_ib_remove() calls
it too even though it is soon to be called by __mlx4_ib_modify_qp()
due to destroying the MAD QP. It also gets called from
mlx4_remove_one() even though by the time this function gets called,
the port is already closed. Is there a reason for that?
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Handle EEH events for active connections.

2009-02-23 Thread Steve Wise


Roland Dreier wrote:

  - return (cxgb3_ofld_send(rdev_p-t3cdev_p, skb));
  + return (iwch_cxgb3_ofld_send(rdev_p-t3cdev_p, skb));

minor but the parens around the function call are totally unnecessary.
If we're touching the line anyway may as well leave them off.

  


Sure.


  +static int iwch_post_qp_fatal(int id, void *p, void *data)
  +{
  + struct ib_event event;
  + struct iwch_qp *qhp = p;
  +
  + event.event = IB_EVENT_DEVICE_FATAL;
  + event.device = qhp-ibqp.device;
  + event.element.qp = qhp-ibqp;
  + BUG_ON(qhp-rhp != data);
  + BUG_ON(qhp-wq.qpid != id);
  + if (qhp-ibqp.event_handler) {
  + PDBG(%s posting DEVICE_FATAL for qpid %u\n,
  + __func__, qhp-wq.qpid);
  + (*qhp-ibqp.event_handler)(event, qhp-ibqp.qp_context);

This doesn't match the IB driver behavior (or the IB spec) -- the
DEVICE_FATAL event is unaffiliated and delivered for the adapter as a
whole.  QP events are supposed to be for events connected to a single
QP, not the whole adapter failing.

  



I'll change this to QP_FATAL then.



BTW I don't think you need the * here, do you?  Would be easier to read
to just call it like

qhp-ibqp.event_handler(event, qhp-ibqp.qp_context)

  



Ok.



  +int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct 
l2t_entry *l2e)
  +{
  + int error=0;
  + struct cxio_rdev *rdev;
  +
  + rdev = (struct cxio_rdev *)tdev-ulp;
  + if (rdev-flags) {

Might be nice to wrap this rdev-flags test up in a trivial inline
function (eg iwch_eeh_set() or something like that) in case other things
get put into those flags later.
  



Agreed.



  + kfree_skb(skb);
  + return -EIO;
  + }
  + error = l2t_send(tdev, skb, l2e);
  + if (error)
  + kfree_skb(skb);
  + return error;
  +}

The kfree_skb() calls here change behavior -- eg you have the change:

  - l2t_send(ep-com.tdev, skb, ep-l2t);
  - return 0;
  + return iwch_l2t_send(ep-com.tdev, skb, ep-l2t);

and now if l2t_send() returns an error the skb is freed, where before it
wasn't.
  


In looking at the l2t_send code, it doesn't free on failure, so I 
believe this was a memory leak in the existing error path.



Also I'm wondering why you want these wrappers in iw_cxgb3 -- would it
not make more sense for the cxgb3 l2t_send() to check the eeh state and
always behave appropriately?  Or is it more complicated than that?

  


Maybe.

Divy, what do you think?


Steve.



 - R.
  


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] el5.3 backport of 1.4(.0)

2009-02-23 Thread Stijn De Weirdt

hi all,

i am preparing an upgrade from SL5.2 to SL5.3 (which are EL5 clones).
one thing we would also like to look at is switching from OFED 1.3.2 to
OFED 1.4. and one thing i noticed is that the necessary 5.3 backport
fixes only exist in the current 1.4.1 daily snapshots.
did anyone already try to backport the el5.3 backport fixes from 1.4.1
to 1.4.0?

many thanks,

stijn

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] OFED (EWG) meeting agenda for today (Feb 23)

2009-02-23 Thread Tziporet Koren


Hi All,
Due to unexpected thing I cannot attend the meeting today :-(
I sent a mail to Gopal asking him to replace me but got no respond yet.
If he can't maybe Woody or Betsy can 
 
In any case - these are the items that should be covered:
 
a. OFED 1.4.1 release:
1. SLES 11 - backport progress - Jeff Becker 
2. Open MPI 1.3.1 - Jeff Squyres 
3. RDS with iWARP support - Steve Wise
4. NFS/RDMA backports - at least to RH 5.2/3 - Steve Wise
5. Critical bugs:
1287maj RHELja...@mellanox.co.il IPoIB
datagram mode initial packet loss
1516cri RHELandy.gro...@oracle.com Kernel panic on
RHAS4.x loading RDS 

Note: There is 1.4.1 release number in bugzilla - please change bug
release number to 1.4.1 if you wish it to be fixed for OFED 1.4.1

b. Open discussion

Tziporet

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] RE: OFED (EWG) meeting agenda for today (Feb 23)

2009-02-23 Thread John Russo

Betsy can't make it today.  I will be covering for her.  Worst case, I will 
cover the items that you listed.

-Original Message-
From: ewg-boun...@lists.openfabrics.org 
[mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of Tziporet Koren
Sent: Monday, February 23, 2009 11:11 AM
To: e...@lists.openfabrics.org
Cc: general@lists.openfabrics.org
Subject: [ewg] OFED (EWG) meeting agenda for today (Feb 23)


Hi All,
Due to unexpected thing I cannot attend the meeting today :-(
I sent a mail to Gopal asking him to replace me but got no respond yet.
If he can't maybe Woody or Betsy can 
 
In any case - these are the items that should be covered:
 
a. OFED 1.4.1 release:
1. SLES 11 - backport progress - Jeff Becker 
2. Open MPI 1.3.1 - Jeff Squyres 
3. RDS with iWARP support - Steve Wise
4. NFS/RDMA backports - at least to RH 5.2/3 - Steve Wise
5. Critical bugs:
1287maj RHELja...@mellanox.co.il IPoIB
datagram mode initial packet loss
1516cri RHELandy.gro...@oracle.com Kernel panic on
RHAS4.x loading RDS 

Note: There is 1.4.1 release number in bugzilla - please change bug
release number to 1.4.1 if you wish it to be fixed for OFED 1.4.1

b. Open discussion

Tziporet

___
ewg mailing list
e...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [ewg] RE: OFED (EWG) meeting agenda for today (Feb 23)

2009-02-23 Thread Tziporet Koren


John Russo wrote:

Betsy can't make it today.  I will be covering for her.  Worst case, I will 
cover the items that you listed.


  

Many thanks

Tziporet

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: NFSRDMA connectathon prelim. testing status,

2009-02-23 Thread Tom Tucker


Vu:

What memory registration model are you using?

Vu Pham wrote:

Hi Tom,

I have both nfsrdma client and server on 2.6.29-rc5 kernel, 
nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and ConnectX 
(mlx4_ib) HCAs

I have seen several problems during my testing at NFS Connectathon 2009

1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the 
client can not mount. Talking to Tom Talpey and scanning the code, I saw 
that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs 
provider does not have the implementation for this verb.
If I have client on mlx4_ib and server on ib_mthca, I hit the following 
crash because of bad error handling in xprtrdma (see file attached - 
mlx4_mount_problem.log)


Because of this problem, I use InfiniHost III (ib_mthca) for all of my 
tests at Connectathon


2. Testing Linux nfsrdma client against both Linux and OpenSolaris 
nfsrdma servers, I hit the process hung problem during the 
connectathon's lock test (seeing sync_page_1.log and sync_page_2.log 
attached files). I can only reproduce it when I ran connectathon more 
than 500 iterations (-N 1000)

I can NOT reproduce the problem with nfs client/server over IPoIB

3. Testing openSolaris nfsrdma client against linux nfsrdma server, I 
hit the following BUG_ON() right away(see file attached - svcrdma_send.log)


thanks,
-vu



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] opensm/osm_subnet: fix crash in qos string config parameters reloading

2009-02-23 Thread Sasha Khapyorsky


This fixes double free() crash in qos string config parameters
reloading. Assuming that qos parameters can be specified using config
file only we will always keep this in sync with options copy loaded from
file.

Signed-off-by: Sasha Khapyorsky sas...@voltaire.com
---

On 09:40 Mon 23 Feb , Eli Dorfman (Voltaire) wrote:
 Command Line Arguments:
  Log File: /var/log/opensm.log
 -
 OpenSM 3.3.0_c4d9bcf

[snip...]

 Using default GUID 0x2c9020022f019
  Loading Cached Option:qos_vlarb_high = 
 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
 *** glibc detected *** ./sbin/opensm: double free or corruption (!prev): 
 0x1bd932b0 ***

This happens because qos string parameter is freed separately in
subn_init_qos_options() and its mirror pointer in file config copy still
refer already not allocated memory. Thanks for finding this. The patch
should fix the issue.

Sasha

 opensm/opensm/osm_subnet.c |   29 ++---
 1 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
index 01478be..b3100a4 100644
--- a/opensm/opensm/osm_subnet.c
+++ b/opensm/opensm/osm_subnet.c
@@ -640,7 +640,7 @@ static void subn_set_default_qos_options(IN 
osm_qos_options_t * opt)
opt-sl2vl = OSM_DEFAULT_QOS_SL2VL;
 }
 
-static void subn_init_qos_options(IN osm_qos_options_t * opt)
+static void subn_init_qos_options(osm_qos_options_t *opt, osm_qos_options_t *f)
 {
opt-max_vls = 0;
opt-high_limit = -1;
@@ -653,6 +653,8 @@ static void subn_init_qos_options(IN osm_qos_options_t * 
opt)
if (opt-sl2vl)
free(opt-sl2vl);
opt-sl2vl = NULL;
+   if (f)
+   memcpy(f, opt, sizeof(*f));
 }
 
 /**
@@ -743,11 +745,11 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const 
p_opt)
p_opt-no_clients_rereg = FALSE;
p_opt-prefix_routes_file = strdup(OSM_DEFAULT_PREFIX_ROUTES_FILE);
p_opt-consolidate_ipv6_snm_req = FALSE;
-   subn_init_qos_options(p_opt-qos_options);
-   subn_init_qos_options(p_opt-qos_ca_options);
-   subn_init_qos_options(p_opt-qos_sw0_options);
-   subn_init_qos_options(p_opt-qos_swe_options);
-   subn_init_qos_options(p_opt-qos_rtr_options);
+   subn_init_qos_options(p_opt-qos_options, NULL);
+   subn_init_qos_options(p_opt-qos_ca_options, NULL);
+   subn_init_qos_options(p_opt-qos_sw0_options, NULL);
+   subn_init_qos_options(p_opt-qos_swe_options, NULL);
+   subn_init_qos_options(p_opt-qos_rtr_options, NULL);
 }
 
 /**
@@ -1192,11 +1194,16 @@ int osm_subn_rescan_conf_files(IN osm_subn_t * const 
p_subn)
return -1;
}
 
-   subn_init_qos_options(p_opts-qos_options);
-   subn_init_qos_options(p_opts-qos_ca_options);
-   subn_init_qos_options(p_opts-qos_sw0_options);
-   subn_init_qos_options(p_opts-qos_swe_options);
-   subn_init_qos_options(p_opts-qos_rtr_options);
+   subn_init_qos_options(p_opts-qos_options,
+ p_opts-file_opts-qos_options);
+   subn_init_qos_options(p_opts-qos_ca_options,
+ p_opts-file_opts-qos_ca_options);
+   subn_init_qos_options(p_opts-qos_sw0_options,
+ p_opts-file_opts-qos_sw0_options);
+   subn_init_qos_options(p_opts-qos_swe_options,
+ p_opts-file_opts-qos_swe_options);
+   subn_init_qos_options(p_opts-qos_rtr_options,
+ p_opts-file_opts-qos_rtr_options);
 
while (fgets(line, 1023, opts_file) != NULL) {
/* get the first token */
-- 
1.6.1.2.319.gbd9e

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] opensm/main.c: remove enable_stack_dump() call

2009-02-23 Thread Sasha Khapyorsky


enable_stack_dump() symbol was defined in already removed libibcommon.
There still be conditional (undef #ifdef _DEBUG_) call to this function
in opensm/main.c which breaks build opensm linkage when --enable-debug
configured. Removing this.

Signed-off-by: Sasha Khapyorsky sas...@voltaire.com
---
 opensm/opensm/main.c |3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index e22c2c4..47fd658 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -596,9 +596,6 @@ int main(int argc, char *argv[])
osm_is_debug(), cl_is_debug());
exit(1);
}
-#if defined (_DEBUG_)  defined (OSM_VENDOR_INTF_OPENIB)
-   enable_stack_dump(1);
-#endif
 
printf(-\n);
printf(%s\n, OSM_VERSION);
-- 
1.6.1.2.319.gbd9e

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: NFSRDMA connectathon prelim. testing status,

2009-02-23 Thread Vu Pham


Tom,


Vu:

What memory registration model are you using?


It is 6 (when the connection/mount established)




Vu Pham wrote:

Hi Tom,

I have both nfsrdma client and server on 2.6.29-rc5 kernel, 
nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and 
ConnectX (mlx4_ib) HCAs

I have seen several problems during my testing at NFS Connectathon 2009

1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the 
client can not mount. Talking to Tom Talpey and scanning the code, I 
saw that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs 
provider does not have the implementation for this verb.
If I have client on mlx4_ib and server on ib_mthca, I hit the 
following crash because of bad error handling in xprtrdma (see file 
attached - mlx4_mount_problem.log)


Because of this problem, I use InfiniHost III (ib_mthca) for all of 
my tests at Connectathon


2. Testing Linux nfsrdma client against both Linux and OpenSolaris 
nfsrdma servers, I hit the process hung problem during the 
connectathon's lock test (seeing sync_page_1.log and sync_page_2.log 
attached files). I can only reproduce it when I ran connectathon more 
than 500 iterations (-N 1000)

I can NOT reproduce the problem with nfs client/server over IPoIB

3. Testing openSolaris nfsrdma client against linux nfsrdma server, I 
hit the following BUG_ON() right away(see file attached - 
svcrdma_send.log)


thanks,
-vu





___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: NFSRDMA connectathon prelim. testing status,

2009-02-23 Thread Tom Talpey

At 01:03 PM 2/23/2009, Vu Pham wrote:
Tom,

 Vu:

 What memory registration model are you using?

It is 6 (when the connection/mount established)

i.e. all physical (get_dma_mr). Long chunk lists due to discontiguous
physical pages.

We'll try with ConnectX and frmr's later today here at Connectathon.
This will reduce the chunk lists to roughly three entries (head, pages,
tail).

With the two assertions disabled, we're again passing all general and
special tests from the OpenSolaris client, btw. :-)

Tom.




 Vu Pham wrote:
 Hi Tom,

 I have both nfsrdma client and server on 2.6.29-rc5 kernel, 
 nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and 
 ConnectX (mlx4_ib) HCAs
 I have seen several problems during my testing at NFS Connectathon 2009

 1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the 
 client can not mount. Talking to Tom Talpey and scanning the code, I 
 saw that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs 
 provider does not have the implementation for this verb.
 If I have client on mlx4_ib and server on ib_mthca, I hit the 
 following crash because of bad error handling in xprtrdma (see file 
 attached - mlx4_mount_problem.log)

 Because of this problem, I use InfiniHost III (ib_mthca) for all of 
 my tests at Connectathon

 2. Testing Linux nfsrdma client against both Linux and OpenSolaris 
 nfsrdma servers, I hit the process hung problem during the 
 connectathon's lock test (seeing sync_page_1.log and sync_page_2.log 
 attached files). I can only reproduce it when I ran connectathon more 
 than 500 iterations (-N 1000)
 I can NOT reproduce the problem with nfs client/server over IPoIB

 3. Testing openSolaris nfsrdma client against linux nfsrdma server, I 
 hit the following BUG_ON() right away(see file attached - 
 svcrdma_send.log)

 thanks,
 -vu





___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] build warnings on rhel4 U6

2009-02-23 Thread Brian J. Murrell

On Mon, 2009-02-23 at 09:28 +0200, Jack Morgenstein wrote:
 In the backport spinlock.h file, try the following:
 
 #ifndef assert_spin_locked
 #define assert_spin_locked(lock)  do { (void)(lock); } while(0)
 #endif

Indeed.  That would be a solution for the end-user but that doesn't help
us as a third-party software developer (i.e. being restricted to
building our software with GA releases of OFED -- so that our release
doesn't turn into a patching nightmare for our end-users).

Indeed, this probably should have been a BZ filing as my goal was
equally as much to alert somebody to the problem to ensure future
releases don't have the same problem.

Cheers and many thanks for the input.

b.



signature.asc
Description: This is a digitally signed message part
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Re: NFSRDMA connectathon prelim. testing status,

2009-02-23 Thread Vu Pham


Tom,


What memory registration model are you using?


It is 6 (when the connection/mount established)




Vu Pham wrote:



2. Testing Linux nfsrdma client against both Linux and OpenSolaris 
nfsrdma servers, I hit the process hung problem during the 
connectathon's lock test (seeing sync_page_1.log and sync_page_2.log 
attached files). I can only reproduce it when I ran connectathon 
more than 500 iterations (-N 1000)

I can NOT reproduce the problem with nfs client/server over IPoIB
With mem_reg=4, I can not reproduce this problem (running against both 
OpenSolaris and Linux servers.





3. Testing openSolaris nfsrdma client against linux nfsrdma server, 
I hit the following BUG_ON() right away(see file attached - 
svcrdma_send.log)


After disable two BUG_ON(), we can run test multiple times without 
problem yet


-vu
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH v2] RDMA/cxgb3: Handle EEH events for active connections.

2009-02-23 Thread Steve Wise

- wrapper calls into cxgb3 and fail them if we're in the middle
  of an eeh event.

- correctly unwind and release endpoint and other resources when
  we are in an EEH event.

- post QP_FATAL event on all active QPs when cxgb3 notifies
  iw_cxgb3 of a fatal error.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---

 drivers/infiniband/hw/cxgb3/cxio_hal.c |   10 ++--
 drivers/infiniband/hw/cxgb3/cxio_hal.h |6 ++
 drivers/infiniband/hw/cxgb3/iwch.c |   26 +
 drivers/infiniband/hw/cxgb3/iwch.h |5 ++
 drivers/infiniband/hw/cxgb3/iwch_cm.c  |   90 +++-
 drivers/infiniband/hw/cxgb3/iwch_qp.c  |4 +
 6 files changed, 107 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c 
b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index eeae5f5..1db88dd 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -152,7 +152,7 @@ static int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, 
u32 qpid)
sge_cmd = qpid  8 | 3;
wqe-sge_cmd = cpu_to_be64(sge_cmd);
skb-priority = CPL_PRIORITY_CONTROL;
-   return (cxgb3_ofld_send(rdev_p-t3cdev_p, skb));
+   return iwch_cxgb3_ofld_send(rdev_p-t3cdev_p, skb);
 }
 
 int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
@@ -571,7 +571,7 @@ static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p)
 (unsigned long long) rdev_p-ctrl_qp.dma_addr,
 rdev_p-ctrl_qp.workq, 1  T3_CTRL_QP_SIZE_LOG2);
skb-priority = CPL_PRIORITY_CONTROL;
-   return (cxgb3_ofld_send(rdev_p-t3cdev_p, skb));
+   return iwch_cxgb3_ofld_send(rdev_p-t3cdev_p, skb);
 err:
kfree_skb(skb);
return err;
@@ -701,7 +701,7 @@ static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 
reset_tpt_entry,
u32 stag_idx;
u32 wptr;
 
-   if (rdev_p-flags)
+   if (cxio_fatal_error(rdev_p))
return -EIO;
 
stag_state = stag_state  0;
@@ -858,7 +858,7 @@ int cxio_rdma_init(struct cxio_rdev *rdev_p, struct 
t3_rdma_init_attr *attr)
wqe-qp_dma_size = cpu_to_be32(attr-qp_dma_size);
wqe-irs = cpu_to_be32(attr-irs);
skb-priority = 0;  /* 0=ToeQ; 1=CtrlQ */
-   return (cxgb3_ofld_send(rdev_p-t3cdev_p, skb));
+   return iwch_cxgb3_ofld_send(rdev_p-t3cdev_p, skb);
 }
 
 void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
@@ -1024,9 +1024,9 @@ void cxio_rdev_close(struct cxio_rdev *rdev_p)
cxio_hal_pblpool_destroy(rdev_p);
cxio_hal_rqtpool_destroy(rdev_p);
list_del(rdev_p-entry);
-   rdev_p-t3cdev_p-ulp = NULL;
cxio_hal_destroy_ctrl_qp(rdev_p);
cxio_hal_destroy_resource(rdev_p-rscp);
+   rdev_p-t3cdev_p-ulp = NULL;
}
 }
 
diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.h 
b/drivers/infiniband/hw/cxgb3/cxio_hal.h
index 9ed65b0..2fd5d03 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.h
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.h
@@ -112,6 +112,11 @@ struct cxio_rdev {
 #defineCXIO_ERROR_FATAL1
 };
 
+static inline int cxio_fatal_error(struct cxio_rdev *rdev_p)
+{
+   return (rdev_p-flags  CXIO_ERROR_FATAL);
+}
+
 static inline int cxio_num_stags(struct cxio_rdev *rdev_p)
 {
return min((int)T3_MAX_NUM_STAG, (int)((rdev_p-rnic_info.tpt_top - 
rdev_p-rnic_info.tpt_base)  5));
@@ -185,6 +190,7 @@ void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, 
int *count);
 void cxio_flush_hw_cq(struct t3_cq *cq);
 int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe,
 u8 *cqe_flushed, u64 *cookie, u32 *credit);
+int iwch_cxgb3_ofld_send(struct t3cdev *tdev, struct sk_buff *skb);
 
 #define MOD iw_cxgb3: 
 #define PDBG(fmt, args...) pr_debug(MOD fmt, ## args)
diff --git a/drivers/infiniband/hw/cxgb3/iwch.c 
b/drivers/infiniband/hw/cxgb3/iwch.c
index 37a4fc2..3548861 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.c
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -162,15 +162,37 @@ static void close_rnic_dev(struct t3cdev *tdev)
mutex_unlock(dev_mutex);
 }
 
+static int iwch_post_qp_fatal(int id, void *p, void *data)
+{
+   struct ib_event event;
+   struct iwch_qp *qhp = p;
+
+   event.event = IB_EVENT_QP_FATAL;
+   event.device = qhp-ibqp.device;
+   event.element.qp = qhp-ibqp;
+   BUG_ON(qhp-rhp != data);
+   BUG_ON(qhp-wq.qpid != id);
+   if (qhp-ibqp.event_handler) {
+   PDBG(%s posting QP_FATAL for qpid %u\n,
+   __func__, qhp-wq.qpid);
+   (*qhp-ibqp.event_handler)(event, qhp-ibqp.qp_context);
+   }
+   return 0;
+}
+
 static void iwch_err_handler(struct t3cdev *tdev, u32 status, u32 error)
 {
struct cxio_rdev *rdev = tdev-ulp;
+   struct iwch_dev *rnicp = rdev_to_iwch_dev(rdev);
 
-   if (status == OFFLOAD_STATUS_DOWN)
+   if (status == OFFLOAD_STATUS_DOWN) {

[ofa-general] [PATCH] [ib-diag] saquery: add support for WinOF

2009-02-23 Thread Sean Hefty

A lot of type casting with include fix-ups.  Luckily, because
the macro CHECK_AND_SET_VAL() was added, I could add type casts
into the macro and avoid sprinkling even more throughout the code.

Signed-off-by: Sean Hefty sean.he...@intel.com
---


 infiniband-diags/src/saquery.c |   80 ++--
 1 files changed, 44 insertions(+), 36 deletions(-)

diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c
index 9726d22..9d5f475 100644
--- a/infiniband-diags/src/saquery.c
+++ b/infiniband-diags/src/saquery.c
@@ -37,20 +37,25 @@
  *
  */
 
+#if HAVE_CONFIG_H
+#  include config.h
+#endif /* HAVE_CONFIG_H */
+
 #include unistd.h
 #include stdio.h
 #include arpa/inet.h
 #include ctype.h
 #include string.h
 #include errno.h
+#include assert.h
 
 #define _GNU_SOURCE
 #include getopt.h
 
 #include infiniband/umad.h
 #include infiniband/mad.h
-#include infiniband/iba/ib_types.h
-#include infiniband/complib/cl_nodenamemap.h
+#include iba/ib_types.h
+#include complib/cl_nodenamemap.h
 
 #include ibdiag_common.h
 
@@ -170,7 +175,7 @@ recv_mad:
if (ibdebug  1)
xdump(stdout, SA Response:\n, mad, len);
 
-   method = mad_get_field(mad, 0, IB_MAD_METHOD_F);
+   method = (uint8_t) mad_get_field(mad, 0, IB_MAD_METHOD_F);
offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F);
result.status = mad_get_field(mad, 0, IB_MAD_STATUS_F);
result.p_result_madw = mad;
@@ -189,12 +194,12 @@ recv_mad:
 static void *get_query_rec(void *mad, unsigned i)
 {
int offset = mad_get_field(mad, 0, IB_SA_ATTROFFS_F);
-   return mad + IB_SA_DATA_OFFS + i * (offset  3);
+   return (char *) mad + IB_SA_DATA_OFFS + i * (offset  3);
 }
 
 static unsigned valid_gid(ib_gid_t *gid)
 {
-   ib_gid_t zero_gid = { };
+   ib_gid_t zero_gid = { 0 };
return memcmp(zero_gid, gid, sizeof(*gid));
 }
 
@@ -442,7 +447,7 @@ static void dump_multicast_member_record(void *data)
char gid_str2[INET6_ADDRSTRLEN];
ib_member_rec_t *p_mcmr = data;
uint16_t mlid = cl_ntoh16(p_mcmr-mlid);
-   int i = 0;
+   unsigned i = 0;
char *node_name = unknown;
 
/* go through the node records searching for a port guid which matches
@@ -758,7 +763,7 @@ static void dump_one_mft_record(void *data)
 
 static void dump_results(struct query_res *r, void (*dump_func) (void *))
 {
-   int i;
+   unsigned i;
for (i = 0; i  r-result_cnt; i++) {
void *data = get_query_rec(r-p_result_madw, i);
dump_func(data);
@@ -768,7 +773,7 @@ static void dump_results(struct query_res *r, void 
(*dump_func) (void *))
 static void return_mad(void)
 {
if (result.p_result_madw) {
-   free(result.p_result_madw - umad_size());
+   free((char *) result.p_result_madw - umad_size());
result.p_result_madw = NULL;
}
 }
@@ -839,7 +844,8 @@ get_lid_from_name(bind_handle_t h, const char *name, 
uint16_t* lid)
 {
ib_node_record_t *node_record = NULL;
ib_node_info_t *p_ni = NULL;
-   int i = 0, ret;
+   unsigned i;
+   int ret;
 
ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0);
if (ret)
@@ -869,7 +875,7 @@ static uint16_t get_lid(bind_handle_t h, const char *name)
if (isalpha(name[0]))
assert(get_lid_from_name(h, name, rc_lid) == IB_SUCCESS);
else
-   rc_lid = atoi(name);
+   rc_lid = (uint16_t) atoi(name);
if (rc_lid == 0)
fprintf(stderr, Failed to find lid for \%s\\n, name);
return rc_lid;
@@ -917,8 +923,8 @@ static int parse_lid_and_ports(bind_handle_t h,
 
 #define cl_hton8(x) (x)
 #define CHECK_AND_SET_VAL(val, size, comp_with, target, name, mask) \
-   if (val  comp_with) { \
-   target = cl_hton##size(val); \
+   if ((uint##size##_t) val  (uint##size##_t) comp_with) { \
+   target = cl_hton##size((uint##size##_t) val); \
comp_mask |= IB_##name##_COMPMASK_##mask; \
}
 
@@ -951,7 +957,8 @@ static int get_issm_records(bind_handle_t h, ib_net32_t 
capability_mask)
 
 static int print_node_records(bind_handle_t h)
 {
-   int i = 0, ret;
+   unsigned i;
+   int ret;
 
ret = get_all_records(h, IB_SA_ATTR_NODERECORD, 0);
if (ret)
@@ -1027,7 +1034,7 @@ static int query_path_records(const struct query_cmd *q, 
bind_handle_t h,
CHECK_AND_SET_VAL(p-dlid, 16, 0, pr.dlid, PR, DLID);
CHECK_AND_SET_VAL(p-hop_limit, 32, -1, pr.hop_flow_raw, PR, HOPLIMIT);
CHECK_AND_SET_VAL(p-flow_label, 8, 0, flow, PR, FLOWLABEL);
-   pr.hop_flow_raw |= cl_hton32(flow  8);
+   pr.hop_flow_raw |= (uint8_t) cl_hton32(flow  8);
CHECK_AND_SET_VAL(p-tclass, 8, 0, pr.tclass, PR, TCLASS);
CHECK_AND_SET_VAL(p-reversible, 8, -1, reversible, PR, REVERSIBLE);
CHECK_AND_SET_VAL(p-numb_path, 8, -1, pr.num_path, PR,

[ofa-general] Bandwidth of performance with multirail IB

2009-02-23 Thread Jie Cai

I have implemented a uDAPL program to measure the bandwidth on IB with 
multirail connections.


The HCA used in the cluster is Mellanox ConnectX HCA. Each HCA has two 
ports.


The program utilize the two port on each node of cluster to build 
multirail IB connections.


The peak bandwidth I can get is ~ 1.3 GB/s (not bi-directional), which 
is almost the same as single rail connections.


Does anyone have similar experience?
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver

2009-02-23 Thread Jack Morgenstein

On Monday 23 February 2009 20:31, Roland Dreier wrote:
   I'm not sure that it does.  This does not make sysfs access atomic wrt 
 module unloading.
   I think an app can still lose it's timeslice while inside the sysfs 
 access, and module
   unload can still occur while the app is waiting for a new time slice 
 (although the code pages
   will not be removed as yet -- see below).
 
 Not sure I follow... the low-level driver must handle requests until
 ib_unregister_device() returns, and with the change I proposed,
 ib_unregister_device() will not return until all sysfs files are gone
 (and no open file handles remain).
 
   What about the patch I just submitted?
 
 I'd rather not add a superfluous mutex that adds complexity when a
 simpler solution is available.

You're right, your solution does work.  I was just concerned that the 
unregister-sysfs calls
would simply prevent new accessors from seeing the files, but would return 
before the file reference count
reached zero (thus allowing low-level driver cleanup while current accessors 
were still in progress).
I checked, and this does not happen.  As you mention in your answer, the 
unregister-sysfs calls do not
return while someone still has an open file handle on these files.

- Jack
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] ofa_1_4_kernel 20090223-0200 daily build status

Re: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver

[ofa-general] Too many calls to mlx4_CLOSE_PORT()?

[ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Handle EEH events for active connections.

[ofa-general] el5.3 backport of 1.4(.0)

[ofa-general] OFED (EWG) meeting agenda for today (Feb 23)

[ofa-general] RE: OFED (EWG) meeting agenda for today (Feb 23)

[ofa-general] Re: [ewg] RE: OFED (EWG) meeting agenda for today (Feb 23)

[ofa-general] Re: NFSRDMA connectathon prelim. testing status,

[ofa-general] [PATCH] opensm/osm_subnet: fix crash in qos string config parameters reloading

[ofa-general] [PATCH] opensm/main.c: remove enable_stack_dump() call

[ofa-general] Re: NFSRDMA connectathon prelim. testing status,

[ofa-general] Re: NFSRDMA connectathon prelim. testing status,

Re: [ofa-general] build warnings on rhel4 U6

Re: [ofa-general] Re: NFSRDMA connectathon prelim. testing status,

[ofa-general] [PATCH v2] RDMA/cxgb3: Handle EEH events for active connections.

[ofa-general] [PATCH] [ib-diag] saquery: add support for WinOF

[ofa-general] Bandwidth of performance with multirail IB

Re: [ofa-general] Race condition in core/sysfs.c (kernel panic) when unloading the driver

19 matches

Site Navigation

Mail list logo

Footer information