Re: [ewg] RDS problematic on RC2

2008-01-20 Thread Or Gerlitz

Ralph Campbell wrote:

Attached is the patch I sent to Olaf.
It basically exchanges calls like dma_map_sg() to ib_dma_map_sg()
so that the InfiniPath driver can intercept the DMA mapping
calls and use kernel virtual addresses instead of physical addresses.
The InfiniPath driver uses the host CPU to copy data in most cases
instead of DMA.  I doubt this will fix or change the other issues
Olaf is working on for RDS.


OK, got it.

Or

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RDS problematic on RC2

2008-01-17 Thread Olaf Kirch
On Thursday 17 January 2008 11:57, Johann George wrote:
  That's a remote invalid request error. Were you testing
  with RDMA or without?
 
 We were using the version that runs over IB.

Well, yes. But you can do that with ordinary SENDs, or you
can enable RDMA for large data blobs as well. But looking at
the qperf source, it doesn't do that.

 To run it, on one machine (the server), run it with no
 arguments.  On the other machine, run:
 
 qperf server_nodename rds_bw

Okay, will give it a try.

 If the TCP part is entirely non-working, it might be better
 to disable it for now rather than have it crash the machine.
 So far, I have never gotten it to function correctly and it
 crashes some machines almost immediately.

Let's put it that way - nobody looked at the code for a while.
I kind of put it at the bottom of my todo list, around position 18
or so :-/

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
[EMAIL PROTECTED] |/ | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RDS problematic on RC2

2008-01-17 Thread Johann George
 Oh, and if you're using RDMA - does this happen to be with
 qlogic HCAs?  If so, I just received a patch from Ralph
 Campbell with some fixes to the way we set up out DMA
 mapping.

RDS in OFED 1.3 does not currently work on the QLogic HCAs
due to the way you are setting up DMA mapping.  We already
discovered that and the patch that Ralph sent will hopefully
fix the problem.

The machine in the cluster which we upgraded to RC2 and
subsequently encountered failure contains a Mellanox Lion
Mini DDR HCA.

Johann
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RDS problematic on RC2

2008-01-17 Thread Or Gerlitz

Johann George wrote:

Oh, and if you're using RDMA - does this happen to be with
qlogic HCAs?  If so, I just received a patch from Ralph
Campbell with some fixes to the way we set up out DMA
mapping.


RDS in OFED 1.3 does not currently work on the QLogic HCAs
due to the way you are setting up DMA mapping.  We already
discovered that and the patch that Ralph sent will hopefully
fix the problem.


Olaf, Johan, may we (rds-devel) see the patch from Ralph?

Or.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RDS problematic on RC2

2008-01-17 Thread Ralph Campbell
Attached is the patch I sent to Olaf.
It basically exchanges calls like dma_map_sg() to ib_dma_map_sg()
so that the InfiniPath driver can intercept the DMA mapping
calls and use kernel virtual addresses instead of physical addresses.
The InfiniPath driver uses the host CPU to copy data in most cases
instead of DMA.  I doubt this will fix or change the other issues
Olaf is working on for RDS.

On Thu, 2008-01-17 at 14:06 +0200, Or Gerlitz wrote:
 Johann George wrote:
  Oh, and if you're using RDMA - does this happen to be with
  qlogic HCAs?  If so, I just received a patch from Ralph
  Campbell with some fixes to the way we set up out DMA
  mapping.
  
  RDS in OFED 1.3 does not currently work on the QLogic HCAs
  due to the way you are setting up DMA mapping.  We already
  discovered that and the patch that Ralph sent will hopefully
  fix the problem.
 
 Olaf, Johan, may we (rds-devel) see the patch from Ralph?
 
 Or.
 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 669ec7b..9c75767 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -70,7 +70,7 @@ struct rds_ib_connection {
 	struct rds_ib_work_ring	i_send_ring;
 	struct rds_message	*i_rm;
 	struct rds_header	*i_send_hdrs;
-	dma_addr_t 		i_send_hdrs_dma;
+	u64			i_send_hdrs_dma;
 	struct rds_ib_send_work *i_sends;
 
 	/* rx */
@@ -79,7 +79,7 @@ struct rds_ib_connection {
 	struct rds_ib_incoming	*i_ibinc;
 	u32			i_recv_data_rem;
 	struct rds_header	*i_recv_hdrs;
-	dma_addr_t 		i_recv_hdrs_dma;
+	u64			i_recv_hdrs_dma;
 	struct rds_ib_recv_work *i_recvs;
 	struct rds_page_frag	i_frag;
 	dma_addr_t 		i_addr;
@@ -96,7 +96,7 @@ struct rds_ib_connection {
 	struct rds_header	*i_ack;
 	struct ib_send_wr	i_ack_wr;
 	struct ib_sge		i_ack_sge;
-	dma_addr_t 		i_ack_dma;
+	u64			i_ack_dma;
 	unsigned long		i_ack_queued;
  
  	/* sending congestion bitmaps */
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 41fa294..d962239 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -111,6 +111,7 @@ static void rds_ib_qp_event_handler(struct ib_event *event, void *data)
 static int rds_ib_setup_qp(struct rds_connection *conn)
 {
 	struct rds_ib_connection *ic = conn-c_transport_data;
+	struct ib_device *dev = ic-i_cm_id-device;
 	struct ib_qp_init_attr attr;
 	struct rds_ib_device *rds_ibdev;
 	int ret;
@@ -120,11 +121,11 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 	 * for each.  If that fails for any reason, it will not register
 	 * the rds_ibdev at all.
 	 */
-	rds_ibdev = ib_get_client_data(ic-i_cm_id-device, rds_ib_client);
+	rds_ibdev = ib_get_client_data(dev, rds_ib_client);
 	if (rds_ibdev == NULL) {
 		if (printk_ratelimit())
 			printk(KERN_NOTICE RDS/IB: No client_data for device %s\n,
-	ic-i_cm_id-device-name);
+	dev-name);
 		return -EOPNOTSUPP;
 	}
 
@@ -132,8 +133,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 	ic-i_pd = rds_ibdev-pd;
 	ic-i_mr = rds_ibdev-mr;
 
-	ic-i_send_cq = ib_create_cq(ic-i_cm_id-device,
- rds_ib_send_cq_comp_handler,
+	ic-i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler,
  rds_ib_cq_event_handler, conn,
  ic-i_send_ring.w_nr + 1, 0);
 	if (IS_ERR(ic-i_send_cq)) {
@@ -143,8 +143,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 		goto out;
 	}
 
-	ic-i_recv_cq = ib_create_cq(ic-i_cm_id-device,
- rds_ib_recv_cq_comp_handler,
+	ic-i_recv_cq = ib_create_cq(dev, rds_ib_recv_cq_comp_handler,
  rds_ib_cq_event_handler, conn,
  ic-i_recv_ring.w_nr, 0);
 	if (IS_ERR(ic-i_recv_cq)) {
@@ -190,32 +189,31 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 		goto out;
 	}
 
-	ic-i_send_hdrs = dma_alloc_coherent(ic-i_cm_id-device-dma_device,
+	ic-i_send_hdrs = ib_dma_alloc_coherent(dev,
 	   ic-i_send_ring.w_nr *
 	   	sizeof(struct rds_header),
 	   ic-i_send_hdrs_dma, GFP_KERNEL);
 	if (ic-i_send_hdrs == NULL) {
 		ret = -ENOMEM;
-		rdsdebug(dma_alloc_coherent send failed\n);
+		rdsdebug(ib_dma_alloc_coherent send failed\n);
 		goto out;
 	}
 
-	ic-i_recv_hdrs = dma_alloc_coherent(ic-i_cm_id-device-dma_device,
+	ic-i_recv_hdrs = ib_dma_alloc_coherent(dev,
 	   ic-i_recv_ring.w_nr *
 	   	sizeof(struct rds_header),
 	   ic-i_recv_hdrs_dma, GFP_KERNEL);
 	if (ic-i_recv_hdrs == NULL) {
 		ret = -ENOMEM;
-		rdsdebug(dma_alloc_coherent recv failed\n);
+		rdsdebug(ib_dma_alloc_coherent recv failed\n);
 		goto out;
 	}
 
-	ic-i_ack = dma_alloc_coherent(ic-i_cm_id-device-dma_device,
-   sizeof(struct rds_header),
+	ic-i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header),
    ic-i_ack_dma, GFP_KERNEL);
 	if (ic-i_ack == NULL) {
 		ret = -ENOMEM;
-		rdsdebug(dma_alloc_coherent ack failed\n);
+		rdsdebug(ib_dma_alloc_coherent ack failed\n);
 		goto out;
 	}
 
@@ -496,6 +494,7 @@ out:
 void rds_ib_conn_shutdown(struct rds_connection *conn)
 {
 	

[ewg] RDS problematic on RC2

2008-01-16 Thread Johann George
We've been testing the OFED 1.3 pre-releases on a 12 node cluster here
at UNH-IOL.  RDS seemed largely functional (other than problems we
were aware of) on OFED 1.3 RC1.  When we installed RC2, RDS stopped
working.  A dmesg indicates the following message repeatedly on the
console:

RDS/IB: completion on 10.1.1.205 had status 9, disconnecting and reconnecting

Note that this is using RDS over IB.  Our minimal experience with the
non-IB version of RDS was worse.  We only tried it with RC1 and it
crashed one of the two machines almost instantly.

Johann
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RDS problematic on RC2

2008-01-16 Thread Vladimir Sokolovsky

Johann George wrote:

We've been testing the OFED 1.3 pre-releases on a 12 node cluster here
at UNH-IOL.  RDS seemed largely functional (other than problems we
were aware of) on OFED 1.3 RC1.  When we installed RC2, RDS stopped
working.  A dmesg indicates the following message repeatedly on the
console:

RDS/IB: completion on 10.1.1.205 had status 9, disconnecting and reconnecting

Note that this is using RDS over IB.  Our minimal experience with the
non-IB version of RDS was worse.  We only tried it with RC1 and it
crashed one of the two machines almost instantly.

Johann


Hi Johann,
Please open a bug in Bugzilla and add some info that will help in debugging 
(OS,kernel,arch,test).

Thanks,
Vladimir
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RDS problematic on RC2

2008-01-16 Thread Olaf Kirch
On Thursday 17 January 2008 04:15, Johann George wrote:
 We've been testing the OFED 1.3 pre-releases on a 12 node cluster here
 at UNH-IOL.  RDS seemed largely functional (other than problems we
 were aware of) on OFED 1.3 RC1.  When we installed RC2, RDS stopped
 working.  A dmesg indicates the following message repeatedly on the

Huh, scary. It works reasonably well here, though.

 console:
 
 RDS/IB: completion on 10.1.1.205 had status 9, disconnecting and reconnecting

That's a remote invalid request error. Were you testing with
RDMA or without? What user application were you using for testing?

 Note that this is using RDS over IB.  Our minimal experience with the
 non-IB version of RDS was worse.  We only tried it with RC1 and it
 crashed one of the two machines almost instantly.

Yes, the TCP part of RDS isn't being looked after very much, unfortunately.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
[EMAIL PROTECTED] |/ | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg