Re: [ewg] RDS problematic on RC2
Ralph Campbell wrote: Attached is the patch I sent to Olaf. It basically exchanges calls like dma_map_sg() to ib_dma_map_sg() so that the InfiniPath driver can intercept the DMA mapping calls and use kernel virtual addresses instead of physical addresses. The InfiniPath driver uses the host CPU to copy data in most cases instead of DMA. I doubt this will fix or change the other issues Olaf is working on for RDS. OK, got it. Or ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] RDS problematic on RC2
On Thursday 17 January 2008 11:57, Johann George wrote: That's a remote invalid request error. Were you testing with RDMA or without? We were using the version that runs over IB. Well, yes. But you can do that with ordinary SENDs, or you can enable RDMA for large data blobs as well. But looking at the qperf source, it doesn't do that. To run it, on one machine (the server), run it with no arguments. On the other machine, run: qperf server_nodename rds_bw Okay, will give it a try. If the TCP part is entirely non-working, it might be better to disable it for now rather than have it crash the machine. So far, I have never gotten it to function correctly and it crashes some machines almost immediately. Let's put it that way - nobody looked at the code for a while. I kind of put it at the bottom of my todo list, around position 18 or so :-/ Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play [EMAIL PROTECTED] |/ | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] RDS problematic on RC2
Oh, and if you're using RDMA - does this happen to be with qlogic HCAs? If so, I just received a patch from Ralph Campbell with some fixes to the way we set up out DMA mapping. RDS in OFED 1.3 does not currently work on the QLogic HCAs due to the way you are setting up DMA mapping. We already discovered that and the patch that Ralph sent will hopefully fix the problem. The machine in the cluster which we upgraded to RC2 and subsequently encountered failure contains a Mellanox Lion Mini DDR HCA. Johann ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] RDS problematic on RC2
Johann George wrote: Oh, and if you're using RDMA - does this happen to be with qlogic HCAs? If so, I just received a patch from Ralph Campbell with some fixes to the way we set up out DMA mapping. RDS in OFED 1.3 does not currently work on the QLogic HCAs due to the way you are setting up DMA mapping. We already discovered that and the patch that Ralph sent will hopefully fix the problem. Olaf, Johan, may we (rds-devel) see the patch from Ralph? Or. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] RDS problematic on RC2
Attached is the patch I sent to Olaf. It basically exchanges calls like dma_map_sg() to ib_dma_map_sg() so that the InfiniPath driver can intercept the DMA mapping calls and use kernel virtual addresses instead of physical addresses. The InfiniPath driver uses the host CPU to copy data in most cases instead of DMA. I doubt this will fix or change the other issues Olaf is working on for RDS. On Thu, 2008-01-17 at 14:06 +0200, Or Gerlitz wrote: Johann George wrote: Oh, and if you're using RDMA - does this happen to be with qlogic HCAs? If so, I just received a patch from Ralph Campbell with some fixes to the way we set up out DMA mapping. RDS in OFED 1.3 does not currently work on the QLogic HCAs due to the way you are setting up DMA mapping. We already discovered that and the patch that Ralph sent will hopefully fix the problem. Olaf, Johan, may we (rds-devel) see the patch from Ralph? Or. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg diff --git a/net/rds/ib.h b/net/rds/ib.h index 669ec7b..9c75767 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -70,7 +70,7 @@ struct rds_ib_connection { struct rds_ib_work_ring i_send_ring; struct rds_message *i_rm; struct rds_header *i_send_hdrs; - dma_addr_t i_send_hdrs_dma; + u64 i_send_hdrs_dma; struct rds_ib_send_work *i_sends; /* rx */ @@ -79,7 +79,7 @@ struct rds_ib_connection { struct rds_ib_incoming *i_ibinc; u32 i_recv_data_rem; struct rds_header *i_recv_hdrs; - dma_addr_t i_recv_hdrs_dma; + u64 i_recv_hdrs_dma; struct rds_ib_recv_work *i_recvs; struct rds_page_frag i_frag; dma_addr_t i_addr; @@ -96,7 +96,7 @@ struct rds_ib_connection { struct rds_header *i_ack; struct ib_send_wr i_ack_wr; struct ib_sge i_ack_sge; - dma_addr_t i_ack_dma; + u64 i_ack_dma; unsigned long i_ack_queued; /* sending congestion bitmaps */ diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 41fa294..d962239 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -111,6 +111,7 @@ static void rds_ib_qp_event_handler(struct ib_event *event, void *data) static int rds_ib_setup_qp(struct rds_connection *conn) { struct rds_ib_connection *ic = conn-c_transport_data; + struct ib_device *dev = ic-i_cm_id-device; struct ib_qp_init_attr attr; struct rds_ib_device *rds_ibdev; int ret; @@ -120,11 +121,11 @@ static int rds_ib_setup_qp(struct rds_connection *conn) * for each. If that fails for any reason, it will not register * the rds_ibdev at all. */ - rds_ibdev = ib_get_client_data(ic-i_cm_id-device, rds_ib_client); + rds_ibdev = ib_get_client_data(dev, rds_ib_client); if (rds_ibdev == NULL) { if (printk_ratelimit()) printk(KERN_NOTICE RDS/IB: No client_data for device %s\n, - ic-i_cm_id-device-name); + dev-name); return -EOPNOTSUPP; } @@ -132,8 +133,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) ic-i_pd = rds_ibdev-pd; ic-i_mr = rds_ibdev-mr; - ic-i_send_cq = ib_create_cq(ic-i_cm_id-device, - rds_ib_send_cq_comp_handler, + ic-i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler, rds_ib_cq_event_handler, conn, ic-i_send_ring.w_nr + 1, 0); if (IS_ERR(ic-i_send_cq)) { @@ -143,8 +143,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) goto out; } - ic-i_recv_cq = ib_create_cq(ic-i_cm_id-device, - rds_ib_recv_cq_comp_handler, + ic-i_recv_cq = ib_create_cq(dev, rds_ib_recv_cq_comp_handler, rds_ib_cq_event_handler, conn, ic-i_recv_ring.w_nr, 0); if (IS_ERR(ic-i_recv_cq)) { @@ -190,32 +189,31 @@ static int rds_ib_setup_qp(struct rds_connection *conn) goto out; } - ic-i_send_hdrs = dma_alloc_coherent(ic-i_cm_id-device-dma_device, + ic-i_send_hdrs = ib_dma_alloc_coherent(dev, ic-i_send_ring.w_nr * sizeof(struct rds_header), ic-i_send_hdrs_dma, GFP_KERNEL); if (ic-i_send_hdrs == NULL) { ret = -ENOMEM; - rdsdebug(dma_alloc_coherent send failed\n); + rdsdebug(ib_dma_alloc_coherent send failed\n); goto out; } - ic-i_recv_hdrs = dma_alloc_coherent(ic-i_cm_id-device-dma_device, + ic-i_recv_hdrs = ib_dma_alloc_coherent(dev, ic-i_recv_ring.w_nr * sizeof(struct rds_header), ic-i_recv_hdrs_dma, GFP_KERNEL); if (ic-i_recv_hdrs == NULL) { ret = -ENOMEM; - rdsdebug(dma_alloc_coherent recv failed\n); + rdsdebug(ib_dma_alloc_coherent recv failed\n); goto out; } - ic-i_ack = dma_alloc_coherent(ic-i_cm_id-device-dma_device, - sizeof(struct rds_header), + ic-i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header), ic-i_ack_dma, GFP_KERNEL); if (ic-i_ack == NULL) { ret = -ENOMEM; - rdsdebug(dma_alloc_coherent ack failed\n); + rdsdebug(ib_dma_alloc_coherent ack failed\n); goto out; } @@ -496,6 +494,7 @@ out: void rds_ib_conn_shutdown(struct rds_connection *conn) {
[ewg] RDS problematic on RC2
We've been testing the OFED 1.3 pre-releases on a 12 node cluster here at UNH-IOL. RDS seemed largely functional (other than problems we were aware of) on OFED 1.3 RC1. When we installed RC2, RDS stopped working. A dmesg indicates the following message repeatedly on the console: RDS/IB: completion on 10.1.1.205 had status 9, disconnecting and reconnecting Note that this is using RDS over IB. Our minimal experience with the non-IB version of RDS was worse. We only tried it with RC1 and it crashed one of the two machines almost instantly. Johann ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] RDS problematic on RC2
Johann George wrote: We've been testing the OFED 1.3 pre-releases on a 12 node cluster here at UNH-IOL. RDS seemed largely functional (other than problems we were aware of) on OFED 1.3 RC1. When we installed RC2, RDS stopped working. A dmesg indicates the following message repeatedly on the console: RDS/IB: completion on 10.1.1.205 had status 9, disconnecting and reconnecting Note that this is using RDS over IB. Our minimal experience with the non-IB version of RDS was worse. We only tried it with RC1 and it crashed one of the two machines almost instantly. Johann Hi Johann, Please open a bug in Bugzilla and add some info that will help in debugging (OS,kernel,arch,test). Thanks, Vladimir ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] RDS problematic on RC2
On Thursday 17 January 2008 04:15, Johann George wrote: We've been testing the OFED 1.3 pre-releases on a 12 node cluster here at UNH-IOL. RDS seemed largely functional (other than problems we were aware of) on OFED 1.3 RC1. When we installed RC2, RDS stopped working. A dmesg indicates the following message repeatedly on the Huh, scary. It works reasonably well here, though. console: RDS/IB: completion on 10.1.1.205 had status 9, disconnecting and reconnecting That's a remote invalid request error. Were you testing with RDMA or without? What user application were you using for testing? Note that this is using RDS over IB. Our minimal experience with the non-IB version of RDS was worse. We only tried it with RC1 and it crashed one of the two machines almost instantly. Yes, the TCP part of RDS isn't being looked after very much, unfortunately. Olaf -- Olaf Kirch | --- o --- Nous sommes du soleil we love when we play [EMAIL PROTECTED] |/ | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg