Yes. this is a problem that I have seen also that we are investigating, probably should open a bug against it, looks like it is hung waiting for a connection, perhaps a problem with running 32-bit applications using the rdma_cm. Please open a bug against this and assign it to Arlin and he will work with Sean to debug the problem.
woody -----Original Message----- From: Yong Qin [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 04, 2007 12:43 PM To: Woodruff, Robert J Cc: [email protected] Subject: RE: [ofa-general] uDAPL question Thanks for the tip, woody. The bug is gone in OFED 1.2. However, we are still experiencing other issues here. Let me explain, we are trying to run both 32-bit and 64-bit applications on an Opteron cluster, with RHEL 4U4. When we were testing 64-bit applications on OFED 1.2 beta1, the uDAPL works fine. However when we switched to 32-bit applications, it hanged in RDMA progress engine: 0: [0] MPIDI_CH3_RDMA_Progress(): entering rdma progress engine, blocking=true 1: [1] MPIDI_CH3_RDMA_Progress(): entering rdma progress engine, blocking=true With the night build 20070404, both 32-bit and 64-bit hanged on RDMA_init. All the testing were done with Intel MPI 3.0. Any thoughts? Thanks again, Yong -----Original Message----- From: Woodruff, Robert J [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 03, 2007 5:59 PM To: Yong Qin; Boris Shpolyansky; Hefty, Sean Cc: [email protected] Subject: RE: [ofa-general] uDAPL question This should now be fixed in OFED 1.2. woody -----Original Message----- From: Yong Qin [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 03, 2007 12:43 PM To: Boris Shpolyansky; Woodruff, Robert J; Hefty, Sean Cc: [email protected] Subject: RE: [ofa-general] uDAPL question Is there any progress on this issue? We are seeing exactly the same error on OFED 1.1 + Intel MPI 3.0 -- "unexpected DAPL event 4006" and wondering if there is a fix. Thanks, Yong -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Boris Shpolyansky Sent: Monday, March 12, 2007 11:28 AM To: Woodruff, Robert J; [email protected]; Hefty, Sean Subject: RE: [ofa-general] uDAPL question Hi Woody, Thanks for your help. I guess the problem is in the CM - is it ? Can you point me to relevant communication/bug reports that explain the fix for this issue ? Would Sean be the right person to ask regarding what exact patch should be added/removed ? I would prefer to stick to OFED-1.1 code with minimal changes - if possible - to avoid compatibility issues. Thanks, Boris -----Original Message----- From: Woodruff, Robert J [mailto:[EMAIL PROTECTED] Sent: Monday, March 12, 2007 8:24 AM To: Boris Shpolyansky; [email protected]; Hefty, Sean Subject: RE: [ofa-general] uDAPL question This is a known problem and should be fixed by now, There was a bad patch that somehow got into OFED that was not in Sean main tree. Assuming this bad patch has been removed, the problem should be fixed. woody ________________________________ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Boris Shpolyansky Sent: Friday, March 09, 2007 8:40 PM To: [email protected] Subject: [ofa-general] uDAPL question Hi, I'm trying to get simple Intel MPI benchmark running over IB (uDAPL) using OFED-1.1 stack. I'm consistently getting the following error: [EMAIL PROTECTED] ~]# ./runjob_I_MPI.boris 2 Task 0 of 2 tasks started on host ibd005.ibd.mti.com clock_resolution = 1.00e-06 s Task 1 of 2 tasks started on host ibd006.ibd.mti.com [0:ibd005] unexpected DAPL event 4006 from 1:ibd006 [1:ibd006] unexpected DAPL event 4006 from 0:ibd005 rank 0 in job 14 ibd005_36193 caused collective abort of all ranks exit status of rank 0: return code 254 I did some digging and found out that event 4006 (actually 0x4006) means DAT_CONNECTION_EVENT_BROKEN and it is returned by function dat_rmr_bind. So my question is why this function consistently fails. I'm using standard dat.conf file: OpenIB-cma u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdaplcma.so mv_dapl.1.2 "ib0 0" "" Appreciate your help, Boris Shpolyansky _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
