I have not touched /etc/dat.conf, so I am using whatever comes with OFED 1.0 rc5.
For whatever reason, things have improved some. I am now running Intel MPI right after bringing up hosts (previously I was trying MVAPICH, then Open MPI, then HP MPI, then Intel MPI). I've run twice, and see these failures: Run #1 (after rebooting all hosts): rank 13 in job 1 192.168.1.1_34674 caused collective abort of all ranks^M exit status of rank 13: killed by signal 11 ^M [EMAIL PROTECTED]:/data/home/scott/builds/TopspinOS-2.7.0/build013 /protes\ t/Lk3/060706_123945/[EMAIL PROTECTED] intel.intel]$ ### TEST-W: Could not run /data/home/scott/builds/TopspinOS-2.7.0/build013/prot\ est/Lk3/060706_123945/intel.intel/1149709233/IMB_2.3/src/IMB-MPI1 Allreduce : 0\ Run #2 (after rebooting all hosts): rank 6 in job 1 192.168.1.1_33649 caused collective abort of all ranks^M exit status of rank 6: killed by signal 11 ^M [EMAIL PROTECTED]:/data/home/scott/builds/TopspinOS-2.7.0/build013 /protes\ t/Lk3/060706_145739/[EMAIL PROTECTED] intel.intel]$ ### TEST-W: Could not run /data/home/scott/builds/TopspinOS-2.7.0/build013/prot\ est/Lk3/060706_145739/intel.intel/1149717497/IMB_2.3/src/IMB-MPI1 Exchange : 0 rank 21 in job 1 192.168.1.1_34734 caused collective abort of all ranks^M exit status of rank 21: killed by signal 11 ^M [EMAIL PROTECTED]:/data/home/scott/builds/TopspinOS-2.7.0/build013 /protes\ t/Lk3/060706_145739/[EMAIL PROTECTED] intel.intel]$ ### TEST-W: Could not run /data/home/scott/builds/TopspinOS-2.7.0/build013/prot\ est/Lk3/060706_145739/intel.intel/1149717497/IMB_2.3/src/IMB-MPI1 Allgatherrv -\ multi 1: 0 Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Arlin Davis [mailto:[EMAIL PROTECTED] > Sent: Wednesday, June 07, 2006 3:25 PM > To: Scott Weitzenkamp (sweitzen) > Cc: Davis, Arlin R; Lentini, James; openib-general > Subject: Re: [openib-general] [PATCH] uDAPL openib-cma > provider - add support for IB_CM_REQ_OPTIONS > > Scott Weitzenkamp (sweitzen) wrote: > > >Yes, the modules were loaded. > > > >Each of the 32 hosts had 3 IB ports up. Does Intel MPI or uDAPL use > >multiple ports and/or multiple HCAs? > > > >I shut down all but one port on each host, and now Pallas is running > >better on the 32 nodes using Intel MPI 2.0.1. HP MPI 2.2 started > >working too with Pallas too over uDAPL, so maybe this is a > uDAPL issue? > > > > > Can you tell me what adapters are installed (ibstat), how they are > configured (ifconfig), and what your dat.conf looks like? It sounds > like a device mapping issue during the dat_ia_open() processing. > > Multiple ports and HCAs should work fine but there is some > care required > in configuration of the dat.conf so you consitantly pick up > the correct > device across the cluster. Intel MPI will simply open a > device based on > the provider/device name (example: setenv > I_MPI_DAPL_PROVIDER=OpenIB-cma) defined in the dat.conf and > query dapl > for the address to be used for connections. This line in the dat.conf > will determine which library to load and which IB device to open and > bind too. If you have the same exact configuration on each > node and know > that the ib0,ib1,ib2, etc will always come up in the same > order then you > can simply use the same netdev names across the cluster and > use the same > exact copy of dat.conf on each node. > > Here are the dat.conf options for OpenIB-cma configurations. > > # For cma version you specify <ia_params> as: > # network address, network hostname, or netdev name and > 0 for port > # > # Simple (OpenIB-cma) default with netdev name provided first on list > # to enable use of same dat.conf version on all nodes > # > OpenIB-cma u1.2 nonthreadsafe default /usr/lib/libdaplcma.so > mv_dapl.1.2 > "ib0 0" "" > OpenIB-cma-ip u1.2 nonthreadsafe default /usr/lib/libdaplcma.so > mv_dapl.1.2 "192.168.0.22 0" "" > OpenIB-cma-name u1.2 nonthreadsafe default /usr/lib/libdaplcma.so > mv_dapl.1.2 "svr1-ib0 0" "" > OpenIB-cma-netdev u1.2 nonthreadsafe default /usr/lib/libdaplcma.so > mv_dapl.1.2 "ib0 0" "" > > Which type are you using? address, hostname, or netdev names? > > Also, Intel MPI is sometimes too smart for its own good when opening > rdma devices via uDAPL. If the open fails with the first rdma device > specified in the dat.conf it will continue onto the next line > until one > is successfull. If all rdma devices fail it will then go onto > the static > device automatcally. This sometimes does more harm then good > since one > node could be failing over to the second device in your configuration > and the other nodes are all on the first device. If they are > all on the > same subnet then it would work fine but if they are on > different subnets > then we would not be able to connect. > > If you send me your configuration, we can set it up here and > hopefully > duplicate your error case. > > -arlin > _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
