Glad to know this helped. If you have any further questions about using MVAPICH2 please feel free to mail mvapich-disc...@cse.ohio-state.edu.
On Fri, Jun 15, 2012 at 02:57:50PM +0200, Dark Charlot wrote: > Dear Jonathan Perkins, > > you put me on the right track ! It was just a problem of memory locked, > DAMN IT ! > > My /etc/security/limits.conf was set correctly with these lines : > * hard memlock unlimited > * soft memlock unlimited > > BUT when I was running "ulimit -l" as a user, I was getting "64" instead of > "unlimited". > > In order to have "unlimited" for all my shells, I had to put in the file > /etc/ssh/sshd_config the line: > > UsePAM yes > > (and restart my sshd daemon : > systemctl restart sshd.service) > > And now my MPI stack over infiniband is working as expected :D:D > > Many many thanks again ! > > Jean-Charles > > ---------- Forwarded message ---------- > From: Dark Charlot <jcld...@gmail.com> > Date: 2012/6/15 > Subject: Re: [ewg] OFED drivers or linux stock drivers ? > To: Jonathan Perkins <perki...@cse.ohio-state.edu> > > > HI, > > after recompiling MVAPICH2 with your configure options, I got this: > > mpirun_rsh -np 2 amos kerkira ./osu_bw > > > > [cli_0]: aborting job: > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(408).......: > MPID_Init(296)..............: channel initialization failed > MPIDI_CH3_Init(283).........: > MPIDI_CH3I_RDMA_init(172)...: > rdma_setup_startup_ring(431): cannot create cq > > [amos:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 6. > MPI process died? > [amos:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI > process died? > [amos:mpispawn_0][child_handler] MPI process (rank: 0, pid: 11879) exited > with status 1 > > [cli_1]: aborting job: > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(408).......: > MPID_Init(296)..............: channel initialization failed > MPIDI_CH3_Init(283).........: > MPIDI_CH3I_RDMA_init(172)...: > rdma_setup_startup_ring(431): cannot create cq > > [kerkira:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. > MPI process died? > [kerkira:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI > process died? > [kerkira:mpispawn_1][child_handler] MPI process (rank: 1, pid: 565) exited > with status 1 > [kerkira:mpispawn_1][report_error] connect() failed: Connection refused > (111) > [kerkira:mpispawn_1][report_error] connect() failed: Connection refused > (111) > > Thanks, JC > > > > 2012/6/15 Jonathan Perkins <perki...@cse.ohio-state.edu> > > > This could be something as simple as a locked limit issue. Can you > > rebuild mvapich2 by passing `--disable-fast --enable-g=dbg' to > > configure? You should get more useful output with these options. > > > > I'm cc'ing mvapich-discuss as well as this may be specific to MVAPICH2. > > > > On Thu, Jun 14, 2012 at 4:14 PM, Dark Charlot <jcld...@gmail.com> wrote: > > > Dear experts, > > > > > > I am running mageia2 linux distribution which comes with kernel 3.3.6. > > > > > > I downloaded ofed 1.5.4.1 drivers and compiled and installed (** with a > > lot > > > of pains and spec files modifications **) some of the RPM : > > > > > > infiniband-diags-1.5.13-1.x86_64.rpm > > > infiniband-diags-debug-1.5.13-1.x86_64.rpm > > > libibmad-1.3.8-1.x86_64.rpm > > > libibmad-debug-1.3.8-1.x86_64.rpm > > > libibmad-devel-1.3.8-1.x86_64.rpm > > > libibmad-static-1.3.8-1.x86_64.rpm > > > libibumad-1.3.7-1.x86_64.rpm > > > libibumad-debug-1.3.7-1.x86_64.rpm > > > libibumad-devel-1.3.7-1.x86_64.rpm > > > libibumad-static-1.3.7-1.x86_64.rpm > > > libibverbs-1.1.4-1.24.gb89d4d7.x86_64.rpm > > > libibverbs-debug-1.1.4-1.24.gb89d4d7.x86_64.rpm > > > libibverbs-devel-1.1.4-1.24.gb89d4d7.x86_64.rpm > > > libibverbs-devel-static-1.1.4-1.24.gb89d4d7.x86_64.rpm > > > libibverbs-utils-1.1.4-1.24.gb89d4d7.x86_64.rpm > > > libmlx4-1.0.1-1.20.g6771d22.x86_64.rpm > > > libmlx4-debug-1.0.1-1.20.g6771d22.x86_64.rpm > > > libmlx4-devel-1.0.1-1.20.g6771d22.x86_64.rpm > > > mstflint-1.4-1.18.g1adcfbf.x86_64.rpm > > > mstflint-debug-1.4-1.18.g1adcfbf.x86_64.rpm > > > opensm-3.3.13-1.x86_64.rpm > > > opensm-debug-3.3.13-1.x86_64.rpm > > > opensm-devel-3.3.13-1.x86_64.rpm > > > opensm-libs-3.3.13-1.x86_64.rpm > > > opensm-static-3.3.13-1.x86_64.rpm > > > > > > But I was **not** able to compile ofa kernel itself. > > > > > > Then I tried to use, instead, all the corresponding modules which come > > with > > > my stock linux kernel distribution (3.3.6) > > > > > > After initializing correctly (I guess) all the necessary mellanox stuffs > > > (openibd, opensm etc...) I can see my Mellanox cards with the command > > > ibv_devinfo. > > > > > > I get the following output for all the computers which have a mellanox > > card > > > > > > 1) ibv_devinfo > > > > > > kerkira:% ibv_devinfo > > > > > > hca_id: mlx4_0 > > > transport: InfiniBand (0) > > > fw_ver: 2.7.000 > > > node_guid: 0002:c903:0009:d1b2 > > > sys_image_guid: 0002:c903:0009:d1b5 > > > vendor_id: 0x02c9 > > > vendor_part_id: 26428 > > > hw_ver: 0xA0 > > > board_id: MT_0C40110009 > > > phys_port_cnt: 1 > > > port: 1 > > > state: PORT_ACTIVE (4) > > > max_mtu: 2048 (4) > > > active_mtu: 2048 (4) > > > sm_lid: 8 > > > port_lid: 8 > > > port_lmc: 0x00 > > > link_layer: IB > > > > > > > > > 2) ibstatus > > > > > > kerkira:% /usr/sbin/ibstatus > > > > > > Infiniband device 'mlx4_0' port 1 status: > > > default gid: fe80:0000:0000:0000:0002:c903:0009:d1b3 > > > base lid: 0x8 > > > sm lid: 0x8 > > > state: 4: ACTIVE > > > phys state: 5: LinkUp > > > rate: 40 Gb/sec (4X QDR) > > > link_layer: InfiniBand > > > > > > > > > QUESTION: > > > > > > ==> According to these outputs, could we say that my computers use > > correctly > > > the mlx4 drivers which comes with my kernel 3.3.6 ? > > > > > > > > > Probably not because I cannot communicate between two machines using > > > mpi..... > > > > > > Here is the detail: > > > I compiled and install MVAPICH2 but I couldn't run "osu_bw" program > > between > > > two machines, I get : > > > > > > kerkira% mpirun_rsh -np 2 kerkira amos ./osu_bw > > > > > > [cli_0]: aborting job: > > > Fatal error in MPI_Init: > > > Other MPI error > > > > > > [kerkira:mpispawn_0][readline] Unexpected End-Of-File on file descriptor > > 6. > > > MPI process died? > > > [kerkira:mpispawn_0][mtpmi_processops] Error while reading PMI socket. > > MPI > > > process died? > > > [kerkira:mpispawn_0][child_handler] MPI process (rank: 0, pid: 5396) > > exited > > > with status 1 > > > [cli_1]: aborting job: > > > Fatal error in MPI_Init: > > > Other MPI error > > > > > > [amos:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 5. > > MPI > > > process died? > > > [amos:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI > > > process died? > > > [amos:mpispawn_1][child_handler] MPI process (rank: 1, pid: 6733) exited > > > with status 1 > > > [amos:mpispawn_1][report_error] connect() failed: Connection refused > > (111) > > > > > > > > > Now f I run on the **same** machine, I get the expected results: > > > > > > kerkira% mpirun_rsh -np 2 kerkira kerkira ./osu_bw > > > # OSU MPI Bandwidth Test v3.6 > > > # Size Bandwidth (MB/s) > > > 1 5.47 > > > 2 11.34 > > > 4 22.84 > > > 8 45.89 > > > 16 91.52 > > > 32 180.27 > > > 64 350.68 > > > 128 661.78 > > > 256 1274.94 > > > 512 2283.42 > > > 1024 3936.39 > > > 2048 6362.91 > > > 4096 9159.54 > > > 8192 10737.42 > > > 16384 9246.39 > > > 32768 8869.26 > > > 65536 8707.28 > > > 131072 8942.07 > > > 262144 9009.39 > > > 524288 9060.31 > > > 1048576 9080.17 > > > 2097152 5702.06 > > > > > > (note: ssh between the machines kerkira and amos works correctly without > > > password) > > > > > > QUESTION: > > > > > > ==> Why MPI programs does not work between two machines ? > > > ==> Is it because I use the mlx4/umad/etc modules from my distribution > > > kernel and not OFED kernel-ib ? > > > > > > Thanks in advance for your help . > > > > > > Jean-Charles Lambert. > > > > > > > > > > > > _______________________________________________ > > > ewg mailing list > > > ewg@lists.openfabrics.org > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > > > > > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo _______________________________________________ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg