[Wien] problem in parallel calculations
Dear all, It has been almost 1 month since I have been tried to make parallel calculations. Im working on model : SUN Blade 6275 clusters Processor: Intel Xeon X5570 CPU/node : 8cpu Memory : 24GB/node, 3GB/core Network: Infiniband 40G 8X QDR Operation: Redhat Enterprise Linux 5.3 Job control : SGE 6.2u5 Compiler : intel 11.1 (MKL therein) MPI : openMPI 1.3.3 FFTW: 2.1.5 (FFTW was compiled with intel 11.1 and configured with --enable-mpi LDFLAGS=-L$MPIHOME/$LIBRARYPATH F77=ifort CC=icc --with-sgi-mp --with-openmp --enable-threads) Compiler option O Compiler options:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include L Linker Flags:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread P Preprocessor flags '-DParallel' R R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS) FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include MP MPIRUN commando: mpirun -mca btl ^tcp -mca plm_rsh_num_concurrent 48 -mca oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_ Within this environment, the compilation goes without any error messages. To make .machines file, I type proclist=(`cat $TMPDIR/machines`). It gives me the list of nodes according to the number of cpu. If I set the total number of cpu 384 in the jobscript file, it export 384 result. Since it exports the name of each nodes, there is 8 same node. case1 : k-point parallelism + 8 mpi task per k-point 1. In my case, I owing to calculate with 48 k-points and 8 mpi tasks per node(per k-points), the machine file was, lapw0:tachyon2066:8 tachyon1982:8 tachyon1207:8 tachyon1396:8 tachyon1152:8 tachyon2440:8 tachyon2120:8 tachyon1555:8 tachyoo n2319:8 tachyon2470:8 tachyon1612:8 tachyon2274:8 tachyon1402:8 tachyon2846:8 tachyon2091:8 tachyon1622:8 tachyon1920:8 tachh yon2213:8 tachyon1832:8 tachyon2672:8 tachyon2370:8 tachyon2545:8 tachyon2359:8 tachyon1770:8 tachyon1018:8 tachyon1456:8 taa chyon1429:8 tachyon3074:8 tachyon1169:8 tachyon2400:8 tachyon2688:8 tachyon1099:8 tachyon2906:8 tachyon1394:8 tachyon1830:8 tachyon1383:8 tachyon2157:8 tachyon2818:8 tachyon2644:8 tachyon2283:8 tachyon1213:8 tachyon1542:8 tachyon2726:8 tachyon2152:: 8 tachyon1135:8 tachyon2144:8 tachyon3015:8 tachyon2077:8 1:tachyon2066:8 1:tachyon1982:8 1:tachyon1207:8 1:tachyon1396:8 1:tachyon1152:8 1:tachyon2440:8 1:tachyon2120:8 1:tachyon1555:8 1:tachyon2319:8 1:tachyon2470:8 1:tachyon1612:8 1:tachyon2274:8 1:tachyon1402:8 1:tachyon2846:8 1:tachyon2091:8 1:tachyon1622:8 1:tachyon1920:8 1:tachyon2213:8 1:tachyon1832:8 1:tachyon2672:8 1:tachyon2370:8 1:tachyon2545:8 1:tachyon2359:8 1:tachyon1770:8 1:tachyon1018:8 1:tachyon1456:8 1:tachyon1429:8 1:tachyon3074:8 1:tachyon1169:8 1:tachyon2400:8 1:tachyon2688:8 1:tachyon1099:8 1:tachyon2906:8 1:tachyon1394:8 1:tachyon1830:8 1:tachyon1383:8 1:tachyon2157:8 1:tachyon2818:8 1:tachyon2644:8 1:tachyon2283:8 1:tachyon1213:8 1:tachyon1542:8 1:tachyon2726:8 1:tachyon2152:8 1:tachyon1135:8 1:tachyon2144:8 1:tachyon3015:8 1:tachyon2077:8 granularity:1 extrafine:1 lapw2_vector_split:1 In this case, case.dayfile shows on tachyon2066 with PID 13780 using WIEN2k_11.1 (Release 5/4/2011) in /home01/x584cjh/code/WIEN2k_11 start (Fri Apr 20 09:13:32 KST 2012) with lapw0 (40/99 to go) cycle 1 (Fri Apr 20 09:13:32 KST 2012) (40/99 to go) lapw0 -p(09:13:32) starting parallel lapw0 at Fri Apr 20 09:13:32 KST 2012 .machine0 : 384 processors tachyon2066:14892: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured? tachyon2066:14892: open_hca: device mthca0 not found tachyon2066:14892: open_hca: device mthca0 not found tachyon2066:14892: open_hca: device ipath0 not found tachyon2066:14892: open_hca: device ipath0 not found tachyon2066:14894: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured? tachyon2066:14894: open_hca: device mthca0 not found tachyon2066:14894: open_hca: device mthca0 not found tachyon2066:14891: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured? tachyon2066:14894: open_hca: device ipath0 not found tachyon2066:14894: open_hca: device ipath0 not found tachyon2319:23519: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured? tachyon2066:14891: open_hca: device mthca0 not found tachyon2066:14891: open_hca: device mthca0 not found tachyon1982:11799: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured? tachyon2319:23519: open_hca: device mthca0 not found tachyon2319:23519: open_hca: device mthca0 not found tachyon1982:11799: open_hca:
[Wien] problem in parallel calculations
When you have very little experience, the first thing to do is: Forget mpi-parallelization (the problem is probably with scalapack (since lapw0_mpi seems to run in your second example) or simply try setting setenv MPI_REMOTE 0in $WIENROOT/parallel_options lets focus on sequential runs only. A system with just 8 atoms should run only a couple of seconds (although with all your debugging switches on it will take a bit longer). lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8 tachyon1519:8 tachyon2673:8 lapw0 still runs in mpi-mode and needs 17 seconds. lapw0 -p (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36 KST 2012 .machine0 : 48 processors 83.154u 13.907s 0:17.04 569.5% 0+0k 0+0io 1899pf+0w However: you have only 8 atoms and lapw0 is parallelized only over atoms. Thus use ONLY 8 cores for this run. You will see that the time decreases, since the fft-part is probably very slow with so many cores. This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8 k-point. The time is almost 1:30 hour! My system contains only 9 Bi atoms and the inversion symmetry is assumed so the number of symmetry operator is 4. So I expected much less time than those. Yes, this is VERY strange. Part of it may come from ??? -mcmodel=medium -CB -g ??? which should probably not be there in production runs. Do agrep HORB case.output1_1 It should give you some cpu/wall time information on 3 different parts of the code. and alsogrep :RKM case.scf1 will tell you what your matrix size is (for 8 atoms it should not be larger than ~1000-2000 and the cpu time/k-point should be in the range of 10-30 seconds/k-point Compiler option O Compiler options: -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include L Linker Flags: $(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread P Preprocessor flags '-DParallel' R R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS) FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include MP MPIRUN commando : mpirun -mca btl ^tcp -mca plm_rsh_num_concurrent 48 -mca oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_ Within this environment, the compilation goes without any error messages. case2 : k-point parallelism *2. Only for the case of k-point parallelism, in this case, I just put the total number of cpu as 384/8=48. * The generated .machine file is lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8 tachyon1519:8 tachyon2673:8 1:tachyon1119 1:tachyon2665 1:tachyon3150 1:tachyon2896 1:tachyon1519 1:tachyon2673 granularity:1 extrafine:1 lapw2_vector_split:1 And the generated .processes file is init:tachyon1119 init:tachyon2665 init:tachyon3150 init:tachyon2896 init:tachyon1519 init:tachyon2673 1 : tachyon1119 : 8 : 1 : 1 2 : tachyon2665 : 8 : 1 : 2 3 : tachyon3150 : 8 : 1 : 3 4 : tachyon2896 : 8 : 1 : 4 5 : tachyon1519 : 8 : 1 : 5 6 : tachyon2673 : 8 : 1 : 6 And the calculation is going smooth until it gets the time limitation. But the problem is time consuming. Below the .dayfile is presented. start (Wed Apr 18 18:02:36 KST 2012) with lapw0 (40/99 to go) cycle 1 (Wed Apr 18 18:02:36 KST 2012) (40/99 to go) lapw0 -p (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36 KST 2012 .machine0 : 48 processors 83.154u 13.907s 0:17.04 569.5% 0+0k 0+0io 1899pf+0w :FORCE convergence: 0 1 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO lapw1 -p (18:02:55) starting parallel lapw1 at Wed Apr 18 18:02:55 KST 2012 - starting parallel LAPW1 jobs at Wed Apr 18 18:02:55 KST 2012 running LAPW1 in parallel mode (using .machines) 6 number_of_parallel_jobs tachyon1001(8) 5167.781u 7.573s 1:26:16.75 99.9% 0+0k 0+0io 175pf+0w tachyon1469(8) 5222.568u 8.425s 1:27:12.71 99.9% 0+0k 0+0io 0pf+0w tachyon2585(8) 5148.924u 7.837s 1:25:58.19 99.9% 0+0k 0+0io 19pf+0w tachyon1214(8) 5170.790u 5.684s 1:26:17.83 99.9% 0+0k 0+0io 0pf+0w tachyon2943(8) 5105.165u 5.959s 1:25:12.38 99.9% 0+0k 0+0io 0pf+0w tachyon1154(8) 5065.181u 6.282s 1:24:32.88 99.9% 0+0k 0+0io 74pf+0w Summary of lapw1para: tachyon1001 k=8 user=5167.78 wallclock=86 tachyon1469 k=8 user=5222.57 wallclock=87 tachyon2585 k=8 user=5148.92 wallclock=85 tachyon1214 k=8 user=5170.79 wallclock=86 tachyon2943 k=8 user=5105.16 wallclock=85 tachyon1154 k=8 user=5065.18 wallclock=84 30883.253u 49.709s 1:27:15.42 590.8% 0+0k 0+0io 276pf+0w This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8 k-point. The time is almost 1:30 hour! My system contains
[Wien] problem in parallel calculations
Several suggestions. a) limit yourself to just using 8 cores on 1 cpu, and something very simple such as TiC until that works. b) Just use simple commands such as x lapw0 -p until it works. c) You probably do not need all the additional parameters in your MPIRUN line, they should already be set at the system level. d) openpmi 1.3.3 is old, and may not work right. e) Talk to your sysadmin. --- Professor Laurence Marks Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu 1-847-491-3996 Research is to see what everybody else has seen, and to think what nobody else has thought Albert Szent-Gyorgi On Apr 19, 2012 7:49 PM, hyunjung kim angpangmokjang at hanmail.net wrote: Dear all, It has been almost 1 month since I have been tried to make parallel calculations. Im working on model : SUN Blade 6275 clusters Processor: Intel Xeon X5570 CPU/node : 8cpu Memory : 24GB/node, 3GB/core Network: Infiniband 40G 8X QDR Operation: Redhat Enterprise Linux 5.3 Job control : SGE 6.2u5 Compiler : intel 11.1 (MKL therein) MPI : openMPI 1.3.3 FFTW: 2.1.5 (FFTW was compiled with intel 11.1 and configured with --enable-mpi LDFLAGS=-L$MPIHOME/$LIBRARYPATH F77=ifort CC=icc --with-sgi-mp --with-openmp --enable-threads) Compiler option O Compiler options:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include L Linker Flags:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread P Preprocessor flags '-DParallel' R R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS) FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include MP MPIRUN commando: mpirun -mca btl ^tcp -mca plm_rsh_num_concurrent 48 -mca oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_ Within this environment, the compilation goes without any error messages. To make .machines file, I type proclist=(`cat $TMPDIR/machines`). It gives me the list of nodes according to the number of cpu. If I set the total number of cpu 384 in the jobscript file, it export 384 result. Since it exports the name of each nodes, there is 8 same node. *case1 : k-point parallelism + 8 mpi task per k-point* *1. In my case, I owing to calculate with 48 k-points and 8 mpi tasks per node(per k-points), the machine file was,* lapw0:tachyon2066:8 tachyon1982:8 tachyon1207:8 tachyon1396:8 tachyon1152:8 tachyon2440:8 tachyon2120:8 tachyon1555:8 tachyoo n2319:8 tachyon2470:8 tachyon1612:8 tachyon2274:8 tachyon1402:8 tachyon2846:8 tachyon2091:8 tachyon1622:8 tachyon1920:8 tachh yon2213:8 tachyon1832:8 tachyon2672:8 tachyon2370:8 tachyon2545:8 tachyon2359:8 tachyon1770:8 tachyon1018:8 tachyon1456:8 taa chyon1429:8 tachyon3074:8 tachyon1169:8 tachyon2400:8 tachyon2688:8 tachyon1099:8 tachyon2906:8 tachyon1394:8 tachyon1830:8 tachyon1383:8 tachyon2157:8 tachyon2818:8 tachyon2644:8 tachyon2283:8 tachyon1213:8 tachyon1542:8 tachyon2726:8 tachyon2152:: 8 tachyon1135:8 tachyon2144:8 tachyon3015:8 tachyon2077:8 1:tachyon2066:8 1:tachyon1982:8 1:tachyon1207:8 1:tachyon1396:8 1:tachyon1152:8 1:tachyon2440:8 1:tachyon2120:8 1:tachyon1555:8 1:tachyon2319:8 1:tachyon2470:8 1:tachyon1612:8 1:tachyon2274:8 1:tachyon1402:8 1:tachyon2846:8 1:tachyon2091:8 1:tachyon1622:8 1:tachyon1920:8 1:tachyon2213:8 1:tachyon1832:8 1:tachyon2672:8 1:tachyon2370:8 1:tachyon2545:8 1:tachyon2359:8 1:tachyon1770:8 1:tachyon1018:8 1:tachyon1456:8 1:tachyon1429:8 1:tachyon3074:8 1:tachyon1169:8 1:tachyon2400:8 1:tachyon2688:8 1:tachyon1099:8 1:tachyon2906:8 1:tachyon1394:8 1:tachyon1830:8 1:tachyon1383:8 1:tachyon2157:8 1:tachyon2818:8 1:tachyon2644:8 1:tachyon2283:8 1:tachyon1213:8 1:tachyon1542:8 1:tachyon2726:8 1:tachyon2152:8 1:tachyon1135:8 1:tachyon2144:8 1:tachyon3015:8 1:tachyon2077:8 granularity:1 extrafine:1 lapw2_vector_split:1 In this case, case.dayfile shows on tachyon2066 with PID 13780 using WIEN2k_11.1 (Release 5/4/2011) in /home01/x584cjh/code/WIEN2k_11 start (Fri Apr 20 09:13:32 KST 2012) with lapw0 (40/99 to go) cycle 1 (Fri Apr 20 09:13:32 KST 2012) (40/99 to go) lapw0 -p(09:13:32) starting parallel lapw0 at Fri Apr 20 09:13:32 KST 2012 .machine0 : 384 processors tachyon2066:14892: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured? tachyon2066:14892: open_hca: device mthca0 not found tachyon2066:14892: open_hca: device mthca0 not found tachyon2066:14892: open_hca: device ipath0 not found tachyon2066:14892: open_hca: device ipath0 not found