When you have very little experience, the first thing to do is: Forget mpi-parallelization (the problem is probably with scalapack (since lapw0_mpi seems to run in your second example) or simply try setting setenv MPI_REMOTE 0 in $WIENROOT/parallel_options
lets focus on sequential runs only. A system with just 8 atoms should run only a couple of seconds (although with all your debugging switches on it will take a bit longer). > lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8 tachyon1519:8 > tachyon2673:8 lapw0 still runs in mpi-mode and needs 17 seconds. > > lapw0 -p (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36 KST > 2012 > -------- .machine0 : 48 processors > 83.154u 13.907s 0:17.04 569.5% 0+0k 0+0io 1899pf+0w However: you have only 8 atoms and lapw0 is parallelized only over atoms. Thus use ONLY 8 cores for this run. You will see that the time decreases, since the fft-part is probably very slow with so many cores. -------- > This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8 > k-point. > The time is almost 1:30 hour! > My system contains only 9 Bi atoms and the inversion symmetry is assumed so > the number of symmetry operator is 4. So I expected much less time than > those. Yes, this is VERY strange. Part of it may come from ??? -mcmodel=medium -CB -g ??? which should probably not be there in production runs. Do a grep HORB case.output1_1 It should give you some cpu/wall time information on 3 different parts of the code. and also grep :RKM case.scf1 will tell you what your matrix size is (for 8 atoms it should not be larger than ~1000-2000 and the cpu time/k-point should be in the range of 10-30 seconds/k-point > Compiler option > O Compiler options: -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML > -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include > L Linker Flags: $(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread > P Preprocessor flags '-DParallel' > R R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread > -lmkl_core -openmp -lpthread -lguide > > RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 > -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS) > FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML > -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include > MP MPIRUN commando : mpirun -mca btl ^tcp -mca plm_rsh_num_concurrent 48 -mca > oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ > -machinefile _HOSTS_ _EXEC_ > > > > Within this environment, the compilation goes without any error messages. > case2 : k-point parallelism > *2. Only for the case of k-point parallelism, in this case, I just put the > total number of cpu as 384/8=48. * > > The generated .machine file is > > lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8 tachyon1519:8 > tachyon2673:8 > 1:tachyon1119 > 1:tachyon2665 > 1:tachyon3150 > 1:tachyon2896 > 1:tachyon1519 > 1:tachyon2673 > granularity:1 > extrafine:1 > lapw2_vector_split:1 > > And the generated .processes file is > init:tachyon1119 > init:tachyon2665 > init:tachyon3150 > init:tachyon2896 > init:tachyon1519 > init:tachyon2673 > 1 : tachyon1119 : 8 : 1 : 1 > 2 : tachyon2665 : 8 : 1 : 2 > 3 : tachyon3150 : 8 : 1 : 3 > 4 : tachyon2896 : 8 : 1 : 4 > 5 : tachyon1519 : 8 : 1 : 5 > 6 : tachyon2673 : 8 : 1 : 6 > > And the calculation is going smooth until it gets the time limitation. But > the problem is time consuming. > Below the .dayfile is presented. > > start (Wed Apr 18 18:02:36 KST 2012) with lapw0 (40/99 to go) > > cycle 1 (Wed Apr 18 18:02:36 KST 2012) (40/99 to go) > > > lapw0 -p (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36 KST 2012 > -------- .machine0 : 48 processors > 83.154u 13.907s 0:17.04 569.5% 0+0k 0+0io 1899pf+0w > :FORCE convergence: 0 1 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO > > lapw1 -p (18:02:55) starting parallel lapw1 at Wed Apr 18 18:02:55 KST 2012 > -> starting parallel LAPW1 jobs at Wed Apr 18 18:02:55 KST 2012 > running LAPW1 in parallel mode (using .machines) > 6 number_of_parallel_jobs > tachyon1001(8) 5167.781u 7.573s 1:26:16.75 99.9% 0+0k 0+0io 175pf+0w > tachyon1469(8) 5222.568u 8.425s 1:27:12.71 99.9% 0+0k 0+0io 0pf+0w > tachyon2585(8) 5148.924u 7.837s 1:25:58.19 99.9% 0+0k 0+0io 19pf+0w > tachyon1214(8) 5170.790u 5.684s 1:26:17.83 99.9% 0+0k 0+0io 0pf+0w > tachyon2943(8) 5105.165u 5.959s 1:25:12.38 99.9% 0+0k 0+0io 0pf+0w > tachyon1154(8) 5065.181u 6.282s 1:24:32.88 99.9% 0+0k 0+0io 74pf+0w > Summary of lapw1para: > tachyon1001 k=8 user=5167.78 wallclock=86 > tachyon1469 k=8 user=5222.57 wallclock=87 > tachyon2585 k=8 user=5148.92 wallclock=85 > tachyon1214 k=8 user=5170.79 wallclock=86 > tachyon2943 k=8 user=5105.16 wallclock=85 > tachyon1154 k=8 user=5065.18 wallclock=84 > 30883.253u 49.709s 1:27:15.42 590.8% 0+0k 0+0io 276pf+0w > > This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8 > k-point. > The time is almost 1:30 hour! > My system contains only 9 Bi atoms and the inversion symmetry is assumed so > the number of symmetry operator is 4. So I expected much less time than those. > Can anybody help me? > > Sincerely, > > HJ Kim. > > > > > > > > > > > > > > _______________________________________________ > Wien mailing list > Wien at zeus.theochem.tuwien.ac.at > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien -- P.Blaha -------------------------------------------------------------------------- Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna Phone: +43-1-58801-165300 FAX: +43-1-58801-165982 Email: blaha at theochem.tuwien.ac.at WWW: http://info.tuwien.ac.at/theochem/ --------------------------------------------------------------------------