[Wien] problem in parallel calculations
Dear all, It has been almost 1 month since I have been tried to make parallel calculations. Im working on model : SUN Blade 6275 clusters Processor: Intel Xeon X5570 CPU/node : 8cpu Memory : 24GB/node, 3GB/core Network: Infiniband 40G 8X QDR Operation: Redhat Enterprise Linux 5.3 Job control : SGE 6.2u5 Compiler : intel 11.1 (MKL therein) MPI : openMPI 1.3.3 FFTW: 2.1.5 (FFTW was compiled with intel 11.1 and configured with --enable-mpi LDFLAGS=-L$MPIHOME/$LIBRARYPATH F77=ifort CC=icc --with-sgi-mp --with-openmp --enable-threads) Compiler option O Compiler options:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include L Linker Flags:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread P Preprocessor flags '-DParallel' R R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS) FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include MP MPIRUN commando: mpirun -mca btl ^tcp -mca plm_rsh_num_concurrent 48 -mca oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_ Within this environment, the compilation goes without any error messages. To make .machines file, I type proclist=(`cat $TMPDIR/machines`). It gives me the list of nodes according to the number of cpu. If I set the total number of cpu 384 in the jobscript file, it export 384 result. Since it exports the name of each nodes, there is 8 same node. case1 : k-point parallelism + 8 mpi task per k-point 1. In my case, I owing to calculate with 48 k-points and 8 mpi tasks per node(per k-points), the machine file was, lapw0:tachyon2066:8 tachyon1982:8 tachyon1207:8 tachyon1396:8 tachyon1152:8 tachyon2440:8 tachyon2120:8 tachyon1555:8 tachyoo n2319:8 tachyon2470:8 tachyon1612:8 tachyon2274:8 tachyon1402:8 tachyon2846:8 tachyon2091:8 tachyon1622:8 tachyon1920:8 tachh yon2213:8 tachyon1832:8 tachyon2672:8 tachyon2370:8 tachyon2545:8 tachyon2359:8 tachyon1770:8 tachyon1018:8 tachyon1456:8 taa chyon1429:8 tachyon3074:8 tachyon1169:8 tachyon2400:8 tachyon2688:8 tachyon1099:8 tachyon2906:8 tachyon1394:8 tachyon1830:8 tachyon1383:8 tachyon2157:8 tachyon2818:8 tachyon2644:8 tachyon2283:8 tachyon1213:8 tachyon1542:8 tachyon2726:8 tachyon2152:: 8 tachyon1135:8 tachyon2144:8 tachyon3015:8 tachyon2077:8 1:tachyon2066:8 1:tachyon1982:8 1:tachyon1207:8 1:tachyon1396:8 1:tachyon1152:8 1:tachyon2440:8 1:tachyon2120:8 1:tachyon1555:8 1:tachyon2319:8 1:tachyon2470:8 1:tachyon1612:8 1:tachyon2274:8 1:tachyon1402:8 1:tachyon2846:8 1:tachyon2091:8 1:tachyon1622:8 1:tachyon1920:8 1:tachyon2213:8 1:tachyon1832:8 1:tachyon2672:8 1:tachyon2370:8 1:tachyon2545:8 1:tachyon2359:8 1:tachyon1770:8 1:tachyon1018:8 1:tachyon1456:8 1:tachyon1429:8 1:tachyon3074:8 1:tachyon1169:8 1:tachyon2400:8 1:tachyon2688:8 1:tachyon1099:8 1:tachyon2906:8 1:tachyon1394:8 1:tachyon1830:8 1:tachyon1383:8 1:tachyon2157:8 1:tachyon2818:8 1:tachyon2644:8 1:tachyon2283:8 1:tachyon1213:8 1:tachyon1542:8 1:tachyon2726:8 1:tachyon2152:8 1:tachyon1135:8 1:tachyon2144:8 1:tachyon3015:8 1:tachyon2077:8 granularity:1 extrafine:1 lapw2_vector_split:1 In this case, case.dayfile shows on tachyon2066 with PID 13780 using WIEN2k_11.1 (Release 5/4/2011) in /home01/x584cjh/code/WIEN2k_11 start (Fri Apr 20 09:13:32 KST 2012) with lapw0 (40/99 to go) cycle 1 (Fri Apr 20 09:13:32 KST 2012) (40/99 to go) lapw0 -p(09:13:32) starting parallel lapw0 at Fri Apr 20 09:13:32 KST 2012 .machine0 : 384 processors tachyon2066:14892: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured? tachyon2066:14892: open_hca: device mthca0 not found tachyon2066:14892: open_hca: device mthca0 not found tachyon2066:14892: open_hca: device ipath0 not found tachyon2066:14892: open_hca: device ipath0 not found tachyon2066:14894: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured? tachyon2066:14894: open_hca: device mthca0 not found tachyon2066:14894: open_hca: device mthca0 not found tachyon2066:14891: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured? tachyon2066:14894: open_hca: device ipath0 not found tachyon2066:14894: open_hca: device ipath0 not found tachyon2319:23519: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured? tachyon2066:14891: open_hca: device mthca0 not found tachyon2066:14891: open_hca: device mthca0 not found tachyon1982:11799: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured? tachyon2319:23519: open_hca: device mthca0 not found tachyon2319:23519: open_hca: device mthca0 not found tachyon1982:11799: open_hca:
[Wien] forrtl: severe (41): insufficient virtual memory
Dear all, I constantly got following error messages when the parallel job was submitted. I attach it. Also the generated .machines file attached, please check whether it is properly generated or not. I intended 24 k-point parallelized job. The compiler version is fortran : ifort, 12.0 (2011.3.174), mpif90 [ I got same error message within ifort 11.1 version, so I guess that fortran version is not the origin of this problem..] openmpi : 1.4.5 FFTW2 : 2.1.5 CC : icc, 12.0 (2011.3.174) compiler option O Compiler options:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -traceback -I$(MKLROOT)/include L Linker Flags:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread P Preprocessor flags '-DParallel' R R_LIB (LAPACK+BLAS): -lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS) FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -traceback -I$(MKLROOT)/include MP MPIRUN commando: mpirun -mca btl self,openib -mca plm_rsh_num_concurrent 400 -mca oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_ The error messages is: ~~ abbreviation ~ LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END forrtl: severe (41): insufficient virtual memory Image PCRoutineLineSource libintlc.so.5 2B0540E88F7A Unknown Unknown Unknown libintlc.so.5 2B0540E87AF5 Unknown Unknown Unknown libifcoremt.so.5 2B0540058CF2 Unknown Unknown Unknown libifcoremt.so.5 2B053FFCAAAB Unknown Unknown Unknown libifcoremt.so.5 2B054001AFBA Unknown Unknown Unknown libifcoremt.so.5 2B054001AE11 Unknown Unknown Unknown lapwso 004281C0 MAIN__131 lapwso.f lapwso 00402A9C Unknown Unknown Unknown libc.so.6 003CFA61D974 Unknown Unknown Unknown lapwso 004029A9 Unknown Unknown Unknown forrtl: severe (41): insufficient virtual memory Image PCRoutineLineSource libintlc.so.5 2B5D32256F7A Unknown Unknown Unknown libintlc.so.5 2B5D32255AF5 Unknown Unknown Unknown libifcoremt.so.5 2B5D31426CF2 Unknown Unknown Unknown libifcoremt.so.5 2B5D31398AAB Unknown Unknown Unknown libifcoremt.so.5 2B5D313E8FBA Unknown Unknown Unknown libifcoremt.so.5 2B5D313E8E11 Unknown Unknown Unknown lapwso 00409A6A hmsout_mp_init_hm 78 modules.f lapwso 004280E2 MAIN__130 lapwso.f lapwso 00402A9C Unknown Unknown Unknown libc.so.6 003CFA61D974 Unknown Unknown Unknown ~~ abbreviation ~~ I note that the compilation was done without any error messages. Any advice will be greatly appreciated! Hyun-Jung Kim (Ph.D student)| phone : ++82 10 7335 7889 Department of Physics | Hanyang University | e-mail: angpangmokjang at hanmail.net 17 Haengdang-Dong | 133-791 Seongdong-Ku,Seoul/Korea| www: http://physics.hanyang.ac.kr/~sst/ -- next part -- An HTML attachment was scrubbed... URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120415/6fbd92f2/attachment.htm
[Wien] forrtl: severe (41): insufficient virtual memory (file attached!!)
Dear all, (I'm sorry, I forgot to attach file which including error message and job script files) I constantly got following error messages when the parallel job was submitted. I attach it. Also the generated .machines file is attached, please check whether it is properly generated or not. I intended to do 24 k-point parallelized job. The compiler version is fortran : ifort, 12.0 (2011.3.174), mpif90 [ I got same error message within ifort 11.1 version, so I guess that fortran version is not the origin of this problem..] openmpi : 1.4.5 FFTW2 : 2.1.5 CC : icc, 12.0 (2011.3.174) compiler option O Compiler options:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -traceback -I$(MKLROOT)/include L Linker Flags:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread P Preprocessor flags '-DParallel' R R_LIB (LAPACK+BLAS): -lmkl_lapack95_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS) FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -mcmodel=medium -i-dynamic -traceback -I$(MKLROOT)/include MP MPIRUN commando: mpirun -mca btl self,openib -mca plm_rsh_num_concurrent 400 -mca oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_ The error messages is: ~~ abbreviation ~ LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW0 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END LAPW1 END forrtl: severe (41): insufficient virtual memory Image PCRoutineLineSource libintlc.so.5 2B0540E88F7A Unknown Unknown Unknown libintlc.so.5 2B0540E87AF5 Unknown Unknown Unknown libifcoremt.so.5 2B0540058CF2 Unknown Unknown Unknown libifcoremt.so.5 2B053FFCAAAB Unknown Unknown Unknown libifcoremt.so.5 2B054001AFBA Unknown Unknown Unknown libifcoremt.so.5 2B054001AE11 Unknown Unknown Unknown lapwso 004281C0 MAIN__131 lapwso.f lapwso 00402A9C Unknown Unknown Unknown libc.so.6 003CFA61D974 Unknown Unknown Unknown lapwso 004029A9 Unknown Unknown Unknown forrtl: severe (41): insufficient virtual memory Image PCRoutineLineSource libintlc.so.5 2B5D32256F7A Unknown Unknown Unknown libintlc.so.5 2B5D32255AF5 Unknown Unknown Unknown libifcoremt.so.5 2B5D31426CF2 Unknown Unknown Unknown libifcoremt.so.5 2B5D31398AAB Unknown Unknown Unknown libifcoremt.so.5 2B5D313E8FBA Unknown Unknown Unknown libifcoremt.so.5 2B5D313E8E11 Unknown Unknown Unknown lapwso 00409A6A hmsout_mp_init_hm 78 modules.f lapwso 004280E2 MAIN__130 lapwso.f lapwso 00402A9C Unknown Unknown Unknown libc.so.6 003CFA61D974 Unknown Unknown Unknown ~~ abbreviation ~~ I note that the compilation was done without any error messages. Any advice will be greatly appreciated! Hyun-Jung Kim (Ph.D student)| phone : ++82 10 7335 7889 Department of Physics | Hanyang University | e-mail: angpangmokjang at hanmail.net 17 Haengdang-Dong | 133-791 Seongdong-Ku,Seoul/Korea| www: http://physics.hanyang.ac.kr/~sst/ -- next part -- An HTML attachment was scrubbed... URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120415/b93c185d/attachment.htm -- next part -- A non-text attachment was scrubbed... Name: error.zip Type: application/zip Size: 8025 bytes Desc: not available URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120415/b93c185d/attachment.zip -- next part -- An HTML attachment was scrubbed... URL:
[Wien] Emax in case.in1 within spin-orbit coupling calculations
Dear all, I have some question on calculation including spin-orbit coupling (SOC). In according to the manual, the second-variational procedure requires to include many more unoccupied states to be calculated. And it can be controlled by increasing the energy maximum (Emax) which eigenvalues shall be searched. For the Emax which defines the number of eigenvalues to be calculated, I have used Emax=2.5 Ry in non-SOC and 5.0 Ry for SOC. Thus, the number of eigenvalues in SOC calculation were increased by about 2 times larger than that of non-SOC. And also I have check the convergence in the function of Emax value. For non-SOC calculations, it does not depend on Emax value as expected. However in the case of SOC calculations, total energy was depend on the Emax value. Larger Emax results the total energy go down. But up to 10Ry of Emax, I cannot find convergence in total energy. You can find the numbers below. system : Bismuth bulk (including SOC) ; total energy Emax = 2 ; E(total) = -86326.2333 Ry Emax = 2.5 ; E(total) = -86326.2420 Ry Emax = 3 ; E(total) = -86326.2454 Ry Emax = 4 ; E(total) = -86326.2527 Ry Emax = 5 ; E(total) = -86326.2577 Ry Emax = 7 ; E(total) = -86326.2634 Ry Emax = 10 ; E(total) = -86326.2662 Ry Questions are, 1. Why the unoccupied states affect the total energy in the case of SOC been included? 2. Is there appropriate suggestions taking Emax value? 3. Should the total energy converged as the function of Emax? 4. Within same Emax restriction (i.e. same parameters), could the total energy be compared between different geometries? Thank you. Best regards, Hyun-Jung Kim. Hyun-Jung Kim (Ph.D student)| phone : ++82 10 7335 7889 Department of Physics | Hanyang University | e-mail: angpangmokjang at hanmail.net 17 Haengdang-Dong | 133-791 Seongdong-Ku,Seoul/Korea| www: http://physics.hanyang.ac.kr/~sst/ -- next part -- An HTML attachment was scrubbed... URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120319/ae2cbfec/attachment.htm
[Wien] Emax in case.in1 within spin-orbit coupling calculations
Dear all, I have some question on calculation including spin-orbit coupling (SOC). In according to the manual, the second-variational procedure requires to include many more unoccupied states to be calculated. And it can be controlled by increasing the energy maximum (Emax) which eigenvalues shall be searched. For the Emax which defines the number of eigenvalues to be calculated, I have used Emax=2.5 Ry in non-SOC and 5.0 Ry for SOC. Thus, the number of eigenvalues in SOC calculation were increased by about 2 times larger than that of non-SOC. And also I have check the convergence in the function of Emax value. For non-SOC calculations, it does not depend on Emax value as expected. However in the case of SOC calculations, total energy was depend on the Emax value. Larger Emax results the total energy go down. But up to 10Ry of Emax, I cannot find convergence in total energy. You can find the numbers below. system : Bismuth bulk (including SOC) ; total energy Emax = 2 ; E(total) = -86326.2333 Ry Emax = 2.5 ; E(total) = -86326.2420 Ry Emax = 3 ; E(total) = -86326.2454 Ry Emax = 4 ; E(total) = -86326.2527 Ry Emax = 5 ; E(total) = -86326.2577 Ry Emax = 7 ; E(total) = -86326.2634 Ry Emax = 10 ; E(total) = -86326.2662 Ry Questions are, 1. Why the unoccupied states affect the total energy in the case of SOC been included? 2. Is there appropriate suggestions taking Emax value? 3. Should the total energy converged as the function of Emax? 4. Within same Emax restriction (i.e. same parameters), could the total energy be compared between different geometries? Thank you. Best regards, Hyun-Jung Kim. Hyun-Jung Kim (Ph.D student)| phone : ++82 10 7335 7889 Department of Physics | Hanyang University | e-mail: angpangmokjang at hanmail.net 17 Haengdang-Dong | 133-791 Seongdong-Ku,Seoul/Korea| www: http://physics.hanyang.ac.kr/~sst/ -- next part -- An HTML attachment was scrubbed... URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120319/be9a99a3/attachment.htm
[Wien] Question on spin-orbit coupling calculation
Dear all, I have some question on calculation including spin-orbit coupling (SOC). In according to the manual, the second-variational procedure requires to include many more unoccupied states to be calculated. And it can be controlled by increasing the energy maximum (Emax) which eigenvalues shall be searched. For the Emax which defines the number of eigenvalues to be calculated, I have used Emax=2.5 Ry in non-SOC and 5.0 Ry for SOC. Thus, the number of eigenvalues in SOC calculation were increased by about 2 times larger than that of non-SOC. And also I have check the convergence in the function of Emax value. For non-SOC calculations, it does not depend on Emax value as expected. However in the case of SOC calculations, total energy was depend on the Emax value. Larger Emax results the total energy go down. But up to 10Ry of Emax, I cannot find convergence in total energy. You can find the numbers below. system : Bismuth bulk (including SOC) ; total energy Emax = 2 ; E(total) = -86326.2333 Ry Emax = 2.5 ; E(total) = -86326.2420 Ry Emax = 3 ; E(total) = -86326.2454 Ry Emax = 4 ; E(total) = -86326.2527 Ry Emax = 5 ; E(total) = -86326.2577 Ry Emax = 7 ; E(total) = -86326.2634 Ry Emax = 10 ; E(total) = -86326.2662 Ry Questions are, 1. Why the unoccupied states affect the total energy in the case of SOC been included? 2. Is there appropriate suggestions taking Emax value? 3. Should the total energy converged as the function of Emax? 4. Within same Emax restriction (i.e. same parameters), could the total energy be compared between different geometries? Thank you. Best regards, Hyun-Jung Kim. Hyun-Jung Kim (Ph.D student)| phone : ++82 10 7335 7889 Department of Physics | Hanyang University | e-mail: angpangmokjang at hanmail.net 17 Haengdang-Dong | 133-791 Seongdong-Ku,Seoul/Korea| www: http://physics.hanyang.ac.kr/~sst/ -- next part -- An HTML attachment was scrubbed... URL: http://zeus.theochem.tuwien.ac.at/pipermail/wien/attachments/20120319/1aae7a9b/attachment.htm