[Wien] problem in parallel calculations

2012-04-20 Thread hyunjung kim
Dear all,

It has been almost 1 month since I have been tried to make parallel 
calculations.

Im working on 
model : SUN Blade 6275 clusters
Processor: Intel Xeon X5570
CPU/node : 8cpu
Memory : 24GB/node, 3GB/core
Network: Infiniband 40G 8X QDR
Operation: Redhat Enterprise Linux 5.3
Job control : SGE 6.2u5

Compiler : intel 11.1 (MKL therein)
MPI : openMPI 1.3.3
FFTW: 2.1.5 (FFTW was compiled with intel 11.1 and configured with 
--enable-mpi LDFLAGS=-L$MPIHOME/$LIBRARYPATH F77=ifort CC=icc --with-sgi-mp 
--with-openmp --enable-threads)


Compiler option
 O   Compiler options:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML 
-mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include
 L   Linker Flags:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) 
-pthread
 P   Preprocessor flags   '-DParallel'
 R   R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread 
-lmkl_core -openmp -lpthread -lguide

 RP  RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 
-lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS)
 FP  FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML 
-mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include
 MP  MPIRUN commando: mpirun -mca btl ^tcp -mca plm_rsh_num_concurrent 
48 -mca oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ 
-machinefile _HOSTS_ _EXEC_



Within this environment, the compilation goes without any error messages.

To make .machines file, I type proclist=(`cat $TMPDIR/machines`).
It gives me the list of nodes according to the number of cpu.
If I set the total number of cpu 384 in the jobscript file, it export 384 
result.
Since it exports the name of each nodes, there is 8 same node. 

case1 : k-point parallelism + 8 mpi task per k-point
1. In my case, I owing to calculate with 48 k-points and 8 mpi tasks per 
node(per k-points), the machine file was,

lapw0:tachyon2066:8 tachyon1982:8 tachyon1207:8 tachyon1396:8 tachyon1152:8 
tachyon2440:8 tachyon2120:8 tachyon1555:8 tachyoo
n2319:8 tachyon2470:8 tachyon1612:8 tachyon2274:8 tachyon1402:8 tachyon2846:8 
tachyon2091:8 tachyon1622:8 tachyon1920:8 tachh
yon2213:8 tachyon1832:8 tachyon2672:8 tachyon2370:8 tachyon2545:8 tachyon2359:8 
tachyon1770:8 tachyon1018:8 tachyon1456:8 taa
chyon1429:8 tachyon3074:8 tachyon1169:8 tachyon2400:8 tachyon2688:8 
tachyon1099:8 tachyon2906:8 tachyon1394:8 tachyon1830:8  
tachyon1383:8 tachyon2157:8 tachyon2818:8 tachyon2644:8 tachyon2283:8 
tachyon1213:8 tachyon1542:8 tachyon2726:8 tachyon2152::
8 tachyon1135:8 tachyon2144:8 tachyon3015:8 tachyon2077:8
1:tachyon2066:8
1:tachyon1982:8
1:tachyon1207:8
1:tachyon1396:8
1:tachyon1152:8
1:tachyon2440:8
1:tachyon2120:8
1:tachyon1555:8
1:tachyon2319:8
1:tachyon2470:8
1:tachyon1612:8
1:tachyon2274:8
1:tachyon1402:8
1:tachyon2846:8
1:tachyon2091:8
1:tachyon1622:8
1:tachyon1920:8
1:tachyon2213:8
1:tachyon1832:8
1:tachyon2672:8
1:tachyon2370:8
1:tachyon2545:8
1:tachyon2359:8
1:tachyon1770:8
1:tachyon1018:8
1:tachyon1456:8
1:tachyon1429:8
1:tachyon3074:8
1:tachyon1169:8
1:tachyon2400:8
1:tachyon2688:8
1:tachyon1099:8
1:tachyon2906:8
1:tachyon1394:8
1:tachyon1830:8
1:tachyon1383:8
1:tachyon2157:8
1:tachyon2818:8
1:tachyon2644:8
1:tachyon2283:8
1:tachyon1213:8
1:tachyon1542:8
1:tachyon2726:8
1:tachyon2152:8
1:tachyon1135:8
1:tachyon2144:8
1:tachyon3015:8
1:tachyon2077:8
granularity:1
extrafine:1
lapw2_vector_split:1

In this case, 

case.dayfile shows

on tachyon2066 with PID 13780
using WIEN2k_11.1 (Release 5/4/2011) in /home01/x584cjh/code/WIEN2k_11


start   (Fri Apr 20 09:13:32 KST 2012) with lapw0 (40/99 to go)

cycle 1 (Fri Apr 20 09:13:32 KST 2012)  (40/99 to go)

   lapw0 -p(09:13:32) starting parallel lapw0 at Fri Apr 20 09:13:32 KST 
 2012
 .machine0 : 384 processors
tachyon2066:14892:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 
configured?
tachyon2066:14892:  open_hca: device mthca0 not found
tachyon2066:14892:  open_hca: device mthca0 not found
tachyon2066:14892:  open_hca: device ipath0 not found
tachyon2066:14892:  open_hca: device ipath0 not found
tachyon2066:14894:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 
configured?
tachyon2066:14894:  open_hca: device mthca0 not found
tachyon2066:14894:  open_hca: device mthca0 not found
tachyon2066:14891:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 
configured?
tachyon2066:14894:  open_hca: device ipath0 not found
tachyon2066:14894:  open_hca: device ipath0 not found
tachyon2319:23519:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 
configured?
tachyon2066:14891:  open_hca: device mthca0 not found
tachyon2066:14891:  open_hca: device mthca0 not found
tachyon1982:11799:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 
configured?
tachyon2319:23519:  open_hca: device mthca0 not found
tachyon2319:23519:  open_hca: device mthca0 not found
tachyon1982:11799:  open_hca: 

[Wien] problem in parallel calculations

2012-04-20 Thread Peter Blaha
When you have very little experience, the first thing to do is:

Forget mpi-parallelization (the problem is probably with scalapack (since 
lapw0_mpi seems to run in your second example)
or simply try setting setenv MPI_REMOTE 0in $WIENROOT/parallel_options

lets focus on sequential runs only.

A system with just 8 atoms should run only a couple of seconds (although with 
all your debugging switches on
it will take a bit longer).

  lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8 tachyon1519:8 
  tachyon2673:8
lapw0 still runs in mpi-mode and needs 17 seconds.

lapw0 -p (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36 KST 
  2012
   .machine0 : 48 processors
  83.154u 13.907s 0:17.04 569.5% 0+0k 0+0io 1899pf+0w

However: you have only 8 atoms and lapw0 is parallelized only over atoms. Thus 
use ONLY 8 cores for this run.
You will see that the time decreases, since the fft-part is probably very slow 
with so many cores.


  This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8 
  k-point.
  The time is almost 1:30 hour!
  My system contains only 9 Bi atoms and the inversion symmetry is assumed so 
  the number of symmetry operator is 4. So I expected much less time than 
  those.

Yes, this is VERY strange. Part of it may come from ??? -mcmodel=medium -CB -g 
??? which should probably not
be there in production runs.

Do agrep HORB  case.output1_1

It should give you some cpu/wall time information on 3 different parts of the 
code.
and alsogrep :RKM case.scf1 will tell you what your matrix size is (for 
8 atoms it should
not be larger than ~1000-2000  and the cpu time/k-point should be in the range 
of 10-30 seconds/k-point



 Compiler option
 O Compiler options: -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML 
 -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include
 L Linker Flags: $(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH) -pthread
 P Preprocessor flags '-DParallel'
 R R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread 
 -lmkl_core -openmp -lpthread -lguide

 RP RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64 
 -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS)
 FP FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML 
 -mcmodel=medium -i-dynamic -CB -g -traceback -I$(MKLROOT)/include
 MP MPIRUN commando : mpirun -mca btl ^tcp -mca plm_rsh_num_concurrent 48 -mca 
 oob_tcp_listen_mode listen_thread -mca plm_rsh_tree_spawn 1 -np _NP_ 
 -machinefile _HOSTS_ _EXEC_



 Within this environment, the compilation goes without any error messages.

 case2 : k-point parallelism
 *2. Only for the case of k-point parallelism, in this case, I just put the 
 total number of cpu as 384/8=48. *

 The generated .machine file is

 lapw0:tachyon1119:8 tachyon2665:8 tachyon3150:8 tachyon2896:8 tachyon1519:8 
 tachyon2673:8
 1:tachyon1119
 1:tachyon2665
 1:tachyon3150
 1:tachyon2896
 1:tachyon1519
 1:tachyon2673
 granularity:1
 extrafine:1
 lapw2_vector_split:1

 And the generated .processes file is
 init:tachyon1119
 init:tachyon2665
 init:tachyon3150
 init:tachyon2896
 init:tachyon1519
 init:tachyon2673
 1 : tachyon1119 : 8 : 1 : 1
 2 : tachyon2665 : 8 : 1 : 2
 3 : tachyon3150 : 8 : 1 : 3
 4 : tachyon2896 : 8 : 1 : 4
 5 : tachyon1519 : 8 : 1 : 5
 6 : tachyon2673 : 8 : 1 : 6

 And the calculation is going smooth until it gets the time limitation. But 
 the problem is time consuming.
 Below the .dayfile is presented.

 start (Wed Apr 18 18:02:36 KST 2012) with lapw0 (40/99 to go)

 cycle 1 (Wed Apr 18 18:02:36 KST 2012) (40/99 to go)

   lapw0 -p (18:02:36) starting parallel lapw0 at Wed Apr 18 18:02:36 KST 2012
  .machine0 : 48 processors
 83.154u 13.907s 0:17.04 569.5% 0+0k 0+0io 1899pf+0w
 :FORCE convergence: 0 1 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO 0 YCO
   lapw1 -p (18:02:55) starting parallel lapw1 at Wed Apr 18 18:02:55 KST 2012
 - starting parallel LAPW1 jobs at Wed Apr 18 18:02:55 KST 2012
 running LAPW1 in parallel mode (using .machines)
 6 number_of_parallel_jobs
 tachyon1001(8) 5167.781u 7.573s 1:26:16.75 99.9% 0+0k 0+0io 175pf+0w
 tachyon1469(8) 5222.568u 8.425s 1:27:12.71 99.9% 0+0k 0+0io 0pf+0w
 tachyon2585(8) 5148.924u 7.837s 1:25:58.19 99.9% 0+0k 0+0io 19pf+0w
 tachyon1214(8) 5170.790u 5.684s 1:26:17.83 99.9% 0+0k 0+0io 0pf+0w
 tachyon2943(8) 5105.165u 5.959s 1:25:12.38 99.9% 0+0k 0+0io 0pf+0w
 tachyon1154(8) 5065.181u 6.282s 1:24:32.88 99.9% 0+0k 0+0io 74pf+0w
 Summary of lapw1para:
 tachyon1001 k=8 user=5167.78 wallclock=86
 tachyon1469 k=8 user=5222.57 wallclock=87
 tachyon2585 k=8 user=5148.92 wallclock=85
 tachyon1214 k=8 user=5170.79 wallclock=86
 tachyon2943 k=8 user=5105.16 wallclock=85
 tachyon1154 k=8 user=5065.18 wallclock=84
 30883.253u 49.709s 1:27:15.42 590.8% 0+0k 0+0io 276pf+0w

 This shows LAPW1 cycle runs with 6 nodes and each node calculates with 8 
 k-point.
 The time is almost 1:30 hour!
 My system contains 

[Wien] problem in parallel calculations

2012-04-19 Thread Laurence Marks
Several suggestions.
a) limit yourself to just using 8 cores on 1 cpu, and something very simple
such as TiC until that works.
b) Just use simple commands such as x lapw0 -p until it works.
c) You probably do not need all the additional parameters in your MPIRUN
line, they should already be set at the system level.
d) openpmi 1.3.3 is old, and may not work right.
e) Talk to your sysadmin.

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu 1-847-491-3996
Research is to see what everybody else has seen, and to think what nobody
else has thought
Albert Szent-Gyorgi
 On Apr 19, 2012 7:49 PM, hyunjung kim angpangmokjang at hanmail.net wrote:

 Dear all,

 It has been almost 1 month since I have been tried to make parallel
 calculations.

 Im working on
 model : SUN Blade 6275 clusters
 Processor: Intel Xeon X5570
 CPU/node : 8cpu
 Memory : 24GB/node, 3GB/core
 Network: Infiniband 40G 8X QDR
 Operation: Redhat Enterprise Linux 5.3
 Job control : SGE 6.2u5

 Compiler : intel 11.1 (MKL therein)
 MPI : openMPI 1.3.3
 FFTW: 2.1.5 (FFTW was compiled with intel 11.1 and configured with
 --enable-mpi LDFLAGS=-L$MPIHOME/$LIBRARYPATH F77=ifort
 CC=icc --with-sgi-mp --with-openmp --enable-threads)


 Compiler option
  O   Compiler options:-FR -mp1 -w -prec_div -pc80 -pad -ip
 -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback
 -I$(MKLROOT)/include
  L   Linker Flags:$(FOPT) -L$(MKLROOT)/lib/$(MKL_TARGET_ARCH)
 -pthread
  P   Preprocessor flags   '-DParallel'
  R   R_LIB (LAPACK+BLAS): -lmkl_lapack -lmkl_intel_lp64
 -lmkl_intel_thread -lmkl_core -openmp -lpthread -lguide

  RP  RP_LIB(SCALAPACK+PBLAS): -lmkl_scalapack_lp64 -lmkl_solver_lp64
 -lmkl_blacs_lp64 -L$(FFTWPATH)/lib -lfftw_mpi -lfftw $(R_LIBS)
  FP  FPOPT(par.comp.options): -FR -mp1 -w -prec_div -pc80 -pad -ip
 -DINTEL_VML -mcmodel=medium -i-dynamic -CB -g -traceback
 -I$(MKLROOT)/include
  MP  MPIRUN commando: mpirun -mca btl ^tcp -mca
 plm_rsh_num_concurrent 48 -mca oob_tcp_listen_mode listen_thread -mca
 plm_rsh_tree_spawn 1 -np _NP_ -machinefile _HOSTS_ _EXEC_



 Within this environment, the compilation goes without any error messages.

 To make .machines file, I type proclist=(`cat $TMPDIR/machines`).
 It gives me the list of nodes according to the number of cpu.
 If I set the total number of cpu 384 in the jobscript file, it export 384
 result.
 Since it exports the name of each nodes, there is 8 same node.

 *case1 : k-point parallelism + 8 mpi task per k-point*
 *1. In my case, I owing to calculate with 48 k-points and 8 mpi tasks per
 node(per k-points), the machine file was,*

 lapw0:tachyon2066:8 tachyon1982:8 tachyon1207:8 tachyon1396:8
 tachyon1152:8 tachyon2440:8 tachyon2120:8 tachyon1555:8 tachyoo
 n2319:8 tachyon2470:8 tachyon1612:8 tachyon2274:8 tachyon1402:8
 tachyon2846:8 tachyon2091:8 tachyon1622:8 tachyon1920:8 tachh
 yon2213:8 tachyon1832:8 tachyon2672:8 tachyon2370:8 tachyon2545:8
 tachyon2359:8 tachyon1770:8 tachyon1018:8 tachyon1456:8 taa
 chyon1429:8 tachyon3074:8 tachyon1169:8 tachyon2400:8 tachyon2688:8
 tachyon1099:8 tachyon2906:8 tachyon1394:8 tachyon1830:8
 tachyon1383:8 tachyon2157:8 tachyon2818:8 tachyon2644:8 tachyon2283:8
 tachyon1213:8 tachyon1542:8 tachyon2726:8 tachyon2152::
 8 tachyon1135:8 tachyon2144:8 tachyon3015:8 tachyon2077:8
 1:tachyon2066:8
 1:tachyon1982:8
 1:tachyon1207:8
 1:tachyon1396:8
 1:tachyon1152:8
 1:tachyon2440:8
 1:tachyon2120:8
 1:tachyon1555:8
 1:tachyon2319:8
 1:tachyon2470:8
 1:tachyon1612:8
 1:tachyon2274:8
 1:tachyon1402:8
 1:tachyon2846:8
 1:tachyon2091:8
 1:tachyon1622:8
 1:tachyon1920:8
 1:tachyon2213:8
 1:tachyon1832:8
 1:tachyon2672:8
 1:tachyon2370:8
 1:tachyon2545:8
 1:tachyon2359:8
 1:tachyon1770:8
 1:tachyon1018:8
 1:tachyon1456:8
 1:tachyon1429:8
 1:tachyon3074:8
 1:tachyon1169:8
 1:tachyon2400:8
 1:tachyon2688:8
 1:tachyon1099:8
 1:tachyon2906:8
 1:tachyon1394:8
 1:tachyon1830:8
 1:tachyon1383:8
 1:tachyon2157:8
 1:tachyon2818:8
 1:tachyon2644:8
 1:tachyon2283:8
 1:tachyon1213:8
 1:tachyon1542:8
 1:tachyon2726:8
 1:tachyon2152:8
 1:tachyon1135:8
 1:tachyon2144:8
 1:tachyon3015:8
 1:tachyon2077:8
 granularity:1
 extrafine:1
 lapw2_vector_split:1

 In this case,

 case.dayfile shows

 on tachyon2066 with PID 13780
 using WIEN2k_11.1 (Release 5/4/2011) in /home01/x584cjh/code/WIEN2k_11


 start   (Fri Apr 20 09:13:32 KST 2012) with lapw0 (40/99 to go)

 cycle 1 (Fri Apr 20 09:13:32 KST 2012)  (40/99 to go)

lapw0 -p(09:13:32) starting parallel lapw0 at Fri Apr 20 09:13:32
 KST 2012
  .machine0 : 384 processors
 tachyon2066:14892:  open_hca: getaddr_netdev ERROR: Connection refused. Is
 ib1 configured?
 tachyon2066:14892:  open_hca: device mthca0 not found
 tachyon2066:14892:  open_hca: device mthca0 not found
 tachyon2066:14892:  open_hca: device ipath0 not found
 tachyon2066:14892:  open_hca: device ipath0 not found