Hi, I am having difficulties running MPB on MPI. I have installed the following
packages serial and parallel in our cluster multicore as follows:
SERIAL:
hdf5-1.8.4-patch1: CC=icc CFLAGS="-O3 -fPIC" CXX=iCC F77=ifort ./configure
--enable-linux-lfs --enable-production
--prefix=/usr/local
h5utils-1.12.1: ./configure
--prefix=/usr/local
mpb-1.4.2 : CC=icc ./configure --with-blas="$MKL" --with-lapack
--enable-shared --prefix=/usr/local
--with-libctl=/usr/local/share/libctl --with-hdf5
LDFLAGS="-L/usr/local/lib" CPPFLAGS="-I/usr/local/include
-DH5_USE_16_API=1"
where
MKL="-L${MKL_LIB_PATH}
-lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_lapack -lmkl_core -liomp5 -lpthread"
MKL_LIB_PATH=” /opt/intel/Compiler/11.1/046/mkl/lib/em64t”
And for the MPI version:
hdf5-1.8.4-patch1: CC=mpicc ./configure --prefix=/usr/local/hdf5-mpi
--enable-fortran --enable-parallel --enable-production --enable-linux-lfs
mpb-1.4.2:
CC=mpicc CCXX=mpiCC F77=mpif77 ./configure
--prefix=/usr/local/mpbmpi LDFLAGS="-L/usr/local/lib
-L/usr/local/hdf5-mpi/lib -L/opt/intel/Compiler/11.1/046/lib64/intel64/"
CPPFLAGS="-I/usr/local/include -I /usr/local/hdf5-mpi/include/
-I/opt/intel/Compiler/11.1/046/include/ -DH5_USE_16_API=1"
--with-libctl=/usr/local/share/libctl --with-blas="$MKL" --with-lapack
--with-mpi --with-hdf5 --enable-debug
When running mpb-mpi programs like slabdefectolineal.ctl for more than 10
processes,(in this case with 15) the following error appears:
Solving for band
polarization: zeven.
Initializing fields to random numbers...
elapsed time for initialization: 2 seconds.
epsilon: 1-11.97, mean 2.05361, harm. mean 1.11542,
12.8162% > 1, 9.6045% "fill"
Outputting epsilon...
solve_kpoint (0,0,0):
zevenfreqs:, k index, k1, k2, k3, kmag/2pi, zeven band
1, zeven band 2, zeven band 3, zeven band 4, zeven band 5, zeven band 6, zeven
band 7, zeven band 8, zeven band 9, zeven band 10, zeven band 11, zeven band
12, zeven band 13, zeven band 14, zeven band 15, zeven band 16, zeven band 17,
zeven band 18, zeven band 19, zeven band 20
Solving for bands 2 to 11...
[node27:14924] *** Process received signal ***
[node27:14924] Signal: Segmentation fault (11)
[node27:14924] Signal code: (-6)
[node27:14924] Failing at address: 0x3a4c
[node23:31035] *** Process received signal ***
[node27:14924] [ 0]
/lib64/libpthread.so.0 [0x3a0720e4c0]
[node27:14924] [ 1] /lib64/libpthread.so.0(raise+0x2d)
[0x3a0720e38d]
[node27:14924] [ 2] /opt/intel/Compiler/11.1/046/lib/intel64/libiomp5.so
[0x2aae48ca78dc]
[node27:14924] *** End of error message ***
[node18:24349] *** Process received signal ***
[node18:24349] Signal: Segmentation fault (11)
[node18:24349] Signal code: Address not mapped (1)
[node18:24349] Failing at address: (nil)
[node18:24349] [ 0] /lib64/libc.so.6 [0x3a06630280]
[node18:24349] [ 1]
/usr/local/mpbmpi/bin/mpb-mpi(maxwell_zero_k_constraint+0x36) [0x431df8]
[node18:24349] [ 2]
/usr/local/mpbmpi/bin/mpb-mpi(evectconstraint_chain_func+0x6b) [0x439b73]
[node18:24349] [ 3]
/usr/local/mpbmpi/bin/mpb-mpi(eigensolver+0x947) [0x43628d]
[node18:24349] [ 4]
/usr/local/mpbmpi/bin/mpb-mpi(solve_kpoint+0xdff) [0x412335]
[node18:24349] [ 5]
/usr/local/mpbmpi/bin/mpb-mpi(solve_kpoint_aux+0x4d) [0x40f651]
[node18:24349] [ 6] /usr/lib64/libguile.so.17
[0x2b6618f4e883]
[node18:24349] [ 7] /usr/lib64/libguile.so.17
[0x2b6618f4db90]
[node18:24349] [ 8] /usr/lib64/libguile.so.17
[0x2b6618f4c714]
[node18:24349] [ 9]
/usr/lib64/libguile.so.17(scm_eval_body+0x84) [0x2b6618f4ebc4]
[node18:24349] [10]
/usr/lib64/libguile.so.17(scm_map+0x2c3) [0x2b6618f4f763]
[node18:24349] [11] /usr/lib64/libguile.so.17
[0x2b6618f4d61d]
[node18:24349] [12] /usr/lib64/libguile.so.17
[0x2b6618f4c714]
[node18:24349] [13] /usr/lib64/libguile.so.17
[0x2b6618f4c714]
[node18:24349] [14] /usr/lib64/libguile.so.17
[0x2b6618f4c39c]
[node18:24349] [15] /usr/lib64/libguile.so.17
[0x2b6618f4e408]
[node18:24349] [16] /usr/lib64/libguile.so.17
[0x2b6618f4c714]
[node18:24349] [17]
/usr/lib64/libguile.so.17(scm_primitive_load+0x8c) [0x2b6618f63e4c]
[node18:24349] [18] /usr/lib64/libguile.so.17
[0x2b6618f4d735]
[node18:24349] [19] /usr/lib64/libguile.so.17
[0x2b6618f4c714]
[node18:24349] [20] /usr/lib64/libguile.so.17(scm_apply+0x49a)
[0x2b6618f479ea]
[node18:24349] [21]
/usr/local/mpbmpi/bin/mpb-mpi(ctl_include+0x2f) [0x440edf]
[node18:24349] [22]
/usr/local/mpbmpi/bin/mpb-mpi(main_entry+0x4f9) [0x40b24f]
[node18:24349] [23] /usr/lib64/libguile.so.17
[0x2b6618f5703e]
[node18:24349] [24] /usr/lib64/libguile.so.17
[0x2b6618f6135f]
[node18:24349] [25] /usr/lib64/libguile.so.17
[0x2b6618f3927a]
[node18:24349] [26]
/usr/lib64/libguile.so.17(scm_c_catch+0x285) [0x2b6618f9ba05]
[node18:24349] [27]
/usr/lib64/libguile.so.17(scm_i_with_continuation_barrier+0xb1)
[0x2b6618f396d1]
[node18:24349] [28]
/usr/lib64/libguile.so.17(scm_c_with_continuation_barrier+0x30)
[0x2b6618f39770]
[node18:24349] [29]
/usr/lib64/libguile.so.17(scm_i_with_guile_and_parent+0x33) [0x2b6618f9adc3]
[node18:24349] *** End of error message ***
[node21:31019] *** Process received signal ***
[node28:07559] *** Process received signal ***
mpirun noticed that job rank 0 with PID 5354 on node
node15.i2basque.es exited on signal 15 (Terminated).
14 additional processes aborted (not shown) Our cluster has 29 computing
elements composed by 2 Intel QuadCore Xeon processors, (8 processors
identified) each with Cent OS 5.3, kernel 2.6.18 Can anybody help me? If we
run it with 10 processes it finishes ok.
Thanks very
much for your help.
_______________________________________________
mpb-discuss mailing list
[email protected]
http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/mpb-discuss