Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-05-06 Thread Gavin Abo
Ok, now it is clear that there is no additional error messages.  
Unfortunately, I cannot tell specifically what went wrong from those 
error messages.


You might try replacing mpirun with mpirun_rsh.  As you can see at

http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-May/004402.html

they replaced mpirun with mpirun_rsh, and it seems that they found a 
problem with passwordless ssh.


In your parallel_options file, you might also want to change setenv 
USE_REMOTE 0 to setenv USE_REMOTE 1, then try both 0 and 1 for 
MPI_REMOTE to check if any of these other configurations work or not 
while using mpirun.


On 5/6/2015 11:23 AM, lung Fermin wrote:

Thanks for the reply. Please see below.


As I asked before, did you give us all the error information in the 
case.dayfile and from standard output?  It is not entirely clear in 
your previous posts, but it looks to me that you might have only 
provided information from the case.dayfile and the error files (cat 
*.error), but maybe not from the standard output.  Are you still using 
the PBS script in your old post at 
http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html ? 
In the script, I can see that the standard output is set to be written 
to a file called wien2k_output.



Sorry for the confusion. Yes, I still use the PBS script in the above 
link. The posts before are from the standard outputs (wien2k). When 
using 2 nodes with 32 cores for one k point, the standard output gives


Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 
z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 
z1-18 z1-1

8 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
number of processors: 32
 LAPW0 END
[16] Failed to dealloc pd (Device or resource busy)
[0] Failed to dealloc pd (Device or resource busy)
[17] Failed to dealloc pd (Device or resource busy)
[2] Failed to dealloc pd (Device or resource busy)
[18] Failed to dealloc pd (Device or resource busy)
[1] Failed to dealloc pd (Device or resource busy)
 LAPW1 END
LAPW2 - FERMI; weighs written
[z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291) 
terminated with signal 9 - abort job
[z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 
9. MPI process died?
[z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. 
MPI process died?
[z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node 
z1-17 aborted: Error while reading a PMI socket (4)
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file 
descriptor 21. MPI process died?
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file 
descriptor 21. MPI process died?
[z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI 
process died?

cp: cannot stat `.in.tmp': No such file or directory

   stop error
-
And the .dayfile reads:

on z1-17 with PID 29439
using WIEN2k_14.2 (Release 15/10/2014)

start (Thu Apr 30 17:36:59 2015) with lapw0 (40/99 to go)

cycle 1 (Thu Apr 30 17:36:59 2015)  (40/99 to go)

   lapw0 -p (17:36:59) starting parallel lapw0 at Thu Apr 30 17:36:59  2015
 .machine0 : 32 processors
904.074u 8.710s 1:01.54 1483.2% 0+0k 239608+78400io 105pf+0w
   lapw1  -p   -c  (17:38:01) starting parallel lapw1 at Thu Apr 30 
17:38:01 2015
-  starting parallel LAPW1 jobs at Thu Apr 30 17:38:01 2015
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 
z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18(8) 469689.261u 
1680.003s 8:12:29.52 1595.1%  0+0k 204560+31265944io 366pf+0w

   Summary of lapw1para:
   z1-17 k=0 user=0  wallclock=0
469788.683u 1726.356s 8:12:31.33 1595.5%0+0k 206128+31266512io 
379pf+0w

   lapw2 -p-c   (01:50:32) running LAPW2 in parallel mode
  z1-17 0.034u 0.040s 1:35.16 0.0% 0+0k 10696+0io 80pf+0w
   Summary of lapw2para:
   z1-17 user=0.034  wallclock=95.16
**  LAPW2 crashed!
4.645u 0.458s 1:42.01 4.9%  0+0k 74792+45008io 133pf+0w
error: command /home/stretch/flung/DFT/WIEN2k/lapw2cpara -c lapw2.def 
failed


   stop error

-

When it runs fine on a single node, does it always use the same node 
(say z1-17) or does it run fine on other nodes (like z1-18)?


Not really. The nodes were assigned randomly.
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-05-06 Thread lung Fermin
Thanks for the reply. Please see below.


As I asked before, did you give us all the error information in the
case.dayfile and from standard output?  It is not entirely clear in your
previous posts, but it looks to me that you might have only provided
information from the case.dayfile and the error files (cat *.error), but
maybe not from the standard output.  Are you still using the PBS script in
your old post at
http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html ?
In the script, I can see that the standard output is set to be written to a
file called wien2k_output.


Sorry for the confusion. Yes, I still use the PBS script in the above link.
The posts before are from the standard outputs (wien2k). When using 2 nodes
with 32 cores for one k point, the standard output gives

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-1
8 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
number of processors: 32
 LAPW0 END
[16] Failed to dealloc pd (Device or resource busy)
[0] Failed to dealloc pd (Device or resource busy)
[17] Failed to dealloc pd (Device or resource busy)
[2] Failed to dealloc pd (Device or resource busy)
[18] Failed to dealloc pd (Device or resource busy)
[1] Failed to dealloc pd (Device or resource busy)
 LAPW1 END
LAPW2 - FERMI; weighs written
[z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291)
terminated with signal 9 - abort job
[z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9.
MPI process died?
[z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17
aborted: Error while reading a PMI socket (4)
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21.
MPI process died?
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21.
MPI process died?
[z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI
process died?
cp: cannot stat `.in.tmp': No such file or directory

   stop error
-
And the .dayfile reads:

on z1-17 with PID 29439
using WIEN2k_14.2 (Release 15/10/2014)

start   (Thu Apr 30 17:36:59 2015) with lapw0 (40/99 to go)

cycle 1 (Thu Apr 30 17:36:59 2015)  (40/99 to go)

   lapw0 -p(17:36:59) starting parallel lapw0 at Thu Apr 3017:36:59
 2015
 .machine0 : 32 processors
904.074u 8.710s 1:01.54 1483.2% 0+0k 239608+78400io 105pf+0w
   lapw1  -p   -c  (17:38:01) starting parallel lapw1 at Thu Apr 30
17:38:01 2015
-  starting parallel LAPW1 jobs at Thu Apr 30 17:38:01 2015
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18(8) 469689.261u
1680.003s 8:12:29.52 1595.1%  0+0k 204560+31265944io 366pf+0w
   Summary of lapw1para:
   z1-17 k=0 user=0  wallclock=0
469788.683u 1726.356s 8:12:31.33 1595.5%0+0k 206128+31266512io
379pf+0w
   lapw2 -p   -c   (01:50:32) running LAPW2 in parallel mode
  z1-17 0.034u 0.040s 1:35.16 0.0% 0+0k 10696+0io 80pf+0w
   Summary of lapw2para:
   z1-17 user=0.034  wallclock=95.16
**  LAPW2 crashed!
4.645u 0.458s 1:42.01 4.9%  0+0k 74792+45008io 133pf+0w
error: command   /home/stretch/flung/DFT/WIEN2k/lapw2cpara -c lapw2.def
failed

   stop error

-

When it runs fine on a single node, does it always use the same node (say
z1-17) or does it run fine on other nodes (like z1-18)?

Not really. The nodes were assigned randomly.
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-05-06 Thread Gavin Abo

See below for my comments.


Thanks for all the information and suggestions.

I have tried to change -lmkl_blacs_intelmpi_lp64 to -lmkl_blacs_lp64 
and recompile. However, I got the following error message in the 
screen output


 LAPW0 END
[cli_14]: [cli_15]: [cli_6]: aborting job:
Fatal error in PMPI_Comm_size:
Invalid communicator, error stack:
PMPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7f190c) failed
PMPI_Comm_size(69).: Invalid communicator
aborting job:
Fatal error in PMPI_Comm_size:
Invalid communicator, error stack:
PMPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7f190c) failed
PMPI_Comm_size(69).: Invalid communicator
...
[z0-5:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 
20. MPI process died?
[z0-5:mpispawn_0][mtpmi_processops] Error while reading PMI socket. 
MPI process died?
[z0-5:mpispawn_0][child_handler] MPI process (rank: 14, pid: 11260) 
exited with status 1
[z0-5:mpispawn_0][child_handler] MPI process (rank: 3, pid: 11249) 
exited with status 1
[z0-5:mpispawn_0][child_handler] MPI process (rank: 6, pid: 11252) 
exited with status 1

.


This is probably because you are using the wrong blacs library.  The 
-lmkl_blacs_lp64 is for MPICH, but you are using a variant of MPICH3.


Previously I compiled the program with -lmkl_blacs_intelmpi_lp64 and 
the mpi parallelization on a single node seems to be working. I notice 
that during the run, the *.error files have finite sizes, but I 
re-examine them after the job finished and there were no errors 
written inside (and the files have 0kb now). Does this indicates that 
the mpi is not running probably at all even on a single node? But I 
have checked the output result and it's in agreement with the non-mpi 
results..(for some simple cases)


Sounds like it is working fine on a single node.  At least for now, stay 
with -lmkl_blacs_intelmpi_lp64 as it works for a single node.


As I asked before, did you give us all the error information in the 
case.dayfile and from standard output?  It is not entirely clear in your 
previous posts, but it looks to me that you might have only provided 
information from the case.dayfile and the error files (cat *.error), but 
maybe not from the standard output.  Are you still using the PBS script 
in your old post at 
http://www.mail-archive.com/wien%40zeus.theochem.tuwien.ac.at/msg11770.html 
?  In the script, I can see that the standard output is set to be 
written to a file called wien2k_output.


When it runs fine on a single node, does it always use the same node 
(say z1-17) or does it run fine on other nodes (like z1-18)?


I also tried changing the mpirun to mpiexec as suggested by Prof. 
Marks by setting:
setenv WIEN_MPIRUN /usr/local/mvapich2-icc/bin/mpiexec -np _NP_ -f 
_HOSTS_ _EXEC_
in the parallel_option. In this case, the program does not run and 
also does not terminate (qstat on cluster just gives 00:00:00 for the 
time with a running status)..


At least for now, stay with mpirun since it works on a single node.


___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-05-05 Thread lung Fermin
Thanks for all the information and suggestions.

I have tried to change -lmkl_blacs_intelmpi_lp64 to -lmkl_blacs_lp64 and
recompile. However, I got the following error message in the screen output

 LAPW0 END
[cli_14]: [cli_15]: [cli_6]: aborting job:
Fatal error in PMPI_Comm_size:
Invalid communicator, error stack:
PMPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7f190c) failed
PMPI_Comm_size(69).: Invalid communicator
aborting job:
Fatal error in PMPI_Comm_size:
Invalid communicator, error stack:
PMPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0x7f190c) failed
PMPI_Comm_size(69).: Invalid communicator
...
[z0-5:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 20.
MPI process died?
[z0-5:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[z0-5:mpispawn_0][child_handler] MPI process (rank: 14, pid: 11260) exited
with status 1
[z0-5:mpispawn_0][child_handler] MPI process (rank: 3, pid: 11249) exited
with status 1
[z0-5:mpispawn_0][child_handler] MPI process (rank: 6, pid: 11252) exited
with status 1
.


Previously I compiled the program with -lmkl_blacs_intelmpi_lp64 and the
mpi parallelization on a single node seems to be working. I notice that
during the run, the *.error files have finite sizes, but I re-examine them
after the job finished and there were no errors written inside (and the
files have 0kb now). Does this indicates that the mpi is not running
probably at all even on a single node? But I have checked the output result
and it's in agreement with the non-mpi results..(for some simple cases)


I also tried changing the mpirun to mpiexec as suggested by Prof. Marks by
setting:
setenv WIEN_MPIRUN /usr/local/mvapich2-icc/bin/mpiexec -np _NP_ -f _HOSTS_
_EXEC_
in the parallel_option. In this case, the program does not run and also
does not terminate (qstat on cluster just gives 00:00:00 for the time with
a running status)..


Regards,
Fermin
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-05-04 Thread lung Fermin
I have checked that case.vsp/vns are up-to-date. I guess lawp0_mpi runs
properly.

I compiled the source codes with ifort and please find the following for
the linking options:

current:FOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML -traceback

current:FPOPT:-FR -mp1 -w -prec_div -pc80 -pad -ip -DINTEL_VML
-Dmkl_scalapack -traceback

current:FFTW_OPT:-DFFTW3 -I/usr/local/include

current:FFTW_LIBS:-lfftw3_mpi -lfftw3 -L/usr/local/lib

current:LDFLAGS:$(FOPT) -L/opt/intel/Compiler/11.1/046/mkl/lib/em64t
-pthread

current:DPARALLEL:'-DParallel'

current:R_LIBS:-lmkl_lapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core
-openmp -lpthread -lguide

current:RP_LIBS:-lmkl_scalapack_lp64 -lmkl_solver_lp64
-lmkl_blacs_intelmpi_lp64 $(R_LIBS)

current:MPIRUN:/usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile
_HOSTS_ _EXEC_

current:MKL_TARGET_ARCH:intel64


Is it ok to use -lmkl_blacs_intelmpi_lp64?


Thanks a lot for all the suggestions.

Regards,

Fermin


-Original Message-
From: wien-boun...@zeus.theochem.tuwien.ac.at [mailto:
wien-boun...@zeus.theochem.tuwien.ac.at] On Behalf Of Peter Blaha
To: A Mailing list for WIEN2k users
Subject: Re: [Wien] Error in mpi+k point parallelization across multiple
nodes



It seems as if lapw0_mpi runs properly ?? Please check if you have NEW
(check date with ls -als)!! valid case.vsp/vns files, which can be used in
eg. a sequential lapw1 step.



This suggests that   mpi and fftw are ok.



The problems seem to start in lapw1_mpi, and this program requires in
addition to mpi also scalapack.



I guess you compile with ifort and link with the mkl ??

There is one crucial blacs library, which must be adapted to your mpi,
since they are specific to a particular mpi (intelmpi, openmpi, ...):

Which blacks-library do you link ?   -lmkl_blacs_lp64   or another one ??

Check out the doku for the mkl.





Am 04.05.2015 um 05:18 schrieb lung Fermin:

 I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for

 distributing the mpi job. However, the problem still persist... but the
error message looks different this time:



 $ cat *.error

 Error in LAPW2

 **  testerror: Error in Parallel LAPW2



 and the output on screen:

 Warning: no access to tty (Bad file descriptor).

 Thus no job control in this shell.

 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17

 z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18

 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18

 z1-18 z1-1

 8 z1-18 z1-18

 number of processors: 32

   LAPW0 END

 [16] Failed to dealloc pd (Device or resource busy) [0] Failed to

 dealloc pd (Device or resource busy) [17] Failed to dealloc pd (Device

 or resource busy) [2] Failed to dealloc pd (Device or resource busy)

 [18] Failed to dealloc pd (Device or resource busy) [1] Failed to

 dealloc pd (Device or resource busy)

   LAPW1 END

 LAPW2 - FERMI; weighs written

 [z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291)

 terminated with signal 9 - abort job [z1-17:mpispawn_0][readline]
Unexpected End-Of-File on file descriptor 9. MPI process died?

 [z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?

 [z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node

 z1-17 aborted: Error while reading a PMI socket (4)
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21.
MPI process died?

 [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor
21. MPI process died?

 [z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI
process died?

 cp: cannot stat `.in.tmp': No such file or directory



 stop error





 --

 --


___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-05-04 Thread Peter Blaha

No !!! (Can use it only if you are using intelmpi).

I'm not sure (and it may even depend on the compiler version) which
mpi-versions are supported by intel. But maybe try the simplest version
-lmkl_blacs_lp64

Am 04.05.2015 um 08:03 schrieb lung Fermin:

Is it ok to use -lmkl_blacs_intelmpi_lp64?


--
-
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pbl...@theochem.tuwien.ac.at
-
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-05-04 Thread Gavin Abo
On page 131 in the User's Guide for Intel mkl 11.1 for Linux [ 
https://software.intel.com/en-us/mkl_11.1_ug_lin_pdf ], it has:


libmkl_blacs_intelmpi_lp64.so = LP64 version of BLACS routines for 
Intel MPI and MPICH2


So -lmkl_blacs_intelmpi_lp64 might also work with MPICH2.

From the compile settings, it looks like compiler version 11.1 build 46 
is being used, which uses version 10.2 Update 2 of mkl [ 
https://software.intel.com/en-us/articles/which-version-of-ipp--mkl--tbb-is-installed-with-intel-compiler-professional-edition 
].


I could not find the mkl system requirements page for 10.2, but for 
10.3, it has that 10.3 was validated with MPICH2 version 1.3.2p1 [ 
https://software.intel.com/en-us/articles/intel-mkl-103-system-requirements 
].


The MVAPICH2-2.0a that was mentioned as being used is based on 
MPICH-3.0.4 [ 
http://mvapich.cse.ohio-state.edu/static/media/mvapich/MV2_CHANGELOG-2.1.txt 
].


It looks like Intel did not validate MPICH3 until about mkl version 11.2 
with MPICH3 version 3.1 [ 
https://software.intel.com/en-us/articles/intel-mkl-112-system-requirements 
].


Fermin, it looks like you provided information from the WIEN2k dayfile 
and error files.  However, are there any error messages in the standard 
output?  For example, standard output should be what you get in a 
terminal after you execute runsp_lapw -p. However, since clusters 
typically require that calculations be submitted using a job submission 
system, the standard output is usually written instead to a user 
specified file.  On some systems, for example, the calculation isn't 
executed with runsp_lapw -p but with qsub -j oe -o output.log 
myscript.pbs [ http://arc.it.wsu.edu/UserGuide/Using_Qsub.aspx ], where 
the script file called myscript.pbs would contain a line to execute 
runsp_lapw -p  and the standard output would be written to the file 
called output.log.


On 5/4/2015 12:08 AM, Peter Blaha wrote:

No !!! (Can use it only if you are using intelmpi).

I'm not sure (and it may even depend on the compiler version) which
mpi-versions are supported by intel. But maybe try the simplest version
-lmkl_blacs_lp64

Am 04.05.2015 um 08:03 schrieb lung Fermin:

Is it ok to use -lmkl_blacs_intelmpi_lp64?




___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-05-04 Thread Laurence Marks
To reiterate what everyone else said, you should change your blacs, the
intelmpi version only works if you are using impi (I am 98% certain).

Normally this leads to a wierd but understandable error when lapw0/lapw1
initiate the mpi routines, not sure why this did not show up in your case.

On Mon, May 4, 2015 at 12:56 PM, Gavin Abo gs...@crimson.ua.edu wrote:

 On page 131 in the User's Guide for Intel mkl 11.1 for Linux [
 https://software.intel.com/en-us/mkl_11.1_ug_lin_pdf ], it has:

 libmkl_blacs_intelmpi_lp64.so = LP64 version of BLACS routines for
 Intel MPI and MPICH2

 So -lmkl_blacs_intelmpi_lp64 might also work with MPICH2.

  From the compile settings, it looks like compiler version 11.1 build 46
 is being used, which uses version 10.2 Update 2 of mkl [

 https://software.intel.com/en-us/articles/which-version-of-ipp--mkl--tbb-is-installed-with-intel-compiler-professional-edition
 ].

 I could not find the mkl system requirements page for 10.2, but for
 10.3, it has that 10.3 was validated with MPICH2 version 1.3.2p1 [
 https://software.intel.com/en-us/articles/intel-mkl-103-system-requirements
 ].

 The MVAPICH2-2.0a that was mentioned as being used is based on
 MPICH-3.0.4 [

 http://mvapich.cse.ohio-state.edu/static/media/mvapich/MV2_CHANGELOG-2.1.txt
 ].

 It looks like Intel did not validate MPICH3 until about mkl version 11.2
 with MPICH3 version 3.1 [
 https://software.intel.com/en-us/articles/intel-mkl-112-system-requirements
 ].

 Fermin, it looks like you provided information from the WIEN2k dayfile
 and error files.  However, are there any error messages in the standard
 output?  For example, standard output should be what you get in a
 terminal after you execute runsp_lapw -p. However, since clusters
 typically require that calculations be submitted using a job submission
 system, the standard output is usually written instead to a user
 specified file.  On some systems, for example, the calculation isn't
 executed with runsp_lapw -p but with qsub -j oe -o output.log
 myscript.pbs [ http://arc.it.wsu.edu/UserGuide/Using_Qsub.aspx ], where
 the script file called myscript.pbs would contain a line to execute
 runsp_lapw -p  and the standard output would be written to the file
 called output.log.

 On 5/4/2015 12:08 AM, Peter Blaha wrote:
  No !!! (Can use it only if you are using intelmpi).
 
  I'm not sure (and it may even depend on the compiler version) which
  mpi-versions are supported by intel. But maybe try the simplest version
  -lmkl_blacs_lp64
 
  Am 04.05.2015 um 08:03 schrieb lung Fermin:
  Is it ok to use -lmkl_blacs_intelmpi_lp64?
 

 ___
 Wien mailing list
 Wien@zeus.theochem.tuwien.ac.at
 http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
 SEARCH the MAILING-LIST at:
 http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html




-- 
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
Corrosion in 4D: MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what nobody
else has thought
Albert Szent-Gyorgi
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-05-03 Thread lung Fermin
I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for
distributing the mpi job. However, the problem still persist... but the
error message looks different this time:

$ cat *.error
Error in LAPW2
**  testerror: Error in Parallel LAPW2

and the output on screen:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17
z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18
z1-18 z1-18 z1-18 z1-18 z1-18 z1-1
8 z1-18 z1-18
number of processors: 32
 LAPW0 END
[16] Failed to dealloc pd (Device or resource busy)
[0] Failed to dealloc pd (Device or resource busy)
[17] Failed to dealloc pd (Device or resource busy)
[2] Failed to dealloc pd (Device or resource busy)
[18] Failed to dealloc pd (Device or resource busy)
[1] Failed to dealloc pd (Device or resource busy)
 LAPW1 END
LAPW2 - FERMI; weighs written
[z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291)
terminated with signal 9 - abort job
[z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9.
MPI process died?
[z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17
aborted: Error while reading a PMI socket (4)
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21.
MPI process died?
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21.
MPI process died?
[z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI
process died?
cp: cannot stat `.in.tmp': No such file or directory

   stop error




Try setting

setenv MPI_REMOTE 0

in parallel options.



Am 29.04.2015 um 09:44 schrieb lung Fermin:

 Thanks for your comment, Prof. Marks.



 Each node on the cluster has 32GB memory and each core (16 in total)

 on the node is limited to 2GB of memory usage. For the current system,

 I used RKMAX=6,  and the smallest RMT=2.25.



 I have tested the calculation with single k point and mpi on 16 cores

 within a node. The matrix size from



 $ cat *.nmat_only



 is   29138



 Does this mean that the number of matrix elements is 29138 or (29138)^2?

 In general, how shall I estimate the memory required for a calculation?



 I have also checked the memory usage with top on the node. Each core

 has used up ~5% of the memory and this adds up to ~5*16% on the node.

 Perhaps the problem is really caused by the overflow of memory.. I am

 now queuing on the cluster to test for the case of mpi over 32 cores

 (2 nodes).



 Thanks.



 Regards,

 Fermin



 --

 --



 As an addendum, the calculation may be too big for a single node. How

 much memory does the node have, what is the RKMAX, the smallest RMT 

 unit cell size? Maybe use in your machines file



 1:z1-2:16 z1-13:16

 lapw0: z1-2:16 z1-13:16

 granularity:1

 extrafine:1



 Check the size using

 x law1 -c -p -nmat_only

 cat *.nmat



 ___

 Professor Laurence Marks

 Department of Materials Science and Engineering Northwestern

 University www.numis.northwestern.edu

 http://www.numis.northwestern.edu

 MURI4D.numis.northwestern.edu http://MURI4D.numis.northwestern.edu

 Co-Editor, Acta Cryst A

 Research is to see what everybody else has seen, and to think what

 nobody else has thought

 Albert Szent-Gyorgi



 On Apr 28, 2015 10:45 PM, Laurence Marks l-ma...@northwestern.edu

 mailto:l-ma...@northwestern.edu l-ma...@northwestern.edu wrote:



 Unfortunately it is hard to know what is going on. A google search on

 Error while reading PMI socket. indicates that the message you have

 means it did not work, and is not specific. Some suggestions:



 a) Try mpiexec (slightly different arguments). You just edit

 parallel_options.

 https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager

 b) Try an older version of mvapich2 if it is on the system.

 c) Do you have to launch mpdboot for your system

 https://wiki.calculquebec.ca/w/MVAPICH2/en?

 d) Talk to a sys_admin, particularly the one who setup mvapich

 e) Do cat *.error, maybe something else went wrong or it is not

 mpi's fault but a user error.



 ___

 Professor Laurence Marks

 Department of Materials Science and Engineering Northwestern

 University www.numis.northwestern.edu

 http://www.numis.northwestern.edu

 MURI4D.numis.northwestern.edu http://MURI4D.numis.northwestern.edu

 Co-Editor, Acta Cryst A

 Research is to see what everybody else has seen, and to think what

 nobody else has thought

 Albert Szent-Gyorgi



 On Apr 28, 2015 10:17 PM, lung Fermin ferminl...@gmail.com

 mailto:ferminl...@gmail.com ferminl...@gmail.com wrote:



 

Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-05-03 Thread Peter Blaha

It seems as if lapw0_mpi runs properly ?? Please check if you have
NEW (check date with ls -als)!! valid case.vsp/vns files, which can be used in
eg. a sequential lapw1 step.

This suggests that   mpi and fftw are ok.

The problems seem to start in lapw1_mpi, and this program requires in addition 
to
mpi also scalapack.

I guess you compile with ifort and link with the mkl ??
There is one crucial blacs library, which must be adapted to your mpi, since 
they
are specific to a particular mpi (intelmpi, openmpi, ...):
Which blacks-library do you link ?   -lmkl_blacs_lp64   or another one ??
Check out the doku for the mkl.


Am 04.05.2015 um 05:18 schrieb lung Fermin:

I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for 
distributing the mpi job. However, the problem still persist... but the error 
message looks different
this time:

$ cat *.error
Error in LAPW2
**  testerror: Error in Parallel LAPW2

and the output on screen:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 
z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 
z1-18 z1-18
z1-18 z1-1
8 z1-18 z1-18
number of processors: 32
  LAPW0 END
[16] Failed to dealloc pd (Device or resource busy)
[0] Failed to dealloc pd (Device or resource busy)
[17] Failed to dealloc pd (Device or resource busy)
[2] Failed to dealloc pd (Device or resource busy)
[18] Failed to dealloc pd (Device or resource busy)
[1] Failed to dealloc pd (Device or resource busy)
  LAPW1 END
LAPW2 - FERMI; weighs written
[z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291) terminated 
with signal 9 - abort job
[z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9. MPI 
process died?
[z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI 
process died?
[z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17 
aborted: Error while reading a PMI socket (4)
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI 
process died?
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI 
process died?
[z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process 
died?
cp: cannot stat `.in.tmp': No such file or directory

stop error




Try setting

setenv MPI_REMOTE 0

in parallel options.

Am 29.04.2015 um 09:44 schrieb lung Fermin:


Thanks for your comment, Prof.  Marks.







Each node on the cluster has 32GB  memory and each core (16 in total)



on the node is limited to 2GB of  memory usage. For the current system,



I used RKMAX=6,  and the smallest RMT=2.25.







I have tested the calculation with  single k point and mpi on 16 cores



within a node. The matrix size from







$ cat *.nmat_only







is   29138







Does this mean that the number of  matrix elements is 29138 or (29138)^2?



In general, how shall I estimate  the memory required for a calculation?







I have also checked the memory  usage with top on the node. Each core



has used up ~5% of the memory and  this adds up to ~5*16% on the node.



Perhaps the problem is really  caused by the overflow of memory.. I am



now queuing on the cluster to test  for the case of mpi over 32 cores



(2 nodes).







Thanks.







Regards,



Fermin







 --



 --







As an addendum, the calculation may  be too big for a single node. How



much memory does the node have,  what is the RKMAX, the smallest RMT 



unit cell size? Maybe use in your  machines file







1:z1-2:16 z1-13:16



lapw0: z1-2:16 z1-13:16



granularity:1



extrafine:1







Check the size using



x law1 -c -p -nmat_only



cat *.nmat







___



Professor Laurence Marks



Department of Materials Science and  Engineering Northwestern



Universitywww.numis.northwestern.edu http://www.numis.northwestern.edu



http://www.numis.northwestern.edu



MURI4D.numis.northwestern.edu http://MURI4D.numis.northwestern.edu 
http://MURI4D.numis.northwestern.edu



Co-Editor, Acta Cryst A



Research is to see what  everybody else has seen, and to think what



nobody else has thought



Albert Szent-Gyorgi







On Apr 28, 2015 10:45 PM,  Laurence Marks l-ma...@northwestern.edu 
mailto:l-ma...@northwestern.edu



mailto:l-ma...@northwestern.edu wrote:







Unfortunately it is hard to know  what is going on. A google search on



Error while reading PMI  socket. indicates that the message you have



means it did not work, and is not  specific. Some suggestions:







a) Try mpiexec (slightly different  arguments). You just edit



parallel_options.




Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-04-29 Thread lung Fermin
Thanks for your comment, Prof. Marks.

Each node on the cluster has 32GB memory and each core (16 in total) on the
node is limited to 2GB of memory usage. For the current system, I used
RKMAX=6,  and the smallest RMT=2.25.

I have tested the calculation with single k point and mpi on 16 cores
within a node. The matrix size from

$ cat *.nmat_only

is   29138

Does this mean that the number of matrix elements is 29138 or (29138)^2? In
general, how shall I estimate the memory required for a calculation?
I have also checked the memory usage with top on the node. Each core has
used up ~5% of the memory and this adds up to ~5*16% on the node. Perhaps
the problem is really caused by the overflow of memory.. I am now queuing
on the cluster to test for the case of mpi over 32 cores (2 nodes).

Thanks.

Regards,
Fermin



As an addendum, the calculation may be too big for a single node. How much
memory does the node have, what is the RKMAX, the smallest RMT  unit cell
size? Maybe use in your machines file

1:z1-2:16 z1-13:16
lapw0: z1-2:16 z1-13:16
granularity:1
extrafine:1

Check the size using
x law1 -c -p -nmat_only
cat *.nmat

___
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what nobody
else has thought
Albert Szent-Gyorgi

On Apr 28, 2015 10:45 PM, Laurence Marks l-ma...@northwestern.edu wrote:

Unfortunately it is hard to know what is going on. A google search on
Error while reading PMI socket. indicates that the message you have means
it did not work, and is not specific. Some suggestions:

a) Try mpiexec (slightly different arguments). You just edit
parallel_options.
https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
b) Try an older version of mvapich2 if it is on the system.
c) Do you have to launch mpdboot for your system
https://wiki.calculquebec.ca/w/MVAPICH2/en?
d) Talk to a sys_admin, particularly the one who setup mvapich
e) Do cat *.error, maybe something else went wrong or it is not mpi's
fault but a user error.

___
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what nobody
else has thought
Albert Szent-Gyorgi

On Apr 28, 2015 10:17 PM, lung Fermin ferminl...@gmail.com wrote:

Thanks for Prof. Marks' comment.

1. In the previous email, I have missed to copy the line

setenv WIEN_MPIRUN /usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile
_HOSTS_ _EXEC_

It was in the parallel_option. Sorry about that.

2. I have checked that the running program was lapw1c_mpi. Besides, when
the mpi calculation was done on a single node for some other system, the
results are consistent with the literature. So I believe that the mpi code
has been setup and compiled properly.

Would there be something wrong with my option in siteconfig..? Do I have to
set some command to bind the job? Any other possible cause of the error?

Any suggestions or comments would be appreciated. Thanks.



Regards,

Fermin



You appear to be missing the line

setenv WIEN_MPIRUN=...

This is setup when you run siteconfig, and provides the information on how
mpi is run on your system.

N.B., did you setup and compile the mpi code?

___
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what nobody
else has thought
Albert Szent-Gyorgi

On Apr 28, 2015 4:22 AM, lung Fermin ferminl...@gmail.com wrote:

Dear Wien2k community,



I am trying to perform calculation on a system of ~100 in-equivalent atoms
using mpi+k point parallelization on a cluster. Everything goes fine when
the program was run on a single node. However, if I perform the calculation
across different nodes, the follow error occurs. How to solve this problem?
I am a newbie to mpi programming, any help would be appreciated. Thanks.



The error message (MVAPICH2 2.0a):

---

Warning: no access to tty (Bad file descriptor).

Thus no job control in this shell.

z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1

-13 z1-13 z1-13 z1-13 z1-13 z1-13

number of processors: 32

 LAPW0 END


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-04-29 Thread Laurence Marks
As an addendum, the calculation may be too big for a single node. How much
memory does the node have, what is the RKMAX, the smallest RMT  unit cell
size? Maybe use in your machines file

1:z1-2:16 z1-13:16
lapw0: z1-2:16 z1-13:16
granularity:1
extrafine:1

Check the size using
x law1 -c -p -nmat_only
cat *.nmat

___
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what nobody
else has thought
Albert Szent-Gyorgi
On Apr 28, 2015 10:45 PM, Laurence Marks l-ma...@northwestern.edu wrote:

 Unfortunately it is hard to know what is going on. A google search on
 Error while reading PMI socket. indicates that the message you have means
 it did not work, and is not specific. Some suggestions:

 a) Try mpiexec (slightly different arguments). You just edit
 parallel_options.
 https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
 b) Try an older version of mvapich2 if it is on the system.
 c) Do you have to launch mpdboot for your system
 https://wiki.calculquebec.ca/w/MVAPICH2/en?
 d) Talk to a sys_admin, particularly the one who setup mvapich
 e) Do cat *.error, maybe something else went wrong or it is not mpi's
 fault but a user error.

 ___
 Professor Laurence Marks
 Department of Materials Science and Engineering
 Northwestern University
 www.numis.northwestern.edu
 MURI4D.numis.northwestern.edu
 Co-Editor, Acta Cryst A
 Research is to see what everybody else has seen, and to think what nobody
 else has thought
 Albert Szent-Gyorgi
 On Apr 28, 2015 10:17 PM, lung Fermin ferminl...@gmail.com wrote:

  Thanks for Prof. Marks' comment.

 1. In the previous email, I have missed to copy the line

 setenv WIEN_MPIRUN /usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile
 _HOSTS_ _EXEC_
 It was in the parallel_option. Sorry about that.

 2. I have checked that the running program was lapw1c_mpi. Besides, when
 the mpi calculation was done on a single node for some other system, the
 results are consistent with the literature. So I believe that the mpi code
 has been setup and compiled properly.

 Would there be something wrong with my option in siteconfig..? Do I have
 to set some command to bind the job? Any other possible cause of the error?

 Any suggestions or comments would be appreciated. Thanks.


  Regards,

 Fermin


 

 You appear to be missing the line

 setenv WIEN_MPIRUN=...

 This is setup when you run siteconfig, and provides the information on
 how mpi is run on your system.

 N.B., did you setup and compile the mpi code?

 ___
 Professor Laurence Marks
 Department of Materials Science and Engineering
 Northwestern University
 www.numis.northwestern.edu
 MURI4D.numis.northwestern.edu
 Co-Editor, Acta Cryst A
 Research is to see what everybody else has seen, and to think what
 nobody else has thought
 Albert Szent-Gyorgi

 On Apr 28, 2015 4:22 AM, lung Fermin ferminl...@gmail.com wrote:

 Dear Wien2k community,



 I am trying to perform calculation on a system of ~100 in-equivalent
 atoms using mpi+k point parallelization on a cluster. Everything goes fine
 when the program was run on a single node. However, if I perform the
 calculation across different nodes, the follow error occurs. How to solve
 this problem? I am a newbie to mpi programming, any help would be
 appreciated. Thanks.



 The error message (MVAPICH2 2.0a):


 ---

 Warning: no access to tty (Bad file descriptor).

 Thus no job control in this shell.

 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
 z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1

 -13 z1-13 z1-13 z1-13 z1-13 z1-13

 number of processors: 32

  LAPW0 END

 [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13
 aborted: Error while reading a PMI socket (4)

 [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
 terminated with signal 9 - abort job

 [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8.
 MPI process died?

 [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
 process died?

 [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12.
 MPI process died?

 [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
 process died?

 [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
 terminated with signal 9 - abort job

 [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2
 aborted: MPI process error (1)

 [cli_15]: aborting job:

 application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15



stop 

Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-04-29 Thread Peter Blaha

Try setting
setenv MPI_REMOTE 0
in parallel options.

Am 29.04.2015 um 09:44 schrieb lung Fermin:

Thanks for your comment, Prof. Marks.

Each node on the cluster has 32GB memory and each core (16 in total) on
the node is limited to 2GB of memory usage. For the current system, I
used RKMAX=6,  and the smallest RMT=2.25.

I have tested the calculation with single k point and mpi on 16 cores
within a node. The matrix size from

$ cat *.nmat_only

is   29138

Does this mean that the number of matrix elements is 29138 or (29138)^2?
In general, how shall I estimate the memory required for a calculation?

I have also checked the memory usage with top on the node. Each core
has used up ~5% of the memory and this adds up to ~5*16% on the node.
Perhaps the problem is really caused by the overflow of memory.. I am
now queuing on the cluster to test for the case of mpi over 32 cores (2
nodes).

Thanks.

Regards,
Fermin



As an addendum, the calculation may be too big for a single node. How
much memory does the node have, what is the RKMAX, the smallest RMT 
unit cell size? Maybe use in your machines file

1:z1-2:16 z1-13:16
lapw0: z1-2:16 z1-13:16
granularity:1
extrafine:1

Check the size using
x law1 -c -p -nmat_only
cat *.nmat

___
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu http://www.numis.northwestern.edu
MURI4D.numis.northwestern.edu http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what
nobody else has thought
Albert Szent-Gyorgi

On Apr 28, 2015 10:45 PM, Laurence Marks l-ma...@northwestern.edu
mailto:l-ma...@northwestern.edu wrote:

Unfortunately it is hard to know what is going on. A google search on
Error while reading PMI socket. indicates that the message you have
means it did not work, and is not specific. Some suggestions:

a) Try mpiexec (slightly different arguments). You just edit
parallel_options.
https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
b) Try an older version of mvapich2 if it is on the system.
c) Do you have to launch mpdboot for your system
https://wiki.calculquebec.ca/w/MVAPICH2/en?
d) Talk to a sys_admin, particularly the one who setup mvapich
e) Do cat *.error, maybe something else went wrong or it is not mpi's
fault but a user error.

___
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu http://www.numis.northwestern.edu
MURI4D.numis.northwestern.edu http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what
nobody else has thought
Albert Szent-Gyorgi

On Apr 28, 2015 10:17 PM, lung Fermin ferminl...@gmail.com
mailto:ferminl...@gmail.com wrote:

Thanks for Prof. Marks' comment.

1. In the previous email, I have missed to copy the line

setenv WIEN_MPIRUN /usr/local/mvapich2-icc/bin/mpirun -np _NP_
-hostfile _HOSTS_ _EXEC_

It was in the parallel_option. Sorry about that.

2. I have checked that the running program was lapw1c_mpi. Besides, when
the mpi calculation was done on a single node for some other system, the
results are consistent with the literature. So I believe that the mpi
code has been setup and compiled properly.

Would there be something wrong with my option in siteconfig..? Do I have
to set some command to bind the job? Any other possible cause of the error?

Any suggestions or comments would be appreciated. Thanks.

Regards,

Fermin



You appear to be missing the line

setenv WIEN_MPIRUN=...

This is setup when you run siteconfig, and provides the information on
how mpi is run on your system.

N.B., did you setup and compile the mpi code?

___
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu http://www.numis.northwestern.edu
MURI4D.numis.northwestern.edu http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what
nobody else has thought
Albert Szent-Gyorgi

On Apr 28, 2015 4:22 AM, lung Fermin ferminl...@gmail.com
mailto:ferminl...@gmail.com wrote:

Dear Wien2k community,

I am trying to perform calculation on a system of ~100 in-equivalent
atoms using mpi+k point parallelization on a cluster. Everything goes
fine when the program was run on a single node. However, if I perform
the calculation across different nodes, the follow error occurs. How to
solve this problem? I am a newbie to mpi programming, any help would be
appreciated. Thanks.

The error message (MVAPICH2 2.0a):


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-04-28 Thread Laurence Marks
You appear to be missing the line

setenv WIEN_MPIRUN=...

This is setup when you run siteconfig, and provides the information on how
mpi is run on your system.

N.B., did you setup and compile the mpi code?

___
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what nobody
else has thought
Albert Szent-Gyorgi
On Apr 28, 2015 4:22 AM, lung Fermin ferminl...@gmail.com wrote:

  Dear Wien2k community,

  I am trying to perform calculation on a system of ~100 in-equivalent
 atoms using mpi+k point parallelization on a cluster. Everything goes fine
 when the program was run on a single node. However, if I perform the
 calculation across different nodes, the follow error occurs. How to solve
 this problem? I am a newbie to mpi programming, any help would be
 appreciated. Thanks.

  The error message (MVAPICH2 2.0a):

 ---
  Warning: no access to tty (Bad file descriptor).
 Thus no job control in this shell.
 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1
 -13 z1-13 z1-13 z1-13 z1-13 z1-13
 number of processors: 32
  LAPW0 END
 [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13
 aborted: Error while reading a PMI socket (4)
 [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
 terminated with signal 9 - abort job
 [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8.
 MPI process died?
 [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
 process died?
 [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12.
 MPI process died?
 [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
 process died?
 [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
 terminated with signal 9 - abort job
 [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2
 aborted: MPI process error (1)
 [cli_15]: aborting job:
 application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15

 stop error

 --

  The .machines file:
  #
 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
 z1-2 z1-2
 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
 z1-13 z1-13 z1-13 z1-13
 granularity:1
 extrafine:1

 
 The parallel_options:

  setenv TASKSET no
 setenv USE_REMOTE 0
 setenv MPI_REMOTE 1
 setenv WIEN_GRANULARITY 1


 

  Thanks.

  Regards,
 Fermin

___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-04-28 Thread lung Fermin
Thanks for Prof. Marks' comment.

1. In the previous email, I have missed to copy the line

setenv WIEN_MPIRUN /usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile
_HOSTS_ _EXEC_
It was in the parallel_option. Sorry about that.

2. I have checked that the running program was lapw1c_mpi. Besides, when
the mpi calculation was done on a single node for some other system, the
results are consistent with the literature. So I believe that the mpi code
has been setup and compiled properly.

Would there be something wrong with my option in siteconfig..? Do I have to
set some command to bind the job? Any other possible cause of the error?

Any suggestions or comments would be appreciated. Thanks.


Regards,

Fermin



You appear to be missing the line

setenv WIEN_MPIRUN=...

This is setup when you run siteconfig, and provides the information on how
mpi is run on your system.

N.B., did you setup and compile the mpi code?

___
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what nobody
else has thought
Albert Szent-Gyorgi

On Apr 28, 2015 4:22 AM, lung Fermin ferminl...@gmail.com wrote:

Dear Wien2k community,



I am trying to perform calculation on a system of ~100 in-equivalent atoms
using mpi+k point parallelization on a cluster. Everything goes fine when
the program was run on a single node. However, if I perform the calculation
across different nodes, the follow error occurs. How to solve this problem?
I am a newbie to mpi programming, any help would be appreciated. Thanks.



The error message (MVAPICH2 2.0a):

---

Warning: no access to tty (Bad file descriptor).

Thus no job control in this shell.

z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1

-13 z1-13 z1-13 z1-13 z1-13 z1-13

number of processors: 32

 LAPW0 END

[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13
aborted: Error while reading a PMI socket (4)

[z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
terminated with signal 9 - abort job

[z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8.
MPI process died?

[z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?

[z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12.
MPI process died?

[z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?

[z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
terminated with signal 9 - abort job

[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2
aborted: MPI process error (1)

[cli_15]: aborting job:

application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15



   stop error

--



The .machines file:

#

1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
z1-2 z1-2

1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
z1-13 z1-13 z1-13 z1-13

granularity:1

extrafine:1



The parallel_options:



setenv TASKSET no

setenv USE_REMOTE 0

setenv MPI_REMOTE 1

setenv WIEN_GRANULARITY 1







Thanks.



Regards,

Fermin
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] Error in mpi+k point parallelization across multiple nodes

2015-04-28 Thread Laurence Marks
Unfortunately it is hard to know what is going on. A google search on
Error while reading PMI socket. indicates that the message you have means
it did not work, and is not specific. Some suggestions:

a) Try mpiexec (slightly different arguments). You just edit
parallel_options.
https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
b) Try an older version of mvapich2 if it is on the system.
c) Do you have to launch mpdboot for your system
https://wiki.calculquebec.ca/w/MVAPICH2/en?
d) Talk to a sys_admin, particularly the one who setup mvapich
e) Do cat *.error, maybe something else went wrong or it is not mpi's
fault but a user error.

___
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
Research is to see what everybody else has seen, and to think what nobody
else has thought
Albert Szent-Gyorgi
On Apr 28, 2015 10:17 PM, lung Fermin ferminl...@gmail.com wrote:

  Thanks for Prof. Marks' comment.

 1. In the previous email, I have missed to copy the line

 setenv WIEN_MPIRUN /usr/local/mvapich2-icc/bin/mpirun -np _NP_ -hostfile
 _HOSTS_ _EXEC_
 It was in the parallel_option. Sorry about that.

 2. I have checked that the running program was lapw1c_mpi. Besides, when
 the mpi calculation was done on a single node for some other system, the
 results are consistent with the literature. So I believe that the mpi code
 has been setup and compiled properly.

 Would there be something wrong with my option in siteconfig..? Do I have
 to set some command to bind the job? Any other possible cause of the error?

 Any suggestions or comments would be appreciated. Thanks.


  Regards,

 Fermin


 

 You appear to be missing the line

 setenv WIEN_MPIRUN=...

 This is setup when you run siteconfig, and provides the information on how
 mpi is run on your system.

 N.B., did you setup and compile the mpi code?

 ___
 Professor Laurence Marks
 Department of Materials Science and Engineering
 Northwestern University
 www.numis.northwestern.edu
 MURI4D.numis.northwestern.edu
 Co-Editor, Acta Cryst A
 Research is to see what everybody else has seen, and to think what nobody
 else has thought
 Albert Szent-Gyorgi

 On Apr 28, 2015 4:22 AM, lung Fermin ferminl...@gmail.com wrote:

 Dear Wien2k community,



 I am trying to perform calculation on a system of ~100 in-equivalent atoms
 using mpi+k point parallelization on a cluster. Everything goes fine when
 the program was run on a single node. However, if I perform the calculation
 across different nodes, the follow error occurs. How to solve this problem?
 I am a newbie to mpi programming, any help would be appreciated. Thanks.



 The error message (MVAPICH2 2.0a):


 ---

 Warning: no access to tty (Bad file descriptor).

 Thus no job control in this shell.

 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1

 -13 z1-13 z1-13 z1-13 z1-13 z1-13

 number of processors: 32

  LAPW0 END

 [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13
 aborted: Error while reading a PMI socket (4)

 [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
 terminated with signal 9 - abort job

 [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8.
 MPI process died?

 [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
 process died?

 [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12.
 MPI process died?

 [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
 process died?

 [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
 terminated with signal 9 - abort job

 [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2
 aborted: MPI process error (1)

 [cli_15]: aborting job:

 application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15



stop error


 --



 The .machines file:

 #

 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
 z1-2 z1-2

 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
 z1-13 z1-13 z1-13 z1-13

 granularity:1

 extrafine:1


 

 The parallel_options:



 setenv TASKSET no

 setenv USE_REMOTE 0

 setenv MPI_REMOTE 1

 setenv WIEN_GRANULARITY 1




 



 Thanks.



 Regards,

 Fermin


[Wien] Error in mpi+k point parallelization across multiple nodes

2015-04-28 Thread lung Fermin
Dear Wien2k community,

I am trying to perform calculation on a system of ~100 in-equivalent atoms
using mpi+k point parallelization on a cluster. Everything goes fine when
the program was run on a single node. However, if I perform the calculation
across different nodes, the follow error occurs. How to solve this problem?
I am a newbie to mpi programming, any help would be appreciated. Thanks.

The error message (MVAPICH2 2.0a):
---
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1
-13 z1-13 z1-13 z1-13 z1-13 z1-13
number of processors: 32
 LAPW0 END
[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-13
aborted: Error while reading a PMI socket (4)
[z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546)
terminated with signal 9 - abort job
[z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 8.
MPI process died?
[z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 12.
MPI process died?
[z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI
process died?
[z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454)
terminated with signal 9 - abort job
[z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-2
aborted: MPI process error (1)
[cli_15]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15

   stop error
--

The .machines file:
#
1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2
z1-2 z1-2
1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13
z1-13 z1-13 z1-13 z1-13
granularity:1
extrafine:1

The parallel_options:

setenv TASKSET no
setenv USE_REMOTE 0
setenv MPI_REMOTE 1
setenv WIEN_GRANULARITY 1



Thanks.

Regards,
Fermin
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html