Re: [Wien] Error in mpi+k point parallelization across multiple nodes

Peter Blaha Sun, 03 May 2015 22:06:06 -0700

It seems as if lapw0_mpi runs properly ?? Please check if you have
NEW (check date with ls -als)!! valid case.vsp/vns files, which can be used in
eg. a sequential lapw1 step.


This suggests that   mpi and fftw are ok.

The problems seem to start in lapw1_mpi, and this program requires in addition 
to
mpi also scalapack.

I guess you compile with ifort and link with the mkl ??
There is one crucial blacs library, which must be adapted to your mpi, since 
they
are specific to a particular mpi (intelmpi, openmpi, ...):
Which blacks-library do you link ?   -lmkl_blacs_lp64   or another one ??
Check out the doku for the mkl.


Am 04.05.2015 um 05:18 schrieb lung Fermin:

I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for 
distributing the mpi job. However, the problem still persist... but the error 
message looks different
this time:

$> cat *.error
Error in LAPW2
**  testerror: Error in Parallel LAPW2

and the output on screen:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 
z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 
z1-18 z1-18
z1-18 z1-1
8 z1-18 z1-18
number of processors: 32
  LAPW0 END
[16] Failed to dealloc pd (Device or resource busy)
[0] Failed to dealloc pd (Device or resource busy)
[17] Failed to dealloc pd (Device or resource busy)
[2] Failed to dealloc pd (Device or resource busy)
[18] Failed to dealloc pd (Device or resource busy)
[1] Failed to dealloc pd (Device or resource busy)
  LAPW1 END
LAPW2 - FERMI; weighs written
[z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291) terminated 
with signal 9 -> abort job
[z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9. MPI 
process died?
[z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI 
process died?
[z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17 
aborted: Error while reading a PMI socket (4)
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI 
process died?
[z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI 
process died?
[z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process 
died?
cp: cannot stat `.in.tmp': No such file or directory

 >   stop error


------------------------------------------------------------------------------------------------------------

Try setting

setenv MPI_REMOTE 0

in parallel options.

Am 29.04.2015 um 09:44 schrieb lung Fermin:

Thanks for your comment, Prof.  Marks.

Each node on the cluster has 32GB  memory and each core (16 in total)

on the node is limited to 2GB of  memory usage. For the current system,

I used RKMAX=6,  and the smallest RMT=2.25.

I have tested the calculation with  single k point and mpi on 16 cores

within a node. The matrix size from

$ cat *.nmat_only

is       29138

Does this mean that the number of  matrix elements is 29138 or (29138)^2?

In general, how shall I estimate  the memory required for a calculation?

I have also checked the memory  usage with "top" on the node. Each core

has used up ~5% of the memory and  this adds up to ~5*16% on the node.

Perhaps the problem is really  caused by the overflow of memory.. I am

now queuing on the cluster to test  for the case of mpi over 32 cores

(2 nodes).

Thanks.

Regards,

Fermin

 ----------------------------------------------------------------------

 ------------------------------------------

As an addendum, the calculation may  be too big for a single node. How

much memory does the node have,  what is the RKMAX, the smallest RMT &

unit cell size? Maybe use in your  machines file

1:z1-2:16 z1-13:16

lapw0: z1-2:16 z1-13:16

granularity:1

extrafine:1

Check the size using

x law1 -c -p -nmat_only

cat *.nmat

___________________________

Professor Laurence Marks

Department of Materials Science and  Engineering Northwestern

Universitywww.numis.northwestern.edu <http://www.numis.northwestern.edu>

<http://www.numis.northwestern.edu>

MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu> 
<http://MURI4D.numis.northwestern.edu>

Co-Editor, Acta Cryst A

"Research is to see what  everybody else has seen, and to think what

nobody else has thought"

Albert Szent-Gyorgi

On Apr 28, 2015 10:45 PM,  "Laurence Marks" <l-ma...@northwestern.edu 
<mailto:l-ma...@northwestern.edu>

<mailto:l-ma...@northwestern.edu>> wrote:

Unfortunately it is hard to know  what is going on. A google search on

"Error while reading PMI  socket." indicates that the message you have

means it did not work, and is not  specific. Some suggestions:

a) Try mpiexec (slightly different  arguments). You just edit

parallel_options.

https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager

b) Try an older version of mvapich2  if it is on the system.

c) Do you have to launch mpdboot  for your system

https://wiki.calculquebec.ca/w/MVAPICH2/en?

d) Talk to a sys_admin,  particularly the one who setup mvapich

e) Do "cat *.error",  maybe something else went wrong or it is not

mpi's fault but a user error.

___________________________

Professor Laurence Marks

Department of Materials Science and  Engineering Northwestern

Universitywww.numis.northwestern.edu <http://www.numis.northwestern.edu>

<http://www.numis.northwestern.edu>

MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu> 
<http://MURI4D.numis.northwestern.edu>

Co-Editor, Acta Cryst A

"Research is to see what everybody  else has seen, and to think what

nobody else has thought"

Albert Szent-Gyorgi

On Apr 28, 2015 10:17 PM,  "lung Fermin" <ferminl...@gmail.com 
<mailto:ferminl...@gmail.com>

<mailto:ferminl...@gmail.com>> wrote:

Thanks for Prof. Marks' comment.

1. In the previous email, I have  missed to copy the line

setenv WIEN_MPIRUN  "/usr/local/mvapich2-icc/bin/mpirun -np _NP_

-hostfile _HOSTS_ _EXEC_"

It was in the parallel_option.  Sorry about that.

2. I have checked that the running  program was lapw1c_mpi. Besides,

when the mpi calculation was done  on a single node for some other

system, the results are consistent  with the literature. So I believe

that the mpi code has been setup  and compiled properly.

Would there be something wrong with  my option in siteconfig..? Do I

have to set some command to bind  the job? Any other possible cause of the 
error?

Any suggestions or comments would  be appreciated. Thanks.

Regards,

Fermin

 ----------------------------------------------------------------------

------------------------------

You appear to be missing the line

setenv WIEN_MPIRUN=...

This is setup when you run  siteconfig, and provides the information on

how mpi is run on your system.

N.B., did you setup and compile the  mpi code?

___________________________

Professor Laurence Marks

Department of Materials Science and  Engineering Northwestern

Universitywww.numis.northwestern.edu <http://www.numis.northwestern.edu>

<http://www.numis.northwestern.edu>

MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu> 
<http://MURI4D.numis.northwestern.edu>

Co-Editor, Acta Cryst A

"Research is to see what  everybody else has seen, and to think what

nobody else has thought"

Albert Szent-Gyorgi

On Apr 28, 2015 4:22 AM, "lung  Fermin" <ferminl...@gmail.com 
<mailto:ferminl...@gmail.com>

<mailto:ferminl...@gmail.com>> wrote:

Dear Wien2k community,

I am trying to perform calculation  on a system of ~100 in-equivalent

atoms using mpi+k point  parallelization on a cluster. Everything goes

fine when the program was run on a  single node. However, if I perform

the calculation across different  nodes, the follow error occurs. How

to solve this problem? I am a  newbie to mpi programming, any help

would be appreciated. Thanks.

The error message (MVAPICH2 2.0a):

 ----------------------------------------------------------------------

-----------------------------

Warning: no access to tty (Bad file  descriptor).

Thus no job control in this shell.

z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2  z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2

z1-2 z1-2 z1-13 z1-13 z1-13 z1-13  z1-13 z1-13 z1-13 z1-13 z1-13 z1-13

z1

-13 z1-13 z1-13 z1-13 z1-13 z1-13

number of processors: 32

LAPW0 END

 [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node

z1-13 aborted: Error while reading  a PMI socket (4)

[z1-13:mpispawn_0][child_handler]  MPI process (rank: 11, pid: 8546)

terminated with signal 9 ->  abort job

[z1-13:mpispawn_0][readline]  Unexpected End-Of-File on file descriptor

8. MPI process died?

 [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket.

MPI process died?

[z1-2:mpispawn_0][readline]  Unexpected End-Of-File on file descriptor

12. MPI process died?

[z1-2:mpispawn_0][mtpmi_processops]  Error while reading PMI socket.

MPI process died?

[z1-2:mpispawn_0][child_handler]  MPI process (rank: 0, pid: 35454)

terminated with signal 9 ->  abort job

 [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node

z1-2

aborted: MPI process error (1)

[cli_15]: aborting job:

application called  MPI_Abort(MPI_COMM_WORLD, 0) - process 15

  stop error

 ----------------------------------------------------------------------

--------------------------------

The .machines file:

1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2  z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2

z1-2

z1-2 z1-2

1:z1-13 z1-13 z1-13 z1-13 z1-13  z1-13 z1-13 z1-13 z1-13 z1-13 z1-13

z1-13 z1-13 z1-13 z1-13 z1-13

granularity:1

extrafine:1

 ----------------------------------------------------------------------

----------------------------------

The parallel_options:

setenv TASKSET "no"

setenv USE_REMOTE 0

setenv MPI_REMOTE 1

setenv WIEN_GRANULARITY 1





_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


--
-----------------------------------------
Peter Blaha
Inst. Materials Chemistry, TU Vienna
Getreidemarkt 9, A-1060 Vienna, Austria
Tel: +43-1-5880115671
Fax: +43-1-5880115698
email: pbl...@theochem.tuwien.ac.at
-----------------------------------------
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] Error in mpi+k point parallelization across multiple nodes

Reply via email to