I have tried to set MPI_REMOTE=0 and used 32 cores (on 2 nodes) for distributing the mpi job. However, the problem still persist... but the error message looks different this time:
$> cat *.error Error in LAPW2 ** testerror: Error in Parallel LAPW2 and the output on screen: Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-17 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-18 z1-1 8 z1-18 z1-18 number of processors: 32 LAPW0 END [16] Failed to dealloc pd (Device or resource busy) [0] Failed to dealloc pd (Device or resource busy) [17] Failed to dealloc pd (Device or resource busy) [2] Failed to dealloc pd (Device or resource busy) [18] Failed to dealloc pd (Device or resource busy) [1] Failed to dealloc pd (Device or resource busy) LAPW1 END LAPW2 - FERMI; weighs written [z1-17:mpispawn_0][child_handler] MPI process (rank: 0, pid: 28291) terminated with signal 9 -> abort job [z1-17:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 9. MPI process died? [z1-17:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died? [z1-17:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node z1-17 aborted: Error while reading a PMI socket (4) [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died? [z1-18:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 21. MPI process died? [z1-18:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died? cp: cannot stat `.in.tmp': No such file or directory > stop error ------------------------------------------------------------------------------------------------------------ Try setting setenv MPI_REMOTE 0 in parallel options. Am 29.04.2015 um 09:44 schrieb lung Fermin: > Thanks for your comment, Prof. Marks. > > Each node on the cluster has 32GB memory and each core (16 in total) > on the node is limited to 2GB of memory usage. For the current system, > I used RKMAX=6, and the smallest RMT=2.25. > > I have tested the calculation with single k point and mpi on 16 cores > within a node. The matrix size from > > $ cat *.nmat_only > > is 29138 > > Does this mean that the number of matrix elements is 29138 or (29138)^2? > In general, how shall I estimate the memory required for a calculation? > > I have also checked the memory usage with "top" on the node. Each core > has used up ~5% of the memory and this adds up to ~5*16% on the node. > Perhaps the problem is really caused by the overflow of memory.. I am > now queuing on the cluster to test for the case of mpi over 32 cores > (2 nodes). > > Thanks. > > Regards, > Fermin > > ---------------------------------------------------------------------- > ------------------------------------------ > > As an addendum, the calculation may be too big for a single node. How > much memory does the node have, what is the RKMAX, the smallest RMT & > unit cell size? Maybe use in your machines file > > 1:z1-2:16 z1-13:16 > lapw0: z1-2:16 z1-13:16 > granularity:1 > extrafine:1 > > Check the size using > x law1 -c -p -nmat_only > cat *.nmat > > ___________________________ > Professor Laurence Marks > Department of Materials Science and Engineering Northwestern > University www.numis.northwestern.edu > <http://www.numis.northwestern.edu> > MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu> > Co-Editor, Acta Cryst A > "Research is to see what everybody else has seen, and to think what > nobody else has thought" > Albert Szent-Gyorgi > > On Apr 28, 2015 10:45 PM, "Laurence Marks" <l-ma...@northwestern.edu > <mailto:l-ma...@northwestern.edu <l-ma...@northwestern.edu>>> wrote: > > Unfortunately it is hard to know what is going on. A google search on > "Error while reading PMI socket." indicates that the message you have > means it did not work, and is not specific. Some suggestions: > > a) Try mpiexec (slightly different arguments). You just edit > parallel_options. > https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager > b) Try an older version of mvapich2 if it is on the system. > c) Do you have to launch mpdboot for your system > https://wiki.calculquebec.ca/w/MVAPICH2/en? > d) Talk to a sys_admin, particularly the one who setup mvapich > e) Do "cat *.error", maybe something else went wrong or it is not > mpi's fault but a user error. > > ___________________________ > Professor Laurence Marks > Department of Materials Science and Engineering Northwestern > University www.numis.northwestern.edu > <http://www.numis.northwestern.edu> > MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu> > Co-Editor, Acta Cryst A > "Research is to see what everybody else has seen, and to think what > nobody else has thought" > Albert Szent-Gyorgi > > On Apr 28, 2015 10:17 PM, "lung Fermin" <ferminl...@gmail.com > <mailto:ferminl...@gmail.com <ferminl...@gmail.com>>> wrote: > > Thanks for Prof. Marks' comment. > > 1. In the previous email, I have missed to copy the line > > setenv WIEN_MPIRUN "/usr/local/mvapich2-icc/bin/mpirun -np _NP_ > -hostfile _HOSTS_ _EXEC_" > > It was in the parallel_option. Sorry about that. > > 2. I have checked that the running program was lapw1c_mpi. Besides, > when the mpi calculation was done on a single node for some other > system, the results are consistent with the literature. So I believe > that the mpi code has been setup and compiled properly. > > Would there be something wrong with my option in siteconfig..? Do I > have to set some command to bind the job? Any other possible cause of the error? > > Any suggestions or comments would be appreciated. Thanks. > > Regards, > > Fermin > > ---------------------------------------------------------------------- > ------------------------------ > > You appear to be missing the line > > setenv WIEN_MPIRUN=... > > This is setup when you run siteconfig, and provides the information on > how mpi is run on your system. > > N.B., did you setup and compile the mpi code? > > ___________________________ > Professor Laurence Marks > Department of Materials Science and Engineering Northwestern > University www.numis.northwestern.edu > <http://www.numis.northwestern.edu> > MURI4D.numis.northwestern.edu <http://MURI4D.numis.northwestern.edu> > Co-Editor, Acta Cryst A > "Research is to see what everybody else has seen, and to think what > nobody else has thought" > Albert Szent-Gyorgi > > On Apr 28, 2015 4:22 AM, "lung Fermin" <ferminl...@gmail.com > <mailto:ferminl...@gmail.com <ferminl...@gmail.com>>> wrote: > > Dear Wien2k community, > > I am trying to perform calculation on a system of ~100 in-equivalent > atoms using mpi+k point parallelization on a cluster. Everything goes > fine when the program was run on a single node. However, if I perform > the calculation across different nodes, the follow error occurs. How > to solve this problem? I am a newbie to mpi programming, any help > would be appreciated. Thanks. > > The error message (MVAPICH2 2.0a): > > ---------------------------------------------------------------------- > ----------------------------- > > Warning: no access to tty (Bad file descriptor). > > Thus no job control in this shell. > > z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 > z1-2 z1-2 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 > z1 > > -13 z1-13 z1-13 z1-13 z1-13 z1-13 > > number of processors: 32 > > LAPW0 END > > [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node > z1-13 aborted: Error while reading a PMI socket (4) > > [z1-13:mpispawn_0][child_handler] MPI process (rank: 11, pid: 8546) > terminated with signal 9 -> abort job > > [z1-13:mpispawn_0][readline] Unexpected End-Of-File on file descriptor > 8. MPI process died? > > [z1-13:mpispawn_0][mtpmi_processops] Error while reading PMI socket. > MPI process died? > > [z1-2:mpispawn_0][readline] Unexpected End-Of-File on file descriptor > 12. MPI process died? > > [z1-2:mpispawn_0][mtpmi_processops] Error while reading PMI socket. > MPI process died? > > [z1-2:mpispawn_0][child_handler] MPI process (rank: 0, pid: 35454) > terminated with signal 9 -> abort job > > [z1-2:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node > z1-2 > aborted: MPI process error (1) > > [cli_15]: aborting job: > > application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15 > >> stop error > > ---------------------------------------------------------------------- > -------------------------------- > > The .machines file: > > # > > 1:z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 z1-2 > z1-2 > z1-2 z1-2 > > 1:z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 z1-13 > z1-13 z1-13 z1-13 z1-13 z1-13 > > granularity:1 > > extrafine:1 > > ---------------------------------------------------------------------- > ---------------------------------- > > The parallel_options: > > setenv TASKSET "no" > > setenv USE_REMOTE 0 > > setenv MPI_REMOTE 1 > > setenv WIEN_GRANULARITY 1 >
_______________________________________________ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html