Re: [Wien] ** testerror: Error in Parallel LAPW
The example you showed us, was a k-parallel job on only one node. To fix this, just set USE_REMOTE to zero (either in $WIENROOT permanently, or temporarily in your submitted job script. Another test would be to make a new wien2k-installation using "ifort+slurm" in siteconfig. It may work out of the box, in particular when using mpi-parallel. It uses srun, but I'm not sure if all slum-configurations are identical to your cluster. Am 21.06.2023 um 22:58 schrieb Ilias Miroslav, doc. RNDr., PhD.: Dear all, ad: https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html " In order to use multiple nodes, you need to be able to do passwordless ssh to the allocated nodes (or any other command substituting ssh). " According to our cluster admin, one can use (maybe) 'srun' to allocate and connect to a batch node. https://hpc.gsi.de/virgo/slurm/resource_allocation.html Would it possible to use "srun" within Wien2k scripts to run parallel jobs please ? We are using common disk space on that cluster. Best, Miro ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html -- --- Peter Blaha, Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna Phone: +43-158801165300 Email:peter.bl...@tuwien.ac.at WWW:http://www.imc.tuwien.ac.at WIEN2k:http://www.wien2k.at - ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] ** testerror: Error in Parallel LAPW
With apologies to Lukasz and Miro, there are some inaccurate statements being made about how to use Wien2k in parallel -- the code is more complex (and smarter). Please read carefully section 5.5 in detail, the read it again. Google what commands such as ssh, srun, rsh, mpirun do. If your cluster does not have ssh to the slurm allocated nodes, then try and get your admin to read section 5.5. There are ways to get around forbidden ssh, but you need to understand computers first. --- Professor Laurence Marks (Laurie) Department of Materials Science and Engineering Northwestern University www.numis.northwestern.edu "Research is to see what everybody else has seen, and to think what nobody else has thought" Albert Szent-Györgyi On Thu, Jun 22, 2023, 00:23 pluto via Wien wrote: > Dear Miro, > > On my cluster it works by a command > > salloc -p cluster_name -N6 sleep infinity & > > This particular command allocates 6 nodes. You can find which ones by > squeue command. Then passworless to these nodes is allowed in my > cluster. Then in .machines I include the names of these nodes and things > work. > > But there is a big chance that this is blocked in your cluster, you need > to ask your administrator. > > I think srun is the required command within the slurm shell script. You > should get some example shell scripts from your administrator or > colleagues who use the cluster. > > As I mentioned in my earlier email, Prof. Blaha provides workarounds for > slurm. If simple ways are blocked, you will just need to implement these > workarounds. It might not be easy, but setting up cluster calculations > is not supposed to be easy. > > Best, > Lukasz > > > > > On 2023-06-21 22:58, Ilias Miroslav, doc. RNDr., PhD. wrote: > > Dear all, > > > > ad: > > > https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html > > [1] > > > > " In order to use multiple nodes, you need to be able to do > > passwordless ssh to the allocated nodes (or any other command > > substituting ssh). " > > > > According to our cluster admin, one can use (maybe) 'srun' to > > allocate and connect to a batch node. > > https://hpc.gsi.de/virgo/slurm/resource_allocation.html [2] > > > > Would it possible to use "srun" within Wien2k scripts to run > > parallel jobs please ? We are using common disk space on that > > cluster. > > > > Best, Miro > > > > > > Links: > > -- > > [1] > > > https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html > > [2] https://hpc.gsi.de/virgo/slurm/resource_allocation.html > > ___ > > Wien mailing list > > Wien@zeus.theochem.tuwien.ac.at > > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > > SEARCH the MAILING-LIST at: > > http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html > ___ > Wien mailing list > Wien@zeus.theochem.tuwien.ac.at > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > SEARCH the MAILING-LIST at: > http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html > ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] ** testerror: Error in Parallel LAPW
Dear Miro, On my cluster it works by a command salloc -p cluster_name -N6 sleep infinity & This particular command allocates 6 nodes. You can find which ones by squeue command. Then passworless to these nodes is allowed in my cluster. Then in .machines I include the names of these nodes and things work. But there is a big chance that this is blocked in your cluster, you need to ask your administrator. I think srun is the required command within the slurm shell script. You should get some example shell scripts from your administrator or colleagues who use the cluster. As I mentioned in my earlier email, Prof. Blaha provides workarounds for slurm. If simple ways are blocked, you will just need to implement these workarounds. It might not be easy, but setting up cluster calculations is not supposed to be easy. Best, Lukasz On 2023-06-21 22:58, Ilias Miroslav, doc. RNDr., PhD. wrote: Dear all, ad: https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html [1] " In order to use multiple nodes, you need to be able to do passwordless ssh to the allocated nodes (or any other command substituting ssh). " According to our cluster admin, one can use (maybe) 'srun' to allocate and connect to a batch node. https://hpc.gsi.de/virgo/slurm/resource_allocation.html [2] Would it possible to use "srun" within Wien2k scripts to run parallel jobs please ? We are using common disk space on that cluster. Best, Miro Links: -- [1] https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html [2] https://hpc.gsi.de/virgo/slurm/resource_allocation.html ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] ** testerror: Error in Parallel LAPW
Dear all, ad: https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html " In order to use multiple nodes, you need to be able to do passwordless ssh to the allocated nodes (or any other command substituting ssh). " According to our cluster admin, one can use (maybe) 'srun' to allocate and connect to a batch node. https://hpc.gsi.de/virgo/slurm/resource_allocation.html Would it possible to use "srun" within Wien2k scripts to run parallel jobs please ? We are using common disk space on that cluster. Best, Miro ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] ** testerror: Error in Parallel LAPW
it crashed with the message "Host key verification failed. " Seems that your cluster does not allow ssh to an allocated node.(Ask your sys admin). In $WIENROOT/WIEN2k_parallel_options there are variables like USE_REMOTE. If set to zero, ssh is not used and you can run in parallel, but only on one shared memory node. In order to use multiple nodes, you need to be able to do passwordless ssh to the allocated nodes (or any other command substituting ssh). Herethe content of file /lustre/ukt/milias/scratch/Wien2k_23.2_job.main.N1.n4.jid3009460/LvO2onQg/.machines: 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 Job is running on lxbk1177, with 8 cpus allocated; and this is from log : running x dstart : starting parallel dstart at Tue 20 Jun 2023 05:16:21 PM CEST .machine0 : processors running dstart in single mode STOP DSTART ENDS 10.249u 0.322s 0:11.19 94.3% 0+0k 158496+101160io 437pf+0w running 'run_lapw -p -ec 0.0001 -NI' STOP LAPW0 END Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerr or_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > . temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) LvO2onQg.scf1_1: No such file or directory. grep: *scf1*: No such file or directory STOP FERMI - Error cp: cannot stat '.in.tmp': No such file or directory grep: *scf1*: No such file or directory > stop error file ":parallel" starting parallel lapw1 at Tue 20 Jun 2023 05:17:08 PM CEST lxbk1177(4) lxbk1177(3) lxbk1177(3) lxbk1177(3) lxbk1177(3) lxbk1177(3) lxbk1177(3) l xbk1177(3) Summary of lapw1para: lxbk1177 k=25 user=0 wallclock=0 <- done at Tue 20 Jun 2023 05:17:14 PM CEST - -> starting Fermi on lxbk1177 at Tue 20 Jun 2023 05:17:15 PM CEST ** LAPW2 crashed at Tue 20 Jun
Re: [Wien] ** testerror: Error in Parallel LAPW
The "Host key verification failed" is an error from ssh [1]. Thus, it seems like you need to fix your ssh so that WIEN2k can connect to your remote node (in your case below, it looks like the remote node is lxbk1177). It looks like there is an ssh example on slide 10 of the WIEN2k presentation at [2]. I believe it is common to use ssh-keygen to create a key pair (private and public key) on the head node and then use ssh-copy-id to put the public key on each of the remote nodes. However, ssh can be uniquely configured for a computer system. So, you might want to search online for different examples on how ssh has been configured. One example that might be helpful should be at [3]. [1] https://askubuntu.com/questions/45679/ssh-connection-problem-with-host-key-verification-failed-error [2] https://www.bc.edu/content/dam/bc1/schools/mcas/physics/pdf/wien2k/PB-installation.pdf [3] https://www.digitalocean.com/community/tutorials/ssh-essentials-working-with-ssh-servers-clients-and-keys Kind Regards, Gavin WIEN2k user On 6/20/2023 3:25 PM, Ilias Miroslav, doc. RNDr., PhD. wrote: Dear Professor Blaha, thanks, I used PATH variable extension instead of linking; it crashed with the message "Host key verification failed. " Herethe content of file /lustre/ukt/milias/scratch/Wien2k_23.2_job.main.N1.n4.jid3009460/LvO2onQg/.machines: 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 Job is running on lxbk1177, with 8 cpus allocated; and this is from log : running x dstart : starting parallel dstart at Tue 20 Jun 2023 05:16:21 PM CEST .machine0 : processors running dstart in single mode STOP DSTART ENDS 10.249u 0.322s 0:11.19 94.3% 0+0k 158496+101160io 437pf+0w running 'run_lapw -p -ec 0.0001 -NI' STOP LAPW0 END Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerr or_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > . temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \%
Re: [Wien] ** testerror: Error in Parallel LAPW
Dear Professor Blaha, thanks, I used PATH variable extension instead of linking; it crashed with the message "Host key verification failed. " Here the content of file /lustre/ukt/milias/scratch/Wien2k_23.2_job.main.N1.n4.jid3009460/LvO2onQg/.machines: 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 1:lxbk1177 Job is running on lxbk1177, with 8 cpus allocated; and this is from log : running x dstart : starting parallel dstart at Tue 20 Jun 2023 05:16:21 PM CEST .machine0 : processors running dstart in single mode STOP DSTART ENDS 10.249u 0.322s 0:11.19 94.3%0+0k 158496+101160io 437pf+0w running 'run_lapw -p -ec 0.0001 -NI' STOP LAPW0 END Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerr or_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > . temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1] + Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) Host key verification failed. [1]Done ( ( $remote $machine[$p] "cd $PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | perl -e "print stderr " ) LvO2onQg.scf1_1: No such file or directory. grep: *scf1*: No such file or directory STOP FERMI - Error cp: cannot stat '.in.tmp': No such file or directory grep: *scf1*: No such file or directory > stop error file ":parallel" starting parallel lapw1 at Tue 20 Jun 2023 05:17:08 PM CEST lxbk1177(4) lxbk1177(3) lxbk1177(3) lxbk1177(3) lxbk1177(3) lxbk1177(3) lxbk1177(3) l xbk1177(3)Summary of lapw1para: lxbk1177 k=25user=0 wallclock=0 <- done at Tue 20 Jun 2023 05:17:14 PM CEST - -> starting Fermi on lxbk1177 at Tue 20 Jun 2023 05:17:15 PM CEST ** LAPW2 crashed at Tue 20 Jun 2023 05:17:16 PM CEST ** check ERROR FILES! - ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at:
Re: [Wien] ** testerror: Error in Parallel LAPW
Dear Miro, It is hard to give your a meaningful answer with little info, but I will try my best guess because I needed to set this up recently. I assume that you want to use k-parallel and you don't have mpi. With a serial job you automatically run on a single node. Single node is a physical computer with a physical CPU, but typically with 4 memory channels to it can run 8 jobs in parallel. With k-parallel you need to define nodes on which k-points are calculated. With slurm, maybe things will work if you create 8 "localhost" lines in .machines file, because this will still run on a single node that is assigned automatically. But things probably won't work if you create lines such as "node001", "node002" etc (depending on the names of the nodes in your cluster). And to take an advantage of the cluster you need to use as many nodes as possible. Now the problem is, that k-parallel works assuming you can ssh to every node without a password. This is typically forbidden in the slurm environment. Prof. Blaha provides workarounds, but to me their implementation seems complicated (I not an expert): http://www.wien2k.at/reg_user/faq/pbs.html I am using an older cluster where it is possible to allocate nodes, and with this allocation comes automatically passwordless ssh to these nodes. Then the slurm workarounds are not needed. Maybe you can talk to your administrator if this is possible in your cluster, because I think typically this is blocked. Best, Lukasz On 2023-06-20 10:18, Ilias Miroslav, doc. RNDr., PhD. wrote: Hello, I am able to run serial SCF via SLURM https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.01 but when trying parallel https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.02 I get lapw2.error 'LAPW2' - can't open unit: 30 'LAPW2' -filename: LvO2onQg.energy_1 ** testerror: Error in Parallel LAPW2 The file "LvO2onQg.energy" is correct in serial mode. Seems that LvO2onQg.energy_1 file is not produces in parallel run ? All files are https://github.com/miroi/open-collection/tree/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg [1] Best, Miro Links: -- [1] https://github.com/miroi/open-collection/tree/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] ** testerror: Error in Parallel LAPW
Well, the important files (output and error log files from slurm, the dayfile and all error files, .machines) are not present. Also, in this way it is nearly impossible to see which files are recent and which are older ones. The scripts are so complicated, that I cannot follow them Why are you linking all wien-executables ?? You can set a PATH (in case the environment is not transferred). Anyway, in the files I found some lapw1.error_SAVED file . It contains 'SELECT' - E-bottom -520.0 E-top 1.94197 'SELECT' - no energy limits found for atom 2 L= 0 I don't know if this is from the serial or parallel run, but in any case, this error has nothing to do with parallelization. Either you used a different case.in1 file or ? The error you reported is a follow up error because lapw1 did not run properly, at least, this energy_1 file should be produced by lapw1. Am 20.06.2023 um 10:18 schrieb Ilias Miroslav, doc. RNDr., PhD.: Hello, I am able to run serial SCF via SLURM https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.01 but when trying parallel https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.02 I get lapw2.error 'LAPW2' - can't open unit: 30 'LAPW2' - filename: LvO2onQg.energy_1 ** testerror: Error in Parallel LAPW2 The file "LvO2onQg.energy" is correct in serial mode. Seems that LvO2onQg.energy_1 file is not produces in parallel run ? All files are https://github.com/miroi/open-collection/tree/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg Best, Miro ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html -- --- Peter Blaha, Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna Phone: +43-158801165300 Email:peter.bl...@tuwien.ac.at WWW:http://www.imc.tuwien.ac.at WIEN2k:http://www.wien2k.at - ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
[Wien] ** testerror: Error in Parallel LAPW
Hello, I am able to run serial SCF via SLURM https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.01 but when trying parallel https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.02 I get lapw2.error 'LAPW2' - can't open unit: 30 'LAPW2' -filename: LvO2onQg.energy_1 ** testerror: Error in Parallel LAPW2 The file "LvO2onQg.energy" is correct in serial mode. Seems that LvO2onQg.energy_1 file is not produces in parallel run ? All files are https://github.com/miroi/open-collection/tree/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg Best, Miro ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html