Re: [Wien] ** testerror: Error in Parallel LAPW

2023-06-21 Thread Peter Blaha

The example you showed us, was a k-parallel job on only one node.

To fix this, just set USE_REMOTE to zero (either in $WIENROOT 
permanently, or temporarily in your submitted job script.



Another test would be to make a new wien2k-installation using 
"ifort+slurm" in siteconfig. It may work out of the box, in particular 
when using mpi-parallel. It uses   srun,  but I'm not sure if all 
slum-configurations are identical to your cluster.



Am 21.06.2023 um 22:58 schrieb Ilias Miroslav, doc. RNDr., PhD.:

Dear all,

ad: 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html 

" In order to use multiple nodes, you need to be able to do 
passwordless ssh to the allocated nodes (or any other command 
substituting ssh). "


According to our cluster admin, one can use (maybe) 'srun' to allocate 
and connect to a batch node. 
https://hpc.gsi.de/virgo/slurm/resource_allocation.html


Would  it possible to use  "srun" within Wien2k scripts to run 
parallel jobs please ?  We are using common disk space on that cluster.


Best, Miro

___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST 
at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


--
---
Peter Blaha,  Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-158801165300
Email:peter.bl...@tuwien.ac.at   
WWW:http://www.imc.tuwien.ac.at   WIEN2k:http://www.wien2k.at

-
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] ** testerror: Error in Parallel LAPW

2023-06-21 Thread Laurence Marks
With apologies to Lukasz and Miro, there are some inaccurate statements
being made about how to use Wien2k in parallel -- the code is more complex
(and smarter). Please read carefully section 5.5 in detail, the read it
again. Google what commands such as ssh, srun, rsh, mpirun do.

If your cluster does not have ssh to the slurm allocated nodes, then try
and get your admin to read section 5.5. There are ways to get around
forbidden ssh, but you need to understand computers first.

---
Professor Laurence Marks (Laurie)
Department of Materials Science and Engineering
Northwestern University
www.numis.northwestern.edu
"Research is to see what everybody else has seen, and to think what nobody
else has thought" Albert Szent-Györgyi

On Thu, Jun 22, 2023, 00:23 pluto via Wien 
wrote:

> Dear Miro,
>
> On my cluster it works by a command
>
> salloc -p cluster_name -N6 sleep infinity &
>
> This particular command allocates 6 nodes. You can find which ones by
> squeue command. Then passworless to these nodes is allowed in my
> cluster. Then in .machines I include the names of these nodes and things
> work.
>
> But there is a big chance that this is blocked in your cluster, you need
> to ask your administrator.
>
> I think srun is the required command within the slurm shell script. You
> should get some example shell scripts from your administrator or
> colleagues who use the cluster.
>
> As I mentioned in my earlier email, Prof. Blaha provides workarounds for
> slurm. If simple ways are blocked, you will just need to implement these
> workarounds. It might not be easy, but setting up cluster calculations
> is not supposed to be easy.
>
> Best,
> Lukasz
>
>
>
>
> On 2023-06-21 22:58, Ilias Miroslav, doc. RNDr., PhD. wrote:
> > Dear all,
> >
> >  ad:
> >
> https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html
> > [1]
> >
> >  " In order to use multiple nodes, you need to be able to do
> > passwordless ssh to the allocated nodes (or any other command
> > substituting ssh). "
> >
> >  According to our cluster admin, one  can use (maybe) 'srun' to
> > allocate and connect to a batch node.
> > https://hpc.gsi.de/virgo/slurm/resource_allocation.html [2]
> >
> >  Would  it possible to use  "srun" within Wien2k scripts to run
> > parallel jobs please ?  We are using common disk space on that
> > cluster.
> >
> >  Best, Miro
> >
> >
> > Links:
> > --
> > [1]
> >
> https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html
> > [2] https://hpc.gsi.de/virgo/slurm/resource_allocation.html
> > ___
> > Wien mailing list
> > Wien@zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > SEARCH the MAILING-LIST at:
> > http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] ** testerror: Error in Parallel LAPW

2023-06-21 Thread pluto via Wien

Dear Miro,

On my cluster it works by a command

salloc -p cluster_name -N6 sleep infinity &

This particular command allocates 6 nodes. You can find which ones by 
squeue command. Then passworless to these nodes is allowed in my 
cluster. Then in .machines I include the names of these nodes and things 
work.


But there is a big chance that this is blocked in your cluster, you need 
to ask your administrator.


I think srun is the required command within the slurm shell script. You 
should get some example shell scripts from your administrator or 
colleagues who use the cluster.


As I mentioned in my earlier email, Prof. Blaha provides workarounds for 
slurm. If simple ways are blocked, you will just need to implement these 
workarounds. It might not be easy, but setting up cluster calculations 
is not supposed to be easy.


Best,
Lukasz




On 2023-06-21 22:58, Ilias Miroslav, doc. RNDr., PhD. wrote:

Dear all,

 ad:
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html
[1]

 " In order to use multiple nodes, you need to be able to do
passwordless ssh to the allocated nodes (or any other command
substituting ssh). "

 According to our cluster admin, one  can use (maybe) 'srun' to
allocate and connect to a batch node.
https://hpc.gsi.de/virgo/slurm/resource_allocation.html [2]

 Would  it possible to use  "srun" within Wien2k scripts to run
parallel jobs please ?  We are using common disk space on that
cluster.

 Best, Miro


Links:
--
[1] 
https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html

[2] https://hpc.gsi.de/virgo/slurm/resource_allocation.html
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] ** testerror: Error in Parallel LAPW

2023-06-21 Thread Ilias Miroslav, doc. RNDr., PhD.
Dear all,

ad: https://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg22588.html
" In order to use multiple nodes, you need to be able to do passwordless ssh to 
the allocated nodes (or any other command substituting ssh). "

According to our cluster admin, one  can use (maybe) 'srun' to allocate and 
connect to a batch node. https://hpc.gsi.de/virgo/slurm/resource_allocation.html

Would  it possible to use  "srun" within Wien2k scripts to run parallel jobs 
please ?  We are using common disk space on that cluster.

Best, Miro
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] ** testerror: Error in Parallel LAPW

2023-06-21 Thread Peter Blaha



it  crashed with the message  "Host key verification failed. "

Seems that your cluster does not allow   ssh to an allocated node.(Ask 
your sys admin).


In $WIENROOT/WIEN2k_parallel_options  there are variables like

USE_REMOTE.  If set to zero, ssh is not used and you can run in 
parallel, but only on one shared memory node.


In order to use multiple nodes, you need to be able to do passwordless 
ssh to the allocated nodes (or any other command substituting ssh).



Herethe content of file 
/lustre/ukt/milias/scratch/Wien2k_23.2_job.main.N1.n4.jid3009460/LvO2onQg/.machines:

1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177

Job is running on lxbk1177, with 8 cpus allocated;

and this is from log :

running x dstart :
starting parallel dstart at Tue 20 Jun 2023 05:16:21 PM CEST
 .machine0 : processors
running dstart in single mode
STOP DSTART ENDS
10.249u 0.322s 0:11.19 94.3%    0+0k 158496+101160io 437pf+0w

running 'run_lapw -p -ec 0.0001 -NI'
STOP  LAPW0 END
Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerr
or_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; 
if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .
temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% 
.temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]    Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

LvO2onQg.scf1_1: No such file or directory.
grep: *scf1*: No such file or directory
STOP FERMI - Error
cp: cannot stat '.in.tmp': No such file or directory
grep: *scf1*: No such file or directory

>   stop error



file ":parallel"

starting parallel lapw1 at Tue 20 Jun 2023 05:17:08 PM CEST
lxbk1177(4)  lxbk1177(3)  lxbk1177(3)  lxbk1177(3) 
 lxbk1177(3)  lxbk1177(3)  lxbk1177(3)  l

xbk1177(3)    Summary of lapw1para:
  lxbk1177  k=25    user=0  wallclock=0
<-  done at Tue 20 Jun 2023 05:17:14 PM CEST
-
->  starting Fermi on lxbk1177 at Tue 20 Jun 2023 05:17:15 PM CEST
**  LAPW2 crashed at Tue 20 Jun 

Re: [Wien] ** testerror: Error in Parallel LAPW

2023-06-20 Thread Gavin Abo

The "Host key verification failed" is an error from ssh [1].


Thus, it seems like you need to fix your ssh so that WIEN2k can connect 
to your remote node (in your case below, it looks like the remote node 
is lxbk1177).



It looks like there is an ssh example on slide 10 of the WIEN2k 
presentation at [2].



I believe it is common to use ssh-keygen to create a key pair (private 
and public key) on the head node and then use ssh-copy-id to put the 
public key on each of the remote nodes.  However, ssh can be uniquely 
configured for a computer system.  So, you might want to search online 
for different examples on how ssh has been configured.  One example that 
might be helpful should be at [3].



[1] 
https://askubuntu.com/questions/45679/ssh-connection-problem-with-host-key-verification-failed-error


[2] 
https://www.bc.edu/content/dam/bc1/schools/mcas/physics/pdf/wien2k/PB-installation.pdf


[3] 
https://www.digitalocean.com/community/tutorials/ssh-essentials-working-with-ssh-servers-clients-and-keys



Kind Regards,

Gavin

WIEN2k user


On 6/20/2023 3:25 PM, Ilias Miroslav, doc. RNDr., PhD. wrote:

Dear Professor Blaha,

thanks, I used PATH variable extension instead of linking;

it  crashed with the message  "Host key verification failed. "

Herethe content of file 
/lustre/ukt/milias/scratch/Wien2k_23.2_job.main.N1.n4.jid3009460/LvO2onQg/.machines:

1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177

Job is running on lxbk1177, with 8 cpus allocated;

and this is from log :

running x dstart :
starting parallel dstart at Tue 20 Jun 2023 05:16:21 PM CEST
 .machine0 : processors
running dstart in single mode
STOP DSTART ENDS
10.249u 0.322s 0:11.19 94.3%    0+0k 158496+101160io 437pf+0w

running 'run_lapw -p -ec 0.0001 -NI'
STOP  LAPW0 END
Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerr
or_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; 
if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .
temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% 
.temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; 
grep -v \% .temp1_$loop | perl -e "print stderr " )

Host key verification failed.
[1]    Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def 
;fixerror_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw 
.stdout1_$loop > .temp1_$loop; grep \% 

Re: [Wien] ** testerror: Error in Parallel LAPW

2023-06-20 Thread Ilias Miroslav, doc. RNDr., PhD.
Dear Professor Blaha,

thanks, I used PATH variable extension instead of linking;

it  crashed with the message  "Host key verification failed. "

Here the content of file 
/lustre/ukt/milias/scratch/Wien2k_23.2_job.main.N1.n4.jid3009460/LvO2onQg/.machines:
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177
1:lxbk1177

Job is running on lxbk1177, with 8 cpus allocated;

and this is from log :

running x dstart :
starting parallel dstart at Tue 20 Jun 2023 05:16:21 PM CEST
 .machine0 : processors
running dstart in single mode
STOP DSTART ENDS
10.249u 0.322s 0:11.19 94.3%0+0k 158496+101160io 437pf+0w

running 'run_lapw -p -ec 0.0001 -NI'
STOP  LAPW0 END
Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerr
or_lapw ${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdout1_$loop; if ( -f 
.stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > .
temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | 
perl -e "print stderr " )
Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw 
${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > 
.temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | 
perl -e "print stderr " )
Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw 
${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > 
.temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | 
perl -e "print stderr " )
Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw 
${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > 
.temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | 
perl -e "print stderr " )
Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw 
${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > 
.temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | 
perl -e "print stderr " )
Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw 
${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > 
.temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | 
perl -e "print stderr " )
Host key verification failed.
[1]  + Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw 
${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > 
.temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | 
perl -e "print stderr " )
Host key verification failed.
[1]Done  ( ( $remote $machine[$p] "cd 
$PWD;$set_OMP_NUM_THREADS;$t $taskset0 $exe ${def}_$loop.def ;fixerror_lapw 
${def}_$loop"; rm -f .lock_$lockfile[$p] ) >& .stdo
ut1_$loop; if ( -f .stdout1_$loop ) bashtime2csh.pl_lapw .stdout1_$loop > 
.temp1_$loop; grep \% .temp1_$loop >> .time1_$loop; grep -v \% .temp1_$loop | 
perl -e "print stderr " )
LvO2onQg.scf1_1: No such file or directory.
grep: *scf1*: No such file or directory
STOP FERMI - Error
cp: cannot stat '.in.tmp': No such file or directory
grep: *scf1*: No such file or directory

>   stop error



file ":parallel"

starting parallel lapw1 at Tue 20 Jun 2023 05:17:08 PM CEST
lxbk1177(4)  lxbk1177(3)  lxbk1177(3)  lxbk1177(3)  
lxbk1177(3)  lxbk1177(3)  lxbk1177(3)  l
xbk1177(3)Summary of lapw1para:
  lxbk1177  k=25user=0  wallclock=0
<-  done at Tue 20 Jun 2023 05:17:14 PM CEST
-
->  starting Fermi on lxbk1177 at Tue 20 Jun 2023 05:17:15 PM CEST
**  LAPW2 crashed at Tue 20 Jun 2023 05:17:16 PM CEST
**  check ERROR FILES!
-




___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  

Re: [Wien] ** testerror: Error in Parallel LAPW

2023-06-20 Thread pluto via Wien

Dear Miro,

It is hard to give your a meaningful answer with little info, but I will 
try my best guess because I needed to set this up recently. I assume 
that you want to use k-parallel and you don't have mpi.


With a serial job you automatically run on a single node. Single node is 
a physical computer with a physical CPU, but typically with 4 memory 
channels to it can run 8 jobs in parallel.


With k-parallel you need to define nodes on which k-points are 
calculated. With slurm, maybe things will work if you create 8 
"localhost" lines in .machines file, because this will still run on a 
single node that is assigned automatically. But things probably won't 
work if you create lines such as "node001", "node002" etc (depending on 
the names of the nodes in your cluster). And to take an advantage of the 
cluster you need to use as many nodes as possible.


Now the problem is, that k-parallel works assuming you can ssh to every 
node without a password. This is typically forbidden in the slurm 
environment. Prof. Blaha provides workarounds, but to me their 
implementation seems complicated (I not an expert): 
http://www.wien2k.at/reg_user/faq/pbs.html


I am using an older cluster where it is possible to allocate nodes, and 
with this allocation comes automatically passwordless ssh to these 
nodes. Then the slurm workarounds are not needed. Maybe you can talk to 
your administrator if this is possible in your cluster, because I think 
typically this is blocked.


Best,
Lukasz





On 2023-06-20 10:18, Ilias Miroslav, doc. RNDr., PhD. wrote:

Hello,

 I am able to run serial SCF via SLURM


https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.01


 but when trying parallel

https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.02


 I get lapw2.error

 'LAPW2' - can't open unit: 30

'LAPW2' -filename: LvO2onQg.energy_1

**  testerror: Error in Parallel LAPW2

 The file "LvO2onQg.energy" is correct in serial mode.

 Seems that LvO2onQg.energy_1 file is not produces in parallel run ?

 All files are
https://github.com/miroi/open-collection/tree/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg
[1]

 Best,

 Miro


Links:
--
[1]
https://github.com/miroi/open-collection/tree/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] ** testerror: Error in Parallel LAPW

2023-06-20 Thread Peter Blaha
Well, the important files (output and error log files from slurm, the 
dayfile and all error files, .machines) are not present.


Also, in this way it is nearly impossible to see which files are recent 
and which are older ones.



The scripts are so complicated, that I cannot follow them  Why are 
you linking all wien-executables ?? You can set a PATH (in case the 
environment is not transferred).



Anyway, in the files I found some    lapw1.error_SAVED  file . It contains


'SELECT' - E-bottom -520.0 E-top 1.94197

'SELECT' - no energy limits found for atom 2 L= 0


I don't know if this is from the serial or parallel run, but in any 
case, this error has nothing to do with parallelization.


Either you used a different case.in1 file or  ?


The error you reported is a follow up error because lapw1 did not run 
properly, at least, this energy_1 file should be produced by lapw1.




Am 20.06.2023 um 10:18 schrieb Ilias Miroslav, doc. RNDr., PhD.:

Hello,

I am able to run serial SCF via SLURM
https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.01

but when trying parallel
https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.02 



I get lapw2.error

'LAPW2' - can't open unit: 30
'LAPW2' -    filename: LvO2onQg.energy_1
**  testerror: Error in Parallel LAPW2

The file "LvO2onQg.energy" is correct in serial mode.

Seems that LvO2onQg.energy_1 file is not produces in parallel run ?

All files are 
https://github.com/miroi/open-collection/tree/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg


Best,

Miro

___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST 
at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


--
---
Peter Blaha,  Inst. f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-158801165300
Email:peter.bl...@tuwien.ac.at   
WWW:http://www.imc.tuwien.ac.at   WIEN2k:http://www.wien2k.at

-
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


[Wien] ** testerror: Error in Parallel LAPW

2023-06-20 Thread Ilias Miroslav, doc. RNDr., PhD.
Hello,

I am able to run serial SCF via SLURM
https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.01

but when trying parallel
https://github.com/miroi/open-collection/blob/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg/virgo_slurm_wien2kgnupar_fromdstart.02

I get lapw2.error

'LAPW2' - can't open unit: 30
'LAPW2' -filename: LvO2onQg.energy_1
**  testerror: Error in Parallel LAPW2

The file "LvO2onQg.energy" is correct in serial mode.

Seems that LvO2onQg.energy_1 file is not produces in parallel run ?

All files are 
https://github.com/miroi/open-collection/tree/master/theoretical_chemistry/software/wien2k/runs/LvO2_on_small_quartz/wien2k/LvO2onQg

Best,

Miro
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html