The .machines file looks fine to me, but one of the others might see
something that I didn't notice (besides the WIEN2k command not being
there at the bottom of the file - likely missed in the copy and paste).
The main problem seems to the "bash: lapw1: command not found" unless
something happened earlier that is not shown. Tracking down parallel
error messages is more complicated. Unlike a serial calculation that
can output the standard output and error to the display of a terminal on
a desktop, a parallel calculation on a cluster with a queue system can
put them in a standard output (-o) and standard error file (-e) or a
combined output/error file (-j) with user specified name(s) [1,2]. They
can also be written to the hidden dot files like .time* or .stdout* as
mentioned before [3,4,5].
The "lapw1: command not found" might be because $WIENROOT didn't get
added to the PATH on one of the nodes [
http://www.supercluster.org/pipermail/torqueusers/2010-March/010143.html
]. Did you try checking if the path to WIEN2k is in the PATH, such as
PBS_O_PATH with qstat -f jobid [
http://stackoverflow.com/questions/21248406/sleep-command-not-found-in-torque-pbs-but-works-in-shell
].
Did you try to ssh into all 8 nodes and see if you can see lapw1 on each
node? For example,
ssh n024
ls -l $WIENROOT/lapw1
ssh n225
ls -l $WIENROOT/lapw1
...
Above, I'm just guessing about the commands/configuration for your
system, but the administrator or helpdesk for your cluster should know
everything about your system and be able to help you much better with
resolving the command not found error.
[1] http://beige.ucs.indiana.edu/I590/node39.html
[2]
https://wikis.nyu.edu/display/NYUHPC/Tutorial+-+Submitting+a+job+using+qsub
[3]
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg13598.html
[4]
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/msg14148.html
[5] http://zeus.theochem.tuwien.ac.at/pipermail/wien/2017-March/026109.html
On 3/13/2017 1:25 PM, shaymlal dayananda wrote:
Dear developers and users
I was trying to do a volume optimization and scf calculation with spin
polarization in parallel mode. But my both the jobs crashes and I got
the following error file. However both cases run correctly when
parallel mode is removed.
'LAPW2' - can't open unit: 30
'LAPW2' -filename: case.energyup_1
** testerror: Error in Parallel LAPW2
.
Also in STDOUT , I see the following particular errors. (
...
bash: lapw1: command not found
...
.
FERMI - Error
grep: *scf1dn*: No such file or directory
0.381u 0.507s 1:12.66 1.2%0+0k 128+1736io 1pf+0w
Test-TiC-VOl-parallel.scf1dn_1: No such file or directory.
.
I copied my machine file and the job file here. But I think this is
not correct and I am not sure whether I needs to have lines for lapw2
and lapwsp separately. Any help to get corrected this is highly
appreciated.
".machnes" file
.
#
lapw0:n024 n225 n220 n218 n045 n044 n043 n043
1:n024
1:n225
1:n220
1:n218
1:n045
1:n044
1:n043
1:n043
granularity:1
extrafine:1
..
job file is copied below.
# example for 8 nodes
#PBS -l procs=8
#PBS -l pmem=2048mb
#PBS -l walltime=4:00:00
module load wien2k
# change into your working directory
cd $PBS_O_WORKDIR
#start creating .machines
cat $PBS_NODEFILE |cut -c1-6 >.machines_current
aa=`cat .machines_current | wc -l`
echo '#' > .machines
# example for an MPI parallel lapw0
echo -n 'lapw0:' >> .machines
i=1
while [ $i -lt $aa ]
do
echo -n `cat $PBS_NODEFILE |head -$i | tail -1` ' ' >>.machines
i=$((i+1))
done
echo `cat $PBS_NODEFILE |head -$i|tail -1` ' ' >>.machines
#example for k-point parallel lapw1/2
i=1
while [ $i -le $aa ]
do
echo -n '1:' >>.machines
head -$i .machines_current |tail -1 >> .machines
i=$((i+1))
done
echo 'granularity:1' >>.machines
echo 'extrafine:1' >>.machines
#define here your WIEN2k command
Thank you
Chami
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html