Hi all,
I've recently set up an OSCAR cluster with headnode + 1 node (total 4 CPUs).
I've been trying to submit jobs on LAM/MPI but some errors keep occurring. I've
noted that on the "bhost" file (pointed by $PBS_NODEFILE) there's just the
client node. Is it normal (since the headnode is supposed to also compute)?
I attach the script I'm using as well as the error and output logs.
thanks in adv,
FG
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/[EMAIL PROTECTED]/lam-killfile
tkill: f_kill = "/tmp/[EMAIL PROTECTED]/lam-killfile"
tkill: nothing to kill: "/tmp/[EMAIL PROTECTED]/lam-killfile"
Job launched in molevol1.ub.edu at Fri Jul 6 18:11:31 2007
Shutting down LAM
hreq: sending HALT_PING to n0 (molevol1.ub.edu)
hreq: received HALT_ACK from n0 (molevol1.ub.edu)
hreq: sending HALT_DIE to n0 (molevol1.ub.edu)
lamhalt: sleeping to wait for lamds to die
lamhalt: local LAM daemon halted
LAM halted
Job finished at Fri Jul 6 18:11:32 2007
#!/bin/sh
##PBS -l nodes=2:ppn=4
#PBS -N TEST
#PBS -o /home/molevol/tree/test.out
#PBS -e /home/molevol/tree/test.err
##PBS -q work_queue
#PBS -m ae
#PBS -M [EMAIL PROTECTED]
#PBS -u molevol
lamboot -H -d -v $PBS_NODEFILE
DATE=`date +%c`
echo "Job launched in `hostname` at $DATE"
cd /home/molevol/tree
mpirun -np 8 /home/molevol/mrbayes-3.1.2/mb_mpi
/home/molevol/tree/all.all.allsp.SP------------.prot.seq.fas.promals.aln.nex
lamhalt -H -d -v $PBS_NODEFILE
DATE=`date +%c`
echo "Job finished at $DATE"
# All done!!
n-1<17278> ssi:boot:open: opening
n-1<17278> ssi:boot:open: opening boot module globus
n-1<17278> ssi:boot:open: opened boot module globus
n-1<17278> ssi:boot:open: opening boot module rsh
n-1<17278> ssi:boot:open: opened boot module rsh
n-1<17278> ssi:boot:open: opening boot module slurm
n-1<17278> ssi:boot:open: opened boot module slurm
n-1<17278> ssi:boot:open: opening boot module tm
n-1<17278> ssi:boot:open: opened boot module tm
n-1<17278> ssi:boot:select: initializing boot module slurm
n-1<17278> ssi:boot:slurm: not running under SLURM
n-1<17278> ssi:boot:select: boot module not available: slurm
n-1<17278> ssi:boot:select: initializing boot module globus
n-1<17278> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<17278> ssi:boot:select: boot module not available: globus
n-1<17278> ssi:boot:select: initializing boot module tm
n-1<17278> ssi:boot:tm: module initializing
n-1<17278> ssi:boot:tm:verbose: 1000
n-1<17278> ssi:boot:tm:priority: 50
n-1<17278> ssi:boot:select: boot module available: tm, priority: 50
n-1<17278> ssi:boot:select: initializing boot module rsh
n-1<17278> ssi:boot:rsh: module initializing
n-1<17278> ssi:boot:rsh:agent: /usr/bin/ssh
n-1<17278> ssi:boot:rsh:username: <same>
n-1<17278> ssi:boot:rsh:verbose: 1000
n-1<17278> ssi:boot:rsh:algorithm: linear
n-1<17278> ssi:boot:rsh:no_n: 0
n-1<17278> ssi:boot:rsh:no_profile: 0
n-1<17278> ssi:boot:rsh:fast: 0
n-1<17278> ssi:boot:rsh:ignore_stderr: 0
n-1<17278> ssi:boot:rsh:priority: 10
n-1<17278> ssi:boot:select: boot module available: rsh, priority: 10
n-1<17278> ssi:boot:select: finalizing boot module slurm
n-1<17278> ssi:boot:slurm: finalizing
n-1<17278> ssi:boot:select: closing boot module slurm
n-1<17278> ssi:boot:select: finalizing boot module globus
n-1<17278> ssi:boot:globus: finalizing
n-1<17278> ssi:boot:select: closing boot module globus
n-1<17278> ssi:boot:select: finalizing boot module rsh
n-1<17278> ssi:boot:rsh: finalizing
n-1<17278> ssi:boot:select: closing boot module rsh
n-1<17278> ssi:boot:select: selected boot module tm
n-1<17278> ssi:boot:tm: found the following 1 hosts:
n-1<17278> ssi:boot:tm: n0 molevol1.ub.edu (cpu=1)
n-1<17278> ssi:boot:tm: starting RTE procs
n-1<17278> ssi:boot:base:linear_windowed: starting
n-1<17278> ssi:boot:base:linear_windowed: window size: 5
n-1<17278> ssi:boot:base:server: opening server TCP socket
n-1<17278> ssi:boot:base:server: opened port 47671
n-1<17278> ssi:boot:base:linear_windowed: booting n0 (molevol1.ub.edu)
n-1<17278> ssi:boot:tm: starting wipe on (molevol1.ub.edu)
n-1<17278> ssi:boot:tm: starting on n0 (molevol1.ub.edu):
/opt/lam-7.1.2/bin/tkill -setsid -d -v
n-1<17278> ssi:boot:tm: successfully launched on n0 (molevol1.ub.edu)
n-1<17278> ssi:boot:tm: waiting for completion on n0 (molevol1.ub.edu)
n-1<17278> ssi:boot:tm: finished on n0 (molevol1.ub.edu)
n-1<17278> ssi:boot:tm: starting lamd on (molevol1.ub.edu)
n-1<17278> ssi:boot:tm: starting on n0 (molevol1.ub.edu):
/opt/lam-7.1.2/bin/lamd -H 161.116.70.157 -P 47671 -n 0 -o 0 -d
n-1<17278> ssi:boot:tm: successfully launched on n0 (molevol1.ub.edu)
n-1<17278> ssi:boot:base:linear_windowed: finished launching
n-1<17278> ssi:boot:base:server: expecting connection from finite list
n-1<17280> ssi:boot:open: opening
n-1<17280> ssi:boot:open: opening boot module globus
n-1<17280> ssi:boot:open: opened boot module globus
n-1<17280> ssi:boot:open: opening boot module rsh
n-1<17280> ssi:boot:open: opened boot module rsh
n-1<17280> ssi:boot:open: opening boot module slurm
n-1<17280> ssi:boot:open: opened boot module slurm
n-1<17280> ssi:boot:open: opening boot module tm
n-1<17280> ssi:boot:open: opened boot module tm
n-1<17280> ssi:boot:select: initializing boot module slurm
n-1<17280> ssi:boot:slurm: not running under SLURM
n-1<17280> ssi:boot:select: boot module not available: slurm
n-1<17280> ssi:boot:select: initializing boot module globus
n-1<17280> ssi:boot:globus: globus-job-run not found, globus boot will not run
n-1<17280> ssi:boot:select: boot module not available: globus
n-1<17280> ssi:boot:select: initializing boot module tm
n-1<17280> ssi:boot:tm: module initializing
n-1<17280> ssi:boot:tm:verbose: 1000
n-1<17280> ssi:boot:tm:priority: 50
n-1<17280> ssi:boot:select: boot module available: tm, priority: 50
n-1<17280> ssi:boot:select: initializing boot module rsh
n-1<17280> ssi:boot:rsh: module initializing
n-1<17280> ssi:boot:rsh:agent: /usr/bin/ssh
n-1<17280> ssi:boot:rsh:username: <same>
n-1<17280> ssi:boot:rsh:verbose: 1000
n-1<17280> ssi:boot:rsh:algorithm: linear
n-1<17280> ssi:boot:rsh:no_n: 0
n-1<17280> ssi:boot:rsh:no_profile: 0
n-1<17280> ssi:boot:rsh:fast: 0
n-1<17280> ssi:boot:rsh:ignore_stderr: 0
n-1<17280> ssi:boot:rsh:priority: 10
n-1<17280> ssi:boot:select: boot module available: rsh, priority: 10
n-1<17280> ssi:boot:select: finalizing boot module slurm
n-1<17280> ssi:boot:slurm: finalizing
n-1<17280> ssi:boot:select: closing boot module slurm
n-1<17280> ssi:boot:select: finalizing boot module globus
n-1<17280> ssi:boot:globus: finalizing
n-1<17280> ssi:boot:select: closing boot module globus
n-1<17280> ssi:boot:select: finalizing boot module rsh
n-1<17280> ssi:boot:rsh: finalizing
n-1<17280> ssi:boot:select: closing boot module rsh
n-1<17280> ssi:boot:select: selected boot module tm
n-1<17280> ssi:boot:send_lamd: getting node ID from command line
n-1<17280> ssi:boot:send_lamd: getting agent haddr from command line
n-1<17280> ssi:boot:send_lamd: getting agent port from command line
n-1<17280> ssi:boot:send_lamd: getting node ID from command line
n-1<17280> ssi:boot:send_lamd: connecting to 161.116.70.157:47671, node id 0
n-1<17280> ssi:boot:send_lamd: sending dli_port 32787
n-1<17278> ssi:boot:base:server: got connection from 161.116.70.157
n-1<17278> ssi:boot:base:server: this connection is expected (n0)
n-1<17278> ssi:boot:base:server: remote lamd is at 161.116.70.157:32787
n-1<17278> ssi:boot:base:server: closing server socket
n-1<17278> ssi:boot:base:server: connecting to lamd at 161.116.70.157:40164
n-1<17278> ssi:boot:base:server: connected
n-1<17278> ssi:boot:base:server: sending number of links (1)
n-1<17278> ssi:boot:base:server: sending info: n0 (molevol1.ub.edu)
n-1<17278> ssi:boot:base:server: finished sending
n-1<17278> ssi:boot:base:server: disconnected from 161.116.70.157:40164
n-1<17278> ssi:boot:base:linear_windowed: finished
n-1<17278> ssi:boot:tm: all RTE procs started
n-1<17278> ssi:boot:tm: finalizing
n-1<17278> ssi:boot: Closing
n-1<17280> ssi:boot:tm: finalizing
n-1<17280> ssi:boot: Closing
/home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries:
liblamf77mpi.so.0: cannot open shared object file: No such file or directory
/home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries:
liblamf77mpi.so.0: cannot open shared object file: No such file or directory
/home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries:
liblamf77mpi.so.0: cannot open shared object file: No such file or directory
/home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries:
liblamf77mpi.so.0: cannot open shared object file: No such file or directory
/home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries:
liblamf77mpi.so.0: cannot open shared object file: No such file or directory
/home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries:
liblamf77mpi.so.0: cannot open shared object file: No such file or directory
/home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries:
liblamf77mpi.so.0: cannot open shared object file: No such file or directory
/home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries:
liblamf77mpi.so.0: cannot open shared object file: No such file or directory
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).
mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users