The default setup on OSCAR is for the head node not to compute. Did you select the "use head node to compute" option? If so there may very well be a bug, it is not a widely used option.
On 7/6/07, Filipe Garrett <[EMAIL PROTECTED]> wrote: > Hi all, > > I've recently set up an OSCAR cluster with headnode + 1 node (total 4 CPUs). > I've been trying to submit jobs on LAM/MPI but some errors keep occurring. > I've > noted that on the "bhost" file (pointed by $PBS_NODEFILE) there's just the > client node. Is it normal (since the headnode is supposed to also compute)? > > I attach the script I'm using as well as the error and output logs. > > thanks in adv, > FG > > tkill: setting prefix to (null) > tkill: setting suffix to (null) > tkill: got killname back: /tmp/[EMAIL PROTECTED]/lam-killfile > tkill: f_kill = "/tmp/[EMAIL PROTECTED]/lam-killfile" > tkill: nothing to kill: "/tmp/[EMAIL PROTECTED]/lam-killfile" > Job launched in molevol1.ub.edu at Fri Jul 6 18:11:31 2007 > Shutting down LAM > hreq: sending HALT_PING to n0 (molevol1.ub.edu) > hreq: received HALT_ACK from n0 (molevol1.ub.edu) > hreq: sending HALT_DIE to n0 (molevol1.ub.edu) > lamhalt: sleeping to wait for lamds to die > lamhalt: local LAM daemon halted > LAM halted > Job finished at Fri Jul 6 18:11:32 2007 > > n-1<17278> ssi:boot:open: opening > n-1<17278> ssi:boot:open: opening boot module globus > n-1<17278> ssi:boot:open: opened boot module globus > n-1<17278> ssi:boot:open: opening boot module rsh > n-1<17278> ssi:boot:open: opened boot module rsh > n-1<17278> ssi:boot:open: opening boot module slurm > n-1<17278> ssi:boot:open: opened boot module slurm > n-1<17278> ssi:boot:open: opening boot module tm > n-1<17278> ssi:boot:open: opened boot module tm > n-1<17278> ssi:boot:select: initializing boot module slurm > n-1<17278> ssi:boot:slurm: not running under SLURM > n-1<17278> ssi:boot:select: boot module not available: slurm > n-1<17278> ssi:boot:select: initializing boot module globus > n-1<17278> ssi:boot:globus: globus-job-run not found, globus boot will not run > n-1<17278> ssi:boot:select: boot module not available: globus > n-1<17278> ssi:boot:select: initializing boot module tm > n-1<17278> ssi:boot:tm: module initializing > n-1<17278> ssi:boot:tm:verbose: 1000 > n-1<17278> ssi:boot:tm:priority: 50 > n-1<17278> ssi:boot:select: boot module available: tm, priority: 50 > n-1<17278> ssi:boot:select: initializing boot module rsh > n-1<17278> ssi:boot:rsh: module initializing > n-1<17278> ssi:boot:rsh:agent: /usr/bin/ssh > n-1<17278> ssi:boot:rsh:username: <same> > n-1<17278> ssi:boot:rsh:verbose: 1000 > n-1<17278> ssi:boot:rsh:algorithm: linear > n-1<17278> ssi:boot:rsh:no_n: 0 > n-1<17278> ssi:boot:rsh:no_profile: 0 > n-1<17278> ssi:boot:rsh:fast: 0 > n-1<17278> ssi:boot:rsh:ignore_stderr: 0 > n-1<17278> ssi:boot:rsh:priority: 10 > n-1<17278> ssi:boot:select: boot module available: rsh, priority: 10 > n-1<17278> ssi:boot:select: finalizing boot module slurm > n-1<17278> ssi:boot:slurm: finalizing > n-1<17278> ssi:boot:select: closing boot module slurm > n-1<17278> ssi:boot:select: finalizing boot module globus > n-1<17278> ssi:boot:globus: finalizing > n-1<17278> ssi:boot:select: closing boot module globus > n-1<17278> ssi:boot:select: finalizing boot module rsh > n-1<17278> ssi:boot:rsh: finalizing > n-1<17278> ssi:boot:select: closing boot module rsh > n-1<17278> ssi:boot:select: selected boot module tm > n-1<17278> ssi:boot:tm: found the following 1 hosts: > n-1<17278> ssi:boot:tm: n0 molevol1.ub.edu (cpu=1) > n-1<17278> ssi:boot:tm: starting RTE procs > n-1<17278> ssi:boot:base:linear_windowed: starting > n-1<17278> ssi:boot:base:linear_windowed: window size: 5 > n-1<17278> ssi:boot:base:server: opening server TCP socket > n-1<17278> ssi:boot:base:server: opened port 47671 > n-1<17278> ssi:boot:base:linear_windowed: booting n0 (molevol1.ub.edu) > n-1<17278> ssi:boot:tm: starting wipe on (molevol1.ub.edu) > n-1<17278> ssi:boot:tm: starting on n0 (molevol1.ub.edu): > /opt/lam-7.1.2/bin/tkill -setsid -d -v > n-1<17278> ssi:boot:tm: successfully launched on n0 (molevol1.ub.edu) > n-1<17278> ssi:boot:tm: waiting for completion on n0 (molevol1.ub.edu) > n-1<17278> ssi:boot:tm: finished on n0 (molevol1.ub.edu) > n-1<17278> ssi:boot:tm: starting lamd on (molevol1.ub.edu) > n-1<17278> ssi:boot:tm: starting on n0 (molevol1.ub.edu): > /opt/lam-7.1.2/bin/lamd -H 161.116.70.157 -P 47671 -n 0 -o 0 -d > n-1<17278> ssi:boot:tm: successfully launched on n0 (molevol1.ub.edu) > n-1<17278> ssi:boot:base:linear_windowed: finished launching > n-1<17278> ssi:boot:base:server: expecting connection from finite list > n-1<17280> ssi:boot:open: opening > n-1<17280> ssi:boot:open: opening boot module globus > n-1<17280> ssi:boot:open: opened boot module globus > n-1<17280> ssi:boot:open: opening boot module rsh > n-1<17280> ssi:boot:open: opened boot module rsh > n-1<17280> ssi:boot:open: opening boot module slurm > n-1<17280> ssi:boot:open: opened boot module slurm > n-1<17280> ssi:boot:open: opening boot module tm > n-1<17280> ssi:boot:open: opened boot module tm > n-1<17280> ssi:boot:select: initializing boot module slurm > n-1<17280> ssi:boot:slurm: not running under SLURM > n-1<17280> ssi:boot:select: boot module not available: slurm > n-1<17280> ssi:boot:select: initializing boot module globus > n-1<17280> ssi:boot:globus: globus-job-run not found, globus boot will not run > n-1<17280> ssi:boot:select: boot module not available: globus > n-1<17280> ssi:boot:select: initializing boot module tm > n-1<17280> ssi:boot:tm: module initializing > n-1<17280> ssi:boot:tm:verbose: 1000 > n-1<17280> ssi:boot:tm:priority: 50 > n-1<17280> ssi:boot:select: boot module available: tm, priority: 50 > n-1<17280> ssi:boot:select: initializing boot module rsh > n-1<17280> ssi:boot:rsh: module initializing > n-1<17280> ssi:boot:rsh:agent: /usr/bin/ssh > n-1<17280> ssi:boot:rsh:username: <same> > n-1<17280> ssi:boot:rsh:verbose: 1000 > n-1<17280> ssi:boot:rsh:algorithm: linear > n-1<17280> ssi:boot:rsh:no_n: 0 > n-1<17280> ssi:boot:rsh:no_profile: 0 > n-1<17280> ssi:boot:rsh:fast: 0 > n-1<17280> ssi:boot:rsh:ignore_stderr: 0 > n-1<17280> ssi:boot:rsh:priority: 10 > n-1<17280> ssi:boot:select: boot module available: rsh, priority: 10 > n-1<17280> ssi:boot:select: finalizing boot module slurm > n-1<17280> ssi:boot:slurm: finalizing > n-1<17280> ssi:boot:select: closing boot module slurm > n-1<17280> ssi:boot:select: finalizing boot module globus > n-1<17280> ssi:boot:globus: finalizing > n-1<17280> ssi:boot:select: closing boot module globus > n-1<17280> ssi:boot:select: finalizing boot module rsh > n-1<17280> ssi:boot:rsh: finalizing > n-1<17280> ssi:boot:select: closing boot module rsh > n-1<17280> ssi:boot:select: selected boot module tm > n-1<17280> ssi:boot:send_lamd: getting node ID from command line > n-1<17280> ssi:boot:send_lamd: getting agent haddr from command line > n-1<17280> ssi:boot:send_lamd: getting agent port from command line > n-1<17280> ssi:boot:send_lamd: getting node ID from command line > n-1<17280> ssi:boot:send_lamd: connecting to 161.116.70.157:47671, node id 0 > n-1<17280> ssi:boot:send_lamd: sending dli_port 32787 > n-1<17278> ssi:boot:base:server: got connection from 161.116.70.157 > n-1<17278> ssi:boot:base:server: this connection is expected (n0) > n-1<17278> ssi:boot:base:server: remote lamd is at 161.116.70.157:32787 > n-1<17278> ssi:boot:base:server: closing server socket > n-1<17278> ssi:boot:base:server: connecting to lamd at 161.116.70.157:40164 > n-1<17278> ssi:boot:base:server: connected > n-1<17278> ssi:boot:base:server: sending number of links (1) > n-1<17278> ssi:boot:base:server: sending info: n0 (molevol1.ub.edu) > n-1<17278> ssi:boot:base:server: finished sending > n-1<17278> ssi:boot:base:server: disconnected from 161.116.70.157:40164 > n-1<17278> ssi:boot:base:linear_windowed: finished > n-1<17278> ssi:boot:tm: all RTE procs started > n-1<17278> ssi:boot:tm: finalizing > n-1<17278> ssi:boot: Closing > n-1<17280> ssi:boot:tm: finalizing > n-1<17280> ssi:boot: Closing > /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: > liblamf77mpi.so.0: cannot open shared object file: No such file or directory > /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: > liblamf77mpi.so.0: cannot open shared object file: No such file or directory > /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: > liblamf77mpi.so.0: cannot open shared object file: No such file or directory > /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: > liblamf77mpi.so.0: cannot open shared object file: No such file or directory > /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: > liblamf77mpi.so.0: cannot open shared object file: No such file or directory > /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: > liblamf77mpi.so.0: cannot open shared object file: No such file or directory > /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: > liblamf77mpi.so.0: cannot open shared object file: No such file or directory > /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: > liblamf77mpi.so.0: cannot open shared object file: No such file or directory > ----------------------------------------------------------------------------- > It seems that [at least] one of the processes that was started with > mpirun did not invoke MPI_INIT before quitting (it is possible that > more than one process did not invoke MPI_INIT -- mpirun was only > notified of the first one, which was on node n0). > > mpirun can *only* be used with MPI programs (i.e., programs that > invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program > to run non-MPI programs over the lambooted nodes. > ----------------------------------------------------------------------------- > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Oscar-users mailing list > Oscar-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/oscar-users > > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Oscar-users mailing list Oscar-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/oscar-users