Yes, I used the "use head node to compute" option when installing.
Michael Edwards wrote: > The default setup on OSCAR is for the head node not to compute. > > Did you select the "use head node to compute" option? If so there may > very well be a bug, it is not a widely used option. > > On 7/6/07, Filipe Garrett <[EMAIL PROTECTED]> wrote: >> Hi all, >> >> I've recently set up an OSCAR cluster with headnode + 1 node (total 4 CPUs). >> I've been trying to submit jobs on LAM/MPI but some errors keep occurring. >> I've >> noted that on the "bhost" file (pointed by $PBS_NODEFILE) there's just the >> client node. Is it normal (since the headnode is supposed to also compute)? >> >> I attach the script I'm using as well as the error and output logs. >> >> thanks in adv, >> FG >> >> tkill: setting prefix to (null) >> tkill: setting suffix to (null) >> tkill: got killname back: /tmp/[EMAIL PROTECTED]/lam-killfile >> tkill: f_kill = "/tmp/[EMAIL PROTECTED]/lam-killfile" >> tkill: nothing to kill: "/tmp/[EMAIL PROTECTED]/lam-killfile" >> Job launched in molevol1.ub.edu at Fri Jul 6 18:11:31 2007 >> Shutting down LAM >> hreq: sending HALT_PING to n0 (molevol1.ub.edu) >> hreq: received HALT_ACK from n0 (molevol1.ub.edu) >> hreq: sending HALT_DIE to n0 (molevol1.ub.edu) >> lamhalt: sleeping to wait for lamds to die >> lamhalt: local LAM daemon halted >> LAM halted >> Job finished at Fri Jul 6 18:11:32 2007 >> >> n-1<17278> ssi:boot:open: opening >> n-1<17278> ssi:boot:open: opening boot module globus >> n-1<17278> ssi:boot:open: opened boot module globus >> n-1<17278> ssi:boot:open: opening boot module rsh >> n-1<17278> ssi:boot:open: opened boot module rsh >> n-1<17278> ssi:boot:open: opening boot module slurm >> n-1<17278> ssi:boot:open: opened boot module slurm >> n-1<17278> ssi:boot:open: opening boot module tm >> n-1<17278> ssi:boot:open: opened boot module tm >> n-1<17278> ssi:boot:select: initializing boot module slurm >> n-1<17278> ssi:boot:slurm: not running under SLURM >> n-1<17278> ssi:boot:select: boot module not available: slurm >> n-1<17278> ssi:boot:select: initializing boot module globus >> n-1<17278> ssi:boot:globus: globus-job-run not found, globus boot will not >> run >> n-1<17278> ssi:boot:select: boot module not available: globus >> n-1<17278> ssi:boot:select: initializing boot module tm >> n-1<17278> ssi:boot:tm: module initializing >> n-1<17278> ssi:boot:tm:verbose: 1000 >> n-1<17278> ssi:boot:tm:priority: 50 >> n-1<17278> ssi:boot:select: boot module available: tm, priority: 50 >> n-1<17278> ssi:boot:select: initializing boot module rsh >> n-1<17278> ssi:boot:rsh: module initializing >> n-1<17278> ssi:boot:rsh:agent: /usr/bin/ssh >> n-1<17278> ssi:boot:rsh:username: <same> >> n-1<17278> ssi:boot:rsh:verbose: 1000 >> n-1<17278> ssi:boot:rsh:algorithm: linear >> n-1<17278> ssi:boot:rsh:no_n: 0 >> n-1<17278> ssi:boot:rsh:no_profile: 0 >> n-1<17278> ssi:boot:rsh:fast: 0 >> n-1<17278> ssi:boot:rsh:ignore_stderr: 0 >> n-1<17278> ssi:boot:rsh:priority: 10 >> n-1<17278> ssi:boot:select: boot module available: rsh, priority: 10 >> n-1<17278> ssi:boot:select: finalizing boot module slurm >> n-1<17278> ssi:boot:slurm: finalizing >> n-1<17278> ssi:boot:select: closing boot module slurm >> n-1<17278> ssi:boot:select: finalizing boot module globus >> n-1<17278> ssi:boot:globus: finalizing >> n-1<17278> ssi:boot:select: closing boot module globus >> n-1<17278> ssi:boot:select: finalizing boot module rsh >> n-1<17278> ssi:boot:rsh: finalizing >> n-1<17278> ssi:boot:select: closing boot module rsh >> n-1<17278> ssi:boot:select: selected boot module tm >> n-1<17278> ssi:boot:tm: found the following 1 hosts: >> n-1<17278> ssi:boot:tm: n0 molevol1.ub.edu (cpu=1) >> n-1<17278> ssi:boot:tm: starting RTE procs >> n-1<17278> ssi:boot:base:linear_windowed: starting >> n-1<17278> ssi:boot:base:linear_windowed: window size: 5 >> n-1<17278> ssi:boot:base:server: opening server TCP socket >> n-1<17278> ssi:boot:base:server: opened port 47671 >> n-1<17278> ssi:boot:base:linear_windowed: booting n0 (molevol1.ub.edu) >> n-1<17278> ssi:boot:tm: starting wipe on (molevol1.ub.edu) >> n-1<17278> ssi:boot:tm: starting on n0 (molevol1.ub.edu): >> /opt/lam-7.1.2/bin/tkill -setsid -d -v >> n-1<17278> ssi:boot:tm: successfully launched on n0 (molevol1.ub.edu) >> n-1<17278> ssi:boot:tm: waiting for completion on n0 (molevol1.ub.edu) >> n-1<17278> ssi:boot:tm: finished on n0 (molevol1.ub.edu) >> n-1<17278> ssi:boot:tm: starting lamd on (molevol1.ub.edu) >> n-1<17278> ssi:boot:tm: starting on n0 (molevol1.ub.edu): >> /opt/lam-7.1.2/bin/lamd -H 161.116.70.157 -P 47671 -n 0 -o 0 -d >> n-1<17278> ssi:boot:tm: successfully launched on n0 (molevol1.ub.edu) >> n-1<17278> ssi:boot:base:linear_windowed: finished launching >> n-1<17278> ssi:boot:base:server: expecting connection from finite list >> n-1<17280> ssi:boot:open: opening >> n-1<17280> ssi:boot:open: opening boot module globus >> n-1<17280> ssi:boot:open: opened boot module globus >> n-1<17280> ssi:boot:open: opening boot module rsh >> n-1<17280> ssi:boot:open: opened boot module rsh >> n-1<17280> ssi:boot:open: opening boot module slurm >> n-1<17280> ssi:boot:open: opened boot module slurm >> n-1<17280> ssi:boot:open: opening boot module tm >> n-1<17280> ssi:boot:open: opened boot module tm >> n-1<17280> ssi:boot:select: initializing boot module slurm >> n-1<17280> ssi:boot:slurm: not running under SLURM >> n-1<17280> ssi:boot:select: boot module not available: slurm >> n-1<17280> ssi:boot:select: initializing boot module globus >> n-1<17280> ssi:boot:globus: globus-job-run not found, globus boot will not >> run >> n-1<17280> ssi:boot:select: boot module not available: globus >> n-1<17280> ssi:boot:select: initializing boot module tm >> n-1<17280> ssi:boot:tm: module initializing >> n-1<17280> ssi:boot:tm:verbose: 1000 >> n-1<17280> ssi:boot:tm:priority: 50 >> n-1<17280> ssi:boot:select: boot module available: tm, priority: 50 >> n-1<17280> ssi:boot:select: initializing boot module rsh >> n-1<17280> ssi:boot:rsh: module initializing >> n-1<17280> ssi:boot:rsh:agent: /usr/bin/ssh >> n-1<17280> ssi:boot:rsh:username: <same> >> n-1<17280> ssi:boot:rsh:verbose: 1000 >> n-1<17280> ssi:boot:rsh:algorithm: linear >> n-1<17280> ssi:boot:rsh:no_n: 0 >> n-1<17280> ssi:boot:rsh:no_profile: 0 >> n-1<17280> ssi:boot:rsh:fast: 0 >> n-1<17280> ssi:boot:rsh:ignore_stderr: 0 >> n-1<17280> ssi:boot:rsh:priority: 10 >> n-1<17280> ssi:boot:select: boot module available: rsh, priority: 10 >> n-1<17280> ssi:boot:select: finalizing boot module slurm >> n-1<17280> ssi:boot:slurm: finalizing >> n-1<17280> ssi:boot:select: closing boot module slurm >> n-1<17280> ssi:boot:select: finalizing boot module globus >> n-1<17280> ssi:boot:globus: finalizing >> n-1<17280> ssi:boot:select: closing boot module globus >> n-1<17280> ssi:boot:select: finalizing boot module rsh >> n-1<17280> ssi:boot:rsh: finalizing >> n-1<17280> ssi:boot:select: closing boot module rsh >> n-1<17280> ssi:boot:select: selected boot module tm >> n-1<17280> ssi:boot:send_lamd: getting node ID from command line >> n-1<17280> ssi:boot:send_lamd: getting agent haddr from command line >> n-1<17280> ssi:boot:send_lamd: getting agent port from command line >> n-1<17280> ssi:boot:send_lamd: getting node ID from command line >> n-1<17280> ssi:boot:send_lamd: connecting to 161.116.70.157:47671, node id 0 >> n-1<17280> ssi:boot:send_lamd: sending dli_port 32787 >> n-1<17278> ssi:boot:base:server: got connection from 161.116.70.157 >> n-1<17278> ssi:boot:base:server: this connection is expected (n0) >> n-1<17278> ssi:boot:base:server: remote lamd is at 161.116.70.157:32787 >> n-1<17278> ssi:boot:base:server: closing server socket >> n-1<17278> ssi:boot:base:server: connecting to lamd at 161.116.70.157:40164 >> n-1<17278> ssi:boot:base:server: connected >> n-1<17278> ssi:boot:base:server: sending number of links (1) >> n-1<17278> ssi:boot:base:server: sending info: n0 (molevol1.ub.edu) >> n-1<17278> ssi:boot:base:server: finished sending >> n-1<17278> ssi:boot:base:server: disconnected from 161.116.70.157:40164 >> n-1<17278> ssi:boot:base:linear_windowed: finished >> n-1<17278> ssi:boot:tm: all RTE procs started >> n-1<17278> ssi:boot:tm: finalizing >> n-1<17278> ssi:boot: Closing >> n-1<17280> ssi:boot:tm: finalizing >> n-1<17280> ssi:boot: Closing >> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: >> liblamf77mpi.so.0: cannot open shared object file: No such file or directory >> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: >> liblamf77mpi.so.0: cannot open shared object file: No such file or directory >> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: >> liblamf77mpi.so.0: cannot open shared object file: No such file or directory >> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: >> liblamf77mpi.so.0: cannot open shared object file: No such file or directory >> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: >> liblamf77mpi.so.0: cannot open shared object file: No such file or directory >> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: >> liblamf77mpi.so.0: cannot open shared object file: No such file or directory >> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: >> liblamf77mpi.so.0: cannot open shared object file: No such file or directory >> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: >> liblamf77mpi.so.0: cannot open shared object file: No such file or directory >> ----------------------------------------------------------------------------- >> It seems that [at least] one of the processes that was started with >> mpirun did not invoke MPI_INIT before quitting (it is possible that >> more than one process did not invoke MPI_INIT -- mpirun was only >> notified of the first one, which was on node n0). >> >> mpirun can *only* be used with MPI programs (i.e., programs that >> invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program >> to run non-MPI programs over the lambooted nodes. >> ----------------------------------------------------------------------------- >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by DB2 Express >> Download DB2 Express C - the FREE version of DB2 express and take >> control of your XML. No limits. Just data. Click to get it now. >> http://sourceforge.net/powerbar/db2/ >> _______________________________________________ >> Oscar-users mailing list >> Oscar-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/oscar-users >> >> >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Oscar-users mailing list > Oscar-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/oscar-users > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Oscar-users mailing list Oscar-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/oscar-users