The default setup on OSCAR is for the head node not to compute.

Did you select the "use head node to compute" option?  If so there may
very well be a bug, it is not a widely used option.

On 7/6/07, Filipe Garrett <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I've recently set up an OSCAR cluster with headnode + 1 node (total 4 CPUs).
> I've been trying to submit jobs on LAM/MPI but some errors keep occurring. 
> I've
> noted that on the "bhost" file (pointed by $PBS_NODEFILE) there's just the
> client node. Is it normal (since the headnode is supposed to also compute)?
>
> I attach the script I'm using as well as the error and output logs.
>
> thanks in adv,
> FG
>
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/[EMAIL PROTECTED]/lam-killfile
> tkill: f_kill = "/tmp/[EMAIL PROTECTED]/lam-killfile"
> tkill: nothing to kill: "/tmp/[EMAIL PROTECTED]/lam-killfile"
> Job launched in molevol1.ub.edu at Fri Jul  6 18:11:31 2007
> Shutting down LAM
> hreq: sending HALT_PING to n0 (molevol1.ub.edu)
> hreq: received HALT_ACK from n0 (molevol1.ub.edu)
> hreq: sending HALT_DIE to n0 (molevol1.ub.edu)
> lamhalt: sleeping to wait for lamds to die
> lamhalt: local LAM daemon halted
> LAM halted
> Job finished at Fri Jul  6 18:11:32 2007
>
> n-1<17278> ssi:boot:open: opening
> n-1<17278> ssi:boot:open: opening boot module globus
> n-1<17278> ssi:boot:open: opened boot module globus
> n-1<17278> ssi:boot:open: opening boot module rsh
> n-1<17278> ssi:boot:open: opened boot module rsh
> n-1<17278> ssi:boot:open: opening boot module slurm
> n-1<17278> ssi:boot:open: opened boot module slurm
> n-1<17278> ssi:boot:open: opening boot module tm
> n-1<17278> ssi:boot:open: opened boot module tm
> n-1<17278> ssi:boot:select: initializing boot module slurm
> n-1<17278> ssi:boot:slurm: not running under SLURM
> n-1<17278> ssi:boot:select: boot module not available: slurm
> n-1<17278> ssi:boot:select: initializing boot module globus
> n-1<17278> ssi:boot:globus: globus-job-run not found, globus boot will not run
> n-1<17278> ssi:boot:select: boot module not available: globus
> n-1<17278> ssi:boot:select: initializing boot module tm
> n-1<17278> ssi:boot:tm: module initializing
> n-1<17278> ssi:boot:tm:verbose: 1000
> n-1<17278> ssi:boot:tm:priority: 50
> n-1<17278> ssi:boot:select: boot module available: tm, priority: 50
> n-1<17278> ssi:boot:select: initializing boot module rsh
> n-1<17278> ssi:boot:rsh: module initializing
> n-1<17278> ssi:boot:rsh:agent: /usr/bin/ssh
> n-1<17278> ssi:boot:rsh:username: <same>
> n-1<17278> ssi:boot:rsh:verbose: 1000
> n-1<17278> ssi:boot:rsh:algorithm: linear
> n-1<17278> ssi:boot:rsh:no_n: 0
> n-1<17278> ssi:boot:rsh:no_profile: 0
> n-1<17278> ssi:boot:rsh:fast: 0
> n-1<17278> ssi:boot:rsh:ignore_stderr: 0
> n-1<17278> ssi:boot:rsh:priority: 10
> n-1<17278> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<17278> ssi:boot:select: finalizing boot module slurm
> n-1<17278> ssi:boot:slurm: finalizing
> n-1<17278> ssi:boot:select: closing boot module slurm
> n-1<17278> ssi:boot:select: finalizing boot module globus
> n-1<17278> ssi:boot:globus: finalizing
> n-1<17278> ssi:boot:select: closing boot module globus
> n-1<17278> ssi:boot:select: finalizing boot module rsh
> n-1<17278> ssi:boot:rsh: finalizing
> n-1<17278> ssi:boot:select: closing boot module rsh
> n-1<17278> ssi:boot:select: selected boot module tm
> n-1<17278> ssi:boot:tm: found the following 1 hosts:
> n-1<17278> ssi:boot:tm:   n0 molevol1.ub.edu (cpu=1)
> n-1<17278> ssi:boot:tm: starting RTE procs
> n-1<17278> ssi:boot:base:linear_windowed: starting
> n-1<17278> ssi:boot:base:linear_windowed: window size: 5
> n-1<17278> ssi:boot:base:server: opening server TCP socket
> n-1<17278> ssi:boot:base:server: opened port 47671
> n-1<17278> ssi:boot:base:linear_windowed: booting n0 (molevol1.ub.edu)
> n-1<17278> ssi:boot:tm: starting wipe on (molevol1.ub.edu)
> n-1<17278> ssi:boot:tm: starting on n0 (molevol1.ub.edu): 
> /opt/lam-7.1.2/bin/tkill -setsid -d -v
> n-1<17278> ssi:boot:tm: successfully launched on n0 (molevol1.ub.edu)
> n-1<17278> ssi:boot:tm: waiting for completion on n0 (molevol1.ub.edu)
> n-1<17278> ssi:boot:tm: finished on n0 (molevol1.ub.edu)
> n-1<17278> ssi:boot:tm: starting lamd on (molevol1.ub.edu)
> n-1<17278> ssi:boot:tm: starting on n0 (molevol1.ub.edu): 
> /opt/lam-7.1.2/bin/lamd -H 161.116.70.157 -P 47671 -n 0 -o 0 -d
> n-1<17278> ssi:boot:tm: successfully launched on n0 (molevol1.ub.edu)
> n-1<17278> ssi:boot:base:linear_windowed: finished launching
> n-1<17278> ssi:boot:base:server: expecting connection from finite list
> n-1<17280> ssi:boot:open: opening
> n-1<17280> ssi:boot:open: opening boot module globus
> n-1<17280> ssi:boot:open: opened boot module globus
> n-1<17280> ssi:boot:open: opening boot module rsh
> n-1<17280> ssi:boot:open: opened boot module rsh
> n-1<17280> ssi:boot:open: opening boot module slurm
> n-1<17280> ssi:boot:open: opened boot module slurm
> n-1<17280> ssi:boot:open: opening boot module tm
> n-1<17280> ssi:boot:open: opened boot module tm
> n-1<17280> ssi:boot:select: initializing boot module slurm
> n-1<17280> ssi:boot:slurm: not running under SLURM
> n-1<17280> ssi:boot:select: boot module not available: slurm
> n-1<17280> ssi:boot:select: initializing boot module globus
> n-1<17280> ssi:boot:globus: globus-job-run not found, globus boot will not run
> n-1<17280> ssi:boot:select: boot module not available: globus
> n-1<17280> ssi:boot:select: initializing boot module tm
> n-1<17280> ssi:boot:tm: module initializing
> n-1<17280> ssi:boot:tm:verbose: 1000
> n-1<17280> ssi:boot:tm:priority: 50
> n-1<17280> ssi:boot:select: boot module available: tm, priority: 50
> n-1<17280> ssi:boot:select: initializing boot module rsh
> n-1<17280> ssi:boot:rsh: module initializing
> n-1<17280> ssi:boot:rsh:agent: /usr/bin/ssh
> n-1<17280> ssi:boot:rsh:username: <same>
> n-1<17280> ssi:boot:rsh:verbose: 1000
> n-1<17280> ssi:boot:rsh:algorithm: linear
> n-1<17280> ssi:boot:rsh:no_n: 0
> n-1<17280> ssi:boot:rsh:no_profile: 0
> n-1<17280> ssi:boot:rsh:fast: 0
> n-1<17280> ssi:boot:rsh:ignore_stderr: 0
> n-1<17280> ssi:boot:rsh:priority: 10
> n-1<17280> ssi:boot:select: boot module available: rsh, priority: 10
> n-1<17280> ssi:boot:select: finalizing boot module slurm
> n-1<17280> ssi:boot:slurm: finalizing
> n-1<17280> ssi:boot:select: closing boot module slurm
> n-1<17280> ssi:boot:select: finalizing boot module globus
> n-1<17280> ssi:boot:globus: finalizing
> n-1<17280> ssi:boot:select: closing boot module globus
> n-1<17280> ssi:boot:select: finalizing boot module rsh
> n-1<17280> ssi:boot:rsh: finalizing
> n-1<17280> ssi:boot:select: closing boot module rsh
> n-1<17280> ssi:boot:select: selected boot module tm
> n-1<17280> ssi:boot:send_lamd: getting node ID from command line
> n-1<17280> ssi:boot:send_lamd: getting agent haddr from command line
> n-1<17280> ssi:boot:send_lamd: getting agent port from command line
> n-1<17280> ssi:boot:send_lamd: getting node ID from command line
> n-1<17280> ssi:boot:send_lamd: connecting to 161.116.70.157:47671, node id 0
> n-1<17280> ssi:boot:send_lamd: sending dli_port 32787
> n-1<17278> ssi:boot:base:server: got connection from 161.116.70.157
> n-1<17278> ssi:boot:base:server: this connection is expected (n0)
> n-1<17278> ssi:boot:base:server: remote lamd is at 161.116.70.157:32787
> n-1<17278> ssi:boot:base:server: closing server socket
> n-1<17278> ssi:boot:base:server: connecting to lamd at 161.116.70.157:40164
> n-1<17278> ssi:boot:base:server: connected
> n-1<17278> ssi:boot:base:server: sending number of links (1)
> n-1<17278> ssi:boot:base:server: sending info: n0 (molevol1.ub.edu)
> n-1<17278> ssi:boot:base:server: finished sending
> n-1<17278> ssi:boot:base:server: disconnected from 161.116.70.157:40164
> n-1<17278> ssi:boot:base:linear_windowed: finished
> n-1<17278> ssi:boot:tm: all RTE procs started
> n-1<17278> ssi:boot:tm: finalizing
> n-1<17278> ssi:boot: Closing
> n-1<17280> ssi:boot:tm: finalizing
> n-1<17280> ssi:boot: Closing
> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: 
> liblamf77mpi.so.0: cannot open shared object file: No such file or directory
> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: 
> liblamf77mpi.so.0: cannot open shared object file: No such file or directory
> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: 
> liblamf77mpi.so.0: cannot open shared object file: No such file or directory
> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: 
> liblamf77mpi.so.0: cannot open shared object file: No such file or directory
> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: 
> liblamf77mpi.so.0: cannot open shared object file: No such file or directory
> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: 
> liblamf77mpi.so.0: cannot open shared object file: No such file or directory
> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: 
> liblamf77mpi.so.0: cannot open shared object file: No such file or directory
> /home/molevol/mrbayes-3.1.2/mb_mpi: error while loading shared libraries: 
> liblamf77mpi.so.0: cannot open shared object file: No such file or directory
> -----------------------------------------------------------------------------
> It seems that [at least] one of the processes that was started with
> mpirun did not invoke MPI_INIT before quitting (it is possible that
> more than one process did not invoke MPI_INIT -- mpirun was only
> notified of the first one, which was on node n0).
>
> mpirun can *only* be used with MPI programs (i.e., programs that
> invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
> to run non-MPI programs over the lambooted nodes.
> -----------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Oscar-users mailing list
> Oscar-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users
>
>
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to