Now I seem to have some sort of a problem with pbs. I am able to ssh into the two slave nodes from the master. e.g. ssh rosenode1 takes me to rosenode1. lamboot claims to be a success ... however, when I do a simple qsub job with a simple script file:
#PBS -S /bin/sh #PBS -l nodes=1 #PBS -q normal #PBS -N tielema #PBS -j oe echo "I ran on `hostname`" #use mpirun to run my MPI binary with 2 nodes mpirun -np 2 ls i get (in file tielema.o##) I ran on rosenode2.engin.umich.edu ----------------------------------------------------------------------------- It seems that there is no lamd running on this host, which indicates that the LAM/MPI runtime environment is not operating. The LAM/MPI runtime environment is necessary for the "mpirun" command. Please run the "lamboot" command the start the LAM/MPI runtime environment. See the LAM/MPI documentation for how to invoke "lamboot" across multiple machines. --------------------------------------------------------------------------- I am pretty sure that lamboot is running. I even tried to do a lamclean and then did another lamboot -v lamhosts. Even after that the same pbs error happens. Any ideas?? Thanks. Hope to get this "computer" problem fixed ASAP so that I can worry about "science" "Brian W. Barrett" wrote: > On Thu, 28 Mar 2002, Senthil Kandasamy wrote: > > > I fixed it. On the master node, the default shell was csh whereas on the > > slaves, it was bash. I changed the default shell to csh on all three > > nodes and recon worked. Lamboot also worked. I do not understand why > > this shoudl affect the boot, but hey....it worked. .I am not > > complaining. However, when I lamboot , it recognizes only one cpu per > > node though the nodes have dual processors. Is this a problem? > > I'm not sure why the shell change would make a difference. We (the LAM > team) haven't really tested that configuration, so it looks like something > to test for the next OSCAR release. We don't do anything that should > cause problems, but obviousl something is :). > > AS for the CPU count, LAM doesn't actually detect how many CPUs you have, > it requires you to tell it. The only purpose in LAM's CPU count is > scheduling - if you tell it all your nodes are dual CPU machines, you can > do something like: > > mpirun C foo > > and have LAM run two processes on each node. More importantly, LAM will > guarantee that processes on the same machine will be "neighbor" ranks in > MPI_COMM_WORLD - ranks 0 & 1 might be on the first node, for instance. > Since most communication occurs with "neighbors", this can actually make a > performance difference on some applications. > > You might want to take a look at the LAM/MPI FAQ, available at: > > http://www.lam-mpi.org/faq/ > > for more information on the topic. There is a detailed explination of the > CPU count stuff under the "Booting LAM" section. > > > I just submitted a job through command-line mpirun and it seems to work. > > > > Now I have to see if I can submit jobs through pbs. If that works, the > > cluster wil be fully functional. > > Hopefully, you have made it through most of the sticky points. Getting > LAM or MPICH to run under PBS isn't all that difficult. Especially since > PBS should already be running with a basic FIFO queue enabled. > > Brian > > -- > Brian Barrett > LAM/MPI developer and all around nice guy > Have a LAM/MPI day: http://www.lam-mpi.org/ _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
