Now I seem to have some sort of a problem with pbs.
I am able to ssh into the two slave nodes from the master.
e.g. ssh rosenode1 takes me to rosenode1.
lamboot claims to be a success ...
however, when I do a simple qsub job with a simple script file:

#PBS -S /bin/sh
#PBS -l nodes=1
#PBS -q normal
#PBS -N tielema
#PBS -j oe
echo "I ran on `hostname`"

#use mpirun to run my MPI binary with 2 nodes

mpirun -np 2 ls


i get (in file tielema.o##)

I ran on rosenode2.engin.umich.edu
-----------------------------------------------------------------------------

It seems that there is no lamd running on this host, which indicates
that the LAM/MPI runtime environment is not operating.  The LAM/MPI
runtime environment is necessary for the "mpirun" command.

Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
---------------------------------------------------------------------------


I am pretty sure that lamboot is running. I even tried to do a lamclean and
then did another lamboot -v lamhosts.  Even after that the same pbs error
happens.  Any ideas??

Thanks.
Hope to get this "computer" problem fixed ASAP so that I can worry about
"science"


"Brian W. Barrett" wrote:

> On Thu, 28 Mar 2002, Senthil Kandasamy wrote:
>
> > I fixed it. On the master node, the default shell was csh whereas on the
> > slaves, it was bash. I changed the default shell to csh on all three
> > nodes and recon worked. Lamboot also worked. I do not understand why
> > this shoudl affect the boot, but hey....it worked. .I am not
> > complaining. However, when I lamboot , it recognizes only one cpu per
> > node though the nodes have dual processors. Is this a problem?
>
> I'm not sure why the shell change would make a difference.  We (the LAM
> team) haven't really tested that configuration, so it looks like something
> to test for the next OSCAR release.  We don't do anything that should
> cause problems, but obviousl something is :).
>
> AS for the CPU count, LAM doesn't actually detect how many CPUs you have,
> it requires you to tell it.  The only purpose in LAM's CPU count is
> scheduling - if you tell it all your nodes are dual CPU machines, you can
> do something like:
>
>   mpirun C foo
>
> and have LAM run two processes on each node.  More importantly, LAM will
> guarantee that processes on the same machine will be "neighbor" ranks in
> MPI_COMM_WORLD - ranks 0 & 1 might be on the first node, for instance.
> Since most communication occurs with "neighbors", this can actually make a
> performance difference on some applications.
>
> You might want to take a look at the LAM/MPI FAQ, available at:
>
>  http://www.lam-mpi.org/faq/
>
> for more information on the topic.  There is a detailed explination of the
> CPU count stuff under the "Booting LAM" section.
>
> > I just submitted a job through command-line mpirun and it seems to work.
> >
> > Now I have to see if I can submit jobs through pbs. If that works, the
> > cluster wil be fully functional.
>
> Hopefully, you have made it through most of the sticky points.  Getting
> LAM or MPICH to run under PBS isn't all that difficult.  Especially since
> PBS should already be running with a basic FIFO queue enabled.
>
> Brian
>
> --
>   Brian Barrett
>   LAM/MPI developer and all around nice guy
>   Have a LAM/MPI day: http://www.lam-mpi.org/


_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to