Garrick Staples wrote:
On Wed, Jan 24, 2007 at 08:51:11AM +0530, S Ranjan alleged:
Hi
I have torque pbs_server running on the headnode, which is also the
submit host. There are 32 other compute nodes, mentioned in
/var/spool/torque/server_priv/nodes file. There is a single queue at
present. Sometimes, mpi jobs requesting for 28/30 nodes, land up
running on the head node, though the head node is not a compute node at
all. netstat -anp shows several sockets being openend for the job, and
eventually the head node hangs up.
Appreciate any help/suggestion on this.
Which MPI? MPICH? I'd guess mpirun is using the default machinefile
that is created when mpich is built, and not the hostlist provided by
the PBS job.
Run mpirun with "-machinefile $PBS_NODEFILE" or use OSC's mpiexec
instead of mpirun: http://www.osc.edu/~pw/mpiexec/
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers
_____________________________________________________________________
The mail server at Institute for Plasma Research has scanned this
email for Virus using ClamAV 0.88.4
_____________________________________________________________________
We are using Intel mpi 2.0. We are using mpiexec -n 28 ......
inside the pbs script.
However, for mpdboot (executable in the mpi 2.0 binary dir), we are
running it before running the pbs script. The exact syntax being used is
mpdboot -n 32 -f mpd.hosts --rsh=ssh -v
mpd.hosts file, residing in the user's home directory, contains the
names of the 32 compute nodes (excluding the head node).
Sutapa Ranjan
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers