Re: [Mauiusers] Problem with Torque/Maui

S Ranjan Sun, 28 Jan 2007 22:36:20 -0800

Garrick Staples wrote:

On Thu, Jan 25, 2007 at 05:37:56AM +0530, S Ranjan alleged:
Garrick Staples wrote:
On Wed, Jan 24, 2007 at 08:51:11AM +0530, S Ranjan alleged:
Hi
I have torque pbs_server running on the headnode, which is also thesubmit host. There are 32 other compute nodes, mentioned in/var/spool/torque/server_priv/nodes file. There is a single queue atpresent. Sometimes, mpi jobs requesting for 28/30 nodes, land uprunning on the head node, though the head node is not a compute node atall. netstat -anp shows several sockets being openend for the job, andeventually the head node hangs up.
Appreciate any help/suggestion on this.
Which MPI?  MPICH?  I'd guess mpirun is using the default machinefile
that is created when mpich is built, and not the hostlist provided by
the PBS job.

Run mpirun with "-machinefile $PBS_NODEFILE" or use OSC's mpiexec
instead of mpirun: http://www.osc.edu/~pw/mpiexec/

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers
_____________________________________________________________________

The mail server at Institute for Plasma Research has scanned this
email for Virus using ClamAV 0.88.4
_____________________________________________________________________
We are using Intel mpi 2.0. We are using mpiexec -n 28 ......inside the pbs script.However, for mpdboot (executable in the mpi 2.0 binary dir), we arerunning it before running the pbs script. The exact syntax being used is
mpdboot -n 32 -f mpd.hosts --rsh=ssh -v
mpd.hosts file, residing in the user's home directory, contains thenames of the 32 compute nodes (excluding the head node).
<>
There is your problem, you want to use the list of nodes assigned to
your job. So you'll want something like this:
np=$(wc -l < $PBS_NODEFILE)
mpdboot -n $np -f $PBS_NODEFILE --rsh=ssh -v

But I still recommend using OSC's mpiexec instead.


<>
Hi

Using OSC's mpiexec, the mpi job starts and then gives the followingerrors (when used without any mpdboot) with --comm=pmi option . Whenused without --comm=pmi, the job just aborts complaining that it cannotconnect to mpd2.console (the same error that is generated if a mpi jobis launched without starting mpdboot).

I am using Intel MPI 2.0.

Thanks is advance for any help/suggestions

Sutapa

aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(385): MPI_Barrier(MPI_COMM_WORLD) failed

MPIR_Barrier(75):MPIC_Sendrecv(152):MPIC_Wait(321):MPIDI_CH3_Progress_wait(202): an error occurred while handling an event returned by MPIDU_Sock_Wait()

MPIDI_CH3I_Progress_handle_sock_event(1022): [ch3:sock] failed to connnect to 
remote process 339.clustserver-spawn-0:2
MPIDU_Socki_handle_connect(780): connection failure 
(set=0,sock=1,errno=111:Connection refused)
aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(385): MPI_Barrier(MPI_COMM_WORLD) failed

MPIR_Barrier(75):MPIC_Sendrecv(152):MPIC_Wait(321):MPIDI_CH3_Progress_wait(202): an error occurred while handling an event returned by MPIDU_Sock_Wait()

MPIDI_CH3I_Progress_handle_sock_event(1022): [ch3:sock] failed to connnect to 
remote process 339.clustserver-spawn-0:3
MPIDU_Socki_handle_connect(780): connection failure 
(set=0,sock=1,errno=111:Connection refused)
aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(385): MPI_Barrier(MPI_COMM_WORLD) failed

MPIR_Barrier(75):MPIC_Sendrecv(152):MPIC_Wait(321):MPIDI_CH3_Progress_wait(202): an error occurred while handling an event returned by MPIDU_Sock_Wait()MPIDI_CH3I_Progress_handle_sock_event(461):connection_recv_fail(1685):MPIDU_Socki_handle_read(627): connection failure (set=0,sock=1,errno=104:Connection reset by peer)

aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(385): MPI_Barrier(MPI_COMM_WORLD) failed

MPIR_Barrier(75):MPIC_Sendrecv(152):MPIC_Wait(321):MPIDI_CH3_Progress_wait(202): an error occurred while handling an event returned by MPIDU_Sock_Wait()

MPIDI_CH3I_Progress_handle_sock_event(1022): [ch3:sock] failed to connnect to 
remote process 339.clustserver-spawn-0:1
MPIDU_Socki_handle_connect(780): connection failure 
(set=0,sock=1,errno=111:Connection refused)
newmpiexec: Warning: tasks 0-3 exited with status 13.

------------------------------------------------------------------------


_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Re: [Mauiusers] Problem with Torque/Maui

Reply via email to