[Mauiusers] Torque+Maui unable to distribute parallel job to different nodes

L.Wu(Lei.Wu) Fri, 16 Sep 2011 08:24:57 -0700

I've just installed torque and maui on a HP blade system. we have 16 nodes, 
each has 2 xeon e5620 processors. Both serial and parallel jobs within a single 
node can be successfully submited and run perfectly. However, if I set
 
               #PBS -l nodes=X(X larger than 1):ppn 
 
I can see the job in R status with qstat command, but it is not running 
acctually. After canceling the job, I get following error message:
 
   [mpiexec@node1] HYD_pmcd_pmiserv_send_signal 
(./pm/mpiserv/mpiserv_cb.c:184): assert (!closed) failed
   [mpiexec@node1] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:74): unable to send 
SIGUSR1 downstream
   [mpiexec@node1] HYDT_dmxu_poll_wait_for_event 
(./tools/demux/demux_poll.c:77): callback returned error status
   [mpiexec@node1] HYD_pmci_wait_for_completion 
(./pm/pmserv/pmiserv_pmci.c:179): error waiting for event
   [mpiexec@node1] main (./ui/mpich/mpiexec.c:397): process manager error 
waiting for completion
    
     I've also found that the $PBS_NODEFILE(e.g.  JOBID.node1 file in 
/var/spool/torque/aux ) exists only on the first node among the nodes assign 
for this jobs.
     Further more, If I replace the $PBS_NODEFILE with a local file containing 
computing nodes in PBS script, it works well and job can be run on all the 
nodes assigned:
#!/bin/sh
#PBS -N name
#PBS -e  errorfile
#PBS -o  outfile
#PBS -q  test
#PBS -l  nodes=2
cd $work_dir
#mpiexec -f $PBS_NODEFILE ./executables....
mpiexec -f hosts ./executables....
 
hosts file:
node1
node1
node1
node1
node2
node2
node2
node2 
 
Interestingly, if I add :ppn=4 after the #PBS -l nodes=2, i.e.
  #PBS -l nodes=2:ppn=4
The PBS script fails again even if I use local host file.
 
Can anyone help me?

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

[Mauiusers] Torque+Maui unable to distribute parallel job to different nodes

Reply via email to