Dear all,
I met a problem when I tried to submit mpi jobs to PBS using MPICH-G2. I plan
to simulate a grid environment using a 36-nodes cluster in my lab. Now, I use 3
machines of it to simulate a 2-nodes cluster and a portal server. The 2-nodes
cluster includes one master node and one slave node. The master node can also
act as the compute node. The portal server is used to submit jobs.
Following is the configuration of my test environment:
No NFS or other file shred systems are used on my cluster. Firstly, I installed
GT4.2.1 on the portal server and the master node. The 2-nodes cluster is
managed by Torque 2.3.0. Then, I installed mpich1.2.7 on the master node and
copied the install directory to the slave node. After that, I re-installed
MPICH-1.2.7 on the master node, using the command "./configure
--with-device=globus2:-flavor=gcc32dbg" to harness MPICH-G2. The portal server
is also have MPICH-G2 installed.
Then, I tried to submit some test mpi jobs from the portal server. The machine
file is as following:
"master.cluster.net" 2
The rsl file is as following:
+
( &(resourceManagerContact="master.cluster.net/jobmanager-pbs")
(count=2)
(label="subjob 0")
(environment=(GLOBUS_DUROC_SUBJOB_INDEX 0))
(directory=/usr/local)
(executable=/usr/local/helloworld)
)
On the portal server, I used "mpirun -machinefile machines -globusrsl
hello.rsl" to submit the job. But afer submission, the job never finished, and
no any information displayed in the screen. The cursor just stayed there, never
returned. I used the "qstat" command on the master node to check the state of
the job, and found the submitted job was always stayed in the state "R" and
never finished. The command "tracejob" showed following information:
03/16/2009 11:39:17 M JOIN JOB as node 1
03/16/2009 11:39:17 S enqueuing into batch, state 1 hop 1
03/16/2009 11:39:17 S Job queued at request of [email protected],
[email protected], job name=STDIN, queue=batch
03/16/2009 11:39:17 S Job Modified at request of [email protected]
03/16/2009 11:39:17 L Job Run
03/16/2009 11:39:17 S Job Run at request of [email protected]
03/16/2009 11:39:17 A queue=batch
03/16/2009 11:39:17 A user=ciarlab group=ciarlab jobname=STDIN queue=batch
ctime=1237174757 qtime=1237174757 etime=1237174757 start=1237174757
[email protected] exec_host=master/o+slave/o
Resource_List.neednodes=2 Resource_List.nodect=2 Resource_List.nodes=2
Resource_List.walltime=01:00:00
So, this is the problem, why the submitted job could not finish but always
stayed in the state "Running" ? But I want to point out that if I change the
sentence "count=2" of the rsl to "count=1", the job can be finished on the
master node, and return following information to the portal server:
Hello World! process 0 of 1 on master.cluster.net
I have googled above problem but few result is found. Thus, I hope I can get
some support from you. Any help will be much appraciated.
Thanks!
Tracy