Dear all,
I met a problem when I tried to submit mpi jobs to PBS using MPICH-G2. I plan 
to simulate a grid environment using a 36-nodes cluster in my lab. Now, I use 3 
machines of it to simulate a 2-nodes cluster and a portal server. The 2-nodes 
cluster includes one master node and one slave node. The master node can also 
act as the compute node. The portal server is used to submit jobs.
 
Following is the configuration of my test environment:
No NFS or other file shred systems are used on my cluster. Firstly, I installed 
GT4.2.1 on the portal server and the master node. The 2-nodes cluster is 
managed by Torque 2.3.0. Then, I installed mpich1.2.7 on the master node and 
copied the install directory to the slave node. After that, I re-installed 
MPICH-1.2.7 on the master node, using the command "./configure 
--with-device=globus2:-flavor=gcc32dbg" to harness MPICH-G2. The portal server 
is also have MPICH-G2 installed.
 
Then, I tried to submit some test mpi jobs from the portal server. The machine 
file is as following:
"master.cluster.net" 2
The rsl file is as following:
+
( &(resourceManagerContact="master.cluster.net/jobmanager-pbs") 
   (count=2)
   (label="subjob 0")
   (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0))
   (directory=/usr/local)
   (executable=/usr/local/helloworld)
)
On the portal server, I used "mpirun -machinefile machines -globusrsl 
hello.rsl" to submit the job. But afer submission, the job never finished, and 
no any information displayed in the screen. The cursor just stayed there, never 
returned.  I used the "qstat" command on the master node to check the state of 
the job, and found the submitted job was always stayed in the state "R" and 
never finished. The command "tracejob" showed following information:
03/16/2009 11:39:17 M JOIN JOB as node 1
03/16/2009 11:39:17 S enqueuing into batch, state 1 hop 1
03/16/2009 11:39:17 S Job queued at request of [email protected], 
[email protected], job name=STDIN, queue=batch
03/16/2009 11:39:17 S Job Modified at request of [email protected]
03/16/2009 11:39:17 L Job Run
03/16/2009 11:39:17 S Job Run at request of [email protected]
03/16/2009 11:39:17 A queue=batch
03/16/2009 11:39:17 A user=ciarlab group=ciarlab jobname=STDIN queue=batch 
ctime=1237174757 qtime=1237174757 etime=1237174757 start=1237174757 
[email protected] exec_host=master/o+slave/o 
Resource_List.neednodes=2 Resource_List.nodect=2 Resource_List.nodes=2 
Resource_List.walltime=01:00:00
 
So, this is the problem, why the submitted job could not finish but always 
stayed in the state "Running"  ? But I want to point out that if I change the 
sentence "count=2" of the rsl to "count=1", the job can be finished on the 
master node, and return following information to the portal server:
Hello World! process 0 of 1 on master.cluster.net
 
I have googled above problem but few result is found. Thus, I hope I can get 
some support from you. Any help will be much appraciated.
 
Thanks!
Tracy


Reply via email to