Mehdi Sheikhalishahi,

Thank you very much! But I don't exactly know what you mean. As you said, I have shared the MPICH-G2 installation from head nodes for usage by slaves and interfaced my Torque with GT. I made a test with only Cluster 1 to avoid the Private IP problem. But at first I tested my Torque with MPICH2 which I've been using, and it works well. In my environment, $PBS_HOME=/var/spool/torque and $PBS_HOME/server_priv/nodes is like below:
m01.c01
s01.c01
s02.c01
s03.c01
My PBS job script file "test.pbs":
#!/bin/bash
#PBS -l nodes=4
$MPICH2_HOME/bin/mpiexec -np 4 /home/nrs/cpi
The output:
[[EMAIL PROTECTED] ~]$ qsub ./test.pbs
Process 3 on s03.c01
Process 2 on s02.c01
pi is approximately 3.1416009869231245, Error is 0.0000083333333314
wall clock time = 0.199736
Process 1 on s01.c01
Process 0 on m01.c01
Then I testd MPICH-G2 like below:
cpi.rsl:
+
( &(resourceManagerContact="m01.c01/jobmanager-pbs")
   (count=4)
   (jobtype=mpi)
   (label="subjob 0")
   (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
                (LD_LIBRARY_PATH /usr/local/gt-4.0.1/lib/))
   (directory="/home/gt/examples")
   (executable="/home/gt/examples/cpi")
)
It failed and the output is:
[[EMAIL PROTECTED] ~]$ $MPICH-G2_HOME/mpirun -globusrsl ./cpi.rsl
    Submission of subjob (label = "subjob 0") failed because authentication with the remote server failed (error code 57)
    Submission of subjob (label = "subjob 1") failed because the connection to the server failed (check host and port) (error code 62)
    Submission of subjob (label = "subjob 2") failed because the connection to the server failed (check host and port) (error code 62)
    Submission of subjob (label = "subjob 3") failed because the connection to the server failed (check host and port) (error code 62)
So I googled it and someone said that I have to remove the line "(jobtype=mpi)" if I don't use Vender MPI. I did it and the errors were gone, but it seems like all the processes ran on the head nodes while none on the slaves.
[[EMAIL PROTECTED] ~]$ $MPICH-G2_HOME/mpirun -globusrsl ./cpi.rsl
Process 3 on m01.c01
Process 2 on m01.c01
pi is approximately 3.1416009869231245, Error is 0.0000083333333314
wall clock time = 0.199736
Process 1 on
m01.c01
Process 0 on m01.c01

So could you kindly tell me what's wrong with my configuration? Thanks in advance.

Best Regards,
Narisu,
Beihang University,
Beijing,
China.
Email:[EMAIL PROTECTED]

You don't need to install GT on other hosts. At first, you must install a Local Resource Manager on your cluster e.g. Torque and configure it properly to use cluster resources.  In addition, you must interface it with GT. Secondly, you must use your head node address as a value for resourceManagerContact, for example resourceManagerContact= headnode.126.com/jobmanager-pbs. In the slave nodes, you need to install MPICH-G2 not Globus. It is better to share MPICH-G2 installation from head node for usage by slaves.

On 10/8/07, 那日苏 <[EMAIL PROTECTED]> wrote:
I'm trying to build a grid with 2 clusters and each cluster has 4
nodes(1 master and 3 slaves). Here are the hostnames:

Cluster 1:
master: m01.c01 slaves: s01.c01 s02.c01 s03.c01
Cluster 2:
master: m02.c02 slaves: s01.c02 s02.c02 s03.c02

I have installed globus 4.0.5 and MPICH-G2 1.2.7 on the 2 master nodes
and successfully run the cpi example with the MPICH-G2 package on the
two master nodes like below:

Command line:
mpirun -globusrsl ./cpi.rsl
cpi.rsl:
> +
> ( &(resourceManagerContact="m01.c01")
> (label="subjob 0")
> (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
> (LD_LIBRARY_PATH /usr/local/globus-4.0.5/lib/))
> (directory="/home/gt/examples")
> (executable="/home/gt/examples/cpi")
> )
> ( &(resourceManagerContact="m02.c02 ")
> (label="subjob 0")
> (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
> (LD_LIBRARY_PATH /usr/local/globus-4.0.5/lib/))
> (directory="/home/gt/examples")
> (executable="/home/gt/examples/cpi")
> )
$MPICH-G2_HOME/bin/machines:
> m01.c01
> m02.c02
But I don't know how to adopt the other 6 slave nodes in my job. I know
that MPICH-G2 can use the Vender MPI on the slave nodes, but I don't use
Vender-MPI and I was always using MPICH2 within each clusters before. I
have read somewhere else that we can directly use MPICH-G2 on the slaves
nodes without a Vender MPI. But As far as I know , if you wanna run a
MPICH-G2 subjob on a machine, there has to be a globus installed on it.
Installing globus on each nodes in my clusters sounds horrible and
impractical to me, so is there any other ways to handle it?




--
Best Regards,
S.Mehdi Sheikhalishahi,
Web: http://www.cse.shirazu.ac.ir/~alishahi/
Bye.

Reply via email to