Yes, I installed mpich-g2 in
/usr/local/mpich-g2 and I compiled my job with
/usr/local/mpich-g2/bin/mpicc.
Does your job compiled with
mpich-g2?
On 10/9/07, 那日苏 <[EMAIL PROTECTED]> wrote:
Hi,
All,
I have a cluster with 1 head
node and 3 slave nodes, and their
hostnames are:
master: m01.c01
slaves: s01.c01 s02.c01 s03.c01
So I wanna build a small grid. I installed mpich-g2,
globus, torque
into my cluster, and the slaves share the mpich-g2
installation on the head node.
I have pasted the "hello world" example with the mpich
package, which
proves my installation of globus is OK. Then I interfaced torque and
globus and submitted the "hello world" job above with a RSL file like
this:
+
( &(resourceManagerContact="m01.c01/jobmanager-pbs")
(count=2)
(label="subjob 0")
(environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
(LD_LIBRARY_PATH /usr/local/gt-4.0.1/lib/))
(directory=/home/gt)
(executable=/home/gt/hello/hello)
)
It worked very well and the output is:
[[EMAIL PROTECTED]
hello]$ globusrun -w -f hello.rsl
hello, world
hello, world
Then I set the $mpirun in pbs.pm to $MPICH-G2_HOME/bin/mpirun, and
submitted a mpich-g2 job: the classical "cpi"
program with the
mpich
package. But it
failed. This is
the RSL file:
+
( &(resourceManagerContact="m01.c01/jobmanager-pbs")
(count=4)
(jobtype=mpi)
(label="subjob 0")
(environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
(LD_LIBRARY_PATH /usr/local/gt-4.0.1/lib/))
(directory="/home/gt/examples")
(executable="/home/gt/examples/cpi")
)
The output is:
[[EMAIL PROTECTED]
examples]$ ./mpirun -globusrsl
cpi.rsl
Submission of subjob (label = "subjob 0") failed because
authentication with the
remote server failed (error code 57)
Submission of subjob (label = "subjob 1") failed because the
connection to the server failed (check host and port) (error code 62)
Submission of subjob (label = "subjob 2") failed because the
connection to the server failed (check host and port) (error code 62)
Submission of
subjob
(label = "subjob 3") failed because the connection to the server failed
(check host and port) (error code 62)
So I googled it and
someone said that I have to remove the line
"(jobtype=mpi)" if I don't use Vender MPI. I did it and the errors were
gone, but it seems like all the processes ran on the head nodes while
none on the slaves:
[gt@m01.c01
examples]$ ./mpirun -globusrsl cpi.rsl
Process 3 on m01.c01
Process 2 on m01.c01
Process 1 on m01.c01
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.083140
Process 0 on m01.c01
Could anyone tell me what's wrong with it? Thanks in advance!
Best Regards,
Narisu,
Beihang University,
Beijing,
China.
Email:[EMAIL PROTECTED]
--
Best Regards,
S.Mehdi Sheikhalishahi,
Web: http://www.cse.shirazu.ac.ir/~alishahi/
Bye.
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.14.6/1060 - Release Date: 2007/10/9 16:43
|