Does your job compiled with mpich-g2?

On 10/9/07, 那日苏 <[EMAIL PROTECTED]> wrote:
>
>  Hi, All,
>
> I have a cluster with 1 head node and 3 slave nodes, and their hostnames
> are:
>
> master:     m01.c01
> slaves:     s01.c01     s02.c01     s03.c01
>
> So I wanna build a small grid. I installed mpich-g2, globus, torque into
> my cluster, and the slaves share the mpich-g2 installation on the head
> node. I have pasted the "hello world" example with the mpich package,
> which proves my installation of globus is OK. Then I interfaced torque and
> globus and submitted the "hello world" job above with a RSL file like
> this:
>
> +
> ( &(resourceManagerContact="m01.c01/jobmanager-pbs")
>    (count=2)
>    (label="subjob 0")
>    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
>        (LD_LIBRARY_PATH /usr/local/gt-4.0.1/lib/))
>    (directory=/home/gt)
>    (executable=/home/gt/hello/hello)
> )
>
> It worked very well and the output is:
>
> [EMAIL PROTECTED] hello]$ globusrun -w -f hello.rsl
> hello, world
> hello, world
>
> Then I set the $mpirun in pbs.pm to $MPICH-G2_HOME/bin/mpirun, and
> submitted a mpich-g2 job: the classical "cpi" program with the mpichpackage. 
> But
> it failed. This is the RSL file:
>
> +
> ( &(resourceManagerContact="m01.c01/jobmanager-pbs")
>    (count=4)
>    (jobtype=mpi)
>    (label="subjob 0")
>    (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
>                 (LD_LIBRARY_PATH /usr/local/gt-4.0.1/lib/))
>    (directory="/home/gt/examples")
>    (executable="/home/gt/examples/cpi")
> )
>
> The output is:
>
> [EMAIL PROTECTED] examples]$ ./mpirun -globusrsl cpi.rsl
>     Submission of subjob (label = "subjob 0") failed because
> authentication with the remote server failed (error code 57)
>     Submission of subjob (label = "subjob 1") failed because the
> connection to the server failed (check host and port) (error code 62)
>     Submission of subjob (label = "subjob 2") failed because the
> connection to the server failed (check host and port) (error code 62)
>     Submission of subjob (label = "subjob 3") failed because the
> connection to the server failed (check host and port) (error code 62)
>
> So I googled it and someone said that I have to remove the line
> "(jobtype=mpi)" if I don't use Vender MPI. I did it and the errors were
> gone, but it seems like all the processes ran on the head nodes while none
> on the slaves:
>
> [EMAIL PROTECTED] examples]$ ./mpirun -globusrsl cpi.rsl
> Process 3 on m01.c01
> Process 2 on m01.c01
> Process 1 on m01.c01
> pi is approximately 3.1416009869231249, Error is 0.0000083333333318
> wall clock time = 0.083140
> Process 0 on m01.c01
>
> Could anyone tell me what's wrong with it? Thanks in advance!
>
> Best Regards,
> Narisu,
> Beihang University,
> Beijing,
> China.
> Email:[EMAIL PROTECTED]
>



-- 
Best Regards,
S.Mehdi Sheikhalishahi,
Web: http://www.cse.shirazu.ac.ir/~alishahi/
Bye.

Reply via email to