Re: [gt-user] Problems with mpich-g2 + globus + torque

那日苏 Wed, 10 Oct 2007 09:30:02 -0700

Hi, everyone,

I've made a little progress but still got some problems. I've checked the file 
$GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm and found the line:
> $cluster = 0; 
I remembered that I change it from $cluster = 1, but I had forgot why. Then I 
looked it up in the documentation of globus 
on the globus official site, and find out: if you set $cluster to 0, the globus 
will treate the resouce as a SMP machine 
rather than a cluster. Obviously I'm using a cluster, so I changed it back and 
I think this is the reason why all the 
processes of my mpich-g2 job ran on the head node. Then I ran the some job 
again:
> + ( &(resourceManagerContact="m01.c01/jobmanager-pbs") (count=4)
> (jobtype=mpi) (label="subjob 0")
> (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0) (LD_LIBRARY_PATH
> /usr/local/gt-4.0.1/lib/)) (directory="/home/gt/examples")
> (executable="/home/gt/examples/cpi") )
But this time it's just stuck there and give no response, neither failing nor 
success, just like this:
> [EMAIL PROTECTED] examples]$ ./mpirun -globusrsl cpi.rsl 
At first, I thought it's might because the head node cannot get the shell of 
the slaves. So I double checked the passwordless ssh 
between the head and the slaves, and they works well. And also I made sure I 
configured the job-manager-pbs to use ssh 
rather than rsh. So, to sum up, I'm still in trouble. Could any one help me out 
of here? Thank you very much.


Best Regards,
Narisu,
Beihang University,
Beijing,
China.
Email:[EMAIL PROTECTED] <mailto:Email:[EMAIL PROTECTED]> 

> Yes, I installed mpich-g2 in /usr/local/mpich-g2 and I compiled my job
> with /usr/local/mpich-g2/bin/mpicc.
>
>> Does your job compiled with mpich-g2?
>>
>>
>> On 10/9/07, *那日苏* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
>> wrote:
>>
>>     Hi, All,
>>
>>     I have a cluster with 1 head node and 3 slave nodes, and their
>>     hostnames are:
>>
>>     master: m01.c01
>>     slaves: s01.c01 s02.c01 s03.c01
>>
>>     So I wanna build a small grid. I installed mpich-g2, globus,
>>     torque into my cluster, and the slaves share the mpich-g2
>>     installation on the head node. I have pasted the "hello world"
>>     example with the mpich package, which proves my installation of
>>     globus is OK. Then I interfaced torque and globus and submitted
>>     the "hello world" job above with a RSL file like this:
>>>     +
>>>     ( &(resourceManagerContact="m01.c01/jobmanager-pbs")
>>>     (count=2)
>>>     (label="subjob 0")
>>>     (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
>>>     (LD_LIBRARY_PATH /usr/local/gt-4.0.1/lib/))
>>>     (directory=/home/gt)
>>>     (executable=/home/gt/hello/hello)
>>>     )
>>     It worked very well and the output is:
>>>     [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> hello]$ globusrun -w -f 
>>> hello.rsl
>>>     hello, world
>>>     hello, world
>>     Then I set the $mpirun in pbs.pm <http://pbs.pm> to
>>     $MPICH-G2_HOME/bin/mpirun, and submitted a mpich-g2 job: the
>>     classical "cpi" program with the mpich package. But it failed.
>>     This is the RSL file:
>>>     +
>>>     ( &(resourceManagerContact="m01.c01/jobmanager-pbs")
>>>     (count=4)
>>>     (jobtype=mpi)
>>>     (label="subjob 0")
>>>     (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
>>>     (LD_LIBRARY_PATH /usr/local/gt-4.0.1/lib/))
>>>     (directory="/home/gt/examples")
>>>     (executable="/home/gt/examples/cpi")
>>>     )
>>     The output is:
>>>     [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> examples]$ ./mpirun 
>>> -globusrsl
>>>     cpi.rsl
>>>     Submission of subjob (label = "subjob 0") failed because
>>>     authentication with the remote server failed (error code 57)
>>>     Submission of subjob (label = "subjob 1") failed because the
>>>     connection to the server failed (check host and port) (error
>>>     code 62)
>>>     Submission of subjob (label = "subjob 2") failed because the
>>>     connection to the server failed (check host and port) (error
>>>     code 62)
>>>     Submission of subjob (label = "subjob 3") failed because the
>>>     connection to the server failed (check host and port) (error
>>>     code 62)
>>     So I googled it and someone said that I have to remove the line
>>     "(jobtype=mpi)" if I don't use Vender MPI. I did it and the
>>     errors were gone, but it seems like all the processes ran on the
>>     head nodes while none on the slaves：
>>>     [EMAIL PROTECTED] examples]$ ./mpirun -globusrsl cpi.rsl
>>>     Process 3 on m01.c01
>>>     Process 2 on m01.c01
>>>     Process 1 on m01.c01
>>>     pi is approximately 3.1416009869231249, Error is 0.0000083333333318
>>>     wall clock time = 0.083140
>>>     Process 0 on m01.c01
>>     Could anyone tell me what's wrong with it? Thanks in advance!
>>
>>     Best Regards,
>>     Narisu,
>>     Beihang University,
>>     Beijing,
>>     China.
>>     Email:[EMAIL PROTECTED] <mailto:Email:[EMAIL PROTECTED]>
>>
>>
>>
>>
>> -- 
>> Best Regards,
>> S.Mehdi Sheikhalishahi,
>> Web: http://www.cse.shirazu.ac.ir/~alishahi/
>> <http://www.cse.shirazu.ac.ir/%7Ealishahi/>
>> Bye.
>>
>> ------------------------------------------------------------------------
>>
>> No virus found in this incoming message.
>> Checked by AVG Free Edition. 
>> Version: 7.5.488 / Virus Database: 269.14.6/1060 - Release Date: 2007/10/9 
>> 16:43
>

Re: [gt-user] Problems with mpich-g2 + globus + torque

Reply via email to