hi all,

i have a problem (as many other people had) with running mpich-g2
applications on multiple clusters. here's my situation : i have two
multicore machines : 4-core "wn001.grid.info.uvt.ro" machine, and 8-core "
ardbeg.cs.st-andrews.ac.uk" machine. i want to run test program (ring.c)
from MPICH-G2 documentation.

i can run simple jobs from machine A using gatekeeper on machine B. for
example, from "wn001.grid.info.uvt.ro" i can run :

j...@wn001:~$ globus-job-run "ardbeg.cs.st-andrews.ac.uk:
:/O=Grid/OU=SCIEnce/CN=host/ardbeg.cs.st-andrews.ac.uk" /bin/date
Mon Mar 23 21:14:51 GMT 2009

also, from ardbeg.cs.st-andrews.ac.uk, i can run

[...@ardbeg mpich-g2]$ globus-job-run "wn001.grid.info.uvt.ro" /bin/date
Mon Mar 23 23:16:21 EET 2009

but, when i run mpich-g2 job from ardbeg.cs.st-andrews.ac.uk, using this rsl
file :

+
( &(resourceManagerContact="ardbeg.cs.st-andrews.ac.uk:
:/O=Grid/OU=SCIEnce/CN=host/ardbeg.cs.st-andrews.ac.uk")
   (count=8)
   (label="subjob 0")
   (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
                (LD_LIBRARY_PATH /usr/local/globus-4.0.1/lib/))
   (directory="/home/vj/tests/mpich-g2")
   (executable="/home/vj/tests/mpich-g2/ring")
)
( &(resourceManagerContact="wn001.grid.info.uvt.ro")
   (count=4)
   (label="subjob 8")
   (environment=(GLOBUS_DUROC_SUBJOB_INDEX 1)
                (LD_LIBRARY_PATH /opt/globus-4.2.1/lib/))
   (directory="/home/users/jv/tests/mpich-g2")
   (executable="/home/users/jv/tests/mpich-g2/ring")
)

the programs start correctly on all machines, but then nothing happen. that
is, on ardbeg, 8 processes named ring are created (also, 4 are created on
wn001.grid.info.uvt.ro), but they don't start execution (when i do 'ps',
running time of all of them is 0.00. for example :
jv        7763  0.0  0.1   9400  4704 ?        S    23:06   0:00
/home/users/jv/tests/mpich-g2/ring
jv        7764  0.0  0.1   9156  4696 ?        S    23:06   0:00
/home/users/jv/tests/mpich-g2/ring
jv        7765  0.0  0.1   9156  4704 ?        S    23:06   0:00
/home/users/jv/tests/mpich-g2/ring

i can't see anything too suspicious in gram_job_mgr_XXXX.log files. in both
of them, after successfull setup, something like

3/23 23:19:39 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1
3/23 23:19:49 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
3/23 23:19:49 JMI: testing job manager scripts for type fork exist and
permissions are ok.
3/23 23:19:49 JMI: completed script validation: job manager type is fork.
3/23 23:19:49 JMI: in globus_gram_job_manager_poll()
3/23 23:19:49 JMI: local stdout filename = /home/users/jv/.globus/job/
wn001.grid.info.uvt.ro/7755.1237842393/stdout.
3/23 23:19:49 JMI: local stderr filename = /home/users/jv/.globus/job/
wn001.grid.info.uvt.ro/7755.1237842393/stderr.
3/23 23:19:49 JMI: poll: seeking:
https://wn001.grid.info.uvt.ro:42514/7755/1237842393/
3/23 23:19:49 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl
scripts)
3/23 23:19:49 JMI: cmd = poll
3/23 23:19:49 JMI: returning with success

is repetead over and over again. i guess this is alright, cause i get the
same log when i successfuly run mpich-g2 jobs on single cluster.
also, from gatekeeper.log on wn001.grid.info.uvt.ro, i get :

TIME: Mon Mar 23 23:16:20 2009
 PID: 8073 -- Notice: 5: Authenticated globus user:
/O=Grid/OU=SCIEnce/CN=Vladimir Janjic
TIME: Mon Mar 23 23:16:20 2009
 PID: 8073 -- Notice: 0: GATEKEEPER_JM_ID
2009-03-23.23:16:20.0000008073.0000000000 for /O=Grid/OU=SCIEnce/CN=Vladimir
Janjic on 138.251.214.66
TIME: Mon Mar 23 23:16:20 2009
 PID: 8073 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=7
TIME: Mon Mar 23 23:16:20 2009
 PID: 8073 -- Notice: 5: Requested service: jobmanager
TIME: Mon Mar 23 23:16:20 2009
 PID: 8073 -- Notice: 5: Authorized as local user: jv
TIME: Mon Mar 23 23:16:20 2009
 PID: 8073 -- Notice: 5: Authorized as local uid: 1012
TIME: Mon Mar 23 23:16:20 2009
 PID: 8073 -- Notice: 5:           and local gid: 513
TIME: Mon Mar 23 23:16:20 2009
 PID: 8073 -- Notice: 0: executing
/opt/globus-4.2.1//libexec/globus-job-manager
TIME: Mon Mar 23 23:16:20 2009
 PID: 8073 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=11
TIME: Mon Mar 23 23:16:20 2009
 PID: 8073 -- Notice: 0: Child 8074 started

which seems ok.
where should i start looking for solution?

thanks a lot,
vladimir

Reply via email to