hi all,
i have a problem (as many other people had) with running mpich-g2
applications on multiple clusters. here's my situation : i have two
multicore machines : 4-core "wn001.grid.info.uvt.ro" machine, and 8-core "
ardbeg.cs.st-andrews.ac.uk" machine. i want to run test program (ring.c)
from MPICH-G2 documentation.
i can run simple jobs from machine A using gatekeeper on machine B. for
example, from "wn001.grid.info.uvt.ro" i can run :
j...@wn001:~$ globus-job-run "ardbeg.cs.st-andrews.ac.uk:
:/O=Grid/OU=SCIEnce/CN=host/ardbeg.cs.st-andrews.ac.uk" /bin/date
Mon Mar 23 21:14:51 GMT 2009
also, from ardbeg.cs.st-andrews.ac.uk, i can run
[...@ardbeg mpich-g2]$ globus-job-run "wn001.grid.info.uvt.ro" /bin/date
Mon Mar 23 23:16:21 EET 2009
but, when i run mpich-g2 job from ardbeg.cs.st-andrews.ac.uk, using this rsl
file :
+
( &(resourceManagerContact="ardbeg.cs.st-andrews.ac.uk:
:/O=Grid/OU=SCIEnce/CN=host/ardbeg.cs.st-andrews.ac.uk")
(count=8)
(label="subjob 0")
(environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
(LD_LIBRARY_PATH /usr/local/globus-4.0.1/lib/))
(directory="/home/vj/tests/mpich-g2")
(executable="/home/vj/tests/mpich-g2/ring")
)
( &(resourceManagerContact="wn001.grid.info.uvt.ro")
(count=4)
(label="subjob 8")
(environment=(GLOBUS_DUROC_SUBJOB_INDEX 1)
(LD_LIBRARY_PATH /opt/globus-4.2.1/lib/))
(directory="/home/users/jv/tests/mpich-g2")
(executable="/home/users/jv/tests/mpich-g2/ring")
)
the programs start correctly on all machines, but then nothing happen. that
is, on ardbeg, 8 processes named ring are created (also, 4 are created on
wn001.grid.info.uvt.ro), but they don't start execution (when i do 'ps',
running time of all of them is 0.00. for example :
jv 7763 0.0 0.1 9400 4704 ? S 23:06 0:00
/home/users/jv/tests/mpich-g2/ring
jv 7764 0.0 0.1 9156 4696 ? S 23:06 0:00
/home/users/jv/tests/mpich-g2/ring
jv 7765 0.0 0.1 9156 4704 ? S 23:06 0:00
/home/users/jv/tests/mpich-g2/ring
i can't see anything too suspicious in gram_job_mgr_XXXX.log files. in both
of them, after successfull setup, something like
3/23 23:19:39 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1
3/23 23:19:49 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
3/23 23:19:49 JMI: testing job manager scripts for type fork exist and
permissions are ok.
3/23 23:19:49 JMI: completed script validation: job manager type is fork.
3/23 23:19:49 JMI: in globus_gram_job_manager_poll()
3/23 23:19:49 JMI: local stdout filename = /home/users/jv/.globus/job/
wn001.grid.info.uvt.ro/7755.1237842393/stdout.
3/23 23:19:49 JMI: local stderr filename = /home/users/jv/.globus/job/
wn001.grid.info.uvt.ro/7755.1237842393/stderr.
3/23 23:19:49 JMI: poll: seeking:
https://wn001.grid.info.uvt.ro:42514/7755/1237842393/
3/23 23:19:49 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl
scripts)
3/23 23:19:49 JMI: cmd = poll
3/23 23:19:49 JMI: returning with success
is repetead over and over again. i guess this is alright, cause i get the
same log when i successfuly run mpich-g2 jobs on single cluster.
also, from gatekeeper.log on wn001.grid.info.uvt.ro, i get :
TIME: Mon Mar 23 23:16:20 2009
PID: 8073 -- Notice: 5: Authenticated globus user:
/O=Grid/OU=SCIEnce/CN=Vladimir Janjic
TIME: Mon Mar 23 23:16:20 2009
PID: 8073 -- Notice: 0: GATEKEEPER_JM_ID
2009-03-23.23:16:20.0000008073.0000000000 for /O=Grid/OU=SCIEnce/CN=Vladimir
Janjic on 138.251.214.66
TIME: Mon Mar 23 23:16:20 2009
PID: 8073 -- Notice: 0: GRID_SECURITY_HTTP_BODY_FD=7
TIME: Mon Mar 23 23:16:20 2009
PID: 8073 -- Notice: 5: Requested service: jobmanager
TIME: Mon Mar 23 23:16:20 2009
PID: 8073 -- Notice: 5: Authorized as local user: jv
TIME: Mon Mar 23 23:16:20 2009
PID: 8073 -- Notice: 5: Authorized as local uid: 1012
TIME: Mon Mar 23 23:16:20 2009
PID: 8073 -- Notice: 5: and local gid: 513
TIME: Mon Mar 23 23:16:20 2009
PID: 8073 -- Notice: 0: executing
/opt/globus-4.2.1//libexec/globus-job-manager
TIME: Mon Mar 23 23:16:20 2009
PID: 8073 -- Notice: 0: GRID_SECURITY_CONTEXT_FD=11
TIME: Mon Mar 23 23:16:20 2009
PID: 8073 -- Notice: 0: Child 8074 started
which seems ok.
where should i start looking for solution?
thanks a lot,
vladimir