Hi all,
I was hoping to get some help running mpich-g2 apps on a cluster via WS-GRAM
and Torque.
Synopsis:
Torque 2.2.1 installed as root. Gentoo; USE flags 'crypt' and 'server'(I
believe 'crypt' causes it to use ssh instead of rsh)
Globus 4.0.7(source) and mpich 1.2.7p1 installed as 'globus' user
Non-mpi jobs submitted via WS-GRAM/globusrun-ws using a multiJob-style RSL run
as expected. As soon as I add <jobType>mpi</jobType> to the RSL things go
south. I should note that I have used mpirun manually to successfully run
'ring.c' (hence submitting a Pre-WS RSL to gsigatekeeper works as well).
Examples:
# 1)
# run /bin/hostname on each compute-node w/ PBS
# scheduler
globusrun-ws -submit -J -f hostname.multi.xml
# output; these are the 2 compute-nodes
tp-x002.ci.uchicago.edu
tp-x003.ci.uchicago.edu
# 2)
# run mpich-g2 test app 'ring.c' with <jobType>mpi</jobType>
globusrun-ws -submit -J -f ring.multi.xml
# output
Submission of subjob (label = "subjob 0") failed because the connection to
the server failed (check host and port) (error code 62)
Submission of subjob (label = "subjob 1") failed because the connection to
the server failed (check host and port) (error code 62)
The container and torque logs don't show any obvious red flags, and
'globusrun-ws -status' checks appear to show an errorless run.
Any help debugging this would be greatly appreciated. I can post logs if
necessary.
-Adam
--
This message was sent with an unlicensed evaluation version of
Novell NetMail. Please see http://www.netmail.com/ for details.