Hi all,

I was hoping to get some help running mpich-g2 apps on a cluster via WS-GRAM 
and Torque.

Synopsis:
Torque 2.2.1 installed as root.  Gentoo; USE flags 'crypt' and 'server'(I 
believe 'crypt' causes it to use ssh instead of rsh)

Globus 4.0.7(source) and mpich 1.2.7p1 installed as 'globus' user

Non-mpi jobs submitted via WS-GRAM/globusrun-ws using a multiJob-style RSL run 
as expected.  As soon as I add <jobType>mpi</jobType> to the RSL things go 
south.  I should note that I have used mpirun manually to successfully run 
'ring.c' (hence submitting a Pre-WS RSL to gsigatekeeper works as well).

Examples:
# 1)
# run /bin/hostname on each compute-node w/ PBS
# scheduler
globusrun-ws -submit -J -f hostname.multi.xml

# output; these are the 2 compute-nodes
tp-x002.ci.uchicago.edu
tp-x003.ci.uchicago.edu

# 2)
# run mpich-g2 test app 'ring.c' with <jobType>mpi</jobType>
globusrun-ws -submit -J -f ring.multi.xml

# output
    Submission of subjob (label = "subjob 0") failed because the connection to 
the server failed (check host and port) (error code 62)
    Submission of subjob (label = "subjob 1") failed because the connection to 
the server failed (check host and port) (error code 62)

The container and torque logs don't show any obvious red flags, and 
'globusrun-ws -status' checks appear to show an errorless run.

Any help debugging this would be greatly appreciated.  I can post logs if 
necessary.

-Adam

--
This message was sent with an unlicensed evaluation version of
Novell NetMail. Please see http://www.netmail.com/ for details.

Reply via email to