Thanks for the feedback, Brian! I'll add a hint about this to my notes.
Martin
Brian Pratt wrote:
OK, I finally cracked the nut. It was indeed an ssh issue, and
the missing piece was that the user had to be able to ssh to himself
WITHIN THE SAME NODE (!?!). In my case the submitting user is labkey
- it's understood that lab...@[clientnode mailto:lab...@[clientnode]
needs to be able to ssh to lab...@[headnode mailto:lab...@[headnode]
but it turns out he also needs to be able to ssh to lab...@[clientnode
mailto:lab...@[clientnode]. This seems odd to me, but that's how it
is. I suppose there might be a config tweak for that somewhere.
Anyway, I just repeated the steps for establishing ssh trust between
lab...@clientnode mailto:lab...@clientnode and lab...@headnode
mailto:lab...@headnode for lab...@clientnode
mailto:lab...@clientnode and lab...@clientnode
mailto:lab...@clientnode and it's all good. One might have guessed
that this trust relationship was implicit, but it isn't - you have to
add labkey's rsa public key to ~labkey/.ssh/authorized_keys, and update
~labkey/.ssh/known_hosts to include our own hostname.
strace -f on the client node was instrumental in figuring this out, as
well as messing around in the perl scripts on the server node. ssldump
was handy, too.
Thanks to Martin and Jim for the pointers. If you're reading this in an
effort to solve a similar problem you might be interested to see my
scripts for configuring a simple globus+torque cluster
on EC2 at https://hedgehog.fhcrc.org/tor/stedi/trunk/AWS_EC2 .
Brian
On Fri, Dec 4, 2009 at 8:42 AM, Brian Pratt brian.pr...@insilicos.com
mailto:brian.pr...@insilicos.com wrote:
Martin,
Thanks for that tip and the link to some very useful notes. I'd
started poking around in that perl module last night and it looks
like maybe the problem is actually to do with ssh between agents
within the same globus node, so my ssh trust relationships are not
yet quite as comprehensive as they need to be. I will certainly
post the solution here when I crack the nut. I've found lots of
posts out there of folks with similar sounding problems but no
resolution, we'll try to fix that here. Of course there are as many
ways to go afoul as there are clusters, but we must leave bread
crumbs where we can...
Brian
On Thu, Dec 3, 2009 at 7:05 PM, Martin Feller fel...@mcs.anl.gov
mailto:fel...@mcs.anl.gov wrote:
Brian,
The PBS job manager module is
$GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm
http://pbs.pm/
I remember that I had this or a similar problem once too, but can't
seem to find notes about it (sad, i know).
Here's some information about the Perl code which is called by
the Java
pieces of ws-gram to submit the job to the local resource manager.
http://www.mcs.anl.gov/~feller/Globus/technicalStuff/Gram/perl/
While this does not help directly, it may help in debugging.
If i find my notes or have a good idea I'll let you know.
Martin
Brian Pratt wrote:
Good plan, thanks. Now to figure out where that is..
I'm certainly learning a lot!
On Thu, Dec 3, 2009 at 2:01 PM, Jim Basney
jbas...@ncsa.uiuc.edu mailto:jbas...@ncsa.uiuc.edu
mailto:jbas...@ncsa.uiuc.edu mailto:jbas...@ncsa.uiuc.edu
wrote:
It's been a long time since I've debugged a problem like
this, but the
way I did it in the old days was to modify the Globus PBS
glue script to
dump what it's passing to qsub, so I could reproduce it
manually.
Brian Pratt wrote:
Let me amend that - I do think that this is sniffing
around the
right tree,
which is why I said this is in some ways more of a logging
question. It
does look very much like an ssh issue, so what what I
really need
is to
figure out exactly what connection parameters were in
use for the
failue.
They seem to be different in some respect than those
used in the qsub
transactions. What I could really use is a hint at how
to lay
eyes on that.
Thanks,
Brian
On Thu, Dec 3, 2009 at 1:38 PM, Brian Pratt
brian.pr...@insilicos.com
mailto:brian.pr...@insilicos.com
mailto:brian.pr...@insilicos.com
mailto:brian.pr...@insilicos.comwrote:
Hi Jim,
Thanks for the reply. Unfortunately the answer doesn't
seem to
be