Re: [gt-user] yet another Host key verification failed question

2009-12-10 Thread Martin Feller
Thanks for the feedback, Brian! I'll add a hint about this to my notes.

Martin

Brian Pratt wrote:
 OK, I finally cracked the nut.  It was indeed an ssh issue, and
 the missing piece was that the user had to be able to ssh to himself
 WITHIN THE SAME NODE (!?!).  In my case the submitting user is labkey
 - it's understood that lab...@[clientnode mailto:lab...@[clientnode]
 needs to be able to ssh to lab...@[headnode mailto:lab...@[headnode]
 but it turns out he also needs to be able to ssh to lab...@[clientnode
 mailto:lab...@[clientnode].  This seems odd to me, but that's how it
 is.  I suppose there might be a config tweak for that somewhere. 
 Anyway, I just repeated the steps for establishing ssh trust between
 lab...@clientnode mailto:lab...@clientnode and lab...@headnode
 mailto:lab...@headnode for lab...@clientnode
 mailto:lab...@clientnode and lab...@clientnode
 mailto:lab...@clientnode and it's all good.  One might have guessed
 that this trust relationship was implicit, but it isn't - you have to
 add labkey's rsa public key to ~labkey/.ssh/authorized_keys, and update
 ~labkey/.ssh/known_hosts to include our own hostname.
  
 strace -f on the client node was instrumental in figuring this out, as
 well as messing around in the perl scripts on the server node.  ssldump
 was handy, too.
  
 Thanks to Martin and Jim for the pointers.  If you're reading this in an
 effort to solve a similar problem you might be interested to see my
 scripts for configuring a simple globus+torque cluster
 on EC2 at https://hedgehog.fhcrc.org/tor/stedi/trunk/AWS_EC2 .
  
 Brian
 
 On Fri, Dec 4, 2009 at 8:42 AM, Brian Pratt brian.pr...@insilicos.com
 mailto:brian.pr...@insilicos.com wrote:
 
 Martin,
  
 Thanks for that tip and the link to some very useful notes.  I'd
 started poking around in that perl module last night and it looks
 like maybe the problem is actually to do with ssh between agents
 within the same globus node, so my ssh trust relationships are not
 yet quite as comprehensive as they need to be.  I will certainly
 post the solution here when I crack the nut.  I've found lots of
 posts out there of folks with similar sounding problems but no
 resolution, we'll try to fix that here.  Of course there are as many
 ways to go afoul as there are clusters, but we must leave bread
 crumbs where we can...
 Brian
 On Thu, Dec 3, 2009 at 7:05 PM, Martin Feller fel...@mcs.anl.gov
 mailto:fel...@mcs.anl.gov wrote:
 
 Brian,
 
 The PBS job manager module is
 $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm
 http://pbs.pm/
 
 I remember that I had this or a similar problem once too, but can't
 seem to find notes about it (sad, i know).
 Here's some information about the Perl code which is called by
 the Java
 pieces of ws-gram to submit the job to the local resource manager.
 
 http://www.mcs.anl.gov/~feller/Globus/technicalStuff/Gram/perl/
 
 While this does not help directly, it may help in debugging.
 If i find my notes or have a good idea I'll let you know.
 
 Martin
 
 
 
 Brian Pratt wrote:
  Good plan, thanks.  Now to figure out where that is..
 
  I'm certainly learning a lot!
 
  On Thu, Dec 3, 2009 at 2:01 PM, Jim Basney
 jbas...@ncsa.uiuc.edu mailto:jbas...@ncsa.uiuc.edu
  mailto:jbas...@ncsa.uiuc.edu mailto:jbas...@ncsa.uiuc.edu
 wrote:
 
  It's been a long time since I've debugged a problem like
 this, but the
  way I did it in the old days was to modify the Globus PBS
 glue script to
  dump what it's passing to qsub, so I could reproduce it
 manually.
 
  Brian Pratt wrote:
   Let me amend that - I do think that this is sniffing
 around the
  right tree,
   which is why I said this is in some ways more of a logging
  question.  It
   does look very much like an ssh issue, so what what I
 really need
  is to
   figure out exactly what connection parameters were in
 use for the
  failue.
   They seem to be different in some respect than those
 used in the qsub
   transactions.  What I could really use is a hint at how
 to lay
  eyes on that.
  
   Thanks,
  
   Brian
  
   On Thu, Dec 3, 2009 at 1:38 PM, Brian Pratt
  brian.pr...@insilicos.com
 mailto:brian.pr...@insilicos.com
 mailto:brian.pr...@insilicos.com
 mailto:brian.pr...@insilicos.comwrote:
  
   Hi Jim,
  
   Thanks for the reply.  Unfortunately the answer doesn't
 seem to
  be 

Re: [gt-user] yet another Host key verification failed question

2009-12-03 Thread Jim Basney
Hi Brian,

Host key verification failed is an ssh client-side error. The top hit
from Google for this error message is
http://www.securityfocus.com/infocus/1806 which looks like a good
reference on the topic. I suspect you need to populate and distribute
/etc/ssh_known_hosts files between your nodes.

-Jim

Brian Pratt wrote:
 Actually more of a logging question - I don't expect anyone to solve the
 problem by remote control, but I'm having a bit of trouble figuring out
 which node (server or client) the error is coming from.
 
 Here's the scenario: a node running globus/ws-gram/pbs_server/pbs_sched and
 one running pbs_mom. Using the globus simple ca.  Job-submitting user is
 labkey on the globus node, and there's a labkey user on the client node
 too.
 
  I can watch decrypted SSL traffic on the client node with ssldump and
 simpleca private key and can see the job script being handed to the pbs_mom
 node.
 
 passwordless ssh/scp is configured between the two nodes.
 
 job-submitting user's .globus directory is shared via nfs with the mom
 node.  UIDs agree on both nodes.  globus user can write to it.
 
  Jobs submitted with qsub are fine. qsub -o
 ~labkey/globus_test/qsubtest_output.txt -e
 ~labkey/globus_test/qsubtest_err.txt qsubtest
  cat qsubtest
#!/bin/bash
date
env
logger hello from qsubtest, I am $(whoami)
 and indeed it executes on the pbs_mom client node.
 
 Jobs submitted with fork are fine.  globusrun-ws -submit -f gramtest_fork
  cat gramtest_fork
 job
   executable/mnt/userdata/gramtest_fork.sh/executable
   stdoutglobus_test/gramtest_fork_stdout/stdout
   stderrglobus_test/gramtest_fork_stderr/stderr
 /job
 but those run local to the globus node, of course.
 
 But a job submitted as
 globusrun-ws -submit -f gramtest_pbs -Ft PBS
 
 cat gramtest_pbs
 job
   executable/usr/bin/env/executable
   stdoutgramtest_pbs_stdout/stdout
   stderrgramtest_pbs_stderr/stderr
 /job
 
 Gives this: cat globusrun-ws -submit -f gramtest_pbs -Ft PBS
 Host key verification failed.
 /bin/touch: cannot touch
 `/home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0': No such
 file or directory
 /var/spool/torque/mom_priv/jobs/
 1.domu-12-31-38-00-b4-b5.compute-1.internal.SC: 59: cannot open
 /home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0: No such
 file
 [: 59: !=: unexpected operator
 
 I'm stumped - what piece of the authentication picture am I missing?  And
 how to identify the actor that emitted that failure message?
 
 Thanks,
 
 Brian Pratt


Re: [gt-user] yet another Host key verification failed question

2009-12-03 Thread Brian Pratt
Hi Jim,

Thanks for the reply.  Unfortunately the answer doesn't seem to be that
simple - I do have the ssh stuff worked out (believe me, I've googled the
heck out of this thing!), the qsub test won't work without it.  I can scp
between the two nodes in all combinations of user globus or labkey,
logged into either node, and in either direction.

Thanks,

Brian

On Thu, Dec 3, 2009 at 1:33 PM, Jim Basney jbas...@ncsa.uiuc.edu wrote:

 Hi Brian,

 Host key verification failed is an ssh client-side error. The top hit
 from Google for this error message is
 http://www.securityfocus.com/infocus/1806 which looks like a good
 reference on the topic. I suspect you need to populate and distribute
 /etc/ssh_known_hosts files between your nodes.

 -Jim

 Brian Pratt wrote:
  Actually more of a logging question - I don't expect anyone to solve the
  problem by remote control, but I'm having a bit of trouble figuring out
  which node (server or client) the error is coming from.
 
  Here's the scenario: a node running globus/ws-gram/pbs_server/pbs_sched
 and
  one running pbs_mom. Using the globus simple ca.  Job-submitting user is
  labkey on the globus node, and there's a labkey user on the client node
  too.
 
   I can watch decrypted SSL traffic on the client node with ssldump and
  simpleca private key and can see the job script being handed to the
 pbs_mom
  node.
 
  passwordless ssh/scp is configured between the two nodes.
 
  job-submitting user's .globus directory is shared via nfs with the mom
  node.  UIDs agree on both nodes.  globus user can write to it.
 
   Jobs submitted with qsub are fine. qsub -o
  ~labkey/globus_test/qsubtest_output.txt -e
  ~labkey/globus_test/qsubtest_err.txt qsubtest
   cat qsubtest
 #!/bin/bash
 date
 env
 logger hello from qsubtest, I am $(whoami)
  and indeed it executes on the pbs_mom client node.
 
  Jobs submitted with fork are fine.  globusrun-ws -submit -f
 gramtest_fork
   cat gramtest_fork
  job
executable/mnt/userdata/gramtest_fork.sh/executable
stdoutglobus_test/gramtest_fork_stdout/stdout
stderrglobus_test/gramtest_fork_stderr/stderr
  /job
  but those run local to the globus node, of course.
 
  But a job submitted as
  globusrun-ws -submit -f gramtest_pbs -Ft PBS
 
  cat gramtest_pbs
  job
executable/usr/bin/env/executable
stdoutgramtest_pbs_stdout/stdout
stderrgramtest_pbs_stderr/stderr
  /job
 
  Gives this: cat globusrun-ws -submit -f gramtest_pbs -Ft PBS
  Host key verification failed.
  /bin/touch: cannot touch
  `/home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0': No
 such
  file or directory
  /var/spool/torque/mom_priv/jobs/
  1.domu-12-31-38-00-b4-b5.compute-1.internal.SChttp://1.domu-12-31-38-00-b4-b5.compute-1.internal.sc/:
 59: cannot open
  /home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0: No such
  file
  [: 59: !=: unexpected operator
 
  I'm stumped - what piece of the authentication picture am I missing?  And
  how to identify the actor that emitted that failure message?
 
  Thanks,
 
  Brian Pratt



Re: [gt-user] yet another Host key verification failed question

2009-12-03 Thread Brian Pratt
Let me amend that - I do think that this is sniffing around the right tree,
which is why I said this is in some ways more of a logging question.  It
does look very much like an ssh issue, so what what I really need is to
figure out exactly what connection parameters were in use for the failue.
They seem to be different in some respect than those used in the qsub
transactions.  What I could really use is a hint at how to lay eyes on that.

Thanks,

Brian

On Thu, Dec 3, 2009 at 1:38 PM, Brian Pratt brian.pr...@insilicos.comwrote:

 Hi Jim,

 Thanks for the reply.  Unfortunately the answer doesn't seem to be that
 simple - I do have the ssh stuff worked out (believe me, I've googled the
 heck out of this thing!), the qsub test won't work without it.  I can scp
 between the two nodes in all combinations of user globus or labkey,
 logged into either node, and in either direction.

 Thanks,

 Brian

   On Thu, Dec 3, 2009 at 1:33 PM, Jim Basney jbas...@ncsa.uiuc.eduwrote:

 Hi Brian,

 Host key verification failed is an ssh client-side error. The top hit
 from Google for this error message is
 http://www.securityfocus.com/infocus/1806 which looks like a good
 reference on the topic. I suspect you need to populate and distribute
 /etc/ssh_known_hosts files between your nodes.

 -Jim

 Brian Pratt wrote:
  Actually more of a logging question - I don't expect anyone to solve the
  problem by remote control, but I'm having a bit of trouble figuring out
  which node (server or client) the error is coming from.
 
  Here's the scenario: a node running globus/ws-gram/pbs_server/pbs_sched
 and
  one running pbs_mom. Using the globus simple ca.  Job-submitting user is
  labkey on the globus node, and there's a labkey user on the client
 node
  too.
 
   I can watch decrypted SSL traffic on the client node with ssldump and
  simpleca private key and can see the job script being handed to the
 pbs_mom
  node.
 
  passwordless ssh/scp is configured between the two nodes.
 
  job-submitting user's .globus directory is shared via nfs with the mom
  node.  UIDs agree on both nodes.  globus user can write to it.
 
   Jobs submitted with qsub are fine. qsub -o
  ~labkey/globus_test/qsubtest_output.txt -e
  ~labkey/globus_test/qsubtest_err.txt qsubtest
   cat qsubtest
 #!/bin/bash
 date
 env
 logger hello from qsubtest, I am $(whoami)
  and indeed it executes on the pbs_mom client node.
 
  Jobs submitted with fork are fine.  globusrun-ws -submit -f
 gramtest_fork
   cat gramtest_fork
  job
executable/mnt/userdata/gramtest_fork.sh/executable
stdoutglobus_test/gramtest_fork_stdout/stdout
stderrglobus_test/gramtest_fork_stderr/stderr
  /job
  but those run local to the globus node, of course.
 
  But a job submitted as
  globusrun-ws -submit -f gramtest_pbs -Ft PBS
 
  cat gramtest_pbs
  job
executable/usr/bin/env/executable
stdoutgramtest_pbs_stdout/stdout
stderrgramtest_pbs_stderr/stderr
  /job
 
  Gives this: cat globusrun-ws -submit -f gramtest_pbs -Ft PBS
  Host key verification failed.
  /bin/touch: cannot touch
  `/home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0': No
 such
  file or directory
  /var/spool/torque/mom_priv/jobs/
  1.domu-12-31-38-00-b4-b5.compute-1.internal.SChttp://1.domu-12-31-38-00-b4-b5.compute-1.internal.sc/:
 59: cannot open
  /home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0: No
 such
  file
  [: 59: !=: unexpected operator
 
  I'm stumped - what piece of the authentication picture am I missing?
  And
  how to identify the actor that emitted that failure message?
 
  Thanks,
 
  Brian Pratt