Re: [gt-user] yet another Host key verification failed question
Thanks for the feedback, Brian! I'll add a hint about this to my notes. Martin Brian Pratt wrote: OK, I finally cracked the nut. It was indeed an ssh issue, and the missing piece was that the user had to be able to ssh to himself WITHIN THE SAME NODE (!?!). In my case the submitting user is labkey - it's understood that lab...@[clientnode mailto:lab...@[clientnode] needs to be able to ssh to lab...@[headnode mailto:lab...@[headnode] but it turns out he also needs to be able to ssh to lab...@[clientnode mailto:lab...@[clientnode]. This seems odd to me, but that's how it is. I suppose there might be a config tweak for that somewhere. Anyway, I just repeated the steps for establishing ssh trust between lab...@clientnode mailto:lab...@clientnode and lab...@headnode mailto:lab...@headnode for lab...@clientnode mailto:lab...@clientnode and lab...@clientnode mailto:lab...@clientnode and it's all good. One might have guessed that this trust relationship was implicit, but it isn't - you have to add labkey's rsa public key to ~labkey/.ssh/authorized_keys, and update ~labkey/.ssh/known_hosts to include our own hostname. strace -f on the client node was instrumental in figuring this out, as well as messing around in the perl scripts on the server node. ssldump was handy, too. Thanks to Martin and Jim for the pointers. If you're reading this in an effort to solve a similar problem you might be interested to see my scripts for configuring a simple globus+torque cluster on EC2 at https://hedgehog.fhcrc.org/tor/stedi/trunk/AWS_EC2 . Brian On Fri, Dec 4, 2009 at 8:42 AM, Brian Pratt brian.pr...@insilicos.com mailto:brian.pr...@insilicos.com wrote: Martin, Thanks for that tip and the link to some very useful notes. I'd started poking around in that perl module last night and it looks like maybe the problem is actually to do with ssh between agents within the same globus node, so my ssh trust relationships are not yet quite as comprehensive as they need to be. I will certainly post the solution here when I crack the nut. I've found lots of posts out there of folks with similar sounding problems but no resolution, we'll try to fix that here. Of course there are as many ways to go afoul as there are clusters, but we must leave bread crumbs where we can... Brian On Thu, Dec 3, 2009 at 7:05 PM, Martin Feller fel...@mcs.anl.gov mailto:fel...@mcs.anl.gov wrote: Brian, The PBS job manager module is $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm http://pbs.pm/ I remember that I had this or a similar problem once too, but can't seem to find notes about it (sad, i know). Here's some information about the Perl code which is called by the Java pieces of ws-gram to submit the job to the local resource manager. http://www.mcs.anl.gov/~feller/Globus/technicalStuff/Gram/perl/ While this does not help directly, it may help in debugging. If i find my notes or have a good idea I'll let you know. Martin Brian Pratt wrote: Good plan, thanks. Now to figure out where that is.. I'm certainly learning a lot! On Thu, Dec 3, 2009 at 2:01 PM, Jim Basney jbas...@ncsa.uiuc.edu mailto:jbas...@ncsa.uiuc.edu mailto:jbas...@ncsa.uiuc.edu mailto:jbas...@ncsa.uiuc.edu wrote: It's been a long time since I've debugged a problem like this, but the way I did it in the old days was to modify the Globus PBS glue script to dump what it's passing to qsub, so I could reproduce it manually. Brian Pratt wrote: Let me amend that - I do think that this is sniffing around the right tree, which is why I said this is in some ways more of a logging question. It does look very much like an ssh issue, so what what I really need is to figure out exactly what connection parameters were in use for the failue. They seem to be different in some respect than those used in the qsub transactions. What I could really use is a hint at how to lay eyes on that. Thanks, Brian On Thu, Dec 3, 2009 at 1:38 PM, Brian Pratt brian.pr...@insilicos.com mailto:brian.pr...@insilicos.com mailto:brian.pr...@insilicos.com mailto:brian.pr...@insilicos.comwrote: Hi Jim, Thanks for the reply. Unfortunately the answer doesn't seem to be
Re: [gt-user] yet another Host key verification failed question
Hi Brian, Host key verification failed is an ssh client-side error. The top hit from Google for this error message is http://www.securityfocus.com/infocus/1806 which looks like a good reference on the topic. I suspect you need to populate and distribute /etc/ssh_known_hosts files between your nodes. -Jim Brian Pratt wrote: Actually more of a logging question - I don't expect anyone to solve the problem by remote control, but I'm having a bit of trouble figuring out which node (server or client) the error is coming from. Here's the scenario: a node running globus/ws-gram/pbs_server/pbs_sched and one running pbs_mom. Using the globus simple ca. Job-submitting user is labkey on the globus node, and there's a labkey user on the client node too. I can watch decrypted SSL traffic on the client node with ssldump and simpleca private key and can see the job script being handed to the pbs_mom node. passwordless ssh/scp is configured between the two nodes. job-submitting user's .globus directory is shared via nfs with the mom node. UIDs agree on both nodes. globus user can write to it. Jobs submitted with qsub are fine. qsub -o ~labkey/globus_test/qsubtest_output.txt -e ~labkey/globus_test/qsubtest_err.txt qsubtest cat qsubtest #!/bin/bash date env logger hello from qsubtest, I am $(whoami) and indeed it executes on the pbs_mom client node. Jobs submitted with fork are fine. globusrun-ws -submit -f gramtest_fork cat gramtest_fork job executable/mnt/userdata/gramtest_fork.sh/executable stdoutglobus_test/gramtest_fork_stdout/stdout stderrglobus_test/gramtest_fork_stderr/stderr /job but those run local to the globus node, of course. But a job submitted as globusrun-ws -submit -f gramtest_pbs -Ft PBS cat gramtest_pbs job executable/usr/bin/env/executable stdoutgramtest_pbs_stdout/stdout stderrgramtest_pbs_stderr/stderr /job Gives this: cat globusrun-ws -submit -f gramtest_pbs -Ft PBS Host key verification failed. /bin/touch: cannot touch `/home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0': No such file or directory /var/spool/torque/mom_priv/jobs/ 1.domu-12-31-38-00-b4-b5.compute-1.internal.SC: 59: cannot open /home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0: No such file [: 59: !=: unexpected operator I'm stumped - what piece of the authentication picture am I missing? And how to identify the actor that emitted that failure message? Thanks, Brian Pratt
Re: [gt-user] yet another Host key verification failed question
Hi Jim, Thanks for the reply. Unfortunately the answer doesn't seem to be that simple - I do have the ssh stuff worked out (believe me, I've googled the heck out of this thing!), the qsub test won't work without it. I can scp between the two nodes in all combinations of user globus or labkey, logged into either node, and in either direction. Thanks, Brian On Thu, Dec 3, 2009 at 1:33 PM, Jim Basney jbas...@ncsa.uiuc.edu wrote: Hi Brian, Host key verification failed is an ssh client-side error. The top hit from Google for this error message is http://www.securityfocus.com/infocus/1806 which looks like a good reference on the topic. I suspect you need to populate and distribute /etc/ssh_known_hosts files between your nodes. -Jim Brian Pratt wrote: Actually more of a logging question - I don't expect anyone to solve the problem by remote control, but I'm having a bit of trouble figuring out which node (server or client) the error is coming from. Here's the scenario: a node running globus/ws-gram/pbs_server/pbs_sched and one running pbs_mom. Using the globus simple ca. Job-submitting user is labkey on the globus node, and there's a labkey user on the client node too. I can watch decrypted SSL traffic on the client node with ssldump and simpleca private key and can see the job script being handed to the pbs_mom node. passwordless ssh/scp is configured between the two nodes. job-submitting user's .globus directory is shared via nfs with the mom node. UIDs agree on both nodes. globus user can write to it. Jobs submitted with qsub are fine. qsub -o ~labkey/globus_test/qsubtest_output.txt -e ~labkey/globus_test/qsubtest_err.txt qsubtest cat qsubtest #!/bin/bash date env logger hello from qsubtest, I am $(whoami) and indeed it executes on the pbs_mom client node. Jobs submitted with fork are fine. globusrun-ws -submit -f gramtest_fork cat gramtest_fork job executable/mnt/userdata/gramtest_fork.sh/executable stdoutglobus_test/gramtest_fork_stdout/stdout stderrglobus_test/gramtest_fork_stderr/stderr /job but those run local to the globus node, of course. But a job submitted as globusrun-ws -submit -f gramtest_pbs -Ft PBS cat gramtest_pbs job executable/usr/bin/env/executable stdoutgramtest_pbs_stdout/stdout stderrgramtest_pbs_stderr/stderr /job Gives this: cat globusrun-ws -submit -f gramtest_pbs -Ft PBS Host key verification failed. /bin/touch: cannot touch `/home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0': No such file or directory /var/spool/torque/mom_priv/jobs/ 1.domu-12-31-38-00-b4-b5.compute-1.internal.SChttp://1.domu-12-31-38-00-b4-b5.compute-1.internal.sc/: 59: cannot open /home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0: No such file [: 59: !=: unexpected operator I'm stumped - what piece of the authentication picture am I missing? And how to identify the actor that emitted that failure message? Thanks, Brian Pratt
Re: [gt-user] yet another Host key verification failed question
Let me amend that - I do think that this is sniffing around the right tree, which is why I said this is in some ways more of a logging question. It does look very much like an ssh issue, so what what I really need is to figure out exactly what connection parameters were in use for the failue. They seem to be different in some respect than those used in the qsub transactions. What I could really use is a hint at how to lay eyes on that. Thanks, Brian On Thu, Dec 3, 2009 at 1:38 PM, Brian Pratt brian.pr...@insilicos.comwrote: Hi Jim, Thanks for the reply. Unfortunately the answer doesn't seem to be that simple - I do have the ssh stuff worked out (believe me, I've googled the heck out of this thing!), the qsub test won't work without it. I can scp between the two nodes in all combinations of user globus or labkey, logged into either node, and in either direction. Thanks, Brian On Thu, Dec 3, 2009 at 1:33 PM, Jim Basney jbas...@ncsa.uiuc.eduwrote: Hi Brian, Host key verification failed is an ssh client-side error. The top hit from Google for this error message is http://www.securityfocus.com/infocus/1806 which looks like a good reference on the topic. I suspect you need to populate and distribute /etc/ssh_known_hosts files between your nodes. -Jim Brian Pratt wrote: Actually more of a logging question - I don't expect anyone to solve the problem by remote control, but I'm having a bit of trouble figuring out which node (server or client) the error is coming from. Here's the scenario: a node running globus/ws-gram/pbs_server/pbs_sched and one running pbs_mom. Using the globus simple ca. Job-submitting user is labkey on the globus node, and there's a labkey user on the client node too. I can watch decrypted SSL traffic on the client node with ssldump and simpleca private key and can see the job script being handed to the pbs_mom node. passwordless ssh/scp is configured between the two nodes. job-submitting user's .globus directory is shared via nfs with the mom node. UIDs agree on both nodes. globus user can write to it. Jobs submitted with qsub are fine. qsub -o ~labkey/globus_test/qsubtest_output.txt -e ~labkey/globus_test/qsubtest_err.txt qsubtest cat qsubtest #!/bin/bash date env logger hello from qsubtest, I am $(whoami) and indeed it executes on the pbs_mom client node. Jobs submitted with fork are fine. globusrun-ws -submit -f gramtest_fork cat gramtest_fork job executable/mnt/userdata/gramtest_fork.sh/executable stdoutglobus_test/gramtest_fork_stdout/stdout stderrglobus_test/gramtest_fork_stderr/stderr /job but those run local to the globus node, of course. But a job submitted as globusrun-ws -submit -f gramtest_pbs -Ft PBS cat gramtest_pbs job executable/usr/bin/env/executable stdoutgramtest_pbs_stdout/stdout stderrgramtest_pbs_stderr/stderr /job Gives this: cat globusrun-ws -submit -f gramtest_pbs -Ft PBS Host key verification failed. /bin/touch: cannot touch `/home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0': No such file or directory /var/spool/torque/mom_priv/jobs/ 1.domu-12-31-38-00-b4-b5.compute-1.internal.SChttp://1.domu-12-31-38-00-b4-b5.compute-1.internal.sc/: 59: cannot open /home/labkey/.globus/c5acdc30-e04c-11de-9567-d32d83561bbd/exit.0: No such file [: 59: !=: unexpected operator I'm stumped - what piece of the authentication picture am I missing? And how to identify the actor that emitted that failure message? Thanks, Brian Pratt