Apologies for continuing to post here, as this is really a problem
with Net::SSH::Gateway and Net::SSH, but the thread started here and I
figured I'd finish.
I thought the problem might be clearer with a test, so here it is. It
fails against unpatched versions downloaded today. I attempted to
integrate it into the unit tests of Net::SSH::Gateway, but was unable
to recreate the behavior with mocha.
Basically, creating a connection with Net::SSH::Gateway after a
connection has failed will always fail. There may be other failure
conditions, such as existing connections being broken by a failed
connection. Are you willing to merge a fix for this, be it something
along the lines of the previous one or something else entirely? In
either case, I suspect the fix will not be a simple one.
Macintosh:~ ben$ cat gateway-test.rb
#!/usr/bin/ruby
require 'rubygems'
require 'net/ssh'
require 'net/ssh/gateway'
GATEWAY="gateway"
WORKING_HOST="working.host"
FAILING_HOST="failed.host"
USER="user"
# true: program dies from exception on Net::SSH::Gateway process
thread
# false: program locks up forever joining with later connections
Thread.abort_on_exception = true
# order does not matter, but there must be both working and failing
hosts
hosts = [FAILING_HOST, WORKING_HOST]
gateway = Net::SSH::Gateway.new(GATEWAY, USER)
threads = {}
hosts.each do | host |
threads[host] = Thread.new do
begin
conn = Net::SSH.start('localhost', USER, :port =>
gateway.open(host, 22))
rescue Exception => error
# this exception is caught for FAILING_HOST, but
there's no way to
manage it
# other threads will still block or crash
puts "Caught error for #{host}: #{error}"
end
end
end
# force race condition to appear
sleep(3)
t = Thread.new do
begin
conn = Net::SSH.start('localhost', USER, :port =>
gateway.open(WORKING_HOST, 22))
rescue Exception => error
puts "Caught error for #{WORKING_HOST}: #{error}"
end
end
begin
# blocks forever with Thread.abort_on_exception == false
puts "Attempting to join with connection attempted after failed
connection..."
t.join()
# is sometimes reached with Thread.abort_on_exception ==
true, but no use
hosts.each do | host |
puts "joining for #{host}"
threads[host].join
end
# catches SystemExit thrown by Thread.abort_on_exception == true
# Gateway process thread is now dead and no longer continues, all
connections block indefinitely when accessed.
rescue Exception => error
puts "caught error: #{error}"
end
puts "execution continues, but the gateway no longer processes, all
connections block"
Output with Thread.abort_on_exception = false:
Macintosh:~ ben$ ./gateway-test.rb
Attempting to join with connection attempted after failed
connection...
Caught error for failed_host: end of file reached
<infinite wait>
Output with Thread.abort_on_exception = true:
Macintosh:~ ben$ ./gateway-test.rb
Attempting to join with connection attempted after failed
connection...
Caught error for failed_host: end of file reached
/Library/Ruby/Gems/1.8/gems/net-ssh-2.0.2/lib/net/ssh/connection/
session.rb:173:in `select': closed stream (IOError)
from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.2/lib/net/ssh/connection/
session.rb:173:in `process'
from /Library/Ruby/Gems/1.8/gems/net-ssh-gateway-1.0.0/lib/net/ssh/
gateway.rb:189:in `initiate_event_loop!'
from /Library/Ruby/Gems/1.8/gems/net-ssh-gateway-1.0.0/lib/net/ssh/
gateway.rb:189:in `synchronize'
from /Library/Ruby/Gems/1.8/gems/net-ssh-gateway-1.0.0/lib/net/ssh/
gateway.rb:189:in `initiate_event_loop!'
from /Library/Ruby/Gems/1.8/gems/net-ssh-gateway-1.0.0/lib/net/ssh/
gateway.rb:187:in `initialize'
from /Library/Ruby/Gems/1.8/gems/net-ssh-gateway-1.0.0/lib/net/ssh/
gateway.rb:187:in `new'
from /Library/Ruby/Gems/1.8/gems/net-ssh-gateway-1.0.0/lib/net/ssh/
gateway.rb:187:in `initiate_event_loop!'
from /Library/Ruby/Gems/1.8/gems/net-ssh-gateway-1.0.0/lib/net/ssh/
gateway.rb:74:in `initialize'
from ./gateway-test.rb:18:in `new'
from ./gateway-test.rb:18
caught error: #<SystemExit:0x5268f4>
execution continues, but the gateway no longer processes, all
connections block
Thanks,
Ben
On May 27, 9:59 pm, "Ben Lavender" <[EMAIL PROTECTED]> wrote:
> Curiosity has killed the metaphorical cat, the cat, in this case,
> being my time. Here's a patch, but you may not like it, as it
> involves touching Capistrano, Net::SSH, and Net::SSH::Gateway, and
> thus probably qualifies as 'unnatural'. As you wrote all three, I'm
> tossing the diffs out here. If you'd be willing to merge it, I'll
> clean it up some, update the docs and whatnot.
>
> The root problem here is that Ruby's IO.select will tosserrorsif a
> socket throws an error (as they do when the underlying SSH channel is
> broken), instead of marking it as being in an error state and
> returning. This means that any Net::SSH::Gateway that has multiple
> threads will have its process thread die if any one of the connections
> is lost or fails to be created. The following diffs solve the problem
> by allowing one to pass Net::SSH::Gateway a block to be called upon an
> exception being thrown by its session's process() call. Net::SSH
> doesn't provide enough information at present to deal with anyerrors
> sent up by IO.select, so it needs a change, too (and a far better one
> than the hack here).
>
> diffs against git on 27/5/08:
>
> capistrano/configuration/connections.rb:
> 25c25
> < def initialize(gateway, options)
> ---> def initialize(gateway, options, ignore_errors)
> 29a30
> > failure_block = ignore_errors ? lambda { } : nil
>
> 31c32
> < Net::SSH::Gateway.new(host, user, connect_options)
> ---> Net::SSH::Gateway.new(host, user,
> connect_options,failure_block)
>
> 83c84
> < GatewayConnectionFactory.new(fetch(:gateway), self)
> ---> GatewayConnectionFactory.new(fetch(:gateway), self,
> current_task.continue_on_error?)
>
> 100c101,110
> < threads.each { |t| t.join }
> ---
>
> > timeout = exists?(:connection_timeout) ?
> > fetch(:connection_timeout) : 10
> > threads.each { |t| t.join(timeout) }
>
> > servers.each do |server|
> > if !(sessions[server]) then
> > failed_servers << { :server => server, :error =>
> > Net::SSH::Exception.new("Failed to connect to #{server}") }
> > logger.debug "Failed to connect to `#{server}' via
> > gateway"
> > end
> > end
>
> net/ssh/gateway.rb:
> 68c68,69
> < def initialize(host, user, options={})
> ---> def initialize(host, user, options={},block=nil)
> > [EMAIL PROTECTED] ||= block
>
> 189c190,200
> < @session_mutex.synchronize { @session.process(0.1) }
> ---
>
> > @session_mutex.synchronize {
> > begin
> > @session.process(0.1)
> > rescue Exception => error
> > if @failure_block then
> > @failure_block.call(@session,error)
> > else
> > raise error
> > end
> > end
> > }
>
> net/ssh/service/forward.rb:
> 79c79
> < channel[:socket].close
> ---
>
> > #channel[:socket].close
>
> forward.rb would need a more generic solution to be correct, but
> simply closing the socket on error without closing a socket built over
> it will simply cause the Gateway's process() to throw exceptions
> endlessly. Forward needs to offer another interface to respond to what
> IO.select() throws. That's another solution I'd be willing to hammer
> out, if you're willing to merge this stuff.
>
> Let me know,
> Ben
>
> On Wed, May 21, 2008 at 5:57 PM, Ben Lavender <[EMAIL PROTECTED]> wrote:
>
> > I spent some time playing with this today, and it seems to be based on
> > Net::SSH's not being thread safe. [1] The short version is that when
> > the gateway host kicks back the 'host unreachable' message, all of the
> > connection threads lock up/die/go away. The exception wanders up the
> > stack and is handled normally, but all of the other threads stop.
>
> > I'm dubious about the possibility of creating a patch for this that
> > doesn't do unnatural things to the code. I'm not sure if that means I
> > can use cap or not, for what I'm trying to do, but I'll find another
> > way to make things work if I do.
>
> > Thanks,
> > Ben
>
> > [1]:http://weblog.jamisbuck.org/2008/3/18/net-ssh-and-thread-safety
>
> > On May 21, 3:08 am, Jamis Buck <[EMAIL PROTECTED]> wrote:
> >> It's an exception. If it pains someone enough to write a patch for it,
> >> I'd consider applying it, if it doesn't do unnatural things to the code.
>
> >> - Jamis
>
> >> On May 20, 2008, at 4:16 PM, David Masover wrote:
>
> >> > I'm not sure yet whether that's a pattern or an antipattern. If it's
> >> > a pattern, then maybe we could do something like:
>
> >> > HOSTS="-foo"
>
> >> > to remove host foo from whatever the normal host list would be?
>
> >> > On Tue, May 20, 2008 at 3:03 PM, Jamis Buck <[EMAIL PROTECTED]>
> >> > wrote:
> >> > Honestly, I think I'd recommend just removing the server in question
> >> > from the server list temporarily, running your stuff, and then
> >> > adding it back. I might consider a patch to capistrano to work
> >> > around this, but at the same time, capistrano is already
> >> > ridiculously complex in places.
>
> >> > - Jamis
>
> >> > On May 20, 2008, at 1:54 PM, Ben Lavender wrote:
>
> >> > Ah, oops, err, pardon me for not posting everything I had tried, but
> >> > alas, :on_error does not do the trick here. The current version is:
>
> >> > task :add_user, :on_error => :continue do
> >> > prompt(:username)
> >> > #prompt(:new_password)
> >> > begin
> >> > run "useradd #{username}"
> >> > rescue Exception => error
> >> > puts "Caught an error woo woo! It's " + error
> >> > end
> >> > end
>
> >> > This still dies:
> >> > /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.1/lib/net/ssh/connection/
> >> > session.rb:173:in `select': closed stream (IOError)
> >> > from /Library/Ruby/Gems/1.8/gems/net-ssh-2.0.1/lib/net/ssh/
> >> > connection/
> >> > session.rb:173:in `process'
> >> > from /Library/Ruby/Gems/1.8/gems/net-ssh-gateway-1.0.0/lib/
> >> > net/ssh/
> >> > gateway.rb:189:in `initiate_event_loop!'
>
> >> > In addition, catching the Exception processes the SystemExit on its
> >> > way up the stack, albiet not gracefully. It's too late to do any
> >> > good, it seems:
> >> > ./sysadmin.cap.rb:39:in `+': SystemExit#to_str should return String
> >> > (TypeError)
> >> > from ./sysadmin.cap.rb:39:in `load'
> >> > from /Library/Ruby/Gems/1.8/gems/capistrano-2.3.0/lib/
> >> > capistrano/
> >> > configuration/execution.rb:80:in `instance_eval'
>
> >> > I should also mention I'm using 2.3.0 with capistrano-ext 1.2.0, both
> >> > freshly updated via gem today.
>
> >> > I'm new to this, so I'm probably missing something; any ideas?
>
> >> > Ben
>
> >> > On May 20, 9:38 pm, Jamis Buck <[EMAIL PROTECTED]> wrote:
> >> > Ben,
>
> >> > It sounds like you want the :on_error => :continue option for the
> >> > task:
>
> >> > task :add_user, :on_error => :continue do
> >> > # ...
> >> > end
>
> >> > With that option set, connectionerrorsand runtimeerrorswill be
> >> > dutifully logged, but capistrano will not abort.
>
> >> > - Jamis
>
> >> > On May 20, 2008, at 5:18 AM, Ben Lavender wrote:
>
> >> > Hi all,
>
> >> > I'm looking into using Capistrano for system administration as opposed
> >> > to deployment. I'm having some troublehandlingerrors.
>
> >> > As an example, I'm trying to write an add_user task. Easy enough:
>
> >> > task :add_user do
> >> > run "useradd #{username}"
> >> > end
>
> >> > The problem is inhandlingerror conditions. For example, right now
> >> > I'm trying to add an administrator to a number of machines, but one of
> >> > them is currently offline for maintenance. When I run my task, I get:
> >> > /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/
> >> > gems/1.8/gems/net-ssh-1.1.2/lib/net/ssh/service/forward/driver.rb:
> >> > 126:in `direct_channel': could not open direct channel for
> >> > 65530:1425-6:22 (2, No route to host) (Net::SSH::Exception)
>
> >> > The other machines work fine, and if I use a subset of roles that does
> >> > not include the affected machine, it's all fine. However, I'd like to
> >> > be able to specify that this task continue if one of a subset of
> >> > machines is unavailable (since I can run it again, harmlessly,
> >> > later). Ideally, I'd like to be able to specify the action to be
> >> > taken for a given kind of exception raised for a task. For this one,
> >> > for example, I might send an email to my trouble ticket system that
> >> > useradd failed on a given machine, reminding me to do it later.
>
> >> > I dug around in cli/execute, and it seems like errorhandlingis done
> >> > rather statically, by handle_error. Is there an accepted way to do
> >> > this before I start overwriting that method?
>
> >> > smime.p7s
> >> > 3KDownload
>
> >> smime.p7s
> >> 3KDownload
--~--~---------~--~----~------------~-------~--~----~
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/capistrano
-~----------~----~----~----~------~----~------~--~---