Matt,

I use Fabric to orchestrate by EC2 instances also. What I did though was to create a loop that tests for "ssh connectability" before I invoke Fabric scripts. Very roughly copying and pasting the code, it looks something like this:

        # The instance state is "running" before entering this loop.
        while True:
            time.sleep(1)
            self.update() # This updates self.instance.state
            if self.instance.state != "running":
raise Exception('Unexpected instance state "' + self.instance.state + '"')
            if self._test_ssh(False):
                break
        # Should be okay to run Fabric commands now.

    def _test_ssh(self, throw=True):
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        try:
            if "port" in self.configuration:
                port = int(self.configuration["port"])
            else:
                port = 22
            sock.settimeout(1)
            sock.connect((self.address, port))
            return True
        except socket.timeout:
            if throw:
                raise
        except socket.error, e:
            if throw or e.errno != 111:
                raise
        finally:
            sock.close()
        return False

HTH,
Pat


On 06/14/2010 12:16 PM, Matt Calder wrote:
Patrick,

I thought you were on to something there, but alas no. I get the same
using both DNS and IP. Both the errors without the fixes described,
and correct connections with the fixes.

Matt

On Mon, Jun 14, 2010 at 5:25 PM, Patrick J McNerthney
<pmcnerth...@clearpointmetrics.com>  wrote:
Matt,

Try eliminating the use of DNS, ie.
"ec2-174-129-96-241.compute-1.amazonaws.com", and instead connect directly
to the IP address, ie. 174.129.96.241, to see if that has something to do
with it.

Pat


On 06/14/2010 11:16 AM, Matt Calder wrote:
All,

After much debugging I finally found a workaround. I'd like to explain
what I did in the hopes that someone might see what the underlying
problem is.

I don't think I made this point explicit in my previous emails, but, I
am using fabric as a library. For simplicity, say I have two
functions, createInstance, and runStuff. The createInstance function
creates an ec2 instance (using boto) and waits for the instance's
state to be "running". The runStuff function uses fabric to run code
on the instance. So, my program looks like:

createInstance()
runStuff()

If I run it as is, I will get connection failures, inside
fabric/network.py: connect, either a socket error or a timeout. I know
that ec2 instances can report their state as "running" but still not
be ready to take connections. So I added a sleep to my program,

createInstance()
sleep(240)
runStuff()

Now, four minutes may seem excessive, but, with four minutes I still
get connection errors. During my investigations, I made a few
interesting observations. If I place a debugger break point just after
the sleep. I can break, and resume and I will not get connection
errors. If during the sleep period, I ssh into the instance from a
terminal, I will not get connection errors, either in the terminal or
in the program when the sleep passes (yes, really). Lastly, if I run
just createInstance in one process, then after, run just runStuff in
another separate process, I do not get connection errors.

The workaround that I found was two part. First, I removed the
sleep(240). Instead, I placed a sleep of 20 seconds in
paramiko/client.py, at the very beginning of Client.connect. Then I
added logic to fabric/network.py connect to retry on timeouts and
socket errors up to six times. With these changes, I often connect the
first time (that would include one 20 second sleep), and if not,
always the second time (in the ten or so runs I have done).

Note that the connection errors are occurring prior to any ssh
activities, the connection is just getting a socket to port 22 on the
ec2 instance.

For the record I am running Ubuntu 10.04, however, colleagues report
the same errors on Windows and MacOS.

I hope someone can provide a reason for the behavior I have been
seeing. I don't mind the workaround, but while it works, it is not
based on any real understanding of what the problem is.

Matt





On Thu, Jun 10, 2010 at 8:57 PM, Patrick J McNerthney
<pmcnerth...@clearpointmetrics.com>    wrote:

Try using the --disable-known-hosts command line option to see if it has
something to do with a prior use of the same ip address.

On 06/10/2010 01:19 PM, Matt Calder wrote:

Jeff,

On Thu, Jun 10, 2010 at 6:54 PM, Jeff Forcier<j...@bitprophet.org>
  wrote:


Hi Matt,

Paramiko doesn't have a connection cache that I'm aware of, but Fabric
itself does. However, from your description it sounds like you are
creating a new instance and then connecting to it, so I'm not sure why
a cache would present a problem.



I'm fairly certain fabric's cache is empty, because the code goes into
the network.py : connect function. The reason I suggested a "paramiko
cache" is that, while it is true that just after an instance goes from
"pending" to "running" there is a period when connections fail, but
that usually is very brief (<      10 sec). That is why I do a sleep(60)
after the startup, to give time for that to settle.



If you're rebooting a remote system or doing anything to alter the
networking of an already-connected system, then you can force a
reconnect by manipulating fabric.state.connections. For example, see
what the (master-only) reboot() operation does:



  
http://code.fabfile.org/repositories/entry/fabric/master/fabric/operations.py#L668



I will look at that.



If the problem is as straightforward as it sounds, though, I'm
honestly not sure what's up other than "possible Paramiko bug". Are
you getting any prompts or anything when you connect to the new
instance by hand?



I can log in by hand, completely and correctly, from a terminal. I can
do this after the instance is started but before fabric's first run
call. The funny thing is, if I do log in from a terminal, the fabric
run command will work. So, a pseudo code timeline:

# Version 1, this will fail, the run cannot connect to the instance.
startInstance()
sleep(60)
run("ls")

# Version 2, this will succeed in running "ls" on the instance.
startInstance()
sleep(60) # During this sleep, using a terminal, I log into the
instance.
run("ls")

Another variation that works is:

# Version 3, this also succeeds.
startInstance()
sleep(60)
<Debugger breakpoint here>      Using debugger, look at variables (no
changes), proceed
run("ls")

It is the examples that work that shout out "threading error" or
"caching error" to me.



Another thing to try is to upgrade Paramiko to 1.7.6 if you're using
the bundled 1.7.4.



I will try that. Thanks for taking the time to help!

Matt



-Jeff


On Thu, Jun 10, 2010 at 5:38 PM, Matt Calder<mvcal...@gmail.com>
  wrote:


Bruno,

No it is in a good group. I can log in using fabric if I restart it
and the instance is already running. I can see that fabric is inside
network.py trying to make the connection. I get one of two errors:
either timeout or low level socket error. In debugging, I added
retries to network.connect and it will fail repeatedly. First it times
out a few times, then gives the "low level socket" error. While it
doing that, I can ssh into it from a terminal. I wonder does paramiko
have a connection cache ? Maybe it is not really retrying? Thanks for
any help.


Matt

On Thu, Jun 10, 2010 at 5:23 PM, Bruno Clermont
<bruno.clerm...@gmail.com>      wrote:


Is your instance in a security group that allow your IP and the port
your
trying to connect to?
If it timeout, it's probably blocked by Amazon firewalls.

On Thu, Jun 10, 2010 at 15:07, Matt Calder<mvcal...@gmail.com>
  wrote:


Hi,

I am having problems using fabric with EC2 instances. I am not
entirely sure fabric is even the source of the problem, but I am
hoping someone on this list can suggest a solution or a path to
investigate. Here is the problem. I start an EC2 instance using
boto.
I wait for the instance to report its state as "running". I wait an
addition 60 seconds after that. Then I try to "run" things on the
instance through fabric. At that point I get:

  [ubu...@ec2-174-129-96-241.compute-1.amazonaws.com] run: ls

Fatal error: Timed out trying to connect to
ec2-174-129-96-241.compute-1.amazonaws.com

Aborting.

Now, the interesting thing is this. During that additional 60 second
wait I can log into the instance from a separate terminal, moreover,
when I do that separate login, the fabric login succeeds.

Obviously, there is not a lot to go on here, but I am not entirely
sure what additional information would be helpful. If anyone has a
suggestion of what I might try to do, I would greatly appreciate it.
Thanks,

Matt

_______________________________________________
Fab-user mailing list
Fab-user@nongnu.org
http://lists.nongnu.org/mailman/listinfo/fab-user



_______________________________________________
Fab-user mailing list
Fab-user@nongnu.org
http://lists.nongnu.org/mailman/listinfo/fab-user



--
Jeff Forcier
Unix sysadmin; Python/Ruby developer
http://bitprophet.org



_______________________________________________
Fab-user mailing list
Fab-user@nongnu.org
http://lists.nongnu.org/mailman/listinfo/fab-user


_______________________________________________
Fab-user mailing list
Fab-user@nongnu.org
http://lists.nongnu.org/mailman/listinfo/fab-user


_______________________________________________
Fab-user mailing list
Fab-user@nongnu.org
http://lists.nongnu.org/mailman/listinfo/fab-user


_______________________________________________
Fab-user mailing list
Fab-user@nongnu.org
http://lists.nongnu.org/mailman/listinfo/fab-user

_______________________________________________
Fab-user mailing list
Fab-user@nongnu.org
http://lists.nongnu.org/mailman/listinfo/fab-user


_______________________________________________
Fab-user mailing list
Fab-user@nongnu.org
http://lists.nongnu.org/mailman/listinfo/fab-user

Reply via email to