Patrick,

Thanks! Your test_ssh function does what I was doing in client.py, but
is a much better solution as it can be done outside the fabric code
itself. I still find it a mystery as to why placing this check before
fabric starts makes such a difference. It is quite possible that is
due to our code and not fabric. If I manage to discover anything, I'll
be sure to share. Thanks for you help,

Matt

On Mon, Jun 14, 2010 at 6:38 PM, Patrick J McNerthney
<pmcnerth...@clearpointmetrics.com> wrote:
> Matt,
>
> I use Fabric to orchestrate by EC2 instances also.  What I did though was to
> create a loop that tests for "ssh connectability" before I invoke Fabric
> scripts.  Very roughly copying and pasting the code, it looks something like
> this:
>
>        # The instance state is "running" before entering this loop.
>        while True:
>            time.sleep(1)
>            self.update() # This updates self.instance.state
>            if self.instance.state != "running":
>                raise Exception('Unexpected instance state "' +
> self.instance.state + '"')
>            if self._test_ssh(False):
>                break
>        # Should be okay to run Fabric commands now.
>
>    def _test_ssh(self, throw=True):
>        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>        try:
>            if "port" in self.configuration:
>                port = int(self.configuration["port"])
>            else:
>                port = 22
>            sock.settimeout(1)
>            sock.connect((self.address, port))
>            return True
>        except socket.timeout:
>            if throw:
>                raise
>        except socket.error, e:
>            if throw or e.errno != 111:
>                raise
>        finally:
>            sock.close()
>        return False
>
> HTH,
> Pat
>
>
> On 06/14/2010 12:16 PM, Matt Calder wrote:
>>
>> Patrick,
>>
>> I thought you were on to something there, but alas no. I get the same
>> using both DNS and IP. Both the errors without the fixes described,
>> and correct connections with the fixes.
>>
>> Matt
>>
>> On Mon, Jun 14, 2010 at 5:25 PM, Patrick J McNerthney
>> <pmcnerth...@clearpointmetrics.com>  wrote:
>>
>>>
>>> Matt,
>>>
>>> Try eliminating the use of DNS, ie.
>>> "ec2-174-129-96-241.compute-1.amazonaws.com", and instead connect
>>> directly
>>> to the IP address, ie. 174.129.96.241, to see if that has something to do
>>> with it.
>>>
>>> Pat
>>>
>>>
>>> On 06/14/2010 11:16 AM, Matt Calder wrote:
>>>
>>>>
>>>> All,
>>>>
>>>> After much debugging I finally found a workaround. I'd like to explain
>>>> what I did in the hopes that someone might see what the underlying
>>>> problem is.
>>>>
>>>> I don't think I made this point explicit in my previous emails, but, I
>>>> am using fabric as a library. For simplicity, say I have two
>>>> functions, createInstance, and runStuff. The createInstance function
>>>> creates an ec2 instance (using boto) and waits for the instance's
>>>> state to be "running". The runStuff function uses fabric to run code
>>>> on the instance. So, my program looks like:
>>>>
>>>> createInstance()
>>>> runStuff()
>>>>
>>>> If I run it as is, I will get connection failures, inside
>>>> fabric/network.py: connect, either a socket error or a timeout. I know
>>>> that ec2 instances can report their state as "running" but still not
>>>> be ready to take connections. So I added a sleep to my program,
>>>>
>>>> createInstance()
>>>> sleep(240)
>>>> runStuff()
>>>>
>>>> Now, four minutes may seem excessive, but, with four minutes I still
>>>> get connection errors. During my investigations, I made a few
>>>> interesting observations. If I place a debugger break point just after
>>>> the sleep. I can break, and resume and I will not get connection
>>>> errors. If during the sleep period, I ssh into the instance from a
>>>> terminal, I will not get connection errors, either in the terminal or
>>>> in the program when the sleep passes (yes, really). Lastly, if I run
>>>> just createInstance in one process, then after, run just runStuff in
>>>> another separate process, I do not get connection errors.
>>>>
>>>> The workaround that I found was two part. First, I removed the
>>>> sleep(240). Instead, I placed a sleep of 20 seconds in
>>>> paramiko/client.py, at the very beginning of Client.connect. Then I
>>>> added logic to fabric/network.py connect to retry on timeouts and
>>>> socket errors up to six times. With these changes, I often connect the
>>>> first time (that would include one 20 second sleep), and if not,
>>>> always the second time (in the ten or so runs I have done).
>>>>
>>>> Note that the connection errors are occurring prior to any ssh
>>>> activities, the connection is just getting a socket to port 22 on the
>>>> ec2 instance.
>>>>
>>>> For the record I am running Ubuntu 10.04, however, colleagues report
>>>> the same errors on Windows and MacOS.
>>>>
>>>> I hope someone can provide a reason for the behavior I have been
>>>> seeing. I don't mind the workaround, but while it works, it is not
>>>> based on any real understanding of what the problem is.
>>>>
>>>> Matt
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 10, 2010 at 8:57 PM, Patrick J McNerthney
>>>> <pmcnerth...@clearpointmetrics.com>    wrote:
>>>>
>>>>
>>>>>
>>>>> Try using the --disable-known-hosts command line option to see if it
>>>>> has
>>>>> something to do with a prior use of the same ip address.
>>>>>
>>>>> On 06/10/2010 01:19 PM, Matt Calder wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Jeff,
>>>>>>
>>>>>> On Thu, Jun 10, 2010 at 6:54 PM, Jeff Forcier<j...@bitprophet.org>
>>>>>>  wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Hi Matt,
>>>>>>>
>>>>>>> Paramiko doesn't have a connection cache that I'm aware of, but
>>>>>>> Fabric
>>>>>>> itself does. However, from your description it sounds like you are
>>>>>>> creating a new instance and then connecting to it, so I'm not sure
>>>>>>> why
>>>>>>> a cache would present a problem.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I'm fairly certain fabric's cache is empty, because the code goes into
>>>>>> the network.py : connect function. The reason I suggested a "paramiko
>>>>>> cache" is that, while it is true that just after an instance goes from
>>>>>> "pending" to "running" there is a period when connections fail, but
>>>>>> that usually is very brief (<      10 sec). That is why I do a
>>>>>> sleep(60)
>>>>>> after the startup, to give time for that to settle.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> If you're rebooting a remote system or doing anything to alter the
>>>>>>> networking of an already-connected system, then you can force a
>>>>>>> reconnect by manipulating fabric.state.connections. For example, see
>>>>>>> what the (master-only) reboot() operation does:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  http://code.fabfile.org/repositories/entry/fabric/master/fabric/operations.py#L668
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I will look at that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> If the problem is as straightforward as it sounds, though, I'm
>>>>>>> honestly not sure what's up other than "possible Paramiko bug". Are
>>>>>>> you getting any prompts or anything when you connect to the new
>>>>>>> instance by hand?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I can log in by hand, completely and correctly, from a terminal. I can
>>>>>> do this after the instance is started but before fabric's first run
>>>>>> call. The funny thing is, if I do log in from a terminal, the fabric
>>>>>> run command will work. So, a pseudo code timeline:
>>>>>>
>>>>>> # Version 1, this will fail, the run cannot connect to the instance.
>>>>>> startInstance()
>>>>>> sleep(60)
>>>>>> run("ls")
>>>>>>
>>>>>> # Version 2, this will succeed in running "ls" on the instance.
>>>>>> startInstance()
>>>>>> sleep(60) # During this sleep, using a terminal, I log into the
>>>>>> instance.
>>>>>> run("ls")
>>>>>>
>>>>>> Another variation that works is:
>>>>>>
>>>>>> # Version 3, this also succeeds.
>>>>>> startInstance()
>>>>>> sleep(60)
>>>>>> <Debugger breakpoint here>      Using debugger, look at variables (no
>>>>>> changes), proceed
>>>>>> run("ls")
>>>>>>
>>>>>> It is the examples that work that shout out "threading error" or
>>>>>> "caching error" to me.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Another thing to try is to upgrade Paramiko to 1.7.6 if you're using
>>>>>>> the bundled 1.7.4.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I will try that. Thanks for taking the time to help!
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> -Jeff
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jun 10, 2010 at 5:38 PM, Matt Calder<mvcal...@gmail.com>
>>>>>>>  wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Bruno,
>>>>>>>>
>>>>>>>> No it is in a good group. I can log in using fabric if I restart it
>>>>>>>> and the instance is already running. I can see that fabric is inside
>>>>>>>> network.py trying to make the connection. I get one of two errors:
>>>>>>>> either timeout or low level socket error. In debugging, I added
>>>>>>>> retries to network.connect and it will fail repeatedly. First it
>>>>>>>> times
>>>>>>>> out a few times, then gives the "low level socket" error. While it
>>>>>>>> doing that, I can ssh into it from a terminal. I wonder does
>>>>>>>> paramiko
>>>>>>>> have a connection cache ? Maybe it is not really retrying? Thanks
>>>>>>>> for
>>>>>>>> any help.
>>>>>>>>
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>> On Thu, Jun 10, 2010 at 5:23 PM, Bruno Clermont
>>>>>>>> <bruno.clerm...@gmail.com>      wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is your instance in a security group that allow your IP and the
>>>>>>>>> port
>>>>>>>>> your
>>>>>>>>> trying to connect to?
>>>>>>>>> If it timeout, it's probably blocked by Amazon firewalls.
>>>>>>>>>
>>>>>>>>> On Thu, Jun 10, 2010 at 15:07, Matt Calder<mvcal...@gmail.com>
>>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I am having problems using fabric with EC2 instances. I am not
>>>>>>>>>> entirely sure fabric is even the source of the problem, but I am
>>>>>>>>>> hoping someone on this list can suggest a solution or a path to
>>>>>>>>>> investigate. Here is the problem. I start an EC2 instance using
>>>>>>>>>> boto.
>>>>>>>>>> I wait for the instance to report its state as "running". I wait
>>>>>>>>>> an
>>>>>>>>>> addition 60 seconds after that. Then I try to "run" things on the
>>>>>>>>>> instance through fabric. At that point I get:
>>>>>>>>>>
>>>>>>>>>>  [ubu...@ec2-174-129-96-241.compute-1.amazonaws.com] run: ls
>>>>>>>>>>
>>>>>>>>>> Fatal error: Timed out trying to connect to
>>>>>>>>>> ec2-174-129-96-241.compute-1.amazonaws.com
>>>>>>>>>>
>>>>>>>>>> Aborting.
>>>>>>>>>>
>>>>>>>>>> Now, the interesting thing is this. During that additional 60
>>>>>>>>>> second
>>>>>>>>>> wait I can log into the instance from a separate terminal,
>>>>>>>>>> moreover,
>>>>>>>>>> when I do that separate login, the fabric login succeeds.
>>>>>>>>>>
>>>>>>>>>> Obviously, there is not a lot to go on here, but I am not entirely
>>>>>>>>>> sure what additional information would be helpful. If anyone has a
>>>>>>>>>> suggestion of what I might try to do, I would greatly appreciate
>>>>>>>>>> it.
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Matt
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Fab-user mailing list
>>>>>>>>>> Fab-user@nongnu.org
>>>>>>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Fab-user mailing list
>>>>>>>> Fab-user@nongnu.org
>>>>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Forcier
>>>>>>> Unix sysadmin; Python/Ruby developer
>>>>>>> http://bitprophet.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Fab-user mailing list
>>>>>> Fab-user@nongnu.org
>>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Fab-user mailing list
>>>>> Fab-user@nongnu.org
>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Fab-user mailing list
>>>> Fab-user@nongnu.org
>>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Fab-user mailing list
>>> Fab-user@nongnu.org
>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>
>>>
>>
>> _______________________________________________
>> Fab-user mailing list
>> Fab-user@nongnu.org
>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>
>
>

_______________________________________________
Fab-user mailing list
Fab-user@nongnu.org
http://lists.nongnu.org/mailman/listinfo/fab-user

Reply via email to