All,

After much debugging I finally found a workaround. I'd like to explain
what I did in the hopes that someone might see what the underlying
problem is.

I don't think I made this point explicit in my previous emails, but, I
am using fabric as a library. For simplicity, say I have two
functions, createInstance, and runStuff. The createInstance function
creates an ec2 instance (using boto) and waits for the instance's
state to be "running". The runStuff function uses fabric to run code
on the instance. So, my program looks like:

createInstance()
runStuff()

If I run it as is, I will get connection failures, inside
fabric/network.py: connect, either a socket error or a timeout. I know
that ec2 instances can report their state as "running" but still not
be ready to take connections. So I added a sleep to my program,

createInstance()
sleep(240)
runStuff()

Now, four minutes may seem excessive, but, with four minutes I still
get connection errors. During my investigations, I made a few
interesting observations. If I place a debugger break point just after
the sleep. I can break, and resume and I will not get connection
errors. If during the sleep period, I ssh into the instance from a
terminal, I will not get connection errors, either in the terminal or
in the program when the sleep passes (yes, really). Lastly, if I run
just createInstance in one process, then after, run just runStuff in
another separate process, I do not get connection errors.

The workaround that I found was two part. First, I removed the
sleep(240). Instead, I placed a sleep of 20 seconds in
paramiko/client.py, at the very beginning of Client.connect. Then I
added logic to fabric/network.py connect to retry on timeouts and
socket errors up to six times. With these changes, I often connect the
first time (that would include one 20 second sleep), and if not,
always the second time (in the ten or so runs I have done).

Note that the connection errors are occurring prior to any ssh
activities, the connection is just getting a socket to port 22 on the
ec2 instance.

For the record I am running Ubuntu 10.04, however, colleagues report
the same errors on Windows and MacOS.

I hope someone can provide a reason for the behavior I have been
seeing. I don't mind the workaround, but while it works, it is not
based on any real understanding of what the problem is.

Matt





On Thu, Jun 10, 2010 at 8:57 PM, Patrick J McNerthney
<pmcnerth...@clearpointmetrics.com> wrote:
>
> Try using the --disable-known-hosts command line option to see if it has
> something to do with a prior use of the same ip address.
>
> On 06/10/2010 01:19 PM, Matt Calder wrote:
>>
>> Jeff,
>>
>> On Thu, Jun 10, 2010 at 6:54 PM, Jeff Forcier<j...@bitprophet.org>  wrote:
>>
>>>
>>> Hi Matt,
>>>
>>> Paramiko doesn't have a connection cache that I'm aware of, but Fabric
>>> itself does. However, from your description it sounds like you are
>>> creating a new instance and then connecting to it, so I'm not sure why
>>> a cache would present a problem.
>>>
>>>
>>
>> I'm fairly certain fabric's cache is empty, because the code goes into
>> the network.py : connect function. The reason I suggested a "paramiko
>> cache" is that, while it is true that just after an instance goes from
>> "pending" to "running" there is a period when connections fail, but
>> that usually is very brief (<  10 sec). That is why I do a sleep(60)
>> after the startup, to give time for that to settle.
>>
>>
>>>
>>> If you're rebooting a remote system or doing anything to alter the
>>> networking of an already-connected system, then you can force a
>>> reconnect by manipulating fabric.state.connections. For example, see
>>> what the (master-only) reboot() operation does:
>>>
>>>
>>>  http://code.fabfile.org/repositories/entry/fabric/master/fabric/operations.py#L668
>>>
>>>
>>
>> I will look at that.
>>
>>
>>>
>>> If the problem is as straightforward as it sounds, though, I'm
>>> honestly not sure what's up other than "possible Paramiko bug". Are
>>> you getting any prompts or anything when you connect to the new
>>> instance by hand?
>>>
>>>
>>
>> I can log in by hand, completely and correctly, from a terminal. I can
>> do this after the instance is started but before fabric's first run
>> call. The funny thing is, if I do log in from a terminal, the fabric
>> run command will work. So, a pseudo code timeline:
>>
>> # Version 1, this will fail, the run cannot connect to the instance.
>> startInstance()
>> sleep(60)
>> run("ls")
>>
>> # Version 2, this will succeed in running "ls" on the instance.
>> startInstance()
>> sleep(60) # During this sleep, using a terminal, I log into the instance.
>> run("ls")
>>
>> Another variation that works is:
>>
>> # Version 3, this also succeeds.
>> startInstance()
>> sleep(60)
>> <Debugger breakpoint here>  Using debugger, look at variables (no
>> changes), proceed
>> run("ls")
>>
>> It is the examples that work that shout out "threading error" or
>> "caching error" to me.
>>
>>
>>>
>>> Another thing to try is to upgrade Paramiko to 1.7.6 if you're using
>>> the bundled 1.7.4.
>>>
>>>
>>
>> I will try that. Thanks for taking the time to help!
>>
>> Matt
>>
>>
>>>
>>> -Jeff
>>>
>>>
>>> On Thu, Jun 10, 2010 at 5:38 PM, Matt Calder<mvcal...@gmail.com>  wrote:
>>>
>>>>
>>>> Bruno,
>>>>
>>>> No it is in a good group. I can log in using fabric if I restart it
>>>> and the instance is already running. I can see that fabric is inside
>>>> network.py trying to make the connection. I get one of two errors:
>>>> either timeout or low level socket error. In debugging, I added
>>>> retries to network.connect and it will fail repeatedly. First it times
>>>> out a few times, then gives the "low level socket" error. While it
>>>> doing that, I can ssh into it from a terminal. I wonder does paramiko
>>>> have a connection cache ? Maybe it is not really retrying? Thanks for
>>>> any help.
>>>>
>>>>
>>>> Matt
>>>>
>>>> On Thu, Jun 10, 2010 at 5:23 PM, Bruno Clermont
>>>> <bruno.clerm...@gmail.com>  wrote:
>>>>
>>>>>
>>>>> Is your instance in a security group that allow your IP and the port
>>>>> your
>>>>> trying to connect to?
>>>>> If it timeout, it's probably blocked by Amazon firewalls.
>>>>>
>>>>> On Thu, Jun 10, 2010 at 15:07, Matt Calder<mvcal...@gmail.com>  wrote:
>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am having problems using fabric with EC2 instances. I am not
>>>>>> entirely sure fabric is even the source of the problem, but I am
>>>>>> hoping someone on this list can suggest a solution or a path to
>>>>>> investigate. Here is the problem. I start an EC2 instance using boto.
>>>>>> I wait for the instance to report its state as "running". I wait an
>>>>>> addition 60 seconds after that. Then I try to "run" things on the
>>>>>> instance through fabric. At that point I get:
>>>>>>
>>>>>>  [ubu...@ec2-174-129-96-241.compute-1.amazonaws.com] run: ls
>>>>>>
>>>>>> Fatal error: Timed out trying to connect to
>>>>>> ec2-174-129-96-241.compute-1.amazonaws.com
>>>>>>
>>>>>> Aborting.
>>>>>>
>>>>>> Now, the interesting thing is this. During that additional 60 second
>>>>>> wait I can log into the instance from a separate terminal, moreover,
>>>>>> when I do that separate login, the fabric login succeeds.
>>>>>>
>>>>>> Obviously, there is not a lot to go on here, but I am not entirely
>>>>>> sure what additional information would be helpful. If anyone has a
>>>>>> suggestion of what I might try to do, I would greatly appreciate it.
>>>>>> Thanks,
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>> _______________________________________________
>>>>>> Fab-user mailing list
>>>>>> Fab-user@nongnu.org
>>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Fab-user mailing list
>>>> Fab-user@nongnu.org
>>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>>
>>>>
>>>
>>>
>>> --
>>> Jeff Forcier
>>> Unix sysadmin; Python/Ruby developer
>>> http://bitprophet.org
>>>
>>>
>>
>> _______________________________________________
>> Fab-user mailing list
>> Fab-user@nongnu.org
>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>
>
>
> _______________________________________________
> Fab-user mailing list
> Fab-user@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/fab-user
>

_______________________________________________
Fab-user mailing list
Fab-user@nongnu.org
http://lists.nongnu.org/mailman/listinfo/fab-user

Reply via email to