Patrick,

I thought you were on to something there, but alas no. I get the same
using both DNS and IP. Both the errors without the fixes described,
and correct connections with the fixes.

Matt

On Mon, Jun 14, 2010 at 5:25 PM, Patrick J McNerthney
<pmcnerth...@clearpointmetrics.com> wrote:
> Matt,
>
> Try eliminating the use of DNS, ie.
> "ec2-174-129-96-241.compute-1.amazonaws.com", and instead connect directly
> to the IP address, ie. 174.129.96.241, to see if that has something to do
> with it.
>
> Pat
>
>
> On 06/14/2010 11:16 AM, Matt Calder wrote:
>>
>> All,
>>
>> After much debugging I finally found a workaround. I'd like to explain
>> what I did in the hopes that someone might see what the underlying
>> problem is.
>>
>> I don't think I made this point explicit in my previous emails, but, I
>> am using fabric as a library. For simplicity, say I have two
>> functions, createInstance, and runStuff. The createInstance function
>> creates an ec2 instance (using boto) and waits for the instance's
>> state to be "running". The runStuff function uses fabric to run code
>> on the instance. So, my program looks like:
>>
>> createInstance()
>> runStuff()
>>
>> If I run it as is, I will get connection failures, inside
>> fabric/network.py: connect, either a socket error or a timeout. I know
>> that ec2 instances can report their state as "running" but still not
>> be ready to take connections. So I added a sleep to my program,
>>
>> createInstance()
>> sleep(240)
>> runStuff()
>>
>> Now, four minutes may seem excessive, but, with four minutes I still
>> get connection errors. During my investigations, I made a few
>> interesting observations. If I place a debugger break point just after
>> the sleep. I can break, and resume and I will not get connection
>> errors. If during the sleep period, I ssh into the instance from a
>> terminal, I will not get connection errors, either in the terminal or
>> in the program when the sleep passes (yes, really). Lastly, if I run
>> just createInstance in one process, then after, run just runStuff in
>> another separate process, I do not get connection errors.
>>
>> The workaround that I found was two part. First, I removed the
>> sleep(240). Instead, I placed a sleep of 20 seconds in
>> paramiko/client.py, at the very beginning of Client.connect. Then I
>> added logic to fabric/network.py connect to retry on timeouts and
>> socket errors up to six times. With these changes, I often connect the
>> first time (that would include one 20 second sleep), and if not,
>> always the second time (in the ten or so runs I have done).
>>
>> Note that the connection errors are occurring prior to any ssh
>> activities, the connection is just getting a socket to port 22 on the
>> ec2 instance.
>>
>> For the record I am running Ubuntu 10.04, however, colleagues report
>> the same errors on Windows and MacOS.
>>
>> I hope someone can provide a reason for the behavior I have been
>> seeing. I don't mind the workaround, but while it works, it is not
>> based on any real understanding of what the problem is.
>>
>> Matt
>>
>>
>>
>>
>>
>> On Thu, Jun 10, 2010 at 8:57 PM, Patrick J McNerthney
>> <pmcnerth...@clearpointmetrics.com>  wrote:
>>
>>>
>>> Try using the --disable-known-hosts command line option to see if it has
>>> something to do with a prior use of the same ip address.
>>>
>>> On 06/10/2010 01:19 PM, Matt Calder wrote:
>>>
>>>>
>>>> Jeff,
>>>>
>>>> On Thu, Jun 10, 2010 at 6:54 PM, Jeff Forcier<j...@bitprophet.org>
>>>>  wrote:
>>>>
>>>>
>>>>>
>>>>> Hi Matt,
>>>>>
>>>>> Paramiko doesn't have a connection cache that I'm aware of, but Fabric
>>>>> itself does. However, from your description it sounds like you are
>>>>> creating a new instance and then connecting to it, so I'm not sure why
>>>>> a cache would present a problem.
>>>>>
>>>>>
>>>>>
>>>>
>>>> I'm fairly certain fabric's cache is empty, because the code goes into
>>>> the network.py : connect function. The reason I suggested a "paramiko
>>>> cache" is that, while it is true that just after an instance goes from
>>>> "pending" to "running" there is a period when connections fail, but
>>>> that usually is very brief (<    10 sec). That is why I do a sleep(60)
>>>> after the startup, to give time for that to settle.
>>>>
>>>>
>>>>
>>>>>
>>>>> If you're rebooting a remote system or doing anything to alter the
>>>>> networking of an already-connected system, then you can force a
>>>>> reconnect by manipulating fabric.state.connections. For example, see
>>>>> what the (master-only) reboot() operation does:
>>>>>
>>>>>
>>>>>
>>>>>  http://code.fabfile.org/repositories/entry/fabric/master/fabric/operations.py#L668
>>>>>
>>>>>
>>>>>
>>>>
>>>> I will look at that.
>>>>
>>>>
>>>>
>>>>>
>>>>> If the problem is as straightforward as it sounds, though, I'm
>>>>> honestly not sure what's up other than "possible Paramiko bug". Are
>>>>> you getting any prompts or anything when you connect to the new
>>>>> instance by hand?
>>>>>
>>>>>
>>>>>
>>>>
>>>> I can log in by hand, completely and correctly, from a terminal. I can
>>>> do this after the instance is started but before fabric's first run
>>>> call. The funny thing is, if I do log in from a terminal, the fabric
>>>> run command will work. So, a pseudo code timeline:
>>>>
>>>> # Version 1, this will fail, the run cannot connect to the instance.
>>>> startInstance()
>>>> sleep(60)
>>>> run("ls")
>>>>
>>>> # Version 2, this will succeed in running "ls" on the instance.
>>>> startInstance()
>>>> sleep(60) # During this sleep, using a terminal, I log into the
>>>> instance.
>>>> run("ls")
>>>>
>>>> Another variation that works is:
>>>>
>>>> # Version 3, this also succeeds.
>>>> startInstance()
>>>> sleep(60)
>>>> <Debugger breakpoint here>    Using debugger, look at variables (no
>>>> changes), proceed
>>>> run("ls")
>>>>
>>>> It is the examples that work that shout out "threading error" or
>>>> "caching error" to me.
>>>>
>>>>
>>>>
>>>>>
>>>>> Another thing to try is to upgrade Paramiko to 1.7.6 if you're using
>>>>> the bundled 1.7.4.
>>>>>
>>>>>
>>>>>
>>>>
>>>> I will try that. Thanks for taking the time to help!
>>>>
>>>> Matt
>>>>
>>>>
>>>>
>>>>>
>>>>> -Jeff
>>>>>
>>>>>
>>>>> On Thu, Jun 10, 2010 at 5:38 PM, Matt Calder<mvcal...@gmail.com>
>>>>>  wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Bruno,
>>>>>>
>>>>>> No it is in a good group. I can log in using fabric if I restart it
>>>>>> and the instance is already running. I can see that fabric is inside
>>>>>> network.py trying to make the connection. I get one of two errors:
>>>>>> either timeout or low level socket error. In debugging, I added
>>>>>> retries to network.connect and it will fail repeatedly. First it times
>>>>>> out a few times, then gives the "low level socket" error. While it
>>>>>> doing that, I can ssh into it from a terminal. I wonder does paramiko
>>>>>> have a connection cache ? Maybe it is not really retrying? Thanks for
>>>>>> any help.
>>>>>>
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>> On Thu, Jun 10, 2010 at 5:23 PM, Bruno Clermont
>>>>>> <bruno.clerm...@gmail.com>    wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Is your instance in a security group that allow your IP and the port
>>>>>>> your
>>>>>>> trying to connect to?
>>>>>>> If it timeout, it's probably blocked by Amazon firewalls.
>>>>>>>
>>>>>>> On Thu, Jun 10, 2010 at 15:07, Matt Calder<mvcal...@gmail.com>
>>>>>>>  wrote:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am having problems using fabric with EC2 instances. I am not
>>>>>>>> entirely sure fabric is even the source of the problem, but I am
>>>>>>>> hoping someone on this list can suggest a solution or a path to
>>>>>>>> investigate. Here is the problem. I start an EC2 instance using
>>>>>>>> boto.
>>>>>>>> I wait for the instance to report its state as "running". I wait an
>>>>>>>> addition 60 seconds after that. Then I try to "run" things on the
>>>>>>>> instance through fabric. At that point I get:
>>>>>>>>
>>>>>>>>  [ubu...@ec2-174-129-96-241.compute-1.amazonaws.com] run: ls
>>>>>>>>
>>>>>>>> Fatal error: Timed out trying to connect to
>>>>>>>> ec2-174-129-96-241.compute-1.amazonaws.com
>>>>>>>>
>>>>>>>> Aborting.
>>>>>>>>
>>>>>>>> Now, the interesting thing is this. During that additional 60 second
>>>>>>>> wait I can log into the instance from a separate terminal, moreover,
>>>>>>>> when I do that separate login, the fabric login succeeds.
>>>>>>>>
>>>>>>>> Obviously, there is not a lot to go on here, but I am not entirely
>>>>>>>> sure what additional information would be helpful. If anyone has a
>>>>>>>> suggestion of what I might try to do, I would greatly appreciate it.
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Fab-user mailing list
>>>>>>>> Fab-user@nongnu.org
>>>>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Fab-user mailing list
>>>>>> Fab-user@nongnu.org
>>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Jeff Forcier
>>>>> Unix sysadmin; Python/Ruby developer
>>>>> http://bitprophet.org
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Fab-user mailing list
>>>> Fab-user@nongnu.org
>>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Fab-user mailing list
>>> Fab-user@nongnu.org
>>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>>
>>>
>>
>> _______________________________________________
>> Fab-user mailing list
>> Fab-user@nongnu.org
>> http://lists.nongnu.org/mailman/listinfo/fab-user
>>
>
>
> _______________________________________________
> Fab-user mailing list
> Fab-user@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/fab-user
>

_______________________________________________
Fab-user mailing list
Fab-user@nongnu.org
http://lists.nongnu.org/mailman/listinfo/fab-user

Reply via email to