Patrick, I thought you were on to something there, but alas no. I get the same using both DNS and IP. Both the errors without the fixes described, and correct connections with the fixes.
Matt On Mon, Jun 14, 2010 at 5:25 PM, Patrick J McNerthney <pmcnerth...@clearpointmetrics.com> wrote: > Matt, > > Try eliminating the use of DNS, ie. > "ec2-174-129-96-241.compute-1.amazonaws.com", and instead connect directly > to the IP address, ie. 174.129.96.241, to see if that has something to do > with it. > > Pat > > > On 06/14/2010 11:16 AM, Matt Calder wrote: >> >> All, >> >> After much debugging I finally found a workaround. I'd like to explain >> what I did in the hopes that someone might see what the underlying >> problem is. >> >> I don't think I made this point explicit in my previous emails, but, I >> am using fabric as a library. For simplicity, say I have two >> functions, createInstance, and runStuff. The createInstance function >> creates an ec2 instance (using boto) and waits for the instance's >> state to be "running". The runStuff function uses fabric to run code >> on the instance. So, my program looks like: >> >> createInstance() >> runStuff() >> >> If I run it as is, I will get connection failures, inside >> fabric/network.py: connect, either a socket error or a timeout. I know >> that ec2 instances can report their state as "running" but still not >> be ready to take connections. So I added a sleep to my program, >> >> createInstance() >> sleep(240) >> runStuff() >> >> Now, four minutes may seem excessive, but, with four minutes I still >> get connection errors. During my investigations, I made a few >> interesting observations. If I place a debugger break point just after >> the sleep. I can break, and resume and I will not get connection >> errors. If during the sleep period, I ssh into the instance from a >> terminal, I will not get connection errors, either in the terminal or >> in the program when the sleep passes (yes, really). Lastly, if I run >> just createInstance in one process, then after, run just runStuff in >> another separate process, I do not get connection errors. >> >> The workaround that I found was two part. First, I removed the >> sleep(240). Instead, I placed a sleep of 20 seconds in >> paramiko/client.py, at the very beginning of Client.connect. Then I >> added logic to fabric/network.py connect to retry on timeouts and >> socket errors up to six times. With these changes, I often connect the >> first time (that would include one 20 second sleep), and if not, >> always the second time (in the ten or so runs I have done). >> >> Note that the connection errors are occurring prior to any ssh >> activities, the connection is just getting a socket to port 22 on the >> ec2 instance. >> >> For the record I am running Ubuntu 10.04, however, colleagues report >> the same errors on Windows and MacOS. >> >> I hope someone can provide a reason for the behavior I have been >> seeing. I don't mind the workaround, but while it works, it is not >> based on any real understanding of what the problem is. >> >> Matt >> >> >> >> >> >> On Thu, Jun 10, 2010 at 8:57 PM, Patrick J McNerthney >> <pmcnerth...@clearpointmetrics.com> wrote: >> >>> >>> Try using the --disable-known-hosts command line option to see if it has >>> something to do with a prior use of the same ip address. >>> >>> On 06/10/2010 01:19 PM, Matt Calder wrote: >>> >>>> >>>> Jeff, >>>> >>>> On Thu, Jun 10, 2010 at 6:54 PM, Jeff Forcier<j...@bitprophet.org> >>>> wrote: >>>> >>>> >>>>> >>>>> Hi Matt, >>>>> >>>>> Paramiko doesn't have a connection cache that I'm aware of, but Fabric >>>>> itself does. However, from your description it sounds like you are >>>>> creating a new instance and then connecting to it, so I'm not sure why >>>>> a cache would present a problem. >>>>> >>>>> >>>>> >>>> >>>> I'm fairly certain fabric's cache is empty, because the code goes into >>>> the network.py : connect function. The reason I suggested a "paramiko >>>> cache" is that, while it is true that just after an instance goes from >>>> "pending" to "running" there is a period when connections fail, but >>>> that usually is very brief (< 10 sec). That is why I do a sleep(60) >>>> after the startup, to give time for that to settle. >>>> >>>> >>>> >>>>> >>>>> If you're rebooting a remote system or doing anything to alter the >>>>> networking of an already-connected system, then you can force a >>>>> reconnect by manipulating fabric.state.connections. For example, see >>>>> what the (master-only) reboot() operation does: >>>>> >>>>> >>>>> >>>>> http://code.fabfile.org/repositories/entry/fabric/master/fabric/operations.py#L668 >>>>> >>>>> >>>>> >>>> >>>> I will look at that. >>>> >>>> >>>> >>>>> >>>>> If the problem is as straightforward as it sounds, though, I'm >>>>> honestly not sure what's up other than "possible Paramiko bug". Are >>>>> you getting any prompts or anything when you connect to the new >>>>> instance by hand? >>>>> >>>>> >>>>> >>>> >>>> I can log in by hand, completely and correctly, from a terminal. I can >>>> do this after the instance is started but before fabric's first run >>>> call. The funny thing is, if I do log in from a terminal, the fabric >>>> run command will work. So, a pseudo code timeline: >>>> >>>> # Version 1, this will fail, the run cannot connect to the instance. >>>> startInstance() >>>> sleep(60) >>>> run("ls") >>>> >>>> # Version 2, this will succeed in running "ls" on the instance. >>>> startInstance() >>>> sleep(60) # During this sleep, using a terminal, I log into the >>>> instance. >>>> run("ls") >>>> >>>> Another variation that works is: >>>> >>>> # Version 3, this also succeeds. >>>> startInstance() >>>> sleep(60) >>>> <Debugger breakpoint here> Using debugger, look at variables (no >>>> changes), proceed >>>> run("ls") >>>> >>>> It is the examples that work that shout out "threading error" or >>>> "caching error" to me. >>>> >>>> >>>> >>>>> >>>>> Another thing to try is to upgrade Paramiko to 1.7.6 if you're using >>>>> the bundled 1.7.4. >>>>> >>>>> >>>>> >>>> >>>> I will try that. Thanks for taking the time to help! >>>> >>>> Matt >>>> >>>> >>>> >>>>> >>>>> -Jeff >>>>> >>>>> >>>>> On Thu, Jun 10, 2010 at 5:38 PM, Matt Calder<mvcal...@gmail.com> >>>>> wrote: >>>>> >>>>> >>>>>> >>>>>> Bruno, >>>>>> >>>>>> No it is in a good group. I can log in using fabric if I restart it >>>>>> and the instance is already running. I can see that fabric is inside >>>>>> network.py trying to make the connection. I get one of two errors: >>>>>> either timeout or low level socket error. In debugging, I added >>>>>> retries to network.connect and it will fail repeatedly. First it times >>>>>> out a few times, then gives the "low level socket" error. While it >>>>>> doing that, I can ssh into it from a terminal. I wonder does paramiko >>>>>> have a connection cache ? Maybe it is not really retrying? Thanks for >>>>>> any help. >>>>>> >>>>>> >>>>>> Matt >>>>>> >>>>>> On Thu, Jun 10, 2010 at 5:23 PM, Bruno Clermont >>>>>> <bruno.clerm...@gmail.com> wrote: >>>>>> >>>>>> >>>>>>> >>>>>>> Is your instance in a security group that allow your IP and the port >>>>>>> your >>>>>>> trying to connect to? >>>>>>> If it timeout, it's probably blocked by Amazon firewalls. >>>>>>> >>>>>>> On Thu, Jun 10, 2010 at 15:07, Matt Calder<mvcal...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am having problems using fabric with EC2 instances. I am not >>>>>>>> entirely sure fabric is even the source of the problem, but I am >>>>>>>> hoping someone on this list can suggest a solution or a path to >>>>>>>> investigate. Here is the problem. I start an EC2 instance using >>>>>>>> boto. >>>>>>>> I wait for the instance to report its state as "running". I wait an >>>>>>>> addition 60 seconds after that. Then I try to "run" things on the >>>>>>>> instance through fabric. At that point I get: >>>>>>>> >>>>>>>> [ubu...@ec2-174-129-96-241.compute-1.amazonaws.com] run: ls >>>>>>>> >>>>>>>> Fatal error: Timed out trying to connect to >>>>>>>> ec2-174-129-96-241.compute-1.amazonaws.com >>>>>>>> >>>>>>>> Aborting. >>>>>>>> >>>>>>>> Now, the interesting thing is this. During that additional 60 second >>>>>>>> wait I can log into the instance from a separate terminal, moreover, >>>>>>>> when I do that separate login, the fabric login succeeds. >>>>>>>> >>>>>>>> Obviously, there is not a lot to go on here, but I am not entirely >>>>>>>> sure what additional information would be helpful. If anyone has a >>>>>>>> suggestion of what I might try to do, I would greatly appreciate it. >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Matt >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Fab-user mailing list >>>>>>>> Fab-user@nongnu.org >>>>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Fab-user mailing list >>>>>> Fab-user@nongnu.org >>>>>> http://lists.nongnu.org/mailman/listinfo/fab-user >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Jeff Forcier >>>>> Unix sysadmin; Python/Ruby developer >>>>> http://bitprophet.org >>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Fab-user mailing list >>>> Fab-user@nongnu.org >>>> http://lists.nongnu.org/mailman/listinfo/fab-user >>>> >>>> >>> >>> _______________________________________________ >>> Fab-user mailing list >>> Fab-user@nongnu.org >>> http://lists.nongnu.org/mailman/listinfo/fab-user >>> >>> >> >> _______________________________________________ >> Fab-user mailing list >> Fab-user@nongnu.org >> http://lists.nongnu.org/mailman/listinfo/fab-user >> > > > _______________________________________________ > Fab-user mailing list > Fab-user@nongnu.org > http://lists.nongnu.org/mailman/listinfo/fab-user > _______________________________________________ Fab-user mailing list Fab-user@nongnu.org http://lists.nongnu.org/mailman/listinfo/fab-user