[ansible-project] Re: EC2 slow cloud-init, Ansible SSH connection fails due to race condition (wait_for is not good enough)

Jared Bristow Mon, 09 May 2016 13:06:12 -0700

I am having this same issue.  Did you ever figure out a solution?

I have 3 different images I'm testing against: CentOS6, CentOS7, Sles12. 
The strange thing is that I only seem to have a problem on CentOS7.


On Monday, January 25, 2016 at 2:07:14 PM UTC-7, James Cuzella wrote:
>
> Hello,
>
> I believe I've found an interesting race condition during EC2 instance 
> creation due to a slow-running cloud-init process.  The issue is that 
> cloud-init appears to create the initial login user & installs the public 
> SSH key onto a newly started EC2 instance, then restarts sshd.  It takes a 
> while to do this, and creates a race condition where Ansible cannot connect 
> to the host and fails the playbook run.  In my playbook, I'm using the ec2 
> module, followed by add_host, and then wait_for to wait for the SSH port to 
> be open.  I have also experimented with using a simple "shell: echo 
> host_is_up" command with a retry / do-until loop.  However this also fails 
> because Ansible wants the initial SSH connection to be successful, which it 
> will not in this case.  So Ansible does not retry :-(
>
> It appears that due to the user not existing until ~3 minutes after it is 
> booted and sshd is listening on port 22, Ansible cannot connect as the 
> initial login user for the CentOS AMI ("centos").  So the SSH port open 
> check is not good enough to detect and wait for the port to be open AND the 
> login user to exist.  The simple echo shell command with retry do/until 
> loop also does not work, because the very first SSH connection Ansible 
> tries to make to run the module fails also.
>
> For some detailed debug info, and a playbook to reproduce the issue, 
> please see this Gist:  
> https://gist.github.com/trinitronx/afd894c89384d413597b
>
> My question is:   Has anyone run into a similar issue with EC2 instances 
> being slow to become available causing Ansible to fail to connect, and also 
> found a solution to this?
>
> I realize that a sleep task is one possible solution (and I may be forced 
> to reach for that sledgehammer), but it doesn't feel like the absolute best 
> solution because we really want to wait for both cloud-init to be finished 
> creating "centos" user on the instance AND SSH to be up.  So really, the 
> only other way I can think of is to somehow tell SSH to retry connecting as 
> centos until it succeeds or a surpasses a very long timeout.  Is this 
> possible?  Are there better ways of handling this?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/c2ae9bdd-fe57-492c-a756-7137dc9310ab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[ansible-project] Re: EC2 slow cloud-init, Ansible SSH connection fails due to race condition (wait_for is not good enough)

Reply via email to