[ansible-project] EC2 slow cloud-init, Ansible SSH connection fails due to race condition (wait_for is not good enough)

James Cuzella Mon, 25 Jan 2016 13:07:21 -0800

Hello,

I believe I've found an interesting race condition during EC2 instance 
creation due to a slow-running cloud-init process.  The issue is that 
cloud-init appears to create the initial login user & installs the public 
SSH key onto a newly started EC2 instance, then restarts sshd.  It takes a 
while to do this, and creates a race condition where Ansible cannot connect 
to the host and fails the playbook run.  In my playbook, I'm using the ec2 
module, followed by add_host, and then wait_for to wait for the SSH port to 
be open.  I have also experimented with using a simple "shell: echo 
host_is_up" command with a retry / do-until loop.  However this also fails 
because Ansible wants the initial SSH connection to be successful, which it 
will not in this case.  So Ansible does not retry :-(

It appears that due to the user not existing until ~3 minutes after it is
booted and sshd is listening on port 22, Ansible cannot connect as the
initial login user for the CentOS AMI ("centos"). So the SSH port open
check is not good enough to detect and wait for the port to be open AND the
login user to exist. The simple echo shell command with retry do/until
loop also does not work, because the very first SSH connection Ansible
tries to make to run the module fails also.

For some detailed debug info, and a playbook to reproduce the issue, please
see this Gist: https://gist.github.com/trinitronx/afd894c89384d413597b

My question is: Has anyone run into a similar issue with EC2 instances
being slow to become available causing Ansible to fail to connect, and also
found a solution to this?

I realize that a sleep task is one possible solution (and I may be forced
to reach for that sledgehammer), but it doesn't feel like the absolute best
solution because we really want to wait for both cloud-init to be finished
creating "centos" user on the instance AND SSH to be up. So really, the
only other way I can think of is to somehow tell SSH to retry connecting as
centos until it succeeds or a surpasses a very long timeout. Is this
possible? Are there better ways of handling this?

--
You received this message because you are subscribed to the Google Groups
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/ansible-project/ab00766f-5a4a-4c41-b543-eaf9cec406c6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[ansible-project] EC2 slow cloud-init, Ansible SSH connection fails due to race condition (wait_for is not good enough)

Reply via email to