[ansible-project] Re: EC2 slow cloud-init, Ansible SSH connection fails due to race condition (wait_for is not good enough)

Allen Sanabria Mon, 09 May 2016 13:20:28 -0700

This is what I do, to make sure that SSH comes up, but also wait until the 
user has been created on my instance.


- set_fact:
    ec2_ip: "{{ ec2_name | get_instance(aws_region, state='running') }}"

- name: Wait for SSH to come up on instance
  wait_for:
    host: "{{ ec2_ip }}"
    port: 22
    delay: 15
    timeout: 320
    state: started

- name: Wait until the ansible user can log into the host.
  local_action: command ssh -oStrictHostKeyChecking=no ansible@{{ ec2_ip }} exit
  register: ssh_output
  until: ssh_output.rc == 0
  retries: 20
  delay: 10




On Monday, May 9, 2016 at 1:05:11 PM UTC-7, Jared Bristow wrote:
>
> I am having this same issue.  Did you ever figure out a solution?
>
> I have 3 different images I'm testing against: CentOS6, CentOS7, Sles12. 
> The strange thing is that I only seem to have a problem on CentOS7.  
>
> On Monday, January 25, 2016 at 2:07:14 PM UTC-7, James Cuzella wrote:
>>
>> Hello,
>>
>> I believe I've found an interesting race condition during EC2 instance 
>> creation due to a slow-running cloud-init process.  The issue is that 
>> cloud-init appears to create the initial login user & installs the public 
>> SSH key onto a newly started EC2 instance, then restarts sshd.  It takes a 
>> while to do this, and creates a race condition where Ansible cannot connect 
>> to the host and fails the playbook run.  In my playbook, I'm using the ec2 
>> module, followed by add_host, and then wait_for to wait for the SSH port to 
>> be open.  I have also experimented with using a simple "shell: echo 
>> host_is_up" command with a retry / do-until loop.  However this also fails 
>> because Ansible wants the initial SSH connection to be successful, which it 
>> will not in this case.  So Ansible does not retry :-(
>>
>> It appears that due to the user not existing until ~3 minutes after it is 
>> booted and sshd is listening on port 22, Ansible cannot connect as the 
>> initial login user for the CentOS AMI ("centos").  So the SSH port open 
>> check is not good enough to detect and wait for the port to be open AND the 
>> login user to exist.  The simple echo shell command with retry do/until 
>> loop also does not work, because the very first SSH connection Ansible 
>> tries to make to run the module fails also.
>>
>> For some detailed debug info, and a playbook to reproduce the issue, 
>> please see this Gist:  
>> https://gist.github.com/trinitronx/afd894c89384d413597b
>>
>> My question is:   Has anyone run into a similar issue with EC2 instances 
>> being slow to become available causing Ansible to fail to connect, and also 
>> found a solution to this?
>>
>> I realize that a sleep task is one possible solution (and I may be forced 
>> to reach for that sledgehammer), but it doesn't feel like the absolute best 
>> solution because we really want to wait for both cloud-init to be finished 
>> creating "centos" user on the instance AND SSH to be up.  So really, the 
>> only other way I can think of is to somehow tell SSH to retry connecting as 
>> centos until it succeeds or a surpasses a very long timeout.  Is this 
>> possible?  Are there better ways of handling this?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/99e8d84a-7856-436a-bb31-10807317f47f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[ansible-project] Re: EC2 slow cloud-init, Ansible SSH connection fails due to race condition (wait_for is not good enough)

Reply via email to