[ansible-project] Re: EC2 slow cloud-init, Ansible SSH connection fails due to race condition (wait_for is not good enough)

'J Hawkesworth' via Ansible Project Thu, 14 Jul 2016 13:17:17 -0700

I'm not an ec2 user but I wonder if it might be possible to adapt the 
approach used here to wait for a webservice to return:


https://groups.google.com/forum/#!topic/ansible-project/iLjIbsCASWU

In that case he's using uri with an 'until' and using the 'default' filter 
to make sure no results doesn't mean a task failure. 

- name: run test
  uri:
    url: "https://0.0.0.0:3030/api/canary";
    validate_certs: no
  register: result
  until: result['status']|default(0) == 200

Obviously you'd have to replace uri with something that tells you the 
instance is ready to start working, but it might give you a way to get rid 
of the need for the arbitrary pause.

Jon

On Wednesday, July 13, 2016 at 6:40:20 PM UTC+1, Joanna Delaporte wrote:
>
> I just discovered this issue as well, with various random ssh connection 
> or generic authentication/permission failure messages, some occurring after 
> a play successfully passed a task or two. It occurred very consistently 
> with many CentOS 7 t2.nano hosts. A 2-minute pause after waiting for ssh 
> listener resolved it for me. The system logs showed the centos user added 
> about 1 minute into boot time, so I gave it two minutes to be generous:
>
>   - name: wait for instances to listen on port:22
>     wait_for:
>       state: started
>       host: "{{ item }}"
>       port: 22
>
>   - name: wait for boot process to finish
>     pause: minutes=2
>
> It also helped to make sure I removed old host keys, even though I have 
> strict checking turned off:
>   - name: remove old host keys
>     known_hosts:
>       name: "{{item}}"
>       state: absent
>     with_items: "{{aws_eips}}"
>
>
> Joanna
>
> PS) Some of the errors I saw caused by this:
> "failed to transfer file to /tmp/.ansible"
> "Authentication or permission failure. In some cases, you may have been 
> able to authenticate and did not have permissions on the remote directory. 
> Consider changing the remote temp path in ansible.cfg to a path rooted in 
> \"/tmp\". Failed command was: ( umask 77 && mkdir -p \"` echo 
> /tmp/.ansible/..."
>
> On Monday, May 9, 2016 at 3:20:04 PM UTC-5, Allen Sanabria wrote:
>>
>> This is what I do, to make sure that SSH comes up, but also wait until 
>> the user has been created on my instance.
>>
>> - set_fact:
>>     ec2_ip: "{{ ec2_name | get_instance(aws_region, state='running') }}"
>>
>> - name: Wait for SSH to come up on instance
>>   wait_for:
>>     host: "{{ ec2_ip }}"
>>     port: 22
>>     delay: 15
>>     timeout: 320
>>     state: started
>>
>> - name: Wait until the ansible user can log into the host.
>>   local_action: command ssh -oStrictHostKeyChecking=no ansible@{{ ec2_ip }} 
>> exit
>>   register: ssh_output
>>   until: ssh_output.rc == 0
>>   retries: 20
>>   delay: 10
>>
>>
>>
>>
>> On Monday, May 9, 2016 at 1:05:11 PM UTC-7, Jared Bristow wrote:
>>>
>>> I am having this same issue.  Did you ever figure out a solution?
>>>
>>> I have 3 different images I'm testing against: CentOS6, CentOS7, Sles12. 
>>> The strange thing is that I only seem to have a problem on CentOS7.  
>>>
>>> On Monday, January 25, 2016 at 2:07:14 PM UTC-7, James Cuzella wrote:
>>>>
>>>> Hello,
>>>>
>>>> I believe I've found an interesting race condition during EC2 instance 
>>>> creation due to a slow-running cloud-init process.  The issue is that 
>>>> cloud-init appears to create the initial login user & installs the public 
>>>> SSH key onto a newly started EC2 instance, then restarts sshd.  It takes a 
>>>> while to do this, and creates a race condition where Ansible cannot 
>>>> connect 
>>>> to the host and fails the playbook run.  In my playbook, I'm using the ec2 
>>>> module, followed by add_host, and then wait_for to wait for the SSH port 
>>>> to 
>>>> be open.  I have also experimented with using a simple "shell: echo 
>>>> host_is_up" command with a retry / do-until loop.  However this also fails 
>>>> because Ansible wants the initial SSH connection to be successful, which 
>>>> it 
>>>> will not in this case.  So Ansible does not retry :-(
>>>>
>>>> It appears that due to the user not existing until ~3 minutes after it 
>>>> is booted and sshd is listening on port 22, Ansible cannot connect as the 
>>>> initial login user for the CentOS AMI ("centos").  So the SSH port open 
>>>> check is not good enough to detect and wait for the port to be open AND 
>>>> the 
>>>> login user to exist.  The simple echo shell command with retry do/until 
>>>> loop also does not work, because the very first SSH connection Ansible 
>>>> tries to make to run the module fails also.
>>>>
>>>> For some detailed debug info, and a playbook to reproduce the issue, 
>>>> please see this Gist:  
>>>> https://gist.github.com/trinitronx/afd894c89384d413597b
>>>>
>>>> My question is:   Has anyone run into a similar issue with EC2 
>>>> instances being slow to become available causing Ansible to fail to 
>>>> connect, and also found a solution to this?
>>>>
>>>> I realize that a sleep task is one possible solution (and I may be 
>>>> forced to reach for that sledgehammer), but it doesn't feel like the 
>>>> absolute best solution because we really want to wait for both cloud-init 
>>>> to be finished creating "centos" user on the instance AND SSH to be up.  
>>>> So 
>>>> really, the only other way I can think of is to somehow tell SSH to retry 
>>>> connecting as centos until it succeeds or a surpasses a very long timeout. 
>>>>  Is this possible?  Are there better ways of handling this?
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/0c7b3bfd-97a7-4230-9015-3bcfd3e03556%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[ansible-project] Re: EC2 slow cloud-init, Ansible SSH connection fails due to race condition (wait_for is not good enough)

Reply via email to