I'm not an ec2 user but I wonder if it might be possible to adapt the approach used here to wait for a webservice to return:
https://groups.google.com/forum/#!topic/ansible-project/iLjIbsCASWU In that case he's using uri with an 'until' and using the 'default' filter to make sure no results doesn't mean a task failure. - name: run test uri: url: "https://0.0.0.0:3030/api/canary" validate_certs: no register: result until: result['status']|default(0) == 200 Obviously you'd have to replace uri with something that tells you the instance is ready to start working, but it might give you a way to get rid of the need for the arbitrary pause. Jon On Wednesday, July 13, 2016 at 6:40:20 PM UTC+1, Joanna Delaporte wrote: > > I just discovered this issue as well, with various random ssh connection > or generic authentication/permission failure messages, some occurring after > a play successfully passed a task or two. It occurred very consistently > with many CentOS 7 t2.nano hosts. A 2-minute pause after waiting for ssh > listener resolved it for me. The system logs showed the centos user added > about 1 minute into boot time, so I gave it two minutes to be generous: > > - name: wait for instances to listen on port:22 > wait_for: > state: started > host: "{{ item }}" > port: 22 > > - name: wait for boot process to finish > pause: minutes=2 > > It also helped to make sure I removed old host keys, even though I have > strict checking turned off: > - name: remove old host keys > known_hosts: > name: "{{item}}" > state: absent > with_items: "{{aws_eips}}" > > > Joanna > > PS) Some of the errors I saw caused by this: > "failed to transfer file to /tmp/.ansible" > "Authentication or permission failure. In some cases, you may have been > able to authenticate and did not have permissions on the remote directory. > Consider changing the remote temp path in ansible.cfg to a path rooted in > \"/tmp\". Failed command was: ( umask 77 && mkdir -p \"` echo > /tmp/.ansible/..." > > On Monday, May 9, 2016 at 3:20:04 PM UTC-5, Allen Sanabria wrote: >> >> This is what I do, to make sure that SSH comes up, but also wait until >> the user has been created on my instance. >> >> - set_fact: >> ec2_ip: "{{ ec2_name | get_instance(aws_region, state='running') }}" >> >> - name: Wait for SSH to come up on instance >> wait_for: >> host: "{{ ec2_ip }}" >> port: 22 >> delay: 15 >> timeout: 320 >> state: started >> >> - name: Wait until the ansible user can log into the host. >> local_action: command ssh -oStrictHostKeyChecking=no ansible@{{ ec2_ip }} >> exit >> register: ssh_output >> until: ssh_output.rc == 0 >> retries: 20 >> delay: 10 >> >> >> >> >> On Monday, May 9, 2016 at 1:05:11 PM UTC-7, Jared Bristow wrote: >>> >>> I am having this same issue. Did you ever figure out a solution? >>> >>> I have 3 different images I'm testing against: CentOS6, CentOS7, Sles12. >>> The strange thing is that I only seem to have a problem on CentOS7. >>> >>> On Monday, January 25, 2016 at 2:07:14 PM UTC-7, James Cuzella wrote: >>>> >>>> Hello, >>>> >>>> I believe I've found an interesting race condition during EC2 instance >>>> creation due to a slow-running cloud-init process. The issue is that >>>> cloud-init appears to create the initial login user & installs the public >>>> SSH key onto a newly started EC2 instance, then restarts sshd. It takes a >>>> while to do this, and creates a race condition where Ansible cannot >>>> connect >>>> to the host and fails the playbook run. In my playbook, I'm using the ec2 >>>> module, followed by add_host, and then wait_for to wait for the SSH port >>>> to >>>> be open. I have also experimented with using a simple "shell: echo >>>> host_is_up" command with a retry / do-until loop. However this also fails >>>> because Ansible wants the initial SSH connection to be successful, which >>>> it >>>> will not in this case. So Ansible does not retry :-( >>>> >>>> It appears that due to the user not existing until ~3 minutes after it >>>> is booted and sshd is listening on port 22, Ansible cannot connect as the >>>> initial login user for the CentOS AMI ("centos"). So the SSH port open >>>> check is not good enough to detect and wait for the port to be open AND >>>> the >>>> login user to exist. The simple echo shell command with retry do/until >>>> loop also does not work, because the very first SSH connection Ansible >>>> tries to make to run the module fails also. >>>> >>>> For some detailed debug info, and a playbook to reproduce the issue, >>>> please see this Gist: >>>> https://gist.github.com/trinitronx/afd894c89384d413597b >>>> >>>> My question is: Has anyone run into a similar issue with EC2 >>>> instances being slow to become available causing Ansible to fail to >>>> connect, and also found a solution to this? >>>> >>>> I realize that a sleep task is one possible solution (and I may be >>>> forced to reach for that sledgehammer), but it doesn't feel like the >>>> absolute best solution because we really want to wait for both cloud-init >>>> to be finished creating "centos" user on the instance AND SSH to be up. >>>> So >>>> really, the only other way I can think of is to somehow tell SSH to retry >>>> connecting as centos until it succeeds or a surpasses a very long timeout. >>>> Is this possible? Are there better ways of handling this? >>>> >>> -- You received this message because you are subscribed to the Google Groups "Ansible Project" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/0c7b3bfd-97a7-4230-9015-3bcfd3e03556%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
