My understanding of retry files (which could certainly be wrong) is that 
they merely limit the hosts that are included in the run.  Which I don't 
think will work for me, although perhaps this indicates that my playbook is 
not set up well.  Here is a simplified version of my site.yml:

- name: copy new files to all nodes
  hosts: all
  tasks:
  - include: tasks/deploy_files.yml

- name: configure and deploy backend type foo
  hosts: tag_foo
  roles:
    - foo

- name: configure and deploy backend type bar
  hosts: tag_bar
  roles:
  - bar

- name: configure and deploy backend type baz
  hosts: tag_baz
  roles:
  - baz

(etc, for 7 total backend types)

- name: clean up old deployments from all nodes
  hosts: all
  tasks:
  - include: tasks/remove_old_deployments.yml


So, given this structure, pretend that the "foo" step went fine, but then 
some step during one of the "bar" backend deployments failed.  Won't the 
retry file just contain that single host?  (assuming we are running 
"serial: 1" for that task that failed)  So if I reran using that file, I 
might get that "bar" host to deploy correctly, but I will totally miss all 
of the "baz" hosts and all other backends whose deployment tasks appear 
after the "bar" task.

I suppose one option might be to break up this single site.yml into 7 
different playbooks, one for each backend type, and then execute them each 
in order, retrying each one as necessary if any errors occur.  Would that 
be a better setup?  That *seems* to be a bit silly, but maybe I'm wrong on 
that...

Thanks,
Ian



On Monday, August 10, 2015 at 3:37:32 PM UTC-4, Brian Coca wrote:
>
> You can use the .retry files as a --limit to rerun the plays. 
>
> On Mon, Aug 10, 2015 at 3:29 PM, Ian Rose <[email protected] 
> <javascript:>> wrote: 
> > Hi all - 
> > 
> > I've been pretty happy running Ansible for a few months now.  The one 
> major 
> > thorn in my side is failed tasks.  Our fleet of VMs is not very large, 
> but 
> > apparently is large enough (or our playbook is long enough) that we hit 
> at 
> > least one spurious SSH error (e.g. "SSH Error: 
> mux_client_hello_exchange: 
> > write packet: Broken pipe"), or, more rarely, I'll hit a spurious 500 
> from a 
> > third party service (e.g. adding or removing our VMs to/from load 
> balancers 
> > via a cloud API). 
> > 
> > What's the best practice for dealing with these kinds of transient 
> failures? 
> > It seems like me that something like "sleep X seconds, then retry, up to 
> Y 
> > times" would work quite well, but it isn't obvious to me how to make 
> that 
> > happen. 
> > 
> > I'm aware of the wait_for module, but I don't think that really helps in 
> > this situation since the problem isn't that a resource is actually 
> missing; 
> > its just spurious failures. 
> > 
> > Any suggestions? 
> > 
> > Thanks! 
> > - Ian 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "Ansible Project" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an 
> > email to [email protected] <javascript:>. 
> > To post to this group, send email to [email protected] 
> <javascript:>. 
> > To view this discussion on the web visit 
> > 
> https://groups.google.com/d/msgid/ansible-project/e47c3c8a-817f-4933-b429-492a430b277f%40googlegroups.com.
>  
>
> > For more options, visit https://groups.google.com/d/optout. 
>
>
>
> -- 
> Brian Coca 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/7774062b-763e-4e37-9488-b0e8ff081198%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to