My understanding of retry files (which could certainly be wrong) is that
they merely limit the hosts that are included in the run. Which I don't
think will work for me, although perhaps this indicates that my playbook is
not set up well. Here is a simplified version of my site.yml:
- name: copy new files to all nodes
hosts: all
tasks:
- include: tasks/deploy_files.yml
- name: configure and deploy backend type foo
hosts: tag_foo
roles:
- foo
- name: configure and deploy backend type bar
hosts: tag_bar
roles:
- bar
- name: configure and deploy backend type baz
hosts: tag_baz
roles:
- baz
(etc, for 7 total backend types)
- name: clean up old deployments from all nodes
hosts: all
tasks:
- include: tasks/remove_old_deployments.yml
So, given this structure, pretend that the "foo" step went fine, but then
some step during one of the "bar" backend deployments failed. Won't the
retry file just contain that single host? (assuming we are running
"serial: 1" for that task that failed) So if I reran using that file, I
might get that "bar" host to deploy correctly, but I will totally miss all
of the "baz" hosts and all other backends whose deployment tasks appear
after the "bar" task.
I suppose one option might be to break up this single site.yml into 7
different playbooks, one for each backend type, and then execute them each
in order, retrying each one as necessary if any errors occur. Would that
be a better setup? That *seems* to be a bit silly, but maybe I'm wrong on
that...
Thanks,
Ian
On Monday, August 10, 2015 at 3:37:32 PM UTC-4, Brian Coca wrote:
>
> You can use the .retry files as a --limit to rerun the plays.
>
> On Mon, Aug 10, 2015 at 3:29 PM, Ian Rose <[email protected]
> <javascript:>> wrote:
> > Hi all -
> >
> > I've been pretty happy running Ansible for a few months now. The one
> major
> > thorn in my side is failed tasks. Our fleet of VMs is not very large,
> but
> > apparently is large enough (or our playbook is long enough) that we hit
> at
> > least one spurious SSH error (e.g. "SSH Error:
> mux_client_hello_exchange:
> > write packet: Broken pipe"), or, more rarely, I'll hit a spurious 500
> from a
> > third party service (e.g. adding or removing our VMs to/from load
> balancers
> > via a cloud API).
> >
> > What's the best practice for dealing with these kinds of transient
> failures?
> > It seems like me that something like "sleep X seconds, then retry, up to
> Y
> > times" would work quite well, but it isn't obvious to me how to make
> that
> > happen.
> >
> > I'm aware of the wait_for module, but I don't think that really helps in
> > this situation since the problem isn't that a resource is actually
> missing;
> > its just spurious failures.
> >
> > Any suggestions?
> >
> > Thanks!
> > - Ian
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups
> > "Ansible Project" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an
> > email to [email protected] <javascript:>.
> > To post to this group, send email to [email protected]
> <javascript:>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/ansible-project/e47c3c8a-817f-4933-b429-492a430b277f%40googlegroups.com.
>
>
> > For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Brian Coca
>
--
You received this message because you are subscribed to the Google Groups
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/ansible-project/7774062b-763e-4e37-9488-b0e8ff081198%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.