Hi all - I've been pretty happy running Ansible for a few months now. The one major thorn in my side is failed tasks. Our fleet of VMs is not very large, but apparently is large enough (or our playbook is long enough) that we hit at least one spurious SSH error (e.g. "SSH Error: mux_client_hello_exchange: write packet: Broken pipe"), or, more rarely, I'll hit a spurious 500 from a third party service (e.g. adding or removing our VMs to/from load balancers via a cloud API).
What's the best practice for dealing with these kinds of transient failures? It seems like me that something like "sleep X seconds, then retry, up to Y times" would work quite well, but it isn't obvious to me how to make that happen. I'm aware of the wait_for module, but I don't think that really helps in this situation since the problem isn't that a resource is actually missing; its just spurious failures. Any suggestions? Thanks! - Ian -- You received this message because you are subscribed to the Google Groups "Ansible Project" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/e47c3c8a-817f-4933-b429-492a430b277f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
